Most Recommended Negative Results
I originally planned this as a separate website or video series, but it’s stalled in the past few months. I’ve decided to post with only a few edits.
What happens to the experiments which don’t show improvement over our previous baseline? In the data science / machine learning community, we hear that negative results can bring balance and objectivity. Yet there is still a publication bias, and a lack of sources for great negative results content. Here I’ve selected three papers which are recognized as standout examples of negative results, with added commentary or definitions.
If this interests you, I’d also recommend the 2020 and upcoming 2021 EMNLP Insights workshop which specifically calls for papers about negative results.
What are we reading?
This paper experiments with unnormalized neural networks (without a batch normalization step in between layers). The authors go beyond removing this step, by:
- identifying situations where batch normalization causes issues
- describing alternatives to replace the benefits of batch normalization
Their final network is still based on ResNet (architecture of neural network) and is competitive on ImageNet (image classification benchmark/standard).
Batch Normalization is a key component in almost all state-of-the-art image classifiers, but it also introduces practical challenges: it breaks the independence between training examples within a batch, can incur compute and memory overhead, and often results in unexpected bugs.
In the abstract, we already see a summarized list of issues with batch normalization. We want to avoid unexpected bugs!
Building on recent theoretical analyses of deep ResNets at initialization, we propose a simple set of analysis tools to characterize signal propagation on the forward pass, and leverage these tools to design highly performant ResNets without activation normalization layers. Crucial to our success is an adapted version of the recently proposed Weight Standardization.
What recent analyses are they talking about, and what is the proposed Weight Standardization process?
Our analysis tools show how this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth. Across a range of FLOP budgets, our networks attain performance competitive with the state-of-the-art EfficientNets on ImageNet.
They are allowing two activation function types in the network.
By FLOP budgets, they mean that they conducted several training runs comparing accuracy of different architectures at increasing levels of computational power. These appear in graphs such as these:
Now let’s skip ahead to the negative results! It is written as a one-page appendix to the paper. The authors describe some other attempts to improve accuracy, and their current explanations for why these have not have worked.
APPENDIX G NEGATIVE RESULTS
G.1 FORWARD MODE VS DECOUPLED WS
Parameterization methods like Weight Standardization (Qiao et al., 2019), Weight Normalization (Salimans & Kingma, 2016), and Spectral Normalization (Miyato et al., 2018) are typically proposed as “forward mode” modifications applied to parameters during the forward pass of a network. This has two consequences: first, this means that the gradients with respect to the underlying parameters are influenced by the parameterization, and that the weights which are optimized may differ substantially from the weights which are actually plugged into the network.
One alternative approach is to implement “decoupled” variants of these parameterizers, by applying them as a projection step in the optimizer.
The authors mentioned three papers which use “forward mode” modifications, but they had wanted to include a “decoupled” version as well.
For example, “Decoupled Weight Standardization” can be implemented atop any gradient based optimizer by replacing W with the normalized Wˆ after the update step. Most papers proposing parameterizations (including the above) argue that the parameterization’s gradient influence is helpful for learning, but this is typically argued with respect to simply ignoring the parameterization during the backward pass, rather than with respect to a strategy such as this.
Using a Forward-Mode parameterization may result in interesting interactions with moving averages or weight decay. For example, with WS, if one takes a moving average of the underlying weights, then applies the WS parameterization to the averaged weights, this will produce different results than if one took the EMA of the Weight-Standardized parameters. Weight decay will have a similar phenomenon: if one is weight decaying a parameter which is actually a proxy for a weight-standardized parameter, how does this change the behavior of the regularization?
So they looked into it — here’s what happened:
We experimented with Decoupled WS and found that it reduced sensitivity to weight decay (presumably because of the strength of the projection step) and often improved the accuracy of the EMA weights early in training, but ultimately led to worse performance than using the originally proposed “forward-mode” formulation. We emphasize that our experiments in this regime were only cursory, and suggest that future work might seek to analyze these interactions in more depth.
We also tried applying Scaled WS as a regularizer (“Soft WS”) by penalizing the mean squared error between the parameter W and its Scaled WS parameterization, Wˆ . We implemented this as a direct addition to the parameters following Loshchilov & Hutter (2017) rather than as a differentiated loss, with a scale hyperparameter controlling the strength of the regularization. We found that this scale could not be meaningfully decreased from its maximal value without drastic training instability, indicating that relaxing the WS constraint is better done through other means, such as the affine gains and biases we employ.
For SPPs, we initially explored plotting activation mean
(np.mean(h))instead of the average squared channel mean, but found that this was less informative.
We also initially explored plotting the average pixel norm: the Frobenius norm of each pixel (reduced across the C axis) then averaged across the NHW axis,
np.mean(np.linalg.norm(h, axis=-1))). We found that this value did not add any information not already contained in the channel or residual variance measures, and was harder to interpret due to it varying with the channel count.
NHWC = array dimensions ordered by: num_images, num_rows, num_columns, num_channels
We explored NF-ResNet variants which maintained constant signal variance, rather than mimicking Batch-Normalized ResNets with signal growth + resets. The first of two key components in this approach was making use of ”rescaled sum junctions,” where the sum junction in a residual block was rewritten to downscale the shortcut path as
y = α∗f(x)+x α2, which is approximately norm-preserving if f(x) is orthogonal to x (which we observed to generally hold in practice).
Instead of scaled WS, this variant employed SeLU (Klambauer et al., 2017) activations, which we found to work as-advertised in encouraging centering and good scaling. While these networks could be made to train stably, we found tuning them to be difficult and were not able to easily recover the performance of BN-ResNets as we were with the approach ultimately presented in this paper.
What am I reading?
This paper describes a model (“Big-GAN”) which generates realistic images. For context, GANs (generative adversarial networks) are used to generate faces, art, and text. Two dueling nets — a generator and a discriminator — train against each other to create fakes and identify fakes.
Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale.
We find that applying orthogonal regularization to the generator renders it amenable to a simple “truncation trick,” allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator’s input.
Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128×128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.65.
Their final network generates images at 128 x128 pixels, based on the categories in ImageNet (image classification benchmark/standard).
Let’s jump ahead to the negative results, again written as an appendix.
APPENDIX H NEGATIVE RESULTS
We explored a range of novel and existing techniques which ended up degrading or otherwise not affecting performance in our setting. We report them here; our evaluations for this section are not as thorough as those for the main architectural choices. Our intention in reporting these results is to save time for future work, and to give a more complete picture of our attempts to improve performance or stability. We note, however, that these results must be understood to be specific to the particular setup we used. A pitfall of reporting negative results is that one might report that a particular technique doesn’t work, when the reality is that this technique did not have the desired effect when applied in a particular way to a particular problem. Drawing overly general conclusions might close off potentially fruitful avenues of research.
This is an excellent mission statement and huge caveat toward reporting negative results in this experiment.
We found that doubling the depth (by inserting an additional Residual block after every up or down-sampling block) hampered performance.
We experimented with sharing class embeddings between both G and D (as opposed to just within G). This is accomplished by replacing D’s class embedding with a projection from G’s embeddings, as is done in G’s BatchNorm layers. In our initial experiments this seemed to help and accelerate training, but we found this trick scaled poorly and was sensitive to optimization hyperparameters, particularly the choice of number of D steps per G step.
We tried replacing BatchNorm in G with WeightNorm (Salimans & Kingma, 2016), but this crippled training. We also tried removing BatchNorm and only having Spectral Normalization, but this also crippled training.
This is interesting in retrospect considering (Paper One), later written by one of the authors, removes BatchNorm.
We tried adding BatchNorm to D (both class-conditional and unconditional) in addition to Spectral Normalization, but this crippled training.
We tried varying the choice of location of the attention block in G and D (and inserting multiple attention blocks at different resolutions) but found that at 128×128 there was no noticeable benefit to doing so, and compute and memory costs increased substantially. We found a benefit to moving the attention block up one stage when moving to 256×256, which is in line with our expectations given the increased resolution.
We tried using filter sizes of 5 or 7 instead of 3 in either G or D or both. We found that having a filter size of 5 in G only provided a small improvement over the baseline but came at an unjustifiable compute cost. All other settings degraded performance.
We tried varying the dilation for convolutional filters in both G and D at 128×128, but found that even a small amount of dilation in either network degraded performance.
We tried bilinear upsampling in G in place of nearest-neighbors upsampling, but this degraded performance.
In some of our models, we observed class-conditional mode collapse, where the model would only output one or two samples for a subset of classes but was still able to generate samples for all other classes. We noticed that the collapsed classes had embedings which had become very large relative to the other embeddings, and attempted to ameliorate this issue by applying weight decay to the shared embedding only. We found that small amounts of weight decay (10−6 ) instead degraded performance, and that only even smaller values (10−8 ) did not degrade performance, but these values were also too small to prevent the class vectors from exploding. Higher-resolution models appear to be more resilient to this problem, and none of our final models appear to suffer from this type of collapse.
We experimented with using MLPs instead of linear projections from G’s class embeddings to its BatchNorm gains and biases, but did not find any benefit to doing so. We also experimented with Spectrally Normalizing these MLPs, and with providing these (and the linear projections) with a bias at their output, but did not notice any benefit.
We tried gradient norm clipping (both the global variant typically used in recurrent networks, and a local version where the clipping value is determined on a per-parameter basis) but found this did not alleviate instability.
YOLOv3: An Incremental Improvement
Joseph Redmon, Ali Farhadi
2018 — https://arxiv.org/abs/1804.02767
YOLO (You Only Look Once) is a series of object detector algorithms which draw multiple bounding boxes on top of an image. As a technical report, the YOLOv3 paper does not have the same formality as a conference or peer-reviewed journal paper.
In February 2020, the first author Joseph Redmon discussed ending computer vision research over ethical concerns. The YOLO name has been passed on to other object detection papers and libraries since.
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that’s pretty swell. It’s a little bigger than last time but more accurate. It’s still fast though, don’t worry.
At 320 × 320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 AP50 in 51 ms on a Titan X, compared to 57.5 AP50 in 198 ms by RetinaNet, similar performance but 3.8× faster.
As always, all the code is online at https://pjreddie.com/yolo/.
The negative results are in the body of the paper, the last section before the conclusions.
4. Things We Tried That Didn’t Work
We tried lots of stuff while we were working on YOLOv3. A lot of it didn’t work. Here’s the stuff we can remember.
Not every paper has to be written formally.
x, yoffset predictions. We tried using the normal anchor box prediction mechanism where you predict the x, y offset as a multiple of the box width or height using a linear activation. We found this formulation decreased model stability and didn’t work very well.
x, ypredictions instead of logistic. We tried using a linear activation to directly predict the x, y offset instead of the logistic activation. This led to a couple point drop in mAP.
mAP = mean Average Precision
Focal loss. We tried using focal loss. It dropped our mAP about 2 points. YOLOv3 may already be robust to the problem focal loss is trying to solve because it has separate objectness predictions and conditional class predictions. Thus for most examples there is no loss from the class predictions? Or something? We aren’t totally sure
Dual IOU thresholds and truth assignment. Faster RCNN uses two IOU thresholds during training. If a prediction overlaps the ground truth by .7 it is as a positive example, by [.3−.7] it is ignored, less than .3 for all ground truth objects it is a negative example. We tried a similar strategy but couldn’t get good results.
We quite like our current formulation, it seems to be at a local optima at least. It is possible that some of these techniques could eventually produce good results, perhaps they just need some tuning to stabilize the training.
This is good advice to keep the door open for new approaches.
This article was posted in June 2021. For my latest recommendations, check this GitHub Readme.