ML Arxiv Haul #2

4 min readFeb 5, 2022

About two months ago, I burned through a backlog of ML articles which I had queued up to read ‘soon’. That list has gotten long again so I’m going to try and write out short summaries as I skim through them.

SCROLLS: Standardized CompaRison Over Long Language Sequences

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a…

arxiv.org

Combines several existing language benchmarks for long texts (the longest part being NarrativeQA, which has questions about whole books). Several of the datasets come from Project Gutenberg (public domain texts), so unless the books are obscure, I’d worry that language models trained on the public internet might know too much already (for example: if I asked you about Les Miserables: “what was Valjean sentenced to jail for stealing?” you may have seen a summary and don’t need to extract answers from the original text).

Release Strategies and the Social Impacts of Language Models

Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze…

arxiv.org

This is a late 2019 dive into OpenAI’s decision to release GPT-2 in slowly increasing model sizes. Unfortunately it predates the completely privatized and commercialized GPT-3 and DALL-E, and EleutherAI’s reconstructed versions of those models, which add complexity to this discussion.

When we talked about this in a 2020 class on international law and AI, the class learned how GPT-2 was withheld but not that it had since been totally released and we weren’t all trapped in a GPT-matrix-web. I would like to place OpenAI’s truth in the space between ‘cautious AI safety research’ ‘don’t need to market itself with open source anymore’ ‘money’ ‘only oligarchs and mega-corps could deploy GPT-3 anyway’ with some sociological research but this probably isn’t the most pressing AI & society issue.

A ConvNet for the 2020s

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly…

arxiv.org

Facebook/Meta’s paper about computer vision architectures. There’s a general sense that convolutional networks (such as ResNet) are being extinct-ed by the rise of attention-based vision transformers. The thing is, all of these architectures could be better with some love and care. The researchers improve their ConvNeXt model and reach a new high on ImageNet. I was a bit concerned that maybe this was a purpose-built ImageNet model not applicable to other CV tasks, but they include a mention and an appendix on robustness of the model.

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a…

arxiv.org

I’ve previously developed BERT and GPT-style language models, fine-tuned them, and submitted one of the first ‘adapters’ to AdapterHub. The idea there was that the last stage of the neural net could be quickly specialized or swapped out for the specific task, like a drill bit. This Google paper takes a new approach by changing the middle layers and finds it interesting, but they don’t claim SOTA results at this time.

I want to include my June 2020 Tweet to look smart about this intermediate layer topic.

OOD

VOS: Learning What You Don't Know by Virtual Outlier Synthesis

Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of…

arxiv.org

This was just posted just recently by researchers at University of Wisconsin. They train a model to be really good at object-recognition boxes and create a new benchmark around out-of-domain objects which the model is unfamiliar with (best explained by the error below).

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test)…

arxiv.org

Predicts accuracy on a test set with a new metric (Average Thresholded Confidence) just seeing how the model reacts to batches of the new input and not knowing the labels. Their example of OOD data are WILDS and some interesting ImageNet spinoffs where the test set would be illustrations or other new formats.

Model prompts

Memory-assisted prompt editing to improve GPT-3 after deployment

Large LMs such as GPT-3, while powerful, are not immune to mistakes, but are prohibitively costly to retrain. One…

arxiv.org

Uses feedback from the users about incorrect answers / misunderstood questions to re-prompt GPT-3 and provide better answers in the future. Researchers are from Carnegie Mellon and AllenAI. I think that OpenAI is more likely to take the InstructGPT route (using reinforcement learning from human prompters) but still interesting.

Other NLP

Grammatical cues are largely, but not completely, redundant with word meanings in natural language

The combinatorial power of language has historically been argued to be enabled by syntax: rules that allow words to…

arxiv.org

Basic but interesting analysis — if you scramble word order, models can figure out the right word order 87% of the time across language families. So the grammar is only particularly helpful in the remaining cases and unexpected cases (“man bites dog”).

Typical Decoding for Natural Language Generation

Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often…

arxiv.org

Discusses an ongoing problem in NLG where the most likely text can be boring and repetitive. The researchers analyze human-generated text and discuss an expected information content at each new token. They cover two common methods to select tokens from GPT-3 end probabilities (top-k, nucleus) and come up with their own (‘typical sampling’) to fit this new system. The new method gets high scores on perplexity and a select number of tasks such as summarization. This was relevant to my GPT-NYC interests and I’m happy to see they have a PR open with HuggingFace Transformers.

ML Arxiv Haul #2

SCROLLS: Standardized CompaRison Over Long Language Sequences

NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a…

Release Strategies and the Social Impacts of Language Models

Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze…

A ConvNet for the 2020s

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly…

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a…

OOD

VOS: Learning What You Don't Know by Virtual Outlier Synthesis

Out-of-distribution (OOD) detection has received much attention lately due to its importance in the safe deployment of…

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test)…

Model prompts

Memory-assisted prompt editing to improve GPT-3 after deployment

Large LMs such as GPT-3, while powerful, are not immune to mistakes, but are prohibitively costly to retrain. One…

Other NLP

Grammatical cues are largely, but not completely, redundant with word meanings in natural language

The combinatorial power of language has historically been argued to be enabled by syntax: rules that allow words to…

Typical Decoding for Natural Language Generation

Despite achieving incredibly low perplexities on myriad natural language corpora, today's language models still often…

Written by Nick Doiron

No responses yet