ML Arxiv Haul #4

11 min readApr 16, 2022

Oh crud, this is too many papers; I should probably rethink this strategy of pasting an Arxiv link here instead of reading or saving it in a tab.
I decided to make a daily routine to finally get ahead on filtering and summarizing.

A Conversational Paradigm for Program Synthesis

Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a…

arxiv.org

This is a little interesting — a Salesforce team’s take on code-generation via a chat interface. I thought it was a little bizarre that the text from humans was more comments requesting the next sequence of code. It’s not something where you give feedback to fix or improve on the previous code.

Deep learning for language understanding of mental health concepts derived from Cognitive…

In recent years, we have seen deep learning and distributed representations of words and sentences make impact on a…

arxiv.org

A scientific writing of, deep learning NLP models are better than other methods on a dataset to classify cognitive-behavioral statements.

DeiT III: Revenge of the ViT

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has…

arxiv.org

This continues a back-and-forth between the new architecture (visual transformers) and revivals of ResNet (specifically ResNeXt) showing that you can get similar accuracy scores with some improvements, data augmentation, etc.

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large…

arxiv.org

Prompt engineering → prompt search. The prompts generated by this method outperform human-edited prompts (but by who? apparently crowd workers?). In a baffling reveal, some ‘improved’ prompts remove the labels and/or coherence from the prompt. Maybe it’s understood that a Tweet-based task is generally positive or negative, or maybe some information is leaking through? (unlikely in a hosted service like InstructGPT).
I would like to see these results used to craft more confusing tasks (i.e. switching positive <-> negative labels, or labeling negative tweets ‘yes’).

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting…

Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should…

arxiv.org

The researchers find an improvement in processing extremely-closely related languages (German and Swiss German, Romanian and Moldovan) by continuing pre-training on text including some examples with spelling errors. It’s difficult to guess why this works, other than maybe playing on different neurons or building more robust patterns rather than expecting direct matches.

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce…

arxiv.org

This paper is part of a new wave of code generation models which ‘infill’ gaps in code or edit in place. They have a demo on HuggingFace/Gradio: https://huggingface.co/spaces/facebook/incoder-demo
Interestingly they are still using a left-to-right style generative model, but replacing it in the code with a <MASK> token and then training the model to follow sample code prompts with its prediction of what the <MASK> was. They also can handle a mask being repeated (such as a variable name) or two masks in the same code prompt.

Jigsaw: Large Language Models meet Program Synthesis

Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code…

arxiv.org

This is a Microsoft Research paper (i.e. not Google’s Jigsaw team). In this case they’re setting up an experimental framework to generate code. The code-writer model is given a data task in the Pandas library and then its code is evaluated for correctness. This is interesting more for what it doesn’t do (generating tons of code samples and improving with reinforcement learning, as DeepMind does).

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks…

arxiv.org

A model for Kinyarwanda language. The authors add a morphology / part-of-speech tagging step before feeding data into a BERT-like model. This helps break down the components of agglutination in Kinyarwanda.

The sentence ’John twarahamusanze biradutangaza’ (We were surprised to find John there).

Language modeling via stochastic processes

Modern language models can generate high-quality short texts. However, they often meander or are incoherent when…

arxiv.org

This paper focuses on the coherence and quality of texts from generative models using ‘time control’. I immediately thought of the ‘typical decoding’ paper. I don’t think anyone has directly compared the two?

Machine Learning State-of-the-Art with Uncertainties

With the availability of data, hardware, software ecosystem and relevant skill sets, the machine learning community is…

arxiv.org

A paper that just came out this week (April 2022). The authors use timm, ResNeXt, and Imagenette (all good choices) and multiple runs to calculate the varying accuracy and give a better grounding for discussions around State-of-the-Art results. The authors point out How Do Vision Transformers Work claims significant results on a new architecture when the confidence interval includes a possibility that there was no improvement.

Multilingual Translation via Grafting Pre-trained Language Models

Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised…

arxiv.org

A scientific approach to something which I’ve tried to do — repurposing an existing model architecture and weights to work in another language with fewer resources.
The authors improve on quality of mBART and other models purpose-built for translation, so this is a promising direction.

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive…

ai.googleblog.com

Google’s enigmatic Pathways architecture posts were followed by this more practical (but still solely research-use) Pathways Language Model.
This is one of the first papers to use the crowdsourced BIG-bench tasks. It looks like they compare PaLM to human and pre-existing model performance, but do most analysis on a select 58 tasks, not including my own tasks here.

In a new development which is sure to be a confounding trend, the dataset and ethics sections of the paper cite Stochastic Parrots without heeding its warnings. The authors (including two expelled from Google over that paper) have all commented on this paper, but I wanted to highlight two threads:

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial…

We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild…

arxiv.org

Algorithmic basis for realistic misspellings and typos (I find it annoying to talk about this as adversarial or an attack, but it’s useful to make models more robust).

Practical tradeoffs between memory, compute, and performance in learned optimizers

Optimization plays a costly and crucial role in developing machine learning systems. In learned optimizers, the few…

arxiv.org

Interesting project from Google Research about taking the optimizer component of the classical training loop (predict->loss ->optimize->) and making that learnable / trainable, instead of a fixed function. They show improvements, but admit a focus on only two tasks. I’m thinking maybe this won’t catch on, but could maybe find a better general-purpose optimizer.

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource…

arxiv.org

The researchers evaluate the translation ability of mBART and mT5. As might be expected, fine-tuning on languages unseen during pre-training does not yield a usable model. The project includes research from the Masakhane project and they evaluate Yoruba language, but the team also gets similar results on Irish/Gaelic, Assamese, and Kannada.

SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with…

Even though several methods have proposed to defend textual neural network (NN) models against black-box adversarial…

arxiv.org

Very much my jam — this “patches” the final layer of a text neural net. The reason for this is something which I didn’t expect: it is trying to vary the output (within probability-space) to prevent static analysis by typical model-attacking techniques.
I found the code for this project, which was just posted last week, and connected it on PapersWithCode.
There is a 2018 paper called SHIELD about another stochastic defensive technique against adversarial attacks, for images, from Georgia Tech.

Software-Supported Audits of Decision-Making Systems: Testing Google and Facebook's Political…

How can society understand and hold accountable complex human and algorithmic decision-making systems whose systematic…

arxiv.org

A test of Facebook advertising policy during the 2018 midterm elections, which claims to evaluate automated approvals but is unclear.

…our audit demonstrated that Facebook was systematically prohibiting nonelection material of importance to American civic life, including public holidays and government websites.

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective

As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes…

arxiv.org

Unconventional format and content for an ML paper — the authors interviewed 25 practitioners about how they used explainable ML tools and how they handled conflicting answers.

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer architectures have rapidly gained traction in…

arxiv.org

This seems to be generally good advice for vision transformers.

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We…

arxiv.org

This modifies pre-training code to drop tokens with the smallest loss. If these tokens are truly superfluous or predictable, then we can save time/compute. The researchers show pre-training can be 25% faster.
What does it mean for tokens to be ‘dropped’ though, especially when they are restored in the final layer? I’ve included a figure from the experiment below… it looks as though ‘the’, prepositions, being verbs, and punctuation get dropped from the source text. I’m still confused about whether you train BERT for a bit and then start removing tokens from future steps/epochs, or the loss is measured using an existing full-text BERT model before pre-training a new abbreviated BERT model.

Towards Building an Open-Domain Dialogue System Incorporated with Internet Memes

In recent years, Internet memes have been widely used in online chatting. Compared with text-based communication…

arxiv.org

A little bit fun, this addresses a challenge from the DSTC10 AI Workshop to participate in WeChat (Chinese) conversations with the right memes and emotional tracking. The memes are stickers identified by their alt-text, which is a little disappointing as a true meme-bot would have some visual understanding, but it’s a fun competition.
More info and dataset links: https://github.com/lizekang/dstc10-mod

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given…

arxiv.org

Chinchilla is a language model from DeepMind which outperforms much larger models (including GPT-3). In a strange top-down approach, the paper graphs the ideal language model based on model and corpus statistics, hypothesizes that existing models are inefficient, and creates Chinchilla to fit their ideal. I was looking for some clever test training runs and scaling laws and things, but it seems like the main takeaway is for every doubling of model size the number of training tokens should also be doubled.
This paper mentions one of my BIG-Bench tasks (disambiguation_q) as an example task where Chinchilla outperforms Gopher.

True Few-Shot Learning with Language Models

Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses…

arxiv.org

I believe that this paper is critiquing prompt engineering (as most results are being reported without the process of choosing the right prompt or hyperparameter run).

Understanding Iterative Revision from Human-Written Text

Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial part of writing is…

arxiv.org

The ITERATER dataset includes revisions to text and annotations about why that change was made (fluency, clarity). They attain small improvements over human editors with BART, FELIX, and PEGASUS models.
This is the first time that I’ve seen FELIX (an alternative to seq2seq) appearing in new research, so I may consider it for the gender-reinflection model stuff.

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on…

arxiv.org

This paper suggests ‘adapter’ layers before and after a frozen language model can learn to process text plus images from CLIP, without the difficulties of processing the whole model. They get good results.

When classifying grammatical role, BERT doesn't care about word order... except when it matters

Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural…

arxiv.org

In 2020–2021 there were a flurry of papers showing that BERT models would ignore word order, seemingly a step back from classical diagram-sentences and part-of-speech NLP. This 2022 paper looks inside the model to show some differences, and also makes the point that many of these word-order experiments are super-weird or impossible so BERT is ignoring them (‘except when it matters’).

"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience…

Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may…

arxiv.org

A component of LIME and other black-box explainable AI methods is assigning importance to input images and text through salience. Here Google researchers take four common algorithms and find examples where the algorithms disagree on the importance of words.
The authors blame ‘shortcuts’ taken by language models, which can come in a variety of forms, but essentially mean techniques which fail on new data (not robust). They recommend avoiding shortcuts by using different explainability methods depending on whether you are using BERT or LSTM (do people still use LSTM?).

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision and language models to conduct…

arxiv.org

As text+image (multi-modal) models become more popular, Winoground is an intriguing image-captioning benchmark which just came out this month (April 2022). Two images have a caption with the same bag of words, but different word order (e.g. a person sits and a dog stands, vs. a person stands and a dog sits). Frustratingly, no model does much better than chance.

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training…

Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art…

arxiv.org

Adversarial machine learning finds inputs that trick conventional models, sometimes to discredit them or to improve robustness during training. This 2017 paper describes a black-box attack (where we don’t have transparency to the full model) which can create ‘attack’ inputs without developing its own shadow model.
This paper was the first time I read about feature squeezing and another 2017 paper Adversarial Examples Are Not Easily Detected. I wonder if I’m out of the loop on adversarial examples or this line of detecting/defending against adversarial examples has fallen out of favor.

ML Arxiv Haul #4

A Conversational Paradigm for Program Synthesis

Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a…

Deep learning for language understanding of mental health concepts derived from Cognitive…

In recent years, we have seen deep learning and distributed representations of words and sentences make impact on a…

DeiT III: Revenge of the ViT

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has…

GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large…

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting…

Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should…

InCoder: A Generative Model for Code Infilling and Synthesis

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce…

Jigsaw: Large Language Models meet Program Synthesis

Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code…

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks…

Language modeling via stochastic processes

Modern language models can generate high-quality short texts. However, they often meander or are incoherent when…

Machine Learning State-of-the-Art with Uncertainties

With the availability of data, hardware, software ecosystem and relevant skill sets, the machine learning community is…

Multilingual Translation via Grafting Pre-trained Language Models

Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised…

Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance

In recent years, large neural networks trained for language understanding and generation have achieved impressive…

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial…

We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild…

Practical tradeoffs between memory, compute, and performance in learned optimizers

Optimization plays a costly and crucial role in developing machine learning systems. In learned optimizers, the few…

Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource…

SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with…

Even though several methods have proposed to defend textual neural network (NN) models against black-box adversarial…

Software-Supported Audits of Decision-Making Systems: Testing Google and Facebook's Political…

How can society understand and hold accountable complex human and algorithmic decision-making systems whose systematic…

The Disagreement Problem in Explainable Machine Learning: A Practitioner's Perspective

As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes…

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer architectures have rapidly gained traction in…

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We…

Towards Building an Open-Domain Dialogue System Incorporated with Internet Memes

In recent years, Internet memes have been widely used in online chatting. Compared with text-based communication…

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training a transformer language model under a given…

True Few-Shot Learning with Language Models

Pretrained language models (LMs) perform well on many tasks even when learning from a few examples, but prior work uses…

Understanding Iterative Revision from Human-Written Text

Writing is, by nature, a strategic, adaptive, and more importantly, an iterative process. A crucial part of writing is…

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on…

When classifying grammatical role, BERT doesn't care about word order... except when it matters

Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural…

"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience…

Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may…

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision and language models to conduct…

ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks without Training…

Deep neural networks (DNNs) are one of the most prominent technologies of our time, as they achieve state-of-the-art…

Written by Nick Doiron