ML Arxiv Haul #6

Nick Doiron
12 min readJun 16, 2022


I had a metric ton of Arxiv links built up in the draft for this post, which I’ve tried to trim down to the papers which I hope can stick around as mental bookmarks for me.

The researched, modern take on the risks of assigning agency to a computer and eliminating accountability. There’s a legendary (apocryphal?) quote from the 1970s: “a computer can never be held accountable, therefore a computer must never make a management decision”.

In an effort to discuss African-American English (AAE) being flagged as hate speech by models, the paper develops a process which raises its own questions.
The authors bring in multiple datasets of hate speech, AAE, and Standard American English (including some AAE Tweets ‘translated’ in a 2020 paper?).

  • A classifier from 2016 (pre-BERT) is used to label Tweets from previously un-labeled sources as AAE or SAE.
  • One hate speech dataset (Davidson 2017) is labeled as 70% AAE, even though it is a general hate dataset.
    Another hate speech dataset (HateXplain 2021) is labeled as ~10% AAE. It includes content from Twitter and Gab (a right-wing network).
  • I’m concerned about de-duplication of Tweets, unless BERT is being fine-tuned and evaluated on each dataset individually?
  • BERT is a small model in the year 2022

Anyway, jump ahead to the conclusions:

Maybe it would be better to use this space to discuss the goals of a hate speech classifier. If you’re making it for Club Penguin or Sports Network Featured Tweets, the final product is not allowing any user or dialect to include the n-word. Meanwhile because the hate-labeling and AAE-labeling of each dataset is on such shaky ground, it feels like it’s never addressed whether true-AAE Tweets which would pass a profanity filter are being conflated with hate speech.

This was a cool paper from the Workshop on Insights from Negative Results in NLP. The authors tried out creating some multi-word / entity tokens (i.e. Statue of Liberty) instead of the typical word and sub-word tokens. Surprisingly it lowered the accuracy of their translation model.

Humans wrote counterfactuals to change restaurant reviews into positive or negative along different factors. This makes it easier to parse real reviews and what could be improved for the diner.

I’m exploring neural search for a couple of reasons. The overall goal is that instead of searching for a literal text string or a fuzzy match of words, there should be a way to create document embeddings and find the most meaningfully similar document in the search tool.
I had assumed that ColBERT was a drop-in solution to get this working on any BERT model, so I could use this on another language BERT for a demo. I learned two things from this:

  • The end goal of ColBERT models is document retrieval, i.e. instead of showing the answer to a search input, or the sentence with the answer to your search input, it’s going to return a full relevant document. This means that you might want to separate the original document into several passages for indexing.
  • ColBERT is its own thing where you would want to start pre-training from scratch, or use the existing weights for English.
  • The examples help you set up with existing weights, existing index but I wasn’t able to figure out starting from scratch in another language.

Facebook/Meta’s recent take on machine translation metrics. This paper points out “BLEU scores are notoriously meaningless” (a recent Google paper on their translation system used BLEU and ChrF as metrics and explored the ratio between them on multiple languages).

Their proposal is XSTS, a 1–5 point scale with design and instructions for the human based on a same-language paraphrase metric (STS).
Results concluded XSTS had the highest level of agreement between humans on English->Other, but fell 2nd to an English reference sentence (MSTS) on Other->English. They also show a way to ‘calibrate’ scores into correlation with BLEU, which seems odd considering that they trashed BLEU earlier.

I’ve been looking for some examples of this because when I was asking European projects about NLP to support refugees, one of the major issues is adapting official text and documents for people who are learning the language, basically a universal need if you think about it.
The models here were mT5 and mBART.
Though there is a focus on spoken language, the actual datasets are text.

This is another paper which I heard about through Rachael Tatman’s YouTube channel.
Researchers aimed to collect speech from Standard Kiswahili (according to the paper’s careful linguistic and historical record, this is understood to mean Kiswahili as spoken in Zanzibar) and 9 popular dialects. They have several sentences recorded in each dialect, but aim to get to 5,000 unique sentences in the future.

Note that a significant part of the CommonVoice team now works at

Twitter researchers find issues with group-level fairness measures, called ‘meta-metrics’ here, which are commonly used in the AI ethics world. Here we’re looking at False Positive Rate, True Positive Rate, and Selection Rate. The authors propose a double-corrected variance estimator, which is shown to disagree with the standard variance estimator.

I’ve had some half-baked ideas about Fairness Universes where different people create simulated populations, metrics, and statistics and hash them out. I’ve had the idea long enough that I thought maybe each universe would be an NFT. As always it’s great to see a scientific approach and actual results on a topic where I was curious.

This pre-print was hotly debated from its viral pre-pre-print that went viral on Twitter in late May.
I was interested in DALLE-2’s text outputs from the start in April. It has a weird defect in text which other models (i.e. Google Imagen) do not, which led to some discussion about OpenAI manipulating the model.

The paper claims that the model has its own secret vocabulary words (the signature one being Apoploe vesrreaitais meaning bird) and some appear in the DALLE-2 text. Some can even be combined to mean bird-eating-bug. This raises questions about how DALLE works, and whether humans can filter out pornographic and offensive terms if DALLE has its own names for them.
To summarize the backlash:

  • text cues and image-text examples are ‘cherry-picked’
  • DALLE-2 text includes imagined characters — consider the image included in my Tweet above — how can the lower lines be transcribed?
  • vocabulary (other than Apoploe) are difficult to repeat
    Apoploe works because the model is confused and guesses that it is a bird species name (Latin)
  • The given words may trigger neurons inside of DALLE, but they are random noise. This appeals to linguists who want to talk about what is a language even and how it is not bird+bug. It is still bad news for text filters on a model.

This is one of my recurring favorite topics — finding how to have language models respond to our changing world. This is our first look at how the EvoNLP workshop is setting up a shared task. They will provide a handful of sentences/Tweets before and after a word adds a new word-sense (e.g. ‘folklore’ coming to mean the Taylor Swift album after its release).

A dataset of hate speech is generated with likely substitutions of emojis to avoid hate filters — replacing threatening verbs, names of groups, individual letters — and other common uses of emoji on top of speech examples.

Part of the ongoing research into human evaluation, fine-tuning, and/or reinforcement learning to improve outputs of large language models. Here the humans are asked to choose an output based on Preference, Humanness, and Interestingness.
I’m reminded of this now that LaMDA is in the news, and its paper measured quality by Sensibleness, Specificity, and Interestingness (in addition to quality, there were measures for Safety and Groundedness… there are multiple tiers of evaluation).

Adds long-term energy conservation to criteria used for selecting hyperparameters. HP search is one of the most energy consuming parts of modern ML training. Here they use a RoBERTa model and a news dataset (i.e. not an advanced problem) to train multiple models and plot them. They recommend the GELU activation function to have good energy efficiency and low perplexity. They report a total energy use of just under 700 kWh for this research.

A Facebook/Meta project to launch a HolisticBias dataset. It has sentences populated with 600 descriptors related to minority groups on “13 different demographic axes”.
In this case the study of bias is not limited to toxicity but includes biases in conversation bot responses (such as: “I’m sorry to hear that” to someone being autistic or hard-of-hearing).

Interesting project fine-tuning GPT-3 to answer non-linguistic tasks, such as describing a row from the Iris dataset.

Massive-scale project primarily from Facebook/Meta on balancing English text on age, race, and gender. The release includes a human-made dataset, a seq2seq “perturber” model, a RoBERTa model pre-trained on a rebalanced corpus, results of ‘fairtuning’ (finetuning on rebalanced data), and a fairness metric (based on the de-bias-ability of the model).

One reason I feel good about this is I have seq2seq models for gender reinflection in Spanish (somewhat useful) and Arabic (less successful) and results of those models on ‘fairtuned’ datasets. It’s good to know that this is on the right track. For example I was suspicious that big-tech would do this with a GPT prompt.

Facebook/Meta is one of a few companies which could provide the necessary resources to get human annotation, apply a seq2seq model across a huge corpus, and include these other axes. This and the HolisticBias dual release flew under the radar considering how useful they both are for the ethics/fairness-via-math school of thought.

Inserting adapter layers into the middle of a pre-trained model to train it on a related language. I’ve seen AdapterHub content for adding a layer at the end, but this is my first time seeing inserted layers.
Training a model on a related language has been an interest for me for a while (idea: Sinhala and Dhivehi) where I’ve tried adding Dhivehi tokens and continuing pre-training different models.

New image augmentation tactic, looks cool, gets good results on making the output model more robust.

Clever use of improving a LLM’s answers by prompting it with “Let’s think step by step”.

In the world of decoder models to pull text out of GPT’s next-token probability space, RankGen may be the first to outperform typical decoding.
The repo contains code to incorporate it into HF/Transformers beam search, but I haven’t seen any issues or discussions about merging it into the main repo.

Exercises for a team to work through while planning and writing the limitations of their model.

Thinking about whether IRB boards have oversight, and ought to have more influence, over how researchers involve crowd workers in generating data. The researchers cover examples where workers are answering questions about themselves, or engaging in a conversation with another worker, to contribute data. It seems that these could be examples of data about the workers, where the IRB should apply.

In this dataset, the model is given a calendar date and a question, with the expectation of answering correctly as-of that date.

Plainly, getting GPT-3 to output its confidence in the answers to math problems, and it correlates with logits’ confidence and correctness. Interestingly, the researchers discuss several ways where GPT may be reviewing its own logits’ confidence, the difficulty of the problem, or other less insightful measures.

Criticism of research at AI Ethics conferences (such as FAccT) and how seriously they focus on real harms to marginalized groups.

The paper explores prompting GPT-3 with explanations of predicted classes, and the difficulty of getting GPT-3 to produce its own coherent explanation. Humans could identify some consistent explanations, which were indicative of GPT-3 choosing the correct class. It’s unclear if GPT-3 is choosing classes based on these kinds of poor signals or it just isn’t prepped to generate explanations.

One of the first things that I confirmed was that they had used InstructGPT, which has gone through a reinforcement learning stage to more accurately answer new prompts.

This benchmark of common misconceptions and inaccuracies resurfaced in the trades because the GPT-4Chan model scored higher than GPT-3 or GPT-J. As noted on all sides of the discussion, each model gets around random chance, and researchers have gotten to higher performance with few-shot learning.
It’d be nice to see whether fine-tuning your model on debunker sites (though not the same misconceptions used in the benchmark) would improve scores here, without adversely affecting other conversations and features of the model.

Compression methods for language models tend to exacerbate problems of bias. Here we see compressing a multilingual model (M2M-100) has the most negative effect on lower-resource languages, as well as gender bias.

This paper from multiple collaborators and organizations discusses the currents state of multilingual bias evaluation. The largest language models are English-centric, not publicly accessible, and only occasionally subjected to rigorous bias testing. They specifically mention the difference in measuring bias in a language with grammatical gender, and how WinoMT was translated directly from US English examples and professions so it might not match necessary bias metrics in other countries.
I believe that this paper was posted before mGPT (multilingual GPT-style model), so it doesn’t mention any effort to check for bias there.
The paper addresses caste very briefly. Most Hindi-English translation models use whatever parallel corpora from are available, such as bilingual news which likely includes colonial-era biases, including caste-related biases. Google’s MuRIL model paper does not mention caste at all.



Nick Doiron

Web->ML developer and mapmaker.