ML Arxiv Haul #6

12 min readJun 16, 2022

I had a metric ton of Arxiv links built up in the draft for this post, which I’ve tried to trim down to the papers which I hope can stick around as mental bookmarks for me.

Accountability in an Algorithmic Society: Relationality, Responsibility, and Robustness in Machine…

In 1996, Accountability in a Computerized Society [95] issued a clarion call concerning the erosion of accountability…

arxiv.org

The researched, modern take on the risks of assigning agency to a computer and eliminating accountability. There’s a legendary (apocryphal?) quote from the 1970s: “a computer can never be held accountable, therefore a computer must never make a management decision”.

Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it…

arxiv.org

In an effort to discuss African-American English (AAE) being flagged as hate speech by models, the paper develops a process which raises its own questions.
The authors bring in multiple datasets of hate speech, AAE, and Standard American English (including some AAE Tweets ‘translated’ in a 2020 paper?).

A classifier from 2016 (pre-BERT) is used to label Tweets from previously un-labeled sources as AAE or SAE.
One hate speech dataset (Davidson 2017) is labeled as 70% AAE, even though it is a general hate dataset.
Another hate speech dataset (HateXplain 2021) is labeled as ~10% AAE. It includes content from Twitter and Gab (a right-wing network).
I’m concerned about de-duplication of Tweets, unless BERT is being fine-tuned and evaluated on each dataset individually?
BERT is a small model in the year 2022

Anyway, jump ahead to the conclusions:

Maybe it would be better to use this space to discuss the goals of a hate speech classifier. If you’re making it for Club Penguin or Sports Network Featured Tweets, the final product is not allowing any user or dialect to include the n-word. Meanwhile because the hate-labeling and AAE-labeling of each dataset is on such shaky ground, it feels like it’s never addressed whether true-AAE Tweets which would pass a profanity filter are being conflated with hate speech.

BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

Abstract BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns…

aclanthology.org

This was a cool paper from the Workshop on Insights from Negative Results in NLP. The authors tried out creating some multi-word / entity tokens (i.e. Statue of Liberty) instead of the typical word and sub-word tokens. Surprisingly it lowered the accuracy of their translation model.

CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior

The increasing size and complexity of modern ML systems has improved their predictive capabilities but made their…

arxiv.org

Humans wrote counterfactuals to change restaurant reviews into positive or negative along different factors. This makes it easier to parse real reviews and what could be improved for the diner.

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many…

arxiv.org

I’m exploring neural search for a couple of reasons. The overall goal is that instead of searching for a literal text string or a fuzzy match of words, there should be a way to create document embeddings and find the most meaningfully similar document in the search tool.
I had assumed that ColBERT was a drop-in solution to get this working on any BERT model, so I could use this on another language BERT for a demo. I learned two things from this:

The end goal of ColBERT models is document retrieval, i.e. instead of showing the answer to a search input, or the sentence with the answer to your search input, it’s going to return a full relevant document. This means that you might want to separate the original document into several passages for indexing.
ColBERT is its own thing where you would want to start pre-training from scratch, or use the existing weights for English.
The examples help you set up with existing weights, existing index but I wasn’t able to figure out starting from scratch in another language.

Consistent Human Evaluation of Machine Translation across Language Pairs

Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given…

arxiv.org

Facebook/Meta’s recent take on machine translation metrics. This paper points out “BLEU scores are notoriously meaningless” (a recent Google paper on their translation system used BLEU and ChrF as metrics and explored the ratio between them on multiple languages).

Their proposal is XSTS, a 1–5 point scale with design and instructions for the human based on a same-language paraphrase metric (STS).
Results concluded XSTS had the highest level of agreement between humans on English->Other, but fell 2nd to an English reference sentence (MSTS) on Other->English. They also show a way to ‘calibrate’ scores into correlation with BLEU, which seems odd considering that they trashed BLEU earlier.

Controlling Translation Formality Using Pre-trained Multilingual Language Models

This paper describes the University of Maryland's submission to the Special Task on Formality Control for Spoken…

arxiv.org

I’ve been looking for some examples of this because when I was asking European projects about NLP to support refugees, one of the major issues is adapting official text and documents for people who are learning the language, basically a universal need if you think about it.
The models here were mT5 and mBART.
Though there is a focus on spoken language, the actual datasets are text.

Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively…

@inproceedings{siminyu-etal-2022-corpus, title = "Corpus Development of Kiswahili Speech Recognition Test and…

aclanthology.org

This is another paper which I heard about through Rachael Tatman’s YouTube channel.
Researchers aimed to collect speech from Standard Kiswahili (according to the paper’s careful linguistic and historical record, this is understood to mean Kiswahili as spoken in Zanzibar) and 9 popular dialects. They have several sentences recorded in each dialect, but aim to get to 5,000 unique sentences in the future.

Note that a significant part of the CommonVoice team now works at Coqui.ai.

De-biasing "bias" measurement

When a model's performance differs across socially or culturally relevant groups--like race, gender, or the…

arxiv.org

Twitter researchers find issues with group-level fairness measures, called ‘meta-metrics’ here, which are commonly used in the AI ethics world. Here we’re looking at False Positive Rate, True Positive Rate, and Selection Rate. The authors propose a double-corrected variance estimator, which is shown to disagree with the standard variance estimator.

I’ve had some half-baked ideas about Fairness Universes where different people create simulated populations, metrics, and statistics and hash them out. I’ve had the idea long enough that I thought maybe each universe would be an NFT. As always it’s great to see a scientific approach and actual results on a topic where I was curious.

Discovering the Hidden Vocabulary of DALLE-2

We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For…

arxiv.org

This pre-print was hotly debated from its viral pre-pre-print that went viral on Twitter in late May.
I was interested in DALLE-2’s text outputs from the start in April. It has a weird defect in text which other models (i.e. Google Imagen) do not, which led to some discussion about OpenAI manipulating the model.

The paper claims that the model has its own secret vocabulary words (the signature one being Apoploe vesrreaitais meaning bird) and some appear in the DALLE-2 text. Some can even be combined to mean bird-eating-bug. This raises questions about how DALLE works, and whether humans can filter out pornographic and offensive terms if DALLE has its own names for them.
To summarize the backlash:

text cues and image-text examples are ‘cherry-picked’
DALLE-2 text includes imagined characters — consider the image included in my Tweet above — how can the lower lines be transcribed?
vocabulary (other than Apoploe) are difficult to repeat
Apoploe works because the model is confused and guesses that it is a bird species name (Latin)
The given words may trigger neurons inside of DALLE, but they are random noise. This appeals to linguists who want to talk about what is a language even and how it is not bird+bug. It is still bad news for text filters on a model.

EvoNLP - Shared Task

We present TempoWiC, a novel temporal meaning shift task. Given a pair of sentences (or, in this case, ) and a target…

sites.google.com

This is one of my recurring favorite topics — finding how to have language models respond to our changing world. This is our first look at how the EvoNLP workshop is setting up a shared task. They will provide a handful of sentences/Tweets before and after a word adds a new word-sense (e.g. ‘folklore’ coming to mean the Taylor Swift album after its release).

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting…

Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive…

arxiv.org

A dataset of hate speech is generated with likely substitutions of emojis to avoid hate filters — replacing threatening verbs, names of groups, individual letters — and other common uses of emoji on top of speech examples.

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods…

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic…

arxiv.org

Part of the ongoing research into human evaluation, fine-tuning, and/or reinforcement learning to improve outputs of large language models. Here the humans are asked to choose an output based on Preference, Humanness, and Interestingness.
I’m reminded of this now that LaMDA is in the news, and its paper measured quality by Sensibleness, Specificity, and Interestingness (in addition to quality, there were measures for Safety and Groundedness… there are multiple tiers of evaluation).

Hyperparameter Power Impact in Transformer Language Model Training

Lucas Høyberg Puvis de Chavannes, Mads Guldborg Kjeldgaard Kongsbak, Timmie Rantzau, Leon Derczynski. Proceedings of…

aclanthology.org

Adds long-term energy conservation to criteria used for selecting hyperparameters. HP search is one of the most energy consuming parts of modern ML training. Here they use a RoBERTa model and a news dataset (i.e. not an advanced problem) to train multiple models and plot them. They recommend the GELU activation function to have good energy efficiency and low perplexity. They report a total energy use of just under 700 kWh for this research.

"I'm sorry to hear that": finding bias in language models with a holistic descriptor dataset

As language models grow in popularity, their biases across all possible markers of demographic identity should be…

arxiv.org

A Facebook/Meta project to launch a HolisticBias dataset. It has sentences populated with 600 descriptors related to minority groups on “13 different demographic axes”.
In this case the study of bias is not limited to toxicity but includes biases in conversation bot responses (such as: “I’m sorry to hear that” to someone being autistic or hard-of-hearing).

GitHub - UW-Madison-Lee-Lab/LanguageInterfacedFineTuning: Code for Language-Interfaced FineTuning…

Tuan Dinh , Yuchen Zeng , Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris…

github.com

Interesting project fine-tuning GPT-3 to answer non-linguistic tasks, such as describing a row from the Iris dataset.

Perturbation Augmentation for Fairer NLP

Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and…

arxiv.org

Massive-scale project primarily from Facebook/Meta on balancing English text on age, race, and gender. The release includes a human-made dataset, a seq2seq “perturber” model, a RoBERTa model pre-trained on a rebalanced corpus, results of ‘fairtuning’ (finetuning on rebalanced data), and a fairness metric (based on the de-bias-ability of the model).

One reason I feel good about this is I have seq2seq models for gender reinflection in Spanish (somewhat useful) and Arabic (less successful) and results of those models on ‘fairtuned’ datasets. It’s good to know that this is on the right track. For example I was suspicious that big-tech would do this with a GPT prompt.

Facebook/Meta is one of a few companies which could provide the necessary resources to get human annotation, apply a seq2seq model across a huge corpus, and include these other axes. This and the HolisticBias dual release flew under the radar considering how useful they both are for the ethics/fairness-via-math school of thought.

Phylogeny-Inspired Adaptation of Multilingual Models to New Languages

Large pretrained multilingual models, trained on dozens of languages, have delivered promising results due to…

arxiv.org

Inserting adapter layers into the middle of a pre-trained model to train it on a related language. I’ve seen AdapterHub content for adding a layer at the end, but this is my first time seeing inserted layers.
Training a model on a related language has been an interest for me for a while (idea: Sinhala and Dhivehi) where I’ve tried adding Dhivehi tokens and continuing pre-training different models.

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond…

arxiv.org

New image augmentation tactic, looks cool, gets good results on making the output model more robust.

Large Language Models are Zero-Shot Reasoners

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and…

arxiv.org

Clever use of improving a LLM’s answers by prompting it with “Let’s think step by step”.

RankGen: Improving Text Generation with Large Ranking Models

Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that…

arxiv.org

In the world of decoder models to pull text out of GPT’s next-token probability space, RankGen may be the first to outperform typical decoding.
The repo contains code to incorporate it into HF/Transformers beam search, but I haven’t seen any issues or discussions about merging it into the main repo.

GitHub - jesmith14/REAL-ML: The Recognizing, Exploring, and Articulating Limitations in Machine…

The Recognizing, Exploring, and Articulating Limitations in Machine Learning research tool (REAL ML) is a set of guided…

github.com

Exercises for a team to work through while planning and writing the limitations of their model.

Resolving the Human Subjects Status of Machine Learning's Crowdworkers

In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets…

arxiv.org

Thinking about whether IRB boards have oversight, and ought to have more influence, over how researchers involve crowd workers in generating data. The researchers cover examples where workers are answering questions about themselves, or engaging in a conversation with another worker, to contribute data. It seems that these could be examples of data about the workers, where the IRB should apply.

StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models

Knowledge and language understanding of models evaluated through question answering (QA) has been usually studied on…

arxiv.org

In this dataset, the model is given a calendar date and a question, with the expectation of answering correctly as-of that date.

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of…

arxiv.org

Plainly, getting GPT-3 to output its confidence in the answers to math problems, and it correlates with logits’ confidence and correctness. Interestingly, the researchers discuss several ways where GPT may be reviewing its own logits’ confidence, the difficulty of the problem, or other less insightful measures.

The Forgotten Margins of AI Ethics

How has recent AI Ethics literature addressed topics such as fairness and justice in the context of continued social…

arxiv.org

Criticism of research at AI Ethics conferences (such as FAccT) and how seriously they focus on real harms to marginalized groups.

The Unreliability of Explanations in Few-Shot In-Context Learning

How can prompting a large language model like GPT-3 with explanations improve in-context learning? We focus…

arxiv.org

The paper explores prompting GPT-3 with explanations of predicted classes, and the difficulty of getting GPT-3 to produce its own coherent explanation. Humans could identify some consistent explanations, which were indicative of GPT-3 choosing the correct class. It’s unclear if GPT-3 is choosing classes based on these kinds of poor signals or it just isn’t prepped to generate explanations.

One of the first things that I confirmed was that they had used InstructGPT, which has gone through a reinforcement learning stage to more accurately answer new prompts.

TruthfulQA: Measuring How Models Mimic Human Falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The…

arxiv.org

This benchmark of common misconceptions and inaccuracies resurfaced in the trades because the GPT-4Chan model scored higher than GPT-3 or GPT-J. As noted on all sides of the discussion, each model gets around random chance, and researchers have gotten to higher performance with few-shot learning.
It’d be nice to see whether fine-tuning your model on debunker sites (though not the same misconceptions used in the benchmark) would improve scores here, without adversely affecting other conversations and features of the model.

What Do Compressed Multilingual Machine Translation Models Forget?

Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP)…

arxiv.org

Compression methods for language models tend to exacerbate problems of bias. Here we see compressing a multilingual model (M2M-100) has the most negative effect on lower-resource languages, as well as gender bias.

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim…

aclanthology.org

This paper from multiple collaborators and organizations discusses the currents state of multilingual bias evaluation. The largest language models are English-centric, not publicly accessible, and only occasionally subjected to rigorous bias testing. They specifically mention the difference in measuring bias in a language with grammatical gender, and how WinoMT was translated directly from US English examples and professions so it might not match necessary bias metrics in other countries.
I believe that this paper was posted before mGPT (multilingual GPT-style model), so it doesn’t mention any effort to check for bias there.
The paper addresses caste very briefly. Most Hindi-English translation models use whatever parallel corpora from are available, such as bilingual news which likely includes colonial-era biases, including caste-related biases. Google’s MuRIL model paper does not mention caste at all.

ML Arxiv Haul #6

Accountability in an Algorithmic Society: Relationality, Responsibility, and Robustness in Machine…

In 1996, Accountability in a Computerized Society [95] issued a clarion call concerning the erosion of accountability…

Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it…

BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation

Abstract BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns…

CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior

The increasing size and complexity of modern ML systems has improved their predictive capabilities but made their…

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction

Neural information retrieval (IR) has greatly advanced search and other knowledge-intensive language tasks. While many…

Consistent Human Evaluation of Machine Translation across Language Pairs

Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given…

Controlling Translation Formality Using Pre-trained Multilingual Language Models

This paper describes the University of Maryland's submission to the Special Task on Formality Control for Spoken…

Corpus Development of Kiswahili Speech Recognition Test and Evaluation sets, Preemptively…

@inproceedings{siminyu-etal-2022-corpus, title = "Corpus Development of Kiswahili Speech Recognition Test and…

De-biasing "bias" measurement

When a model's performance differs across socially or culturally relevant groups--like race, gender, or the…

Discovering the Hidden Vocabulary of DALLE-2

We discover that DALLE-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. For…

EvoNLP - Shared Task

We present TempoWiC, a novel temporal meaning shift task. Given a pair of sentences (or, in this case, ) and a target…

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting…

Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive…

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods…

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic…

Hyperparameter Power Impact in Transformer Language Model Training

Lucas Høyberg Puvis de Chavannes, Mads Guldborg Kjeldgaard Kongsbak, Timmie Rantzau, Leon Derczynski. Proceedings of…

"I'm sorry to hear that": finding bias in language models with a holistic descriptor dataset

As language models grow in popularity, their biases across all possible markers of demographic identity should be…

GitHub - UW-Madison-Lee-Lab/LanguageInterfacedFineTuning: Code for Language-Interfaced FineTuning…

Tuan Dinh *, Yuchen Zeng *, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris…

Perturbation Augmentation for Fairer NLP

Unwanted and often harmful social biases are becoming ever more salient in NLP research, affecting both models and…

Phylogeny-Inspired Adaptation of Multilingual Models to New Languages

Large pretrained multilingual models, trained on dozens of languages, have delivered promising results due to…

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

In real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond…

Large Language Models are Zero-Shot Reasoners

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and…

RankGen: Improving Text Generation with Large Ranking Models

Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that…

GitHub - jesmith14/REAL-ML: The Recognizing, Exploring, and Articulating Limitations in Machine…

The Recognizing, Exploring, and Articulating Limitations in Machine Learning research tool (REAL ML) is a set of guided…

Resolving the Human Subjects Status of Machine Learning's Crowdworkers

In recent years, machine learning (ML) has come to rely more heavily on crowdworkers, both for building bigger datasets…

StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models

Knowledge and language understanding of models evaluated through question answering (QA) has been usually studied on…

Teaching Models to Express Their Uncertainty in Words

We show that a GPT-3 model can learn to express uncertainty about its own answers in natural language -- without use of…

The Forgotten Margins of AI Ethics

How has recent AI Ethics literature addressed topics such as fairness and justice in the context of continued social…

The Unreliability of Explanations in Few-Shot In-Context Learning

How can prompting a large language model like GPT-3 with explanations improve in-context learning? We focus…

TruthfulQA: Measuring How Models Mimic Human Falsehoods

We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The…

What Do Compressed Multilingual Machine Translation Models Forget?

Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP)…

You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings

Zeerak Talat, Aurélie Névéol, Stella Biderman, Miruna Clinciu, Manan Dey, Shayne Longpre, Sasha Luccioni, Maraim…

Written by Nick Doiron

Tuan Dinh , Yuchen Zeng , Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris…