ML Arxiv Haul #13

7 min readDec 7, 2022

Lucky / unlucky 13? I noticed that like my reading blog, these are appearing on a roughly monthly schedule.
I saw that Cohere posted a digest of NLP papers this month ( txt.cohere.ai/top-nlp-papers-november-2022 ) so maybe I can look at those in the future.

A Federated Approach to Predicting Emojis in Hindi Tweets

The use of emojis affords a visual modality to, often private, textual communication. The task of predicting emojis…

arxiv.org

This came up on my radar because they use one of the MonsoonNLP models. It’s about training a classification model, using federated learning, and the challenges of picking a Hindi model. They find some success using LSTM instead of a neural model.

I’m a little concerned that the emojis are an imbalanced dataset.

It’s difficult to predict the right emojis well when it’s easy to use the more popular emojis

Bean Machine: Composable, Fast Probabilistic Inference on PyTorch | Meta Research

By: December 15, 2021 Eric Lippert, JP Chen, Kinjal Shah, Michael Tingley, Sepehr Akhavan Masouleh, Xiaoyan Wang, Brad…

research.facebook.com

Eric Lippert wrote a post about the Probability team getting let go from Facebook / Meta. He says the team was quite successful and cut costs across the org so it’s understandably bittersweet. I wonder if someone in ML leadership is negative about Probabilistic AI and wants to double down on LLMs and doesn’t care about compute.

Dead or Murdered? Predicting Responsibility Perception in Femicide News Reports

Different linguistic expressions can conceptualize the same event from different viewpoints by emphasizing certain…

arxiv.org

Important topic on how violence is represented in Italian media (the paper focuses on murder and gender-based violence, so content warning (domestic violence is cited from a separate paper)). There’s also a reference to the use of language in bicycle deaths (accident vs. collision).

Domain Adaptation under Open Set Label Shift

We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change…

arxiv.org

A paper and dataset modeling label shift (for example, adding a new class). Based on CIFAR.

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python…

arxiv.org

Programming question challenges using Pandas and other popular data science libraries.

Few-shot Learning with Multilingual Language Models

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to…

arxiv.org

I somehow missed Facebook/Meta releasing a multilingual autoregressive model (basically their take on a multilingual GPT-3, separate from OPT, their reproduction of GPT-3). There are a lot of mono- and multi-lingual BERT type models, but few languages have a full GPT-2 and even fewer (outside of English and Chinese) have something like this. The main models would be mGPT, BLOOM, and this XGLM.

HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models

Fairness has become a trending topic in natural language processing (NLP), which addresses biases targeting certain…

arxiv.org

Evaluating regional bias of masked language models (i.e. BERT not GPT). There’s a lot of plotting embeddings of the regions.

Holistic Evaluation of Language Models

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities…

arxiv.org

Stanford announces a benchmark for all LLMs. Isn’t that BIG-Bench? Instead this group rejects BIG-Bench. They do deserve credit for, is getting the various labs and companies to properly evaluate their models in this head-to-head competition.

Human-level play in the game of Diplomacy by combining language models with strategic reasoning

https://www.science.org/doi/10.1126/science.ade9097

Facebook/Meta got an exciting paper into Science about AI agents dominating in Diplomacy. From a livestream of one game which they posted, I understand this as a game sort of like Risk, where the players send messages to each other to forge alliances and dispel concerns about why they are moving units.
Meta already had a pre-print about Diplomacy in October which is a little confusing, right?

From watching that livestream, it seems like most messages are a chatbot which encourages other users to be mutually peaceful. The human player seemed satisfied with their chats, which had great casual language (i.e. “sounds good” “nah” “yeah lets” instead of Accept/Deny). Supposedly the model “can’t lie”. There was one message describing a move/attack which the player hadn’t actually done, which looked like an error. So it’s getting things done through communication and not deception. I don’t know enough about Diplomacy to know if this is masterful gaming or what.

Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE

Figurative language (e.g., “he flew like the wind”) is challenging to understand, as it is hard to tell what implicit…

arxiv.org

Linguists love challenging computers to look at pretty language, so here ya go. Allen AI researchers use an existing model (DREAM) which adds text and possible context to existing stories. Then this experiment adds a test to demonstrate whether the model understands if two pieces of figurative language are in agreement (entailment / contradiction).

MaskTune: Mitigating Spurious Correlations by Forcing to Explore

A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that…

arxiv.org

Interesting appearance of Autodesk AI Lab in this paper’s authors. MaskTune is their method which during a single epoch of fine-tuning, identifies inputs which are already learned by the model and masks them out. The idea is that these improve performance on tasks such as The Background Challenge where image models spuriously connect the object to the background (horses and zebras, flowers and bees).

Memory-Based Model Editing at Scale

Even the largest neural networks make errors, and once-correct predictions can become invalid as the world changes…

arxiv.org

Back in February there was a model fact-editing tool ROME. This group creates a method SERAC which keeps the original model frozen and then builds a framework around it. I think what’s going on is they have a new model detect if input is related to its edits, and then routes relevant queries to its new facts system.

In their tests of counter-factual generations, I appreciated this line:

Banana example was not cherry-picked; it was the first topic attempted.

Mitigating Covertly Unsafe Text within Natural Language Systems

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate…

arxiv.org

“Covertly unsafe” suggestions are tasks which require world knowledge to know are harmful (such as biting a ghost pepper, or putting water on a grease fire). The paper works on defining and describing this problem but doesn’t have models and results.

metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq

Here we have included our full logbook used while training the OPT-175B model, along with a series of notes written to…

github.com

When Facebook/Meta announced OPT, one unusual component was sharing 100+ pages of notes, now called ‘Chronicles of OPT Development’. This makes it easier to see what the team tried. They report the learning rate, metrics which they were using during training, deduplication issue, early signs which they were matching GPT-3 performance.

ProsocialDialog: A Prosocial Backbone for Conversational Agents

Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or…

arxiv.org

ProsocialDialog is a dataset of conversations to help chatbots redirect conversations away from negative topics.

GitHub — AliOsm/shakkelha: Neural Arabic text diacritization

This repository contains the models, dataset, helpers, and systems’ comparison for our paper on Arabic Text…

github.com

Adding vowel marks to Arabic text with a neural network (see kaggle.com/datasets/linuxscout/tashkeela for a dataset). When I first visited the Unicode Conference (in 2015, I think..?), someone told me that Google had worked on adding tashkeel and failed, so it’s interesting to see how this develops.

Background info: in Arabic script, long vowels (aa, ii, uu) are written as distinct letters. The short form of these vowels (or absence of a vowel), appear as diacritic marks above or below letters (Wiki says that vowels are only a part of tashkeel; I am not a pro here). These marks are always noted in the Quran, but unlikely to be found on webpages, street signs, and textbooks. As a language learner, this can be frustrating because I know دبي / D-B-II is Dubai from memory and context, but if I look at a sign on the subway I see only a bunch of consonants? Words can be ambiguous and I see why it’s challenging for an NLP model to fill in the words for text-to-speech or search indexes.

Similarity of Pre-trained and Fine-tuned Representations

In transfer learning, only the last part of the networks - the so-called head - is often fine-tuned. Representation…

arxiv.org

There’s a question across ML of whether fine-tuning could be replaced by continuing pre-training. In image classification, fine-tuning focuses only on the later/head layers, but in Transformers text fine-tuning affects the whole model somehow (there is an option to freeze specific layers).

Task-aware Retrieval with Instructions

We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent…

arxiv.org

Lots of retrieval tasks and benchmarks.

Tracing Knowledge in Language Models Back to the Training Data

Neural language models (LMs) have been shown to memorize a great deal of factual knowledge. But when an LM generates an…

arxiv.org

Finding that the methods to connect fill-in-the-blank facts to specific training examples are still hit-or-miss.

VeLO: Training Versatile Learned Optimizers by Scaling Up

While deep learning models have replaced hand-designed features across many domains, these models are still trained…

arxiv.org

Another Google paper experimenting with the optimizer adapting with the learning/training process. This takes it further and makes the optimizer into a neural net. Particularly interesting tidbit from Jascha, one of the core members of BIG-Bench, describing testing out this optimizer with other researchers:

Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA

Resolving ambiguities in questions is key to successfully answering them. Focusing on questions about images, we create…

arxiv.org

Johns Hopkins researchers developed several image-based question-answering examples. The focus of this paper is finding ambiguous questions of different types, such as asking “what make of motorcycle is that?” on a picture of a dirt bike, or asking about an object when there are two in the image.

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes…

arxiv.org

This is a fun one! Human players are tasked with choosing text-image pairs which are guessed correctly by other humans and confounding to the AI.

ML Arxiv Haul #13

A Federated Approach to Predicting Emojis in Hindi Tweets

The use of emojis affords a visual modality to, often private, textual communication. The task of predicting emojis…

Bean Machine: Composable, Fast Probabilistic Inference on PyTorch | Meta Research

By: December 15, 2021 Eric Lippert, JP Chen, Kinjal Shah, Michael Tingley, Sepehr Akhavan Masouleh, Xiaoyan Wang, Brad…

Dead or Murdered? Predicting Responsibility Perception in Femicide News Reports

Different linguistic expressions can conceptualize the same event from different viewpoints by emphasizing certain…

Domain Adaptation under Open Set Label Shift

We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change…

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python…

Few-shot Learning with Multilingual Language Models

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to…

HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models

Fairness has become a trending topic in natural language processing (NLP), which addresses biases targeting certain…

Holistic Evaluation of Language Models

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities…

Human-level play in the game of Diplomacy by combining language models with strategic reasoning

Just-DREAM-about-it: Figurative Language Understanding with DREAM-FLUTE

Figurative language (e.g., “he flew like the wind”) is challenging to understand, as it is hard to tell what implicit…

MaskTune: Mitigating Spurious Correlations by Forcing to Explore

A fundamental challenge of over-parameterized deep learning models is learning meaningful data representations that…

Memory-Based Model Editing at Scale

Even the largest neural networks make errors, and once-correct predictions can become invalid as the world changes…

Mitigating Covertly Unsafe Text within Natural Language Systems

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate…

metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq

Here we have included our full logbook used while training the OPT-175B model, along with a series of notes written to…

ProsocialDialog: A Prosocial Backbone for Conversational Agents

Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or…

GitHub — AliOsm/shakkelha: Neural Arabic text diacritization

This repository contains the models, dataset, helpers, and systems’ comparison for our paper on Arabic Text…

Similarity of Pre-trained and Fine-tuned Representations

In transfer learning, only the last part of the networks - the so-called head - is often fine-tuned. Representation…

Task-aware Retrieval with Instructions

We study the problem of retrieval with instructions, where users of a retrieval system explicitly describe their intent…

Tracing Knowledge in Language Models Back to the Training Data

Neural language models (LMs) have been shown to memorize a great deal of factual knowledge. But when an LM generates an…

VeLO: Training Versatile Learned Optimizers by Scaling Up

While deep learning models have replaced hand-designed features across many domains, these models are still trained…

Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA

Resolving ambiguities in questions is key to successfully answering them. Focusing on questions about images, we create…

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes…

Written by Nick Doiron

No responses yet