ML Arxiv Haul #13

Nick Doiron
7 min readDec 7, 2022

--

Lucky / unlucky 13? I noticed that like my reading blog, these are appearing on a roughly monthly schedule.
I saw that Cohere posted a digest of NLP papers this month ( txt.cohere.ai/top-nlp-papers-november-2022 ) so maybe I can look at those in the future.

This came up on my radar because they use one of the MonsoonNLP models. It’s about training a classification model, using federated learning, and the challenges of picking a Hindi model. They find some success using LSTM instead of a neural model.

I’m a little concerned that the emojis are an imbalanced dataset.

It’s difficult to predict the right emojis well when it’s easy to use the more popular emojis

Eric Lippert wrote a post about the Probability team getting let go from Facebook / Meta. He says the team was quite successful and cut costs across the org so it’s understandably bittersweet. I wonder if someone in ML leadership is negative about Probabilistic AI and wants to double down on LLMs and doesn’t care about compute.

Important topic on how violence is represented in Italian media (the paper focuses on murder and gender-based violence, so content warning (domestic violence is cited from a separate paper)). There’s also a reference to the use of language in bicycle deaths (accident vs. collision).

A paper and dataset modeling label shift (for example, adding a new class). Based on CIFAR.

Programming question challenges using Pandas and other popular data science libraries.

I somehow missed Facebook/Meta releasing a multilingual autoregressive model (basically their take on a multilingual GPT-3, separate from OPT, their reproduction of GPT-3). There are a lot of mono- and multi-lingual BERT type models, but few languages have a full GPT-2 and even fewer (outside of English and Chinese) have something like this. The main models would be mGPT, BLOOM, and this XGLM.

Evaluating regional bias of masked language models (i.e. BERT not GPT). There’s a lot of plotting embeddings of the regions.

Stanford announces a benchmark for all LLMs. Isn’t that BIG-Bench? Instead this group rejects BIG-Bench. They do deserve credit for, is getting the various labs and companies to properly evaluate their models in this head-to-head competition.

Human-level play in the game of Diplomacy by combining language models with strategic reasoning

https://www.science.org/doi/10.1126/science.ade9097

Facebook/Meta got an exciting paper into Science about AI agents dominating in Diplomacy. From a livestream of one game which they posted, I understand this as a game sort of like Risk, where the players send messages to each other to forge alliances and dispel concerns about why they are moving units.
Meta already had a pre-print about Diplomacy in October which is a little confusing, right?

From watching that livestream, it seems like most messages are a chatbot which encourages other users to be mutually peaceful. The human player seemed satisfied with their chats, which had great casual language (i.e. “sounds good” “nah” “yeah lets” instead of Accept/Deny). Supposedly the model “can’t lie”. There was one message describing a move/attack which the player hadn’t actually done, which looked like an error. So it’s getting things done through communication and not deception. I don’t know enough about Diplomacy to know if this is masterful gaming or what.

Linguists love challenging computers to look at pretty language, so here ya go. Allen AI researchers use an existing model (DREAM) which adds text and possible context to existing stories. Then this experiment adds a test to demonstrate whether the model understands if two pieces of figurative language are in agreement (entailment / contradiction).

Interesting appearance of Autodesk AI Lab in this paper’s authors. MaskTune is their method which during a single epoch of fine-tuning, identifies inputs which are already learned by the model and masks them out. The idea is that these improve performance on tasks such as The Background Challenge where image models spuriously connect the object to the background (horses and zebras, flowers and bees).

Back in February there was a model fact-editing tool ROME. This group creates a method SERAC which keeps the original model frozen and then builds a framework around it. I think what’s going on is they have a new model detect if input is related to its edits, and then routes relevant queries to its new facts system.

In their tests of counter-factual generations, I appreciated this line:

Banana example was not cherry-picked; it was the first topic attempted.

“Covertly unsafe” suggestions are tasks which require world knowledge to know are harmful (such as biting a ghost pepper, or putting water on a grease fire). The paper works on defining and describing this problem but doesn’t have models and results.

When Facebook/Meta announced OPT, one unusual component was sharing 100+ pages of notes, now called ‘Chronicles of OPT Development’. This makes it easier to see what the team tried. They report the learning rate, metrics which they were using during training, deduplication issue, early signs which they were matching GPT-3 performance.

ProsocialDialog is a dataset of conversations to help chatbots redirect conversations away from negative topics.

Adding vowel marks to Arabic text with a neural network (see kaggle.com/datasets/linuxscout/tashkeela for a dataset). When I first visited the Unicode Conference (in 2015, I think..?), someone told me that Google had worked on adding tashkeel and failed, so it’s interesting to see how this develops.

Background info: in Arabic script, long vowels (aa, ii, uu) are written as distinct letters. The short form of these vowels (or absence of a vowel), appear as diacritic marks above or below letters (Wiki says that vowels are only a part of tashkeel; I am not a pro here). These marks are always noted in the Quran, but unlikely to be found on webpages, street signs, and textbooks. As a language learner, this can be frustrating because I know دبي / D-B-II is Dubai from memory and context, but if I look at a sign on the subway I see only a bunch of consonants? Words can be ambiguous and I see why it’s challenging for an NLP model to fill in the words for text-to-speech or search indexes.

There’s a question across ML of whether fine-tuning could be replaced by continuing pre-training. In image classification, fine-tuning focuses only on the later/head layers, but in Transformers text fine-tuning affects the whole model somehow (there is an option to freeze specific layers).

Lots of retrieval tasks and benchmarks.

Finding that the methods to connect fill-in-the-blank facts to specific training examples are still hit-or-miss.

Another Google paper experimenting with the optimizer adapting with the learning/training process. This takes it further and makes the optimizer into a neural net. Particularly interesting tidbit from Jascha, one of the core members of BIG-Bench, describing testing out this optimizer with other researchers:

Johns Hopkins researchers developed several image-based question-answering examples. The focus of this paper is finding ambiguous questions of different types, such as asking “what make of motorcycle is that?” on a picture of a dirt bike, or asking about an object when there are two in the image.

This is a fun one! Human players are tasked with choosing text-image pairs which are guessed correctly by other humans and confounding to the AI.

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet