ML Arxiv Haul #12

Nick Doiron
9 min readNov 16, 2022

--

We’re seeing a bunch of NLP pre-prints ahead of conference dates or end of the semester, so I’m already ready for another 20 papers in this ML Arxiv Haul. Maybe I should turn these into a monthly email newsletter?

In the previous haul I was looking at ‘contrastive decoding’ using a smaller model, and here is a totally different process named ‘contrastive search’. ‘Contrastive’ is also in two other papers in this post. So hot right now. HuggingFace had a blog about this paper last week, so I added it to my future decoders project. The process seems to start during model training, where the tokens are moved around to occupy more of vector-space. The contrastive method here selects a token from most probable tokens, but includes a penalty based on how close it is to the previous token (i.e. highest penalty if it is the same token repeated). I wonder how this would handle situations where you intentionally repeat a word or token (I did not know that that would happen, how does a fly fly — incidentally I am listening to a Grammar Girl podcast episode about double words now).

They use LCCC as a benchmark for Chinese text generation. I haven’t seen this before, and it’s encouraging to see decoder research outside of English. Here they found significant human preference over other decoding methods, but the English models needed both components of their algorithm (SimCTG) to improve.
One downside is though the decoding method is supposedly workable with other models, their examples (I think) all include a fine-tuning step first.

The study looks at Winograd sentences (Angela tried to calm Carrie’s nerves at the airport because _ was scared of flying in airplanes.) and compares human-written explanations to ones written by GPT-3. I was expecting them to then do something with the explanations, but the explanation itself is accurate. They used the ‘Data Maps’ tool which came out in 2020 to pick out which examples are most difficult for a model, and the explanations from GPT-3 are still high quality.

BigScience / HuggingFace has released a bunch of papers about work on their collaborative multilingual model BLOOM — collecting the ROOTS dataset, carbon emissions, and an Instruct version of the model (BLOOMZ).
My Hindi-BERT model makes a tiny appearance here (!) when its tokenizer is compared to BLOOM’s.

The researchers hire crowd-workers to collect sentences from Wikipedia which include more complex negation. The crowd-workers create variations on the sentence. Then the QA task is interpreting what did and didn’t happen based on a human understanding.
The best results come from UnifiedQA (I’m unfamiliar with this one; it’s based on T5) and (when few-shot is allowed) InstructGPT with chain-of-thought.

This is an emerging fear in AI Ethics land which I’m still a bit skeptical of. Essentially if AIs create a significant % of the text and images on the internet, this creates a cycle where their generated content influences and passes on biases to future models. There is some interesting research on how that works and what might be done to make generation from sampling less likely to pass on biases.

New research from Stanford on GitHub Copilot. When students were given several encryption scripting tasks and Copilot, they were likely to create vulnerable code, and were more likely to create these vulnerabilities the more that they accepted suggestions and highly rated its output. It wouldn’t be enough to use static analysis tools listed in the paper, because the security bugs are often “conceptual”.

Most comprehensive external audit which I’ve seen of DALL-E image generation. The authors find several stereotypes in careers, emotions, roles, families, etc. It’s difficult to push the model to generate some scenarios.

The authors endorse a model robustness tool (AutoAttack) but instead decide to focus on robustness measures which occur after the model is trained (in other words: at test time). I’m not familiar with these techniques so this is a good overview. I believe what they’re saying is the tools modify the model weights or the input data before the test or between batches (?). Anyway, their conclusion is that this category of test-time tools is disappointing and computationally expensive.

The Twitter sales pitch for this paper was, as the model becomes more accurate, its technique becomes less aligned with how humans view images.

As I scanned this, I wondered where the human attention maps come from? They come from the ClickMe dataset where “Participants select important image parts with their mouse by “painting” translucent bubbles on screen.” As you might suspect, reporting based on clicking and dragging is not the same as eye-tracking or showing portions of the image to the user. The two examples which I’d highlight are the snake (where people have highlighted the closer, front half of the snake) and the ball (where models highlight the player). The humans seem to be highlighting the minimum area to recognize the object.

There is significant interest in NLP to ‘align’ language models to human intent as with InstructGPT, code models (which are easier to test by running their code), etc. The major research labs have been doing this with reinforcement learning, and Allen AI is letting us in on the process here. The repo for this paper is “RL4LMs” which is more optimistic than the paper’s title implies. Finally they introduce a new RL algorithm and a benchmark.
Unfortunately I failed to run these on a CoLab GPU :( maybe next time.

It’d seem that neural networks could think through the basic rules of Conway’s Game of Life and predict the image a few rounds into the future. The researchers seemed interested in discussing the ‘lottery ticket’ phenomenon but instead were disappointed by neural networks’ general performance.

There are scripts for finding what facts neural models know, editing facts in a model, etc. etc. but this brings it back to the training data. The researchers find that the ability to recall a fact is related to the number of times it sees that fact in pre-training. The study uses TriviaQA to benchmark ability. I wish they had looked at categories of facts which are difficult for models, for example they may choose a well-known city (Rio) for ‘what is the capital of Brazil?’.

I am interested in game-generation algorithms in case they could be applied to language learning. Here they have a game level generator which works through evolution, and then attempt to train a neural network to mimic this evolutionary behavior.

Expanding on previous collaborations on Bahasa Indonesia language benchmarks, this paper announces sentiment analysis data for 10 other Indonesian languages. I’m intrigued by the community organizing and collaboration visible for Indonesia, plus this win for preserving minority languages.
From Twitter convos, I learned that there are two models which came out around the same time both called ‘IndoBERT’.
I am hopeful this could carry over to some regional cooperation on AI (i.e. with Bahasa Melayu).

Interviewing machine learning engineers and managers about their jobs. There’s an interest in speeding up validation/iteration, and tracking versions of models. Includes a list of anti-patterns. I thought that there would be more discussion of feature engineering -type work, but there’s just one point made about changing tools to SparkSQL.

The researchers compare CLIP and a CLIP fine-tuned on a new image task or synthetic data, and use their PAINT code to improve the model based on that. They manage to do this without disrupting accuracy on previous tasks. They describe this as ‘patching’ the model, when it seems more like ‘fine-tune then amplify’?

The Humanitarian OpenStreetMap Team (HOT) announced that they are almost ready to release an AI mapping tool (named fAIr). This works as a competitor to Meta’s RapiD. That was released in 2019 and is now previewing a v2.0 (the main changes are including layers from Esri and Microsoft, and rendering the map in a canvas element instead of SVG).

I’m a bit embarassed that I hadn’t heard about HOT’s fAIr project until today. That project seems to have benefited from ramp, which the repo shows tracing buildings in several countries.

Researchers maintain the changing answers to a set of questions (how many home runs has <player> hit) and evaluate models which are ‘closed book’ or use a standardized retrieval process. As of November 12th, the best results come from GPT-3 and a custom search engine.
Interestingly, they even release a new set of questions each week, based on newspaper quizzes: github.com/realtimeqa/realtimeqa_public/tree/main/latest

Last year, a paper on Contrastive Authoring and Reviewing Pairing (CARP) was shown to evaluate whether humans would like a story. This paper fine-tunes GPT-2 to generate new stories based on CARP’s preference score. This finally explains why the Stable Diffusion team’s text and code team is called Carper AI! The paper brings in prompt-tuning (CoOp) and reinforcement learning to improve the stories.

I think what’s going on here is that they select the highest-loss examples of the text dataset, use k-means clustering, and then present these clusters together so ideally you can spot a trend. They study some examples where GPT-3 describes the clusters. Rather strange for them to have ALBERT model for process and GPT-3 for explaining the error.

NLP researchers want to avoid models discovering ‘shortcuts’ on tasks instead of a more diverse and robust set of signals. In this paper, they look at explainability and other metrics to detect a shortcut in the process. Considering that their example of robustness is susceptiblity to unrelated information in a text / summary, the easiest fix is adversarial training. They mention several other methods, such as ‘worst-group loss minimization’. This is a survey paper so you would have to track down repos for each method.

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet