ML Arxiv Haul #4
Oh crud, this is too many papers; I should probably rethink this strategy of pasting an Arxiv link here instead of reading or saving it in a tab.
I decided to make a daily routine to finally get ahead on filtering and summarizing.
This is a little interesting — a Salesforce team’s take on code-generation via a chat interface. I thought it was a little bizarre that the text from humans was more comments requesting the next sequence of code. It’s not something where you give feedback to fix or improve on the previous code.
A scientific writing of, deep learning NLP models are better than other methods on a dataset to classify cognitive-behavioral statements.
This continues a back-and-forth between the new architecture (visual transformers) and revivals of ResNet (specifically ResNeXt) showing that you can get similar accuracy scores with some improvements, data augmentation, etc.
Prompt engineering → prompt search. The prompts generated by this method outperform human-edited prompts (but by who? apparently crowd workers?). In a baffling reveal, some ‘improved’ prompts remove the labels and/or coherence from the prompt. Maybe it’s understood that a Tweet-based task is generally positive or negative, or maybe some information is leaking through? (unlikely in a hosted service like InstructGPT).
I would like to see these results used to craft more confusing tasks (i.e. switching positive <-> negative labels, or labeling negative tweets ‘yes’).
The researchers find an improvement in processing extremely-closely related languages (German and Swiss German, Romanian and Moldovan) by continuing pre-training on text including some examples with spelling errors. It’s difficult to guess why this works, other than maybe playing on different neurons or building more robust patterns rather than expecting direct matches.
This paper is part of a new wave of code generation models which ‘infill’ gaps in code or edit in place. They have a demo on HuggingFace/Gradio: https://huggingface.co/spaces/facebook/incoder-demo
Interestingly they are still using a left-to-right style generative model, but replacing it in the code with a <MASK> token and then training the model to follow sample code prompts with its prediction of what the <MASK> was. They also can handle a mask being repeated (such as a variable name) or two masks in the same code prompt.
This is a Microsoft Research paper (i.e. not Google’s Jigsaw team). In this case they’re setting up an experimental framework to generate code. The code-writer model is given a data task in the Pandas library and then its code is evaluated for correctness. This is interesting more for what it doesn’t do (generating tons of code samples and improving with reinforcement learning, as DeepMind does).
A model for Kinyarwanda language. The authors add a morphology / part-of-speech tagging step before feeding data into a BERT-like model. This helps break down the components of agglutination in Kinyarwanda.
The sentence ’John twarahamusanze biradutangaza’ (We were surprised to find John there).
This paper focuses on the coherence and quality of texts from generative models using ‘time control’. I immediately thought of the ‘typical decoding’ paper. I don’t think anyone has directly compared the two?
A paper that just came out this week (April 2022). The authors use timm, ResNeXt, and Imagenette (all good choices) and multiple runs to calculate the varying accuracy and give a better grounding for discussions around State-of-the-Art results. The authors point out How Do Vision Transformers Work claims significant results on a new architecture when the confidence interval includes a possibility that there was no improvement.
A scientific approach to something which I’ve tried to do — repurposing an existing model architecture and weights to work in another language with fewer resources.
The authors improve on quality of mBART and other models purpose-built for translation, so this is a promising direction.
Google’s enigmatic Pathways architecture posts were followed by this more practical (but still solely research-use) Pathways Language Model.
This is one of the first papers to use the crowdsourced BIG-bench tasks. It looks like they compare PaLM to human and pre-existing model performance, but do most analysis on a select 58 tasks, not including my own tasks here.
In a new development which is sure to be a confounding trend, the dataset and ethics sections of the paper cite Stochastic Parrots without heeding its warnings. The authors (including two expelled from Google over that paper) have all commented on this paper, but I wanted to highlight two threads:
Algorithmic basis for realistic misspellings and typos (I find it annoying to talk about this as adversarial or an attack, but it’s useful to make models more robust).
Interesting project from Google Research about taking the optimizer component of the classical training loop (predict->loss ->optimize->) and making that learnable / trainable, instead of a fixed function. They show improvements, but admit a focus on only two tasks. I’m thinking maybe this won’t catch on, but could maybe find a better general-purpose optimizer.
The researchers evaluate the translation ability of mBART and mT5. As might be expected, fine-tuning on languages unseen during pre-training does not yield a usable model. The project includes research from the Masakhane project and they evaluate Yoruba language, but the team also gets similar results on Irish/Gaelic, Assamese, and Kannada.
Very much my jam — this “patches” the final layer of a text neural net. The reason for this is something which I didn’t expect: it is trying to vary the output (within probability-space) to prevent static analysis by typical model-attacking techniques.
I found the code for this project, which was just posted last week, and connected it on PapersWithCode.
There is a 2018 paper called SHIELD about another stochastic defensive technique against adversarial attacks, for images, from Georgia Tech.
A test of Facebook advertising policy during the 2018 midterm elections, which claims to evaluate automated approvals but is unclear.
…our audit demonstrated that Facebook was systematically prohibiting nonelection material of importance to American civic life, including public holidays and government websites.
Unconventional format and content for an ML paper — the authors interviewed 25 practitioners about how they used explainable ML tools and how they handled conflicting answers.
This seems to be generally good advice for vision transformers.
This modifies pre-training code to drop tokens with the smallest loss. If these tokens are truly superfluous or predictable, then we can save time/compute. The researchers show pre-training can be 25% faster.
What does it mean for tokens to be ‘dropped’ though, especially when they are restored in the final layer? I’ve included a figure from the experiment below… it looks as though ‘the’, prepositions, being verbs, and punctuation get dropped from the source text. I’m still confused about whether you train BERT for a bit and then start removing tokens from future steps/epochs, or the loss is measured using an existing full-text BERT model before pre-training a new abbreviated BERT model.
A little bit fun, this addresses a challenge from the DSTC10 AI Workshop to participate in WeChat (Chinese) conversations with the right memes and emotional tracking. The memes are stickers identified by their alt-text, which is a little disappointing as a true meme-bot would have some visual understanding, but it’s a fun competition.
More info and dataset links: https://github.com/lizekang/dstc10-mod
Chinchilla is a language model from DeepMind which outperforms much larger models (including GPT-3). In a strange top-down approach, the paper graphs the ideal language model based on model and corpus statistics, hypothesizes that existing models are inefficient, and creates Chinchilla to fit their ideal. I was looking for some clever test training runs and scaling laws and things, but it seems like the main takeaway is for every doubling of model size the number of training tokens should also be doubled.
This paper mentions one of my BIG-Bench tasks (disambiguation_q) as an example task where Chinchilla outperforms Gopher.
I believe that this paper is critiquing prompt engineering (as most results are being reported without the process of choosing the right prompt or hyperparameter run).
The ITERATER dataset includes revisions to text and annotations about why that change was made (fluency, clarity). They attain small improvements over human editors with BART, FELIX, and PEGASUS models.
This is the first time that I’ve seen FELIX (an alternative to seq2seq) appearing in new research, so I may consider it for the gender-reinflection model stuff.
This paper suggests ‘adapter’ layers before and after a frozen language model can learn to process text plus images from CLIP, without the difficulties of processing the whole model. They get good results.
In 2020–2021 there were a flurry of papers showing that BERT models would ignore word order, seemingly a step back from classical diagram-sentences and part-of-speech NLP. This 2022 paper looks inside the model to show some differences, and also makes the point that many of these word-order experiments are super-weird or impossible so BERT is ignoring them (‘except when it matters’).
A component of LIME and other black-box explainable AI methods is assigning importance to input images and text through salience. Here Google researchers take four common algorithms and find examples where the algorithms disagree on the importance of words.
The authors blame ‘shortcuts’ taken by language models, which can come in a variety of forms, but essentially mean techniques which fail on new data (not robust). They recommend avoiding shortcuts by using different explainability methods depending on whether you are using BERT or LSTM (do people still use LSTM?).
As text+image (multi-modal) models become more popular, Winoground is an intriguing image-captioning benchmark which just came out this month (April 2022). Two images have a caption with the same bag of words, but different word order (e.g. a person sits and a dog stands, vs. a person stands and a dog sits). Frustratingly, no model does much better than chance.
Adversarial machine learning finds inputs that trick conventional models, sometimes to discredit them or to improve robustness during training. This 2017 paper describes a black-box attack (where we don’t have transparency to the full model) which can create ‘attack’ inputs without developing its own shadow model.
This paper was the first time I read about feature squeezing and another 2017 paper Adversarial Examples Are Not Easily Detected. I wonder if I’m out of the loop on adversarial examples or this line of detecting/defending against adversarial examples has fallen out of favor.