ML Arxiv Haul #10

10 min readOct 22, 2022

A Hazard Analysis Framework for Code Synthesis Large Language Models

Codex, a large language model (LLM) trained on a variety of codebases, exceeds the previous state of the art in its…

arxiv.org

As I prepared for DEF CON AI Village and when I joined the ‘Big Code Project’ Slack, I saw a couple of papers about code model security.
This summer 2022 paper from OpenAI talks about their Codex model (which also powers GitHub Copilot).
‘Hazard Analysis’ is a term of art in safety-critical industries which is being applied here. The authors note that none of the major languages learned by Codex support formal verification, so it’s difficult to consider all possible hazards and detect them in a large group of generated programs.
The team highlights a particular issue somewhat hidden in Copilot through good prompting:

one consequential word is often the difference between Codex producing correct or incorrect results

The paper is open about not addressing all risks, but it does include a very open spread of them, for example:

• Synthesis features are used to generate code for application with environmental impacts, exacerbating environmental hazards

The paper does not cover code explanation models.
In the longer-term, the authors recommend removing deepfake and other harmful projects from training, and using a fine-tuning technique like PALMS on a small set of moral rules or principles, detecting adversarial user behavior, and avoiding completion when given malicious requests.

Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including…

arxiv.org

This is code-gen + security paper is from August 2021, and some authors spoke on the topic at Black Hat 2022. The experimental process generates code with Copilot which touch upon the top 25 vulnerabilities. They then use CodeQL (new to me) to scan the code to see what vulnerabilities got generated.
I’ll have to check my Copilot Labs / Nightly plugins, but the authors have more access than I’ve had to confidence scores and parameters (top_p is particularly interesting, hinting at the sampling / decoder method used).

This is a rather fun use of something I’ve struggled to define about header comments (but in this case __author__ ) stereotype the user’s code style.

The authors make plausible changes to the comment text, spaces vs. tabs, etc. and find more shifts in vulnerability.
In the conclusions, the authors have some apparent regrets about the experiment setup. Some of the highest risk CWEs involve memory safety or other issues which involve low-level C and Verilog, which Copilot/Codex is not as prepared for, and CodeQL was not able to detect the vulnerabilities. So if I were revisiting this for Big Code Project I would probably pick and choose the CWEs or security bugs.

A Taxonomy of Prompt Modifiers for Text-To-Image Generation

Text-to-image generation has seen an explosion of interest since 2021. Today, beautiful and intriguing digital images…

arxiv.org

Good survey of what’s going on in this wtf space of image prompt engineering. I have tended to refer to The-DALL·E-2-prompt-book but they recommend something new to me, Traveler’s Guide to the Latent Space.

Bias, Consistency, and Partisanship in U.S. Asylum Cases: A
Machine Learning Analysis of Extraneous Factors in Immigration
Court Decisions

Can machine learning measure the impact of bias on U.S. asylum decisions?

By Catherine Vera and Vyoma Raman

medium.com

Finally a paper where ML is helping analyze biases instead of creating new ones! Here the authors at UC Berkeley have a model which can predict 58% of outcomes based on judge assignments and a partisanship score which includes public opinion and elected officials at the time of the decision. I find this partisanship stuff interesting because few new immigration laws were enacted from Obama to Trump to Biden, but this New Yorker article from 2018 covers how the outcome of a traffic stop (getting a ticket, being detained, rapidly deported, or released with court dates) is up to the officers on the scene. So the attitude or dictums of an administration can swiftly change what’s happening on the ground, and apparently in the courts.

This reminds me of a thought experiment where a computer assigns people to a fair judge or to a judge who never grants asylum. It would be difficult to call the computer an ‘unethical AI’ because even a random assignment has unfair outcomes. An AGI or human assigner would assumedly not be allowed to assign everyone to the fair judge. The only ethical output is for the programmer, program, and humans in the system to resist moving more people into that system.

It also reminds me of bananas research that judges in Louisiana give out harsher sentences following LSU losing games.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the…

arxiv.org

Google Research and Stanford team up to improve scores on BIG-Bench tasks. They use the famous ‘Let’s think step by step.’ prompt, and select only tasks which seem solvable by English language models (e.g. not ASCII art, or my multilingual wiki task). My ‘Disambiguation QA’ task made the cut!
Two GPT models (including InstructGPT) show improvement with this technique.

Classifier-Free Diffusion Guidance

Classifier guidance is a recently introduced method to trade off mode coverage and sample fidelity in conditional…

arxiv.org

I watched a video on diffusion models which referenced this paper. Essentially the latest generation of diffusion model, building up an image in steps out of random noise, has typically been assisted by an existing classifier for that image (the cited paper is from May 2021). I’m not really qualified to discuss their alternative method, but they explain it as the classifier being replaced by “[mixing] the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model”.

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed…

arxiv.org

HuggingFace wrote a paper about their platform for ML models and datasets (originally NLP and text data, but expanding into other media for some time now). The paper introduces the evaluate library, and a hosted evaluation-as-a-service (auto evaluator). This leads to a goal of allowing users to test any hosted model x dataset [if it’s reasonably compatible].

I’m hopeful that this work and HF leaderboards can surface models which score higher on benchmarks such as HornMT (translation for languages in Ethiopia, Eritrea, and Somalia) which don’t trend on ML Twitter. Previously people have been writing a one-off paper and repo, setting up a GitHub Pages site, or relying on more advanced platforms for ML competitions.

The authors also speak to some goals in shifting how metrics are used in AI/ML/NLP. They urge users to consider more markers of social bias, the difficulty of evaluating with more expensive or unreproducible methods, and keeping up with the trends in metrics (here they mention SacreBLEU over BLEU, which is too inside baseball for me, but I had heard before that BLEU is not the greatest).

Gender Biases and Where to Find Them: Exploring Gender Bias in Pre-Trained Transformer-based…

Language model debiasing has emerged as an important field of study in the NLP community. Numerous debiasing techniques…

arxiv.org

Two Japanese researchers edit models internal architecture toward a de-biased version using ‘movement pruning’.
The metrics for bias are new to me (SEAT = Sentence Encoder
Association Test, SS = StereoSet Stereotype Score).
Some anachronistic parts of the paper for 2022 — they use a standard BERT model and measure the general accuracy of their model using GLUE.

Large Language Models are Human-Level Prompt Engineers

We propose an algorithm for automatic instruction generation and selection for large language models with human level…

openreview.net

This paper is on OpenReview for the ICLR conference; I don’t know if we’re supposed to speculate who created it?
On a set of tasks, researchers inserted zero or a few examples into unprompted (greedy decoder), human-prompted, or AI-prompted model runs on InstructGPT. The AI did well, so there’s some debate about what it means for human ‘prompt engineering’ research. This is complicated by the authors testing instructions inside of what I might call a meta-prompt? (“I instructed my friend to <INSERT> The friend read the instruction and wrote an output for every one of the inputs.”)

The table shows the best AI prompts and there was only one which came out super-weird and even wrote its own one-shot example.

I’d like to see the #2 or #3 prompts because the famous “Let’s think step by step” prompt paper glossed over that its runner-up was “First, (*1)”.
Also discussed: how the rhyme task is poorly designed (allows response with same word).

Learning to Model Editing Processes

Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in…

arxiv.org

Most editing datasets cover one step, single-issue, or generally ‘atomic’ edits. The paper uses more complex editing steps from Wikipedia and GitHub, with transformers processing the whole process. Though a seq2seq model has good scores on the task, the paper credits that to copying, and presents their own models.

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark…

arxiv.org

Facebook/Meta paper on Diplomacy, a cooperative game. They develop an algorithm and then a reinforcement learning strategy which performs well.

NovGrid: A Flexible Grid World for Evaluating Agent Response to Novelty

A robust body of reinforcement learning techniques have been developed to solve complex sequential decision making…

arxiv.org

Reinforcement learning game from Georgia Tech, based on OpenAI Gym MiniGrid. GitHub link.

Pipelines for Social Bias Testing of Large Language Models

Debora Nozza, Federico Bianchi, Dirk Hovy. Proceedings of BigScience Episode #5 — Workshop on Challenges &…

aclanthology.org

Classifies existing social bias metrics for language models, and whether a CI/CD system can be applied. Interesting discussion of a badging system.

Prompt-to-Prompt Image Editing with Cross Attention Control

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities…

arxiv.org

ML Twitter claims that this paper is a bit of a sleeper hit after the code was released. I’ll try to check this out. Basically right now in DALL-E or Stable Diffusion I’m expected to place items in an image by erasing a section and then requesting inpainting. This makes some cool insertions into my photos of Chicago / BetterStreetsAI. But the AI has to be creative, so it’s labor-intensive and still the infilled area may be out of place. The prompt-to-prompt method is able to connect an unedited/unannotated image with your starter prompt, and then apply changes to your prompt (i.e. a RED car) to the image in a less jarring change.

Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer

Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building…

arxiv.org

Translated SQuAD datasets into Czech, though I found it tricky to obtain these when I looked earlier this year. Compares results on BERT and XLM/RoBERTa models, even when trained on English QA data and evaluated on Czech.

Speech-to-speech translation for a real-world unwritten language — Meta Research

Alexander Winkler, Jungdam Won, Yuting Ye SIGGRAPH Asia — 2022 Donglai Xiang, Timur Bagautdinov, Tuur Stuyck, Fabian…

research.facebook.com

Facebook/Meta keeps churning out these textless NLP projects. Here they have a Taiwanese Hokkien-English translation, which is ostensibly set up because of a lack of one written standard, but more likely a test case for low-resource language NLP where communication is mostly audio.
I guess when people are using Chinese characters and a few romanizations for Hokkien, it may be difficult for Facebook to know what Hokkien data it already has (compared to Korean Unicode characters = Korean language messages).
But if they have a Hokkien text-to-speech program, and can train a model on synthetic speech with realistic results, I would say existing Hokkien content is ahead of many low-resource languages.

State-of-the-art generalisation research in NLP: a taxonomy and review

The ability to generalise well is one of the primary desiderata of natural language processing NLP). Yet, what `good…

arxiv.org

Big paper but sort of underwhelming for me, personally? This covers several tasks which people would consider generalization. Is it trying to point out how broad the term is (cross-lingual, data shift, robustness)?

The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g…

arxiv.org

Super-cool paper which is simulating computer vision continual learning by plotting out a Yahoo Images dataset from 2004 to 2014.

The People’s Ledger: How to Democratize Money and Finance the Economy

71 Pages Posted: 21 Oct 2020 Last revised: 15 Oct 2021 Date Written: October 20, 2020 The COVID-19 crisis underscored…

papers.ssrn.com

This paper came to public space when the author, Saulte T. Omarova, was nominated to Biden’s Office of the Comptroller of the Currency, and ultimately withdrew. This is a rather long paper about an American view of a central bank digital currency (CBDC) which terrified critics because most of the research and actual adoption happens in authoritarian countries. In The Bitcoin Standard, a libertarian take is that Bitcoin will be used for large transactions and settlement between large institutions (i.e. acknowledging fees as a barrier to everyday transactions, but solving a perceived problem with banking with fiat currencies).
This is way over my head, but the concept is gaming out what follows if the Federal Reserve held all bank deposit accounts.

On browsing this, I can see why the nomination would be contentious, because unfortunately they can’t have real conversations in nomination hearings, whether or not this was only a thought experiment.

ML Arxiv Haul #10

A Hazard Analysis Framework for Code Synthesis Large Language Models

Codex, a large language model (LLM) trained on a variety of codebases, exceeds the previous state of the art in its…

Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including…

A Taxonomy of Prompt Modifiers for Text-To-Image Generation

Text-to-image generation has seen an explosion of interest since 2021. Today, beautiful and intriguing digital images…

Bias, Consistency, and Partisanship in U.S. Asylum Cases: AMachine Learning Analysis of Extraneous Factors in ImmigrationCourt Decisions

Can machine learning measure the impact of bias on U.S. asylum decisions?

By Catherine Vera and Vyoma Raman

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the…

Classifier-Free Diffusion Guidance

Classifier guidance is a recently introduced method to trade off mode coverage and sample fidelity in conditional…

Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed…

Gender Biases and Where to Find Them: Exploring Gender Bias in Pre-Trained Transformer-based…

Language model debiasing has emerged as an important field of study in the NLP community. Numerous debiasing techniques…

Large Language Models are Human-Level Prompt Engineers

We propose an algorithm for automatic instruction generation and selection for large language models with human level…

Learning to Model Editing Processes

Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in…

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark…

NovGrid: A Flexible Grid World for Evaluating Agent Response to Novelty

A robust body of reinforcement learning techniques have been developed to solve complex sequential decision making…

Pipelines for Social Bias Testing of Large Language Models

Debora Nozza, Federico Bianchi, Dirk Hovy. Proceedings of BigScience Episode #5 — Workshop on Challenges &…

Prompt-to-Prompt Image Editing with Cross Attention Control

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities…

Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer

Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building…

Speech-to-speech translation for a real-world unwritten language — Meta Research

Alexander Winkler, Jungdam Won, Yuting Ye SIGGRAPH Asia — 2022 Donglai Xiang, Timur Bagautdinov, Tuur Stuyck, Fabian…

State-of-the-art generalisation research in NLP: a taxonomy and review

The ability to generalise well is one of the primary desiderata of natural language processing NLP). Yet, what `good…

The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g…

The People’s Ledger: How to Democratize Money and Finance the Economy

71 Pages Posted: 21 Oct 2020 Last revised: 15 Oct 2021 Date Written: October 20, 2020 The COVID-19 crisis underscored…

Written by Nick Doiron

No responses yet

Bias, Consistency, and Partisanship in U.S. Asylum Cases: A
Machine Learning Analysis of Extraneous Factors in Immigration
Court Decisions