ML Arxiv Haul #10

Nick Doiron
10 min readOct 22, 2022

--

As I prepared for DEF CON AI Village and when I joined the ‘Big Code Project’ Slack, I saw a couple of papers about code model security.
This summer 2022 paper from OpenAI talks about their Codex model (which also powers GitHub Copilot).
‘Hazard Analysis’ is a term of art in safety-critical industries which is being applied here. The authors note that none of the major languages learned by Codex support formal verification, so it’s difficult to consider all possible hazards and detect them in a large group of generated programs.
The team highlights a particular issue somewhat hidden in Copilot through good prompting:

one consequential word is often the difference between Codex producing correct or incorrect results

The paper is open about not addressing all risks, but it does include a very open spread of them, for example:

• Synthesis features are used to generate code for application with environmental impacts, exacerbating environmental hazards

The paper does not cover code explanation models.
In the longer-term, the authors recommend removing deepfake and other harmful projects from training, and using a fine-tuning technique like PALMS on a small set of moral rules or principles, detecting adversarial user behavior, and avoiding completion when given malicious requests.

This is code-gen + security paper is from August 2021, and some authors spoke on the topic at Black Hat 2022. The experimental process generates code with Copilot which touch upon the top 25 vulnerabilities. They then use CodeQL (new to me) to scan the code to see what vulnerabilities got generated.
I’ll have to check my Copilot Labs / Nightly plugins, but the authors have more access than I’ve had to confidence scores and parameters (top_p is particularly interesting, hinting at the sampling / decoder method used).

This is a rather fun use of something I’ve struggled to define about header comments (but in this case __author__ ) stereotype the user’s code style.

The authors make plausible changes to the comment text, spaces vs. tabs, etc. and find more shifts in vulnerability.
In the conclusions, the authors have some apparent regrets about the experiment setup. Some of the highest risk CWEs involve memory safety or other issues which involve low-level C and Verilog, which Copilot/Codex is not as prepared for, and CodeQL was not able to detect the vulnerabilities. So if I were revisiting this for Big Code Project I would probably pick and choose the CWEs or security bugs.

Good survey of what’s going on in this wtf space of image prompt engineering. I have tended to refer to The-DALL·E-2-prompt-book but they recommend something new to me, Traveler’s Guide to the Latent Space.

Bias, Consistency, and Partisanship in U.S. Asylum Cases: A
Machine Learning Analysis of Extraneous Factors in Immigration
Court Decisions

Finally a paper where ML is helping analyze biases instead of creating new ones! Here the authors at UC Berkeley have a model which can predict 58% of outcomes based on judge assignments and a partisanship score which includes public opinion and elected officials at the time of the decision. I find this partisanship stuff interesting because few new immigration laws were enacted from Obama to Trump to Biden, but this New Yorker article from 2018 covers how the outcome of a traffic stop (getting a ticket, being detained, rapidly deported, or released with court dates) is up to the officers on the scene. So the attitude or dictums of an administration can swiftly change what’s happening on the ground, and apparently in the courts.

This reminds me of a thought experiment where a computer assigns people to a fair judge or to a judge who never grants asylum. It would be difficult to call the computer an ‘unethical AI’ because even a random assignment has unfair outcomes. An AGI or human assigner would assumedly not be allowed to assign everyone to the fair judge. The only ethical output is for the programmer, program, and humans in the system to resist moving more people into that system.

It also reminds me of bananas research that judges in Louisiana give out harsher sentences following LSU losing games.

Google Research and Stanford team up to improve scores on BIG-Bench tasks. They use the famous ‘Let’s think step by step.’ prompt, and select only tasks which seem solvable by English language models (e.g. not ASCII art, or my multilingual wiki task). My ‘Disambiguation QA’ task made the cut!
Two GPT models (including InstructGPT) show improvement with this technique.

I watched a video on diffusion models which referenced this paper. Essentially the latest generation of diffusion model, building up an image in steps out of random noise, has typically been assisted by an existing classifier for that image (the cited paper is from May 2021). I’m not really qualified to discuss their alternative method, but they explain it as the classifier being replaced by “[mixing] the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model”.

HuggingFace wrote a paper about their platform for ML models and datasets (originally NLP and text data, but expanding into other media for some time now). The paper introduces the evaluate library, and a hosted evaluation-as-a-service (auto evaluator). This leads to a goal of allowing users to test any hosted model x dataset [if it’s reasonably compatible].

I’m hopeful that this work and HF leaderboards can surface models which score higher on benchmarks such as HornMT (translation for languages in Ethiopia, Eritrea, and Somalia) which don’t trend on ML Twitter. Previously people have been writing a one-off paper and repo, setting up a GitHub Pages site, or relying on more advanced platforms for ML competitions.

The authors also speak to some goals in shifting how metrics are used in AI/ML/NLP. They urge users to consider more markers of social bias, the difficulty of evaluating with more expensive or unreproducible methods, and keeping up with the trends in metrics (here they mention SacreBLEU over BLEU, which is too inside baseball for me, but I had heard before that BLEU is not the greatest).

Two Japanese researchers edit models internal architecture toward a de-biased version using ‘movement pruning’.
The metrics for bias are new to me (SEAT = Sentence Encoder
Association Test, SS = StereoSet Stereotype Score).
Some anachronistic parts of the paper for 2022 — they use a standard BERT model and measure the general accuracy of their model using GLUE.

This paper is on OpenReview for the ICLR conference; I don’t know if we’re supposed to speculate who created it?
On a set of tasks, researchers inserted zero or a few examples into unprompted (greedy decoder), human-prompted, or AI-prompted model runs on InstructGPT. The AI did well, so there’s some debate about what it means for human ‘prompt engineering’ research. This is complicated by the authors testing instructions inside of what I might call a meta-prompt? (“I instructed my friend to <INSERT> The friend read the instruction and wrote an output for every one of the inputs.”)

The table shows the best AI prompts and there was only one which came out super-weird and even wrote its own one-shot example.

I’d like to see the #2 or #3 prompts because the famous “Let’s think step by step” prompt paper glossed over that its runner-up was “First, (*1)”.
Also discussed: how the rhyme task is poorly designed (allows response with same word).

Most editing datasets cover one step, single-issue, or generally ‘atomic’ edits. The paper uses more complex editing steps from Wikipedia and GitHub, with transformers processing the whole process. Though a seq2seq model has good scores on the task, the paper credits that to copying, and presents their own models.

Facebook/Meta paper on Diplomacy, a cooperative game. They develop an algorithm and then a reinforcement learning strategy which performs well.

Reinforcement learning game from Georgia Tech, based on OpenAI Gym MiniGrid. GitHub link.

Classifies existing social bias metrics for language models, and whether a CI/CD system can be applied. Interesting discussion of a badging system.

ML Twitter claims that this paper is a bit of a sleeper hit after the code was released. I’ll try to check this out. Basically right now in DALL-E or Stable Diffusion I’m expected to place items in an image by erasing a section and then requesting inpainting. This makes some cool insertions into my photos of Chicago / BetterStreetsAI. But the AI has to be creative, so it’s labor-intensive and still the infilled area may be out of place. The prompt-to-prompt method is able to connect an unedited/unannotated image with your starter prompt, and then apply changes to your prompt (i.e. a RED car) to the image in a less jarring change.

Translated SQuAD datasets into Czech, though I found it tricky to obtain these when I looked earlier this year. Compares results on BERT and XLM/RoBERTa models, even when trained on English QA data and evaluated on Czech.

Facebook/Meta keeps churning out these textless NLP projects. Here they have a Taiwanese Hokkien-English translation, which is ostensibly set up because of a lack of one written standard, but more likely a test case for low-resource language NLP where communication is mostly audio.
I guess when people are using Chinese characters and a few romanizations for Hokkien, it may be difficult for Facebook to know what Hokkien data it already has (compared to Korean Unicode characters = Korean language messages).
But if they have a Hokkien text-to-speech program, and can train a model on synthetic speech with realistic results, I would say existing Hokkien content is ahead of many low-resource languages.

Big paper but sort of underwhelming for me, personally? This covers several tasks which people would consider generalization. Is it trying to point out how broad the term is (cross-lingual, data shift, robustness)?

Super-cool paper which is simulating computer vision continual learning by plotting out a Yahoo Images dataset from 2004 to 2014.

This paper came to public space when the author, Saulte T. Omarova, was nominated to Biden’s Office of the Comptroller of the Currency, and ultimately withdrew. This is a rather long paper about an American view of a central bank digital currency (CBDC) which terrified critics because most of the research and actual adoption happens in authoritarian countries. In The Bitcoin Standard, a libertarian take is that Bitcoin will be used for large transactions and settlement between large institutions (i.e. acknowledging fees as a barrier to everyday transactions, but solving a perceived problem with banking with fiat currencies).
This is way over my head, but the concept is gaming out what follows if the Federal Reserve held all bank deposit accounts.

On browsing this, I can see why the nomination would be contentious, because unfortunately they can’t have real conversations in nomination hearings, whether or not this was only a thought experiment.

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet