Update on ML side projects

3 min readApr 30, 2022

This is a relatively exciting week in the ML industry. Google/DeepMind has a massive pre-trained visual model called Flamingo, Anthropic raised over a half-billion dollars in their Series B, and a lot of cool papers came out of ICLR workshops (I’m particularly interested in this paper about generating puzzles to teach better code-generation models).

I’m left with this weird anxiety about continuing my ML side projects, which I would break down into:

NLP in 2022 is dominated by a few labs with massive resources, to the point a lot of research is pecking at an API or paywall or limited-beta (GPT-3, Codex, DALL-E 2) or not accessible at all (PaLM). This also raises the bar for results to look current, interesting, and professional.
My early work in this space is not quality, which makes sense, but makes it difficult to say ‘let’s open up that old notebook and start hacking’. New projects haven’t approached the point where I’m collaborating on a paper, and DIY-ing an ML paper is also intimidating.
I work on an ML platform in my day job, mostly on Go/Python backend code, so the amount of time/effort on ‘ML’ is already a lot, but time on models and analysis is less.

Some motivating thoughts:

There is a cycle or treadmill or mechanism which is rapidly escalating the ability of models on English language, Python/JS coding, and some multilingual tasks. Despite only a few labs publishing the models, we’re seeing developments play out publicly.
The risk here is that AI has an exponential growth chart, and AI evaluation / monitoring / auditing continues to be a laborious process. I depend on and respect that rigorous academic / legal space but damn it would be nice to throw a battery of adversarial continually-morphing tests at models as they come off the line.
We need better visualizations of text-generation models. One of my weirder code-generation results shows that when you change the name and license at the top of a code file, it changes details (such as food emojis or city names) in the file. It doesn’t make a ‘wrong’ answer, but it hints that ambiguous instructions float many options into our probability space, and the one which prints out of the model is determined by your decoder and some sketchy butterfly-effect stuff.
Generation models also are not so great at probability in the beginning of some code or text, which makes it possible to miss surprisal/saliency.
Semantic search (searching by document vector similarity) has been getting a lot of attention and promises to be very cool. I would like to try it on /r/AskNYC. I do run into issues with fuzzy search when I search chimichanga and Google Maps returns every Mexican restaurant.
On the other end, explainability and algorithmic recourse (i.e. to change the output, you should change this input) is underrated / under-explored. Governments have spent millions on dowsing rods and will happily adopt random noise AI/ML unless there is some receipt or recourse to inspect why they work.
If you talk with tech-for-good groups, they have NLP tasks which are still difficult to set up in non-English language (text simplification, reverse curriculum, and open-domain QA). Even when a language needs more data and training, it should be easier to plug the libraries and datasets together for a demo.
I’m participating in the Probabilistic AI class in Finland in June, so I’m eager to learn more about that way of thinking.

Update on ML side projects

Written by Nick Doiron

No responses yet