Looking at the future of BIG-Bench
As we reach the end of 2022, almost 6 months after BIG-Bench was posted on Arxiv and 16 months after my contributions were merged, we’re seeing use in research outside of Google and DeepMind. There are also some emerging hot takes! I thought it would be appropriate to review these and add my own retrospective thoughts.
Hot Takes Review
Professor Davis reviews the 48 tasks which were specifically tagged as common sense, and awards only 1/4 of them a ‘high quality’ label.
The median task has 232 examples so the capacity to train/test is rather small (unless you are a zero-shot or few-shot model).
There is an ASCII maze game with poor instructions (I knew there was an ASCII art task, but this one also is annoying).
Choosing one correct answer for each prompt makes some choices, and people are likely to disagree. On the anachronism task:
the sentence “William the Conqueror enjoyed plenty of chile peppers to flavor his meals” is marked as “not an anachronism”
A reviewer brought this up, and the database author responded that, if the two entities existed at the same time, he wasn’t counting it as an anachronism.
And that reviewer was me! I appreciate that Davis looked through the examples and reviews closely to pick up on this.
On my own benchmark disambiguation_qa , it’s labeled ‘flawed’ (not the worst) and derivative. He points out an example with an easy answer, and mentions those solvable by number agreement.
My work on this was certainly derivative. I made ‘they’ ambiguously singular or plural. I added some unambiguous examples and possessive pronoun sentences for variety in structure and responses.
I included a common sense tag only for situations where the ambiguity is resolved by roles, for example: The worker and the pedestrian talked while he was repairing the sidewalk.
Goodman is saying that the BIG-Bench tasks are actually easy for a human. The PaLM paper showed 5-shot outperforming the average human at around 50%, but somewhat close to Goodman’s bet, the top human scored around 90% on a similar set of tasks:
I wonder if this is a crowd worker issue, cultural issue, timing, etc. Galactica is trained only on scientific papers, but unexpectedly outperformed OPT and BLOOM models on these general-purpose tasks.
Stanford released a HELM benchmark and ran it on LLMs from many different companies. First I’d like to credit them for using their role and access to compare models from multiple companies and behind different restrictions.
But the rejection of BIG-Bench is unclear at best:
Previous language model benchmarks (e.g. SuperGLUE, EleutherAI LM Evaluation Harness, BIG-Bench) are collections of datasets, each with a standard task framing and canonical metric, usually accuracy... In comparison, in HELM we take a top-down approach of first explicitly stating what we want to evaluate (i.e. scenarios and metrics) by working through their underlying structure. Given this stated taxonomy, we make deliberate decisions on what subset we implement and evaluate, which makes explicit what we miss (e.g. coverage of languages beyond English).
The benchmarks should coexist, and possibly it’s part of HELM. But the implication that Stanford produced this benchmark from some higher vantage point? I wish that they had added something more, but I’ll comment on that at the end.
For my own take, I’ll start with petty details first. How do you cite a paper with 444 coauthors?
Shortly before the Arxiv pre-print, the PaLM paper used (BIG-bench collaboration, 2021) in text. Most recently, especially with Google-affiliated papers, the go-to format is (Srivastava et al., 2022) and the References section says:
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
But there are a growing number of papers which include everyone’s names! Even Stanford’s HELM and Facebook’s Galactica paper do this.
I have been getting Google Scholar notifications for these, though four have corrected my first name to ‘Nicholas Doiron’ within the citation (??).
Was BIG-Bench accessible?
Coauthorship through a pull request was a route for people outside of ML academia to participate in research.
I did an unscientific count of email domains and affiliations for the co-authors. The most common were Google, Universiteit van Amsterdam, and Stanford. There was a wide spread of major research universities and company labs, and a few people with Gmail accounts so I can’t tract them.
It was nice to see universities as far as Iran and Hong Kong, less tech oriented companies (such as Ford), and a family collaboration.
I wonder how the CFP could’ve been more open or geographically diverse? Though BIG-Bench tasks could be in any language, almost all were in English, and subsequent papers reinforce this with a subset of tasks.
Yes I would like to make the novel thing
In the BIG-Bench process, I got pushback on disambiguation_qa (I’d started with the responses being clarifying questions) and on which_wiki_edit (I’d auto-generated wiki-diff / text matches in languages which I could not read).
This is true other times where I’ve been working on ML; I’ve been frustrated by others not wanting to include my use-case, accept a PR for an old model distillation script, add features, log ambiguous behavior, etc. It’s frustrating.
Here’s why it matters for BIG-Bench —there was an open call and over 200 tasks were accepted. If they fit the typical mold or followed one consistent format, then they’re not new ideas, and they’re not a truely robust test of models. Calls for new tasks should be open to new formats, new ways of asking. The factual correctness of points on the benchmark might mean we aren’t seeking 100% accuracy, but we can use these benchmarks en masse to compare models’ improvement.
Very 2021 / The Future
When I revisited BIG-Bench in a presentation in spring 2022, the obvious caveat was that the benchmark became a little dated. Not that the tasks became easy, as has happened with SuperGLUE, but the emphasis on few-shot performance rings false when it is not the hottest topic anymore.
If you were launching an open call for benchmarks today, you would definitely be considering:
- scripts which generate questions (with or without fixed random seed, with or without internet augmentation)
- labeling which examples to use in few-shot tasks, so different runs can cover more characteristic or difficult options
- all things prompting — chain-of-thought, automated prompt selection, which tasks are benefited or hurt by prompting, instruction models
- generation — debatable how you would evaluate text generation and/or decoding.
For image/text multimodal generation (DALL-E, StableDiffusion) this is the new hotness, but would be even more difficult to evaluate.
- awareness of the many-minded aspect of the corpus: developing a list of common misconceptions (the capital of Nevada), meaningful word choice (Jerusalem or Al-Quds), ways to correct or query humans
- RealtimeQA responding to new information (maybe leveraging this to edit Wikipedia / WikiData / an internal knowledge graph)
- How come we never see amazing new jokes, baffling tongue twisters, etc. rolling out of these language models? I think that the New Yorker cartoon contest idea is on the right track.
- Citing the relevant cases for upcoming Supreme Court decisions x specific justices.
- tools to patrol Wikipedia, OpenStreetMap, Mastodon, etc.
- going into the global university lecture halls, NeurIPS sessions, Wikimedia lists, etc. to fund 200 benchmarks in languages where current NLP is not functional.
- models which can imitate different reading levels or difficulty levels; i.e. predicting performance of a 3rd grader vs. 6th grader, generating coding interview questions