Early this year, I uploaded a seq2seq model to generate gender counterfactuals in Spanish (el profesor viejo <-> la profesora vieja). Since then, I debugged some issues, made a general-purpose library, and created an initial seq2seq model for Arabic.
The goal is to see if passing data through process creates more generalized training data (data augmentation) which improves accuracy and fairness.
This post and notebook were updated July 2021 with a new, fixed
random_state and a more accurate version of SimpleTransformers.
Scott is better known for his book Seeing like a State (which I have not read) but his writing is familiar to me through The Art of Not Being Governed (a Southeast Asian history). Against the Grain goes back further in history to discuss Neolithic peoples’ development of the first farms, domesticated plants and animals, and walled cities. This took place before written history, so we also hear about research methods and the biases toward what is preserved in the archaeological record. …
A Thai BERT model that I’d adapted recently appeared in a paper on cross-language learning in biomedicine. This same model is in the spaCy docs. People use it on GitHub and Kaggle. As in most NLP models, I had used a tokenizer which splits text into a stream of words and sub-word prefixes and suffixes, with words separated by spaces before being broken into prefixes and suffixes. But this assumption falls short in many global languages. Thai, for example, does not put spaces between [most] words.
WangchanBERTa, released by the AI Research Institute of Thailand in 2021, is a better…
This is a shorter (190 pages) book challenging the international development community to invest in agricultural technology and GMOs for Africa.
From the title I was imagining a picture of a GMO-friendly future. Instead much of the book was about US consumer awareness of GMOs, or tracking spending by USAID, NGOs, and African governments on conventional agricultural research. These figures were highly relevant and depressing. …
I originally planned this as a separate website or video series, but it’s stalled in the past few months. I’ve decided to post with only a few edits.
What happens to the experiments which don’t show improvement over our previous baseline? In the data science / machine learning community, we hear that negative results can bring balance and objectivity. Yet there is still a publication bias, and a lack of sources for great negative results content. Here I’ve selected three papers which are recognized as standout examples of negative results, with added commentary or definitions.
The WinoWhy dataset — presented at ACL 2020 by Hongming Zhang, Xinran Zhao, and Yangqiu Song — offers human explanations for ambiguous sentences.
Bill passed the half-empty plate to John because he was full.
We understand that the ‘he’ refers to Bill. Crowdsourced explanations include:
Bill was full, so he gave the rest of his food to John
Bill is full and couldn't eat more
The purpose of the dataset is to fine-tune better explanatory models. But in each of these prompts, an alternate anti-explanation must exist in the model’s probability-space (i.e. John was full so he needed only half).
When I recently chatted with members of a fiction-generation AI startup, one of my pre-written questions was whether a model could be trained for a specific non-fiction location. This is based on my idea for a model trained on the AskNYC subreddit.
Note: I wrote up my experience, thoughts, and conversations on the days when I received doses of vaccine, and when I reached statistical immunity. Nothing dramatic happened, but I wanted to have a contemporary account to look back on later.
Compared to others in the US, I was lucky to receive my first dose when I did. For other countries where access is still being negotiated and a new wave of infections is starting, the wait is still longer. I know this isn’t a great time to read about vaccinations. Sorry.
For people who read this and feel that your personal…
In these first four months of 2021, a few researchers have used my language models or mentioned me in their papers. I feel encouraged and validated to be part of someone else’s work. Sharing models on HuggingFace made it possible for them to continue that work and extend it into new areas.
I’m confident that the next year of papers from researchers in their native languages will greatly surpass my 2020 work.
M. M. Rahman, M. Aktaruzzaman Pramanik, R. Sadik, M. Roy and P. Chakraborty
Compared Multilingual BERT and Bangla-Electra model on three news topic classification tasks: BARD, OSBC, and…
I’m working through the books which I’ve been carrying across the country over the past several months. Soon I’m moving to Colorado and getting my second dose of vaccine. Though COVID has not disappeared here or elsewhere in the world, I’m retiring the title “Pandemic Reads”. Hopefully I can make reading and recommending books a more regular part of my life.
What do I think about reviewing 19 books during a pandemic year? I feel like a determined reader could read two or three books in a month and be very thoughtful on GoodReads. So I under-developed this; I watched…
Nomadic web developer and mapmaker.