Updates on: “Gender bias in Spanish BERT”

3 min readJul 10, 2020

Back in May, I did a long-ish post about measuring gender bias in language models, and transferring that analysis to Spanish, a language with grammatical gender (i.e. el actor / la actriz).

Here are three updates relevant to that project:

1. Placing on the LinCE Benchmark

The University of Houston set up LinCE, a set of NLP ‘code-switching’ challenges. One specific challenge which I focused on was sentiment analysis of Tweets with a mix of Spanish and English words.
The LinCE team’s original ELMo model scored 52.88% accurate on this challenge, and my work with BETO reached 56.47%, so it is currently on top of the leaderboard!

2. Limits of Data Augmentation Feature

In my previous post, I used spaCy and BETO embeddings to ‘flip’ the gender of training sentences and significantly improve the accuracy of a model. The LinCE data’s used of mixed languages and slang made the spaCy sentence modeling unusable. I made a new script to flip gender or change plurals of Spanish words in the training data. This allowed me to create ~40% more training examples.

In CoLab notebooks, I compared the accuracy of Multilingual BERT (mBERT) and BETO, given the original and augmented training data. The better option appears to be mBERT. In an earlier test, I got the exact same counts of wrong classifications on mBERT when training on the augmented data. In the linked example, the accuracy of BETO falls by 1% with augmented data.

In retrospect, I think of two main reasons for this technique not continuing to help out language models:

the bilingual nature of LinCE makes it more challenging to predict and to create meaningful augmented examples
my movie reviews dataset was much smaller, so augmenting the training data had a greater effect

3. The Professional Version

Danielle Saunders and Prof. Bill Byrne (both University of Cambridge, UK) had a paper accepted to this year’s ACL conference which shows how researchers approach the problem. I wish I had read their preprint before trying my own strategy!

Regarding data, we suggest that a small, trusted gender-balanced set could allow more efficient and effective gender debiasing than a larger, noisier set. To explore this we create a tiny, handcrafted profession-based dataset for transfer learning. For contrast, we also consider fine-tuning on a counterfactual subset of the full dataset and propose a straightforward scheme for artificially gender-balancing parallel text for NMT [neural machine translation].

DCSaunders/gender-debias

Adaptation datasets and inflected word lists for the paper Reducing Gender Bias in Neural Machine Translation as a…

github.com

The paper is a good read, and covers English translations to Spanish, German, and Hebrew.

Reflections

I should invest more time to research-y projects. I don’t use ML in my current job, so I was relieved to see that my ideas were similar to how a researcher would frame and address a problem in NLP.
I should teach myself on Kaggle challenges. The LinCE benchmark is something I saw on Twitter and took interest, but it isn’t competitive yet. Without the data augmentation bonus that I’d assumed would help, my model is essentially picking an off-the-shelf model.

Updates?

This article is from July 2020. For updates on Spanish gender re-inflection and WEAT / bias from embeddings, see https://github.com/mapmeld/use-this-now/blob/main/README.md#gender-bias