ML Arxiv Haul #6

  • A classifier from 2016 (pre-BERT) is used to label Tweets from previously un-labeled sources as AAE or SAE.
  • One hate speech dataset (Davidson 2017) is labeled as 70% AAE, even though it is a general hate dataset.
    Another hate speech dataset (HateXplain 2021) is labeled as ~10% AAE. It includes content from Twitter and Gab (a right-wing network).
  • I’m concerned about de-duplication of Tweets, unless BERT is being fine-tuned and evaluated on each dataset individually?
  • BERT is a small model in the year 2022
  • The end goal of ColBERT models is document retrieval, i.e. instead of showing the answer to a search input, or the sentence with the answer to your search input, it’s going to return a full relevant document. This means that you might want to separate the original document into several passages for indexing.
  • ColBERT is its own thing where you would want to start pre-training from scratch, or use the existing weights for English.
  • The examples help you set up with existing weights, existing index but I wasn’t able to figure out starting from scratch in another language.
  • text cues and image-text examples are ‘cherry-picked’
  • DALLE-2 text includes imagined characters — consider the image included in my Tweet above — how can the lower lines be transcribed?
  • vocabulary (other than Apoploe) are difficult to repeat
    Apoploe works because the model is confused and guesses that it is a bird species name (Latin)
  • The given words may trigger neurons inside of DALLE, but they are random noise. This appeals to linguists who want to talk about what is a language even and how it is not bird+bug. It is still bad news for text filters on a model.




Web->ML developer and mapmaker.

Nick Doiron

Nick Doiron

Web->ML developer and mapmaker.

