Thai NLP — with no spaces

Nick Doiron
3 min readDec 14, 2020

One of the first steps in an NLP pipeline is dividing raw text into words or word-pieces, known as tokens. But what if you don’t have spaces to divide sentences into words?

Background

Sample Thai text from Wikipedia

People do write some spaces in Thai text, as you can see above, but they aren’t universal as they are in English. There is also no set punctuation to end a Thai sentence. This can cause confusion, or poetry, but humans are good at separating them in context. The difficult part, then, is getting computers to pick up on that context.

When I first heard about this text-parsing problem in early 2016 at the Asia Foundation, the best solution was to find a rules-based segmentation tool such as LibThai from the Linux Foundation.
In summer 2016, Wannaphong Phatthiyaphaibun created the first release of PyThaiNLP, a library inspired by NLTK.
As we enter the deep learning phase of NLP, a new generation of libraries have emerged using neural networks. Pattarawat Chormai (et. al)’s 2019 paper uses a benchmark to compare AttaCut (based on PyTorch), PyThaiNLP, DeepCut, and Sertis’s library (these last two use TensorFlow).

After tokenization, models

I found a few examples of recent Thai NLP models. There is a modern word2vec-type model, and a FastText release. mBERT has Thai language, but only in one version (cased).

I was curious how mBERT tokenizes Thai sentences:

tokenizing a Wikipedia article

I was surprised to see that mBERT dissolves Thai into almost character-level embeddings. The exceptions are a handful of common words such as ‘born in’, ‘Thailand’, ‘is’. These tiny tokens cannot be placed into meaningful vectors to for example, identify positive vs. negative, formal vs. informal, sports and government topics. This convinced me that a proper {segmenting → tokenizer → transformer} pipeline could improve Thai NLP, as it has in Arabic.

This repo is a Thai-specific BERT model created in 2018 based on tokens derived from {Wikipedia text → sentence segments → matching word-piece vocabulary → BERT}. While I reached out to the original developer, I decided to adapt their TensorFlow model checkpoint to HuggingFace’s transformers library.
The model is now available at huggingface.co/monsoon-nlp/bert-base-thai — unfortunately I have yet to prove an improvement over mBERT

2021 Update

While researching this article, I learned about https://airesearch.in.th/, a government-funded research center. In early 2021 they released several high quality Thai language models on https://huggingface.co/airesearch

Pre-print: https://arxiv.org/abs/2101.09635

For latest updates and my recommendations, see github.com/mapmeld/use-this-now/blob/main/README.md#thai-nlp

--

--