Can written words be time series data?

Nick Doiron
2 min readOct 26, 2019

--

I’ve been thinking about this idea for a while: what if I tried using a machine learning tool for time series (like stock markets, seasonal demand for candy, etc) to language?

Choosing a library

I found four libraries on GitHub which would be interesting to apply here:

I decided on Gluon-TS, as Prophet is more of a seasonal prediction tool with literal datetimes, and Personae examples are all stock market data.

Setting up the training data

I downloaded Samuel Butler’s English text of the Odyssey and the Iliad from the MIT Classics website. The Iliad has over 156,000 words. NLTK splits it into word tokens. On Google CoLab, I installed dependencies and downloaded the FastText pre-trained word embeddings for English, to form a time-series array for each of its 300 word vectors.

Running Gluon-TS 300 times

Here is part of my test data from the Iliad, and predicted continuation of the zeroth dimension of the word vector, using Gluon-TS:

The loss metric, and the large uncertainty suggested in the prediction, make it such that almost any word would likely fit in the range, and the line in the middle (the mean) would never vary far from -0.05. This might be good to show uncertainty in a stock movement, but it doesn’t look promising.

I decided to continue in the event that I might in the future find a different prediction library, which could use the same code.

Generating words

Once I’ve generated 10 predicted values on each dimension (not just going to the mean), I need to turn those numbers back into words. It looks a little like this:

word = []
for v in vects:
word.append(v[w])

model_word_vector = np.array( word, dtype=’f’)
most_similar_words = en_src.most_similar( [ model_word_vector ], [], 1)

--

--

Nick Doiron
Nick Doiron

Written by Nick Doiron

Web->ML developer and mapmaker.

No responses yet