Esperanto NLP Part 1: Generating text with TensorFlow
Can code designed to generate English text learn the rules of Esperanto?
Lately I have been reading David Richardson’s Esperanto intro textbook and testing myself on Duolingo. When I went looking for examples of Esperanto NLP, or Esperanto and TensorFlow, I found hardly anything. This isn’t to say it hasn’t been done — Google Translate does an excellent job— but that information isn’t publicly available.
2020 Update
For a modern approach using Transformers, see how HuggingFace trained an Esperanto model:
https://huggingface.co/blog/how-to-train
Why work on NLP in Esperanto?
- I am interested in NLP, but don’t want to rehash the same examples that everyone else is doing.
- I want to understand how machine learning can be applied to languages other than English. Many awesome examples which you see in deep learning are pre-trained on huge datasets of English text. In my internationalization/localization work I’ve been covering Arabic, Burmese, Dhivehi, and other alphabets, and I’d like to apply any NLP/machine learning tools to these languages in the future.
- Without big companies and researchers covering the language already, the Esperanto experiments seem potentially useful and less like a toy.
- Esperanto feels like it should be easy to work with, as it is a constructed language with strictly regular grammar.
My first project is going to be the simplest — generating valid Esperanto sentences (without the system knowing meaning). People have used TensorFlow to “help” generate Harry Potter chapters, an episode of Seinfeld, and a short film which was actually performed (omg, the egg). It’s only a matter of time until someone uses “Neural Net Generates New Finale Episode of Firefly” to sneakily promote their fanfic. I’m going to start small, though.
Getting started with Shakespearean English
Martin Görner has a 3-hour course “Tensorflow and deep learning — without a PhD” and his code for generating Shakespeare text is the best example that I could find (a year ago I did this with TFLearn, but it has not been kept up-to-date). The system is making a Recurrent Neural Network (RNN) for the text, gradually improving from random letters, to formatting that mimics the source, to character names and words which look sort-of right.
I left the script running overnight on a MacBook Air:
TRAINING STATS: batch 564/856 in epoch 6, batch loss: 1.27156, batch accuracy: 0.60500
And here’s what text generated by this model looks like:
IANO Did I, to bear thee?
[Enter a Messenger]
Well so to thee, thou hast been thy heart.
[Exit]TROILUS AND CRESSIDA
ACT IV
SCENE III The father. A palace.[Enter the Green of Angelo, with a bostm of a bed,
and the Duke of Buckingham and Second Servants]
[Enter a Messenger]This is the banks of this to the season.
There is the proud of the bone of my heart,
The winds and the bosom of the world that should he be a
great and born that thou hast not been the world and
the beaute of the beauty. I will not be said.[Enter and a Second Gentleman]
[Re-enter CASSIO]
[Enter CLEOPATRA, CLEOPATRA, CASSAR DOMARDUS,
and DUCHESS OF YORK, and the Duke of Burgundy]KING HENRY VI Why, then, we say you were, as you are as the world
When they say, and to the king of true lordship than the state
To be as the sun of this face that hath been been to be saint.
[Exit BASSANIO]
What should I do thee from me to thee?
Interesting, and mostly valid structure, though “bostm” is not a word and sentences are more or less meaningless. I stopped the script and generated a play from Martin Görner’s final checkpoint files:
[Enter CLEOPATRA and Attendants]
CASSIO Marry, how do you to your master?
CORIOLANUS What says you that I have a man of the devil?
If I be born and so, my lord, I shall not know.[Exit ANTONIO, CARDINAL, and Attendants]
CASCI The point of Antonius, made me be the strange of this.
I am not to this prince and to the country’s promit,
That I may stay, and then I have been born than the effect
And shall I send my soul and that I have been born,
If I do love thee from my lord, I cannot be
That I have been assail’d, and she’ll be more
As they will see the common property of my beard.
I would not stand to be thy life that thou dost love,
That I have sent to the devil that I have seen.
If thou, thou art, I cannot see,
That therefore bear the compositions of my soul,
That I must see thy father’s body that I have been
The bone of this be thou a motion of men that thou hast
…
This continues on for a long while, and it feels more meaningful than my work-in-progress excerpt.
Checking assumptions with patent text
To make sure the code is general enough to work on other text examples, I decided to try another English example: patents. I’m inspired by a project last year which generated inventions including images, though I can’t seem to dig up a link for it.
I downloaded 139MB of text from patents approved in the year 2000, which is a small fraction of what you can get from the Google and Patent Office bulk download site. I divided it into three files with a roughly equal line count, trying to follow the Shakespeare script’s division of the text into separate files for plays, but I’m not sure if this was necessary.
Here’s some text it generated after running from morning to afternoon:
144 to 405.degree. C. and 56 ml weight could be controlled by the structure of
the transmission control unit, a chea fold will be designed by cells to saip trough from the gotf and the second torm sensor and resistor and therefore, the chaing to the cost of power-on-drave attleast one, of the present invention as described in FIG. 8.The sequence is shown in FIG. 4 of the anount of the predetermined soucce data
as shown in FIG. 4. It also seruentiously, a diameter samples 14 which connected to the base 35.
Samples a positive region 11 of a thirdle, when the solder comprises a cingle angle is prior os the adapter in the same fuming member 20.In accordance with the present invention are detected to price the athector to the output signals from determining tubing through an order on a diselection resistance.
This looks… not so good, but I got the impression that the biggest problem might be that 140MB was too big (the script never got beyond ‘epoch 0’).
Constructing the Esperanto source text
I decided to create my own source text by downloading several articles on the Esperanto Wikipedia, and formatting them into plaintext paragraphs. I wrote a small Node module which you can use to crawl a language’s Wikipedia and download a specified size of article text (10MB in my case).
Validating the output
I’m interested to see if the output of this RNN will generate text which follows Esperanto grammar rules.
A simple Esperanto sentence — “Mi manĝas pomon” = “I eat an apple” — is not obvious to an English speaker. What makes this language easy to learn?
- The first-person pronoun “mi” has no special I/me rules. For possessives, where English has me/my you/your, he/his, us/our, and they/their, Esperanto has “mi/mia”, “vi/via”, “li/lia”, and so on.
- This verb’s “-as” ending is consistent for all Esperanto verbs’ present tense, and doesn’t change based on a singular or plural subject.
- The word for apple, “pomo”, becomes “pomon” because it is the object of the sentence. Likewise the adjective “ruĝa” for red, when applied to this object, becomes “ruĝan.”
The same goes for plurals, so check out this sentence: “Mi manĝas ruĝajn pomojn” = “I eat red apples”. The letter J here is pronounced more like Y in English, due to Esperanto’s Slavic roots.
Even though this RNN script will read in and output many words that I haven’t learned yet, I can use these grammar rules to validate output sentences.
Experiment A: 1MB
I was eager to start, so I did the first run once I had the first 1MB of source data (the articles linked from the Wikipedia main page). Early results were not readable:
La postano estis la landa andenas la langvo
[followed by a list of numbers]
Then became run-on sentences chaining mostly nouns and prepositional phrases, plus one “estas” (being verb).
La lingva en la senta en la suna estas pre la lando en la lango de la lingvo de la sekva prencio de la senda porto de la sondo en la somero de la lindo, kaj la prencan pro sekve la lando en la suda distania kontraoj de la monta kaj la plij malatoj de la malalio de la lango de la sekva komenco estis la plej malatoj kiuj la plej mondoj.
Eventually these run-ons were obsessed with “plej” (most) and “malplej” (presumably least?).
La plej alta prezidento en 1880, la plej alta proventa kaj la plej alta kaj la plej grava parto da parto devenas en la suda lando estas pli altaj altoj en la pli malpri la termona kaj la plej granda provizanta estis kontra la plej grava portugalio kaj en la prezidento de la plej multaj pli malplej gravaj regaloj estas prenekta per la plej altaj paroj kaj ekzeptis la komuniko…
Experiment B: 10.3MB
Once my source builder script was able to crawl Wikipedia recursively, I downloaded 10MB of data. I was a little worried as the crawler dug into the Esperanto article about Wikipedia, and then stubs about other languages’ Wikipedias, then more technical articles on XML and SMTP, the space race, and somehow “La Fluganta Spagetmonstro”.
One potential issue with Esperanto Wikipedia as a source is that most articles are short, for example George W. Bush. To reach 10MB, the script downloaded about 1,400 articles.
The initial results were meaningless, but an improvement on the previous frenzied sentences:
Sed la sukcesa kontra kurturo de la reo de la sendecenta ekonomio kolora periodo por esti erkita kaj majekva, trankaj kaj sekviintaj iuj fortaj klisato.
I let the script run overnight and got more satisfying results:
La konsilio de la rivera konstruao de la Unua Mondmilito en 1840 la propreso de la Unuiinta Relando estis la urbo de la Mondo kaj la Usona Kongreso de Usono.
La regiono en 1941 en 1808 estis alia registaro de la Urbo de la Relando, kaj la plej granda parto de la mondo estis la irkaa loantaro en la mondo.
La registara patro de la Meza Imperio estis ankora esploritaj en 1940 pro tio la regiono de la Unuiintaj Nacioj konsideras la propran politikan signifon de la komunismo kiu estis limigita por la plej frua propra parto de la mondo.
La artikoloj estas komuna poste konstruita en la plej granda parto de la 19-a jarcento.
La prezidento de la marko de la Moderna Kantono la 1-a de Majeo, la riveroj de la Urbo de la Mondo estas la efaj religiaj kontraaj kaj propraj konsilioj.
La la regiono en 1940, la registaro estis la unua portugala regiono de la Universitato de la Unuiinta Nordo de la Unui
A few phrases from other excerpts:
La plej multaj landoj estis konsiderataj en la moderna sendependa lando.
En 1949, la plej granda urbo estis deklarita en 1944 per la reo de Aleksandro, kiu konsistis en la mezepoka kalendaro.
La prezidento de la Respubliko en Eropo estis la unua kaj la sudoriento de la mondo, li estis elektita en 1944 por la provinco de la ministerio kiel la regiona eksterlando.
Conclusions
There isn’t a coherent meaning in what’s written, but it was good to see:
- chains of plural words (such as “la efaj religiaj kontraaj”) have the -j ending on adjectives as well
- almost all the sentences in these excerpts use “estas” or “estis” which is a being verb and correctly doesn’t use the “-n” ending of a direct object
- one phrase which does have an action verb “la Unuiintaj Nacioj konsideras la propran politikan signifon” does have the “-n” ending on the noun and its adjectives
- correct use of past participle in “estis deklarita”= “was declared” and “li estis elektita en 1944" = “he was elected in 1944”
- correctly joining phrases such as “, kiu konsistis” = “, which consisted…”
The main problems are:
- meaningless or tautological output (a sample phrase translates to: “Most countries were considered in the modern independent country.”)
- reluctance to use any action verbs (though I wonder if this is inherited from the voice/tone used on short articles on Esperanto Wikipedia)
Going forward, I believe that this model knows enough to be useful to check if a sentence is grammatically correct, or to build a valid sentence given an initial prompt. I don’t think it’s going to generate play or TV episode scripts anytime soon… I would be better off pulling Esperanto text from various books and long-form text than from a selection of articles about countries and their histories.
—
Continue reading Part 2: Finishing Sentences and Part 3: Correcting Grammar