Esperanto NLP Part 3: Correcting Grammar
The student-bot becomes the teacher-bot
In Part 1 I trained a model on the Esperanto Wikipedia, which generates text that’s grammatically-correct and occasionally context-appropriate. In Part 2, I started completing input sentences and fixed a bug involving Esperanto’s alphabet. Some caveats about Wikipedia as a source: the bot tends to use “estas” (the being verb) or passive voice “estis skribita” (was written) rather than action verbs. But when it does use action verbs, the sentence has the correct form for subject and direct object, and the adjectives matching either side.
So this can generate grammatically-correct nonsense. What’s the use? As an Esperanto beginner, I’d like to make a bot that can suggest correct spelling and grammar as I write.
2020 Update
For a modern approach using Transformers, see how HuggingFace trained an Esperanto model:
https://huggingface.co/blog/how-to-train
Setting up the workspace
I re-ran the code from the previous example, and let it run almost a full day so it could complete its model-building.
I created a static web UI that records my typing (I thought about not using a conventional textbox, as I want to highlight stuff, but it’s easier to show the corrections outside of the textarea). Following a common Esperanto convention, I allow ‘gx’ as a shorthand for typing ĝ and other letters.
After setting up a quick hacky server to display the page, pass text into the Python script, and pass its suggestion back to the client, I type “Mi manĝas pomo_” (I eat apples) — the suggestion is to complete “pomon” as this ending is applied to direct object nouns.
I type “Mi manĝas ruĝajn pomo_” (I eat red apples) and the suggestion bot, seeing a plural adjective, completes the plural form (“pomojn”).
Making less obvious corrections
How will the sentence-completion bot react if I make a misleading error, for example, missing the -n ending to the adjective and asking it to complete “Mi manĝas ruĝa pomo_”? I tried multiple times and suggestions varied randomly between “pomon” and accepting “pomo” by adding a space. It would be interesting to see the neural net battle that I have created here!
I think the best option for me to get started with a practical tool is to go through an input sentence letter by letter, and notice if the model’s probability for picking the next letter in my startPhrase is surprisingly low. I can use some of my sentences and some Wikipedia sentences as a starting point to measure how low counts as surprising. These surprises be especially concerning if they happen in the suffixes. This isn’t a great approach because I might be writing a brand new word, or the language might have important prefix grammar. But we can figure that out another time.
Given that the bot does not understand words’ meaning, I want to prevent my suggestions from correcting verb tense, changing ‘red’ to ‘green’, or adding endless prepositional phrases like we’ve seen in earlier examples.
Measuring the surprise
The sentence “Mi manĝas ruĝa pomo__” ideally should have two red flags. The end of “ruĝa” should not predict a space (even though it would be valid in another part of a sentence), and the end of “pomo” should (due to the previous error) be unusually conflicted.
We shouldn’t try to predict the first letters of words, because the computer isn’t deciding what we’re saying.
By percentage, here’s how well the model predicts: “Mi manĝas pomon”.
M
i = 0.3%
_ = 5%
m
a = 35%
n = 9%
ĝ = 32%
a = 54%
s = 50%
_ = 96%
p
o = 23%
m = 0.5%
o = 15%
n = 76% (space instead of n: 2.8%)
Dropping the “n” and making a space is above the 0.3 and 0.5% lows seen here, but far below the expectation for “n”.
I tried a new score: P(user_letter) / [P(best_letter) + P(2nd_best_letter)].
M
i = 2.8
_ = 36
m
a = 60
n = 13
ĝ = 66
a = 62
s = 65
_ = 99
p
o = 49
m = 0.8
o = 48
n = 88 (j instead of n = 12) (space instead of n = 3.3)
I’m a little disappointed not to calculate a clear minimum alert value, even in this simple sentence. It looks like our model is punishing unfamiliar letters almost as much as it would incorrect ones.
I decide to use the score, and flag letters which get less than 5 points, if the most-probable letter has a probability of 50% or more.
As soon as I try my old troubled sentence “Mi manĝas ruĝa pomo” the model doesn’t catch it, so I adjust: less than 9 points, and most-probable letter is >53%.
Time to check for overfitting. In the article about “lando” (country) a few sentences worked while this one failed:
The model mainly messes up by trying to put a space after “la” (the) or “de” (of) at the start of a word. It doesn’t like a comma. And it really wants the word “kolonioj” to start with “kon”. I wrote little tweaks until finally I decide to ban changing the first three letters of the word. It still doesn’t like “folklorfestivalo” in the Oktoberfest article, but it is a lot quieter, and accepts many more phrases.
I add a final feature to display the model’s most-probable letter in the editor for these cases.
Finding errors on Wikipedia
If the student really is to become the teacher, it should find errors for me to fix in its Wikipedia source articles. Even though the bot adopted the rules from this text, it has built a general model and ought to spy inconsistencies.
In an article about the Islamic calendar, I see these flagged:
- “ke la jaro esta suna aŭ luna” — the model says it should be “estas”, and I can’t find a tense which would allow a plain “esta”.
- “kalendara” — the spelling elsewhere in the text is “kalendaro” and our model expects that
In other articles, there are many, many false positives, so I hacked on the script for a while to look for a more general-purpose typo-finder.
In the ‘Londono’ article, the script raised a grammar issue:
Tiu kunurbaĵo kovras ankaŭ du anglajn graflandojn: la malgrandan distrikton de la Urbo Londono kaj la graflandon Granda Londono.
The model does not want to put a -n on graflandojn or graflandon (distrikton was not flagged, though it follows the same rule). The first one shouldn’t be changed; the second one is unclear to me since it is in this separate list.
The next flagged sentences were more clear-cut: I edited the article to make them plural.
Sed la originala Londono ensumigis nur kelkajn kvadratan mejlon de la Londona urbo-centro….
But the original London was made only a few square miles from the London city centerla japana pentristo Hokusajo kiu siavice montras aliajn mirindaĵon el Azio
the Japanese painter Hokusajo that also shows other wonders from Asia
This bot is now a better Esperanto student than I am. By working together we can catch errors in my writing and edit Wikipedia.
Future options
The Esperanto-bot was trained on Wikipedia data, so a future project might read books and come up with a larger variety of content.
I wonder if the Wikipedia learning technique would be able to work in a language with more irregular words, or grapheme clusters (like the multiple accents and joined letters of Burmese, Hindi, Tamil, etc).
A less one-sided project might make a low resource language chatbot, as an internet friend suggested, by reading conversations on Twitter.