Alif: an Arabic word bank

Resources for puzzles and word games

Nick Doiron
4 min readDec 9, 2018

Two years ago, I made a multilingual crossword puzzle generator, and wrote a post on that — Crosswords in Burmese. It took frustrating manual labor for users to build up their puzzles, so I made wiki-crossword, which grabs random articles from Wikipedia and generates puzzles and clues from there.

While adding Arabic script, it was interesting resolving right-to-left in the game, but there were other roadblocks which I hadn’t expected. Let me put them in context with word games and puzzles:

  • A simple game encourages you to fill in a missing letter. What does the partial word look like? Suppose you are removing it from العَرَبِيَّة.
    Take out one char and you will see العَرَبِ_ة ; by shaping the neighboring letters, we can preserve a more natural العَرَ بـ_ـة
  • A teacher wants to generate puzzles from a category (animals, countries, etc) or their own word list. Where can you pull up a list of 50 foods in Persian or Pashto? Categories exist on Wikipedia and DBpedia, but are difficult to find, parse, and rely on across multiple languages.
  • كرة is one word, but appears as two baseline groups. Does the programmer know which letters such as ر break the baseline? How can this be done without separating a letter from its diacritics?
  • A crossword puzzle has one space per glyph. What are all of the valid combinations of lam, alif, and diacritics which combine into one لا form?

A toolkit and a database

A game developer doesn’t want to stop and parse DBpedia, WikiData, and the Unicode spec before starting to write their game. So I made a thing:

Alif-Toolkit is a TypeScript library that supports all of those functions (and normalization) for any letters in Arabic’s Unicode blocks. All other libraries that I know of are GPL-licensed, so I pored over PDFs and hex codes to be comprehensive and be MIT-licensed.

Alif Word Bank is something that I only got running today, but it uses the toolkit, excerpts of articles, and category names to break down words in several ways. Here’s a Persian article on cookies which you can get as a response:

this is my concept for showing RTL JSON responses

Here’s a sample API request which returns names of birds in Persian (Farsi = “fa”), with a presence on the Simple English Wikipedia (this could help cut down on obscure topics, or give you an easy-to-read resource).
alif-word-bank.herokuapp.com/topic/fa/en:bird?inSimple=true&count=20

DBpedia and WikiData, working together

My DBpedia parser is a little marvel. First I pull in their list of all animals, then use a forEach to look up thousands of entries. I quickly got blocked by the server, so I added random timeouts to each call, spacing out to about 0.8 seconds per animal.

DBpedia lists categories (in English), names in a handful of languages, and a WikiData ID. To get Persian and the full range of Chinese names (such as Simplified vs. Traditional), I also load the WikiData entity. This is also where to note if there is a Simple English link or not.

DBpedia also provides a short blurb in each of their languages, but I have different rules for my crossword blurbs, so I prefer parsing the article directly. For example, the article titled “hyperlink” starts with:

In computing, a hyperlink, or simply a link, is a reference to data that the reader can directly follow either by clicking or tapping.

I use the bold markup to hide both “hyperlink” and “link” from the player, which I would’ve missed if I simply did find-and-replace with the title.

Again, nothing super-advanced, but something a game developer won’t have to scrape and trudge through to access words.

Future goals

  • More categories of articles, including Arabic and Persian category names
  • Indicators for school-relevant and appropriate content, such as having an article in a UK project which curated Wikipedia articles, counting the supported languages on WikiData, or counting incoming links.
  • Flashcards for English and Chinese
  • Testing support for less common languages (such as Kyrgyz, which adds several letters to Arabic Unicode)
  • Adding/removing diacritics at different reading levels — I met an Arabic NLP postdoc and his advice was to look for dictionary sites which may include all of the tashkeel diacritics. I’ll see if PanLex can help.
  • I’d wanted to do some actual NLP, or perhaps train a system on the Arabic Wiki for use on other wikis, but the NLP expert explained how different each of these languages are (even within Arabic, to the point that there is a separate Egyptian Arabic wiki).

--

--