AWS Comprehend learns some Arabic

Nick Doiron
4 min readNov 11, 2019

--

On November 6th, Amazon doubled the number of languages in their NLP service Comprehend, with a more global focus, adding Arabic, Chinese, and more:

Which AWS features support Arabic now?

Comprehend joins Translate, Polly (one text-to-speech voice named Zeina), and Transcribe (Modern Standard Arabic audio files, not streaming) in supporting Arabic.
It’s a technical (and financial) improvement over AWS’s previous advice to use Translate to convert to English before using Comprehend.

Unfortunately I still can’t select Arabic in Comprehend’s Custom Classifiers, or Syntax feature.

On other AWS tools: Lex supports only American English (see Arabot for an Arabic chatbot platform), and Textract (OCR) supports only “Latin-script characters from the standard English alphabet and ASCII symbols”.

What can I do in Comprehend?

The litany of features in AWS Comprehend include Sentiment Analysis, Topic Modeling, and Custom Classification. For this post, I’ll run through three basic features, which you can run on your own Arabic data:

  • Does sentiment analysis agree with the same positive and negative labels as other repos?
  • Can I make a classifier with different dialects? This could reveal how broadly Amazon trained their system.
  • Topic Modeling of a few Sufi poems

A Word on Pricing

The minimum message costs only 3% of one cent on most of Comprehend. It’s excellent for small projects, but if you had a million-Tweet dataset, the price would go up to the $250 range.
Verify your pipeline by processing a small file first.
If you have a technical background, you can likely save money by vectorizing the text with Transformers, and making your own classifier with TensorFlow or PyTorch.

Sentiment Analysis

I uploaded 10,000 positive and 10,000 negative lines from Prof. Motaz Saad’s Arabic Sentiment Analysis repo into S3. (Update: check out his Arabic-language NLP lectures).
The process should feel familiar if you used AWS in the past. Reviewing the results manually was a little challenging, matching output filenames to tasks and individual results to their corresponding lines in the source file.

On the negative training set, 25% were labeled positive, 48% were neutral and 1% mixed (categories missing from the original repo), and the remaining 25% were labeled negative.
About a third of these negative-input-but-positive-output messages had >90% confidence, with these as examples:

أجمل ما حل بي الوقوع بك 🥀
حتى الحيوانات لها قلوب وتحب 💔 #صباح_الخميس

On the positive training set, 11% were labeled negative, 47% were neutral and 0.5% mixed, and the remaining 42% were positive. Only about an eighth of the positive-input-but-negative-output messages had >90% confidence. This is one message as an example:

أدري فقدتك ! بس لازلت أنا أبغيك وإذني عن #العذال ما تزال خرسى 💘 #منيف_الخمشي✒

This turned out to be more complex than I expected, because it’s possible that my original source made mistakes, and its choice of two polar opposites is less likely when Comprehend found roughly half of the Tweets to be neutral. It’s also possible that Comprehend is reluctant to classify a Tweet as negative.

Dialect Classifier

My plan was to use data from the University of British Columbia. Their training dataset has examples of Levantine, Gulf, Egyptian, and Modern Standard Arabic (over 86k rows total). I’ll come back and update this section if it becomes more available.

Topic Modeling and Key Phrases

I chose two poems from this Harvard blog.
Topic modeling, and the sections for ‘entities’ and ‘key phrases,’ did not pull out interesting phrases, for example وَلَسْتُ simply meaning ‘and I’m not’. Potential reasons that this was not fun:

  • length of the content
  • nature of the content (poetry vs. a business news headline)
  • use of tashkeel / vowel signs which is atypical in written Arabic

Why is Amazon forgotten in NLP?

AWS is expanding their features, and Alexa is a successful speech-recognition and question-answering device. Yet Amazon researchers aren’t as well-known in NLP as Google, Facebook, OpenAI, or AllenNLP. They don’t have a model or mega-dataset in the NLP Muppet cinematic universe:

the essential NLP Muppet knowledge map, for serious adults

When I looked up whether engineers had blogged about Amazon’s Arabic text-to-speech or transcription features, I found very few examples. This is a good article about their translation service:

My best guess on AWS’s less talkative role in the market is:

  • because AWS is so big, NLP products might not be given priority
  • Amazon can safely offer their NLP tools as an additional feature to corporate customers, rather than making it widely marketed
  • Corporate customers are quiet and people’s side projects are noisy; if you have a lot of corporate customers, you have few noisy promoters

2020 Update: check out https://arabic-nlp.herokuapp.com/

--

--