AWS Comprehend learns some Arabic

4 min readNov 11, 2019

On November 6th, Amazon doubled the number of languages in their NLP service Comprehend, with a more global focus, adding Arabic, Chinese, and more:

Amazon Comprehend Adds Six New Languages

Posted On: Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights…

aws.amazon.com

Which AWS features support Arabic now?

Comprehend joins Translate, Polly (one text-to-speech voice named Zeina), and Transcribe (Modern Standard Arabic audio files, not streaming) in supporting Arabic.
It’s a technical (and financial) improvement over AWS’s previous advice to use Translate to convert to English before using Comprehend.

Unfortunately I still can’t select Arabic in Comprehend’s Custom Classifiers, or Syntax feature.

On other AWS tools: Lex supports only American English (see Arabot for an Arabic chatbot platform), and Textract (OCR) supports only “Latin-script characters from the standard English alphabet and ASCII symbols”.

What can I do in Comprehend?

The litany of features in AWS Comprehend include Sentiment Analysis, Topic Modeling, and Custom Classification. For this post, I’ll run through three basic features, which you can run on your own Arabic data:

Does sentiment analysis agree with the same positive and negative labels as other repos?
Can I make a classifier with different dialects? This could reveal how broadly Amazon trained their system.
Topic Modeling of a few Sufi poems

A Word on Pricing

The minimum message costs only 3% of one cent on most of Comprehend. It’s excellent for small projects, but if you had a million-Tweet dataset, the price would go up to the $250 range.
Verify your pipeline by processing a small file first.
If you have a technical background, you can likely save money by vectorizing the text with Transformers, and making your own classifier with TensorFlow or PyTorch.

Sentiment Analysis

I uploaded 10,000 positive and 10,000 negative lines from Prof. Motaz Saad’s Arabic Sentiment Analysis repo into S3. (Update: check out his Arabic-language NLP lectures).
The process should feel familiar if you used AWS in the past. Reviewing the results manually was a little challenging, matching output filenames to tasks and individual results to their corresponding lines in the source file.

On the negative training set, 25% were labeled positive, 48% were neutral and 1% mixed (categories missing from the original repo), and the remaining 25% were labeled negative.
About a third of these negative-input-but-positive-output messages had >90% confidence, with these as examples:

أجمل ما حل بي الوقوع بك 🥀
حتى الحيوانات لها قلوب وتحب 💔 #صباح_الخميس

On the positive training set, 11% were labeled negative, 47% were neutral and 0.5% mixed, and the remaining 42% were positive. Only about an eighth of the positive-input-but-negative-output messages had >90% confidence. This is one message as an example:

أدري فقدتك ! بس لازلت أنا أبغيك وإذني عن #العذال ما تزال خرسى 💘 #منيف_الخمشي✒

This turned out to be more complex than I expected, because it’s possible that my original source made mistakes, and its choice of two polar opposites is less likely when Comprehend found roughly half of the Tweets to be neutral. It’s also possible that Comprehend is reluctant to classify a Tweet as negative.

Dialect Classifier

My plan was to use data from the University of British Columbia. Their training dataset has examples of Levantine, Gulf, Egyptian, and Modern Standard Arabic (over 86k rows total). I’ll come back and update this section if it becomes more available.

Topic Modeling and Key Phrases

I chose two poems from this Harvard blog.
Topic modeling, and the sections for ‘entities’ and ‘key phrases,’ did not pull out interesting phrases, for example وَلَسْتُ simply meaning ‘and I’m not’. Potential reasons that this was not fun:

length of the content
nature of the content (poetry vs. a business news headline)
use of tashkeel / vowel signs which is atypical in written Arabic

Why is Amazon forgotten in NLP?

AWS is expanding their features, and Alexa is a successful speech-recognition and question-answering device. Yet Amazon researchers aren’t as well-known in NLP as Google, Facebook, OpenAI, or AllenNLP. They don’t have a model or mega-dataset in the NLP Muppet cinematic universe:

the essential NLP Muppet knowledge map, for serious adults

When I looked up whether engineers had blogged about Amazon’s Arabic text-to-speech or transcription features, I found very few examples. This is a good article about their translation service:

How Do You Say 'JSTOR' in Arabic? How ADRI is translating Arabic academic literature at scale |…

Ali Mazraeh was working toward his masters' degree in Strategic Studies at Victoria University of Wellington, in New…