Transparent data flow with Kedro

Language labeling for Twitter data

6 min readFeb 7, 2020

During my initial analysis of Twitter disinformation, I saw many Tweets with the wrong language label. As I analyze that problem, I needed a data pipeline with these requirements:

data pipeline represented in a graph with linked steps A > B > C
proportionally-sized— proving that a step is working correctly, and making its path comparable to other parts of the pipeline
inspectable — in this context, I want to see Tweets for every relabel, for example Korean→Arabic, to manually check my work, and to start building a model to explain how these mix-ups occur

After paging though many options, I decided on Kedro, a data flow / graph / pipeline open sourced by QuantumBlack:

A Kedro data pipeline visualized in kedro-viz’s dark theme

Filtering pipeline

While writing my pipeline, I’ll test with a million-line CSV (due to multiline content, this is around 312,000 Tweets).

First, I filter out Retweet, link-only, and media-only content.
Next, I use Unicode blocks to determine the script.
~99.5% of Twitter’s labels were in six languages: Arabic, English, Russian, Japanese, Turkish, and Persian, so I can begin to split these languages by script.
In Latin script, I should distinguish between English, Turkish, and other languages based on language-specific letters.
In Arabic script, I would do the same for Persian, Urdu, and Arabic.
~93% of the original Tweets were labeled Arabic by Twitter, so a future neural network step will divide Arabic up into dialects

Filtering out retweets and empty content

Twitter provides an is_retweet column. Here’s our first filter node:

In the pipeline file, I set the input and outputs, and start writing a new node to accept the original Tweets:

The remove_empty node is more complex, because I need a RegEx to remove URLs from each Tweet individually.

Unicode blocks

I pip install unicodeblock to detect Unicode scripts. After ignoring cross-language blocks such as emoji and symbols, I programmed in a list of blocks for each target script. For example: Arabic, Arabic Supplement, Arabic Presentation Forms A, and Arabic Presentation Forms B are all parts of the Arabic script.

Hashtags were a difficult decision. Currently I think an English one such as #relationshipgoals is not proof of the user’s language, but a hashtag #أردوغان is a strong hint.

Customizing the visualization

At this point, let’s see mykedro-viz diagram.

We can see our data flow structure, but with the visualization separated from execution, I have an incomplete picture of how data flows through. Is Arabic still >90% of my content? Is removing Retweets throwing out too much? Does my Turkish detection have a bug that will filter out everything?

The flow chart is a D3 visualization, and the frontend framework is React. I bring in hardcoded row counts to see how the diagram looks with proportional area or a logarithmic scale:

Proportional area (left) and logarithmic (right)

I’m going to continue with the logarithmic scale — it’s not perfect, but we can still determine a successful pass-through and compare sizes of circles.

I notice that ‘Other’ is much larger than I’d expected (larger than Latin script). I will review this in future work, but it is mostly emojis, symbols, and Latin hashtags.
Only 1 or 2 Korean Tweets make it to the end (I previously saw that almost all Korean in this dataset comes from Retweets and/or Arabic text art).

Next I add numbers to the middle of the SVG path. This is a little tricky, because you need a textPath, and paths going from right-to-left make text appear upside-down; flipping the origin and destination on these means I must also change the arrow pointers.
Eventually I tweak the styles and offsets to see this:

Language-specific letters

Turkish has letters which rarely appear in English Tweets: ÇŞĞİÖÜ (plus most of their lower-case equivalents and ı, a dotless i).
Persian has different code points for digits, and these letters: پ, چ, ژ, گ
Arabic has a non-Persian diacritic: ْ
Urdu has Persian letters, but can be further separated with these:
ٹ ,ڈ , ڑ ,ں ,ے , ھ

After adjusting my URL parser, the numbers change again. I later fixed the Turkish-English filter

Inspecting Disagreement with Twitter’s results

My initial concept was to see Tweets when hovering over a node, but later I thought I might prefer to create a table, with rows for each disagreement between the Twitter label x my label. To make things easy, I can output this table in the command prompt at the end of each run.

These are samples where I’d labeled Arabic, Persian, or Urdu language, and Twitter disagreed (Undefined/Other, English, Sindhi, Kurdish, and Korean):

There are mistakes which I want to fix and improve upon — for example, Google Translate and searches agree that my labels need work (especially my Persian and Urdu examples, which were just more Arabic).

At the end of this post I list some more reliable ways to detect language differences within a script in the future. As I continue to improve, I plan to continue with this tabular format to quickly review whether the final relabeling results make sense.

Dialects

The University of British Columbia has a dataset labeling Modern Standard Arabic (MSA), Gulf, Levantine, and Egyptian dialects. I used it before in a Google CoLab notebook.

This dataset lacks examples of the Maghrebi dialect, which is heard across Morocco, Algeria, and Tunisia. I merged in two other dialect datasets, from Johns Hopkins (github.com/ryancotterell/arabic_dialect_annotation) and Qatar University (https://www.aclweb.org/anthology/L18-1579.pdf).

I organize the data into one CSV with one format and dialect count. I get these combined counts for training data:

Egyptian: 27,454; Levantine: 20,509; Maghrebi: 10,541; Gulf: 67,088;
MSA: 120,257

I’m having trouble with a simple neural network classifier here, so I’ll cap the post here and revisit this in the future.

Future plans

Design and implement a tagging system for multilingual content, which is dialect-sensitive, and acknowledges hashtags, internet lingo, and emojis.
Break up the ‘Arabic’ category into dialects with a balanced number of messages from each region
Break up the ‘Other’ category, to label as empty (no meaningful content), internet emojis and memes, or specific languages targeted for disinformation.
Use dictionaries, place name lists, and weighting of letters to more accurately divide languages within Cyrillic, Arabic, and Latin scripts.
Some users tend to use letters that they ‘shouldn’t’ , such as گ in Arabic.
When I am confident that relabeling is working, use a deeper analysis to identify what causes mislabeling by Twitter.