Transparent data flow with Kedro

Language labeling for Twitter data

Nick Doiron
6 min readFeb 7, 2020

During my initial analysis of Twitter disinformation, I saw many Tweets with the wrong language label. As I analyze that problem, I needed a data pipeline with these requirements:

  • data pipeline represented in a graph with linked steps A > B > C
  • proportionally-sized— proving that a step is working correctly, and making its path comparable to other parts of the pipeline
  • inspectable — in this context, I want to see Tweets for every relabel, for example Korean→Arabic, to manually check my work, and to start building a model to explain how these mix-ups occur

After paging though many options, I decided on Kedro, a data flow / graph / pipeline open sourced by QuantumBlack:

A Kedro data pipeline visualized in kedro-viz’s dark theme

Filtering pipeline

While writing my pipeline, I’ll test with a million-line CSV (due to multiline content, this is around 312,000 Tweets).

  • First, I filter out Retweet, link-only, and media-only content.
  • Next, I use Unicode blocks to determine the script.
    ~99.5% of Twitter’s labels were in six languages: Arabic, English, Russian, Japanese, Turkish, and Persian, so I can begin to split these languages by script.
  • In Latin script, I should distinguish between English, Turkish, and other languages based on language-specific letters.
    In Arabic script, I would do the same for Persian, Urdu, and Arabic.
  • ~93% of the original Tweets were labeled Arabic by Twitter, so a future neural network step will divide Arabic up into dialects

Filtering out retweets and empty content

Twitter provides an is_retweet column. Here’s our first filter node:

In the pipeline file, I set the input and outputs, and start writing a new node to accept the original Tweets:

The remove_empty node is more complex, because I need a RegEx to remove URLs from each Tweet individually.

Unicode blocks

I pip install unicodeblock to detect Unicode scripts. After ignoring cross-language blocks such as emoji and symbols, I programmed in a list of blocks for each target script. For example: Arabic, Arabic Supplement, Arabic Presentation Forms A, and Arabic Presentation Forms B are all parts of the Arabic script.

Hashtags were a difficult decision. Currently I think an English one such as #relationshipgoals is not proof of the user’s language, but a hashtag #أردوغان is a strong hint.

Customizing the visualization

At this point, let’s see mykedro-viz diagram.

kedro-viz, light theme

We can see our data flow structure, but with the visualization separated from execution, I have an incomplete picture of how data flows through. Is Arabic still >90% of my content? Is removing Retweets throwing out too much? Does my Turkish detection have a bug that will filter out everything?

The flow chart is a D3 visualization, and the frontend framework is React. I bring in hardcoded row counts to see how the diagram looks with proportional area or a logarithmic scale:

Proportional area (left) and logarithmic (right)

I’m going to continue with the logarithmic scale — it’s not perfect, but we can still determine a successful pass-through and compare sizes of circles.

  • I notice that ‘Other’ is much larger than I’d expected (larger than Latin script). I will review this in future work, but it is mostly emojis, symbols, and Latin hashtags.
  • Only 1 or 2 Korean Tweets make it to the end (I previously saw that almost all Korean in this dataset comes from Retweets and/or Arabic text art).

Next I add numbers to the middle of the SVG path. This is a little tricky, because you need a textPath, and paths going from right-to-left make text appear upside-down; flipping the origin and destination on these means I must also change the arrow pointers.
Eventually I tweak the styles and offsets to see this:

log scale circles, with numbers

Language-specific letters

  • Turkish has letters which rarely appear in English Tweets: ÇŞĞİÖÜ (plus most of their lower-case equivalents and ı, a dotless i).
  • Persian has different code points for digits, and these letters: پ, چ, ژ, گ
    Arabic has a non-Persian diacritic: ْ
  • Urdu has Persian letters, but can be further separated with these:
    ٹ ,ڈ , ڑ ,ں ,ے , ھ
After adjusting my URL parser, the numbers change again. I later fixed the Turkish-English filter

Inspecting Disagreement with Twitter’s results

My initial concept was to see Tweets when hovering over a node, but later I thought I might prefer to create a table, with rows for each disagreement between the Twitter label x my label. To make things easy, I can output this table in the command prompt at the end of each run.

These are samples where I’d labeled Arabic, Persian, or Urdu language, and Twitter disagreed (Undefined/Other, English, Sindhi, Kurdish, and Korean):

Full Table Link

There are mistakes which I want to fix and improve upon — for example, Google Translate and searches agree that my labels need work (especially my Persian and Urdu examples, which were just more Arabic).

At the end of this post I list some more reliable ways to detect language differences within a script in the future. As I continue to improve, I plan to continue with this tabular format to quickly review whether the final relabeling results make sense.

Dialects

The University of British Columbia has a dataset labeling Modern Standard Arabic (MSA), Gulf, Levantine, and Egyptian dialects. I used it before in a Google CoLab notebook.

This dataset lacks examples of the Maghrebi dialect, which is heard across Morocco, Algeria, and Tunisia. I merged in two other dialect datasets, from Johns Hopkins (github.com/ryancotterell/arabic_dialect_annotation) and Qatar University (https://www.aclweb.org/anthology/L18-1579.pdf).

I organize the data into one CSV with one format and dialect count. I get these combined counts for training data:

Egyptian: 27,454; Levantine: 20,509; Maghrebi: 10,541; Gulf: 67,088;
MSA: 120,257

I’m having trouble with a simple neural network classifier here, so I’ll cap the post here and revisit this in the future.

Future plans

  • Design and implement a tagging system for multilingual content, which is dialect-sensitive, and acknowledges hashtags, internet lingo, and emojis.
  • Break up the ‘Arabic’ category into dialects with a balanced number of messages from each region
  • Break up the ‘Other’ category, to label as empty (no meaningful content), internet emojis and memes, or specific languages targeted for disinformation.
  • Use dictionaries, place name lists, and weighting of letters to more accurately divide languages within Cyrillic, Arabic, and Latin scripts.
    Some users tend to use letters that they ‘shouldn’t’ , such as گ in Arabic.
  • When I am confident that relabeling is working, use a deeper analysis to identify what causes mislabeling by Twitter.

--

--