Transparent data flow with Kedro
Language labeling for Twitter data
During my initial analysis of Twitter disinformation, I saw many Tweets with the wrong language label. As I analyze that problem, I needed a data pipeline with these requirements:
- data pipeline represented in a graph with linked steps A > B > C
- proportionally-sized— proving that a step is working correctly, and making its path comparable to other parts of the pipeline
- inspectable — in this context, I want to see Tweets for every relabel, for example Korean→Arabic, to manually check my work, and to start building a model to explain how these mix-ups occur
After paging though many options, I decided on Kedro, a data flow / graph / pipeline open sourced by QuantumBlack:
Filtering pipeline
While writing my pipeline, I’ll test with a million-line CSV (due to multiline content, this is around 312,000 Tweets).
- First, I filter out Retweet, link-only, and media-only content.
- Next, I use Unicode blocks to determine the script.
~99.5% of Twitter’s labels were in six languages: Arabic, English, Russian, Japanese, Turkish, and Persian, so I can begin to split these languages by script. - In Latin script, I should distinguish between English, Turkish, and other languages based on language-specific letters.
In Arabic script, I would do the same for Persian, Urdu, and Arabic. - ~93% of the original Tweets were labeled Arabic by Twitter, so a future neural network step will divide Arabic up into dialects
Filtering out retweets and empty content
Twitter provides an is_retweet
column. Here’s our first filter node:
In the pipeline file, I set the input and outputs, and start writing a new node to accept the original Tweets:
The remove_empty
node is more complex, because I need a RegEx to remove URLs from each Tweet individually.
Unicode blocks
I pip install unicodeblock
to detect Unicode scripts. After ignoring cross-language blocks such as emoji and symbols, I programmed in a list of blocks for each target script. For example: Arabic
, Arabic Supplement
, Arabic Presentation Forms A
, and Arabic Presentation Forms B
are all parts of the Arabic script.
Hashtags were a difficult decision. Currently I think an English one such as #relationshipgoals is not proof of the user’s language, but a hashtag #أردوغان is a strong hint.
Customizing the visualization
At this point, let’s see mykedro-viz
diagram.
We can see our data flow structure, but with the visualization separated from execution, I have an incomplete picture of how data flows through. Is Arabic still >90% of my content? Is removing Retweets throwing out too much? Does my Turkish detection have a bug that will filter out everything?
The flow chart is a D3 visualization, and the frontend framework is React. I bring in hardcoded row counts to see how the diagram looks with proportional area or a logarithmic scale:
I’m going to continue with the logarithmic scale — it’s not perfect, but we can still determine a successful pass-through and compare sizes of circles.
- I notice that ‘Other’ is much larger than I’d expected (larger than Latin script). I will review this in future work, but it is mostly emojis, symbols, and Latin hashtags.
- Only 1 or 2 Korean Tweets make it to the end (I previously saw that almost all Korean in this dataset comes from Retweets and/or Arabic text art).
Next I add numbers to the middle of the SVG path. This is a little tricky, because you need a textPath
, and paths going from right-to-left make text appear upside-down; flipping the origin and destination on these means I must also change the arrow pointers.
Eventually I tweak the styles and offsets to see this:
Language-specific letters
- Turkish has letters which rarely appear in English Tweets: ÇŞĞİÖÜ (plus most of their lower-case equivalents and ı, a dotless i).
- Persian has different code points for digits, and these letters: پ, چ, ژ, گ
Arabic has a non-Persian diacritic: ْ - Urdu has Persian letters, but can be further separated with these:
ٹ ,ڈ , ڑ ,ں ,ے , ھ
Inspecting Disagreement with Twitter’s results
My initial concept was to see Tweets when hovering over a node, but later I thought I might prefer to create a table, with rows for each disagreement between the Twitter label x my label. To make things easy, I can output this table in the command prompt at the end of each run.
These are samples where I’d labeled Arabic, Persian, or Urdu language, and Twitter disagreed (Undefined/Other, English, Sindhi, Kurdish, and Korean):
There are mistakes which I want to fix and improve upon — for example, Google Translate and searches agree that my labels need work (especially my Persian and Urdu examples, which were just more Arabic).
At the end of this post I list some more reliable ways to detect language differences within a script in the future. As I continue to improve, I plan to continue with this tabular format to quickly review whether the final relabeling results make sense.
Dialects
The University of British Columbia has a dataset labeling Modern Standard Arabic (MSA), Gulf, Levantine, and Egyptian dialects. I used it before in a Google CoLab notebook.
This dataset lacks examples of the Maghrebi dialect, which is heard across Morocco, Algeria, and Tunisia. I merged in two other dialect datasets, from Johns Hopkins (github.com/ryancotterell/arabic_dialect_annotation) and Qatar University (https://www.aclweb.org/anthology/L18-1579.pdf).
I organize the data into one CSV with one format and dialect count. I get these combined counts for training data:
Egyptian: 27,454; Levantine: 20,509; Maghrebi: 10,541; Gulf: 67,088;
MSA: 120,257
I’m having trouble with a simple neural network classifier here, so I’ll cap the post here and revisit this in the future.
Future plans
- Design and implement a tagging system for multilingual content, which is dialect-sensitive, and acknowledges hashtags, internet lingo, and emojis.
- Break up the ‘Arabic’ category into dialects with a balanced number of messages from each region
- Break up the ‘Other’ category, to label as empty (no meaningful content), internet emojis and memes, or specific languages targeted for disinformation.
- Use dictionaries, place name lists, and weighting of letters to more accurately divide languages within Cyrillic, Arabic, and Latin scripts.
Some users tend to use letters that they ‘shouldn’t’ , such as گ in Arabic. - When I am confident that relabeling is working, use a deeper analysis to identify what causes mislabeling by Twitter.