On December 20, Twitter released data on nearly 6,000 accounts which they connected to Saudi information operations / disinformation. The 4.3GB zipfile of text (and over a terabyte of media) is Twitter’s largest disclosure, dwarfing the 1.2GB Twitter text dataset from Russia last year.
Back when I released a dataset of political Tweets, I heard whispers that bot networks were in part driven by Saudi Arabia, including US politics accounts. With a sample of their work now in hand, I wanted to see what I could learn.
1. Language breakdown of Tweets
How much of this info op was about Saudi internal politics, the diplomatic war with Qatar and Iran, and US politics? The first step to examine this would be counting Tweets in Arabic, English, Persian, and other languages.
I used Twitter’s
tweet_language column as my source of language.
93% of Tweets were Arabic, and the second highest category, ‘undefined’, was another 4%. The next 15 languages take up nearly 3% of the dataset.
English is the most common language apart from Arabic, but it is a small part (1.5%) of the full dataset.
Disclaimer: I used Pandas
+ to combine multiple CSVs; if one CSV did not contain any Tweets in a language this became a NaN, meaning I don’t have accurate counts for these.
I found these languages to be truly infrequent. Also, any language outside of the top 10–15 is likely mis-categorized. There are 440 ‘Icelandic’ Tweets with content like ‘Skskskskksk’ or app checkins from Spain.
Original Content and Retweets
50% of the Tweets in the dataset are actually Retweets, so I tried rerunning the code without them, seeing only original content (or quote-tweets).
Arabic and ‘undefined’ remain dominant.
Russian, Japanese, and Ukrainian are all highly original (slightly >90%), while Portuguese and German fall in ranking (they are only 33% and 23% original).
Tweets in Portuguese notably used ‘📣 Projeto Follow Trick ™ 📣’ ‘To Gain Followers Follow Me’ to build up a network, more than a disinformation campaign.
‘Korean’ Tweets were almost all actually Arabic retweets, marked incorrectly due to newlines, emojis, and Korean quotation marks. Here’s a genuine, not info-ops Tweet which was Retweeted by these accounts and labeled Korean:
2. Who were these Tweets for? Most popular mentions
In six top languages (Arabic, English, Russian, Japanese, Turkish, Persian), who were the top still-active accounts mentioned?
These include targets of disinformation, or were included in conversations to fake genuine social media interactions, so don’t jump to conclusions.
Not only were English Tweets not as common as you might think, the top accounts were not related to US politics.
The top accounts were related to social media followers, with the tag “#H0MEL3ND”. @LIONxCLAW and @AdryDrive get 241 Retweets on posts like this:
LARRY_B_CAMAPE (social media manager), kvazdopil (photojournalist), vanlovenowS (protected), and varlamov (journalist).
Unlike other languages, only one of the top five accounts was a suspended info ops account.
In Persian: After reviewing Twitter’s ‘Persian’ mentions, the top accounts were mostly Saudi-based accounts posting in Arabic: TopNewsSA featuring news, and SR___74 and 44__kk showing artsy photos and music. I would need to review the mentions for more information.
3. Does my AOC Reply Dataset include actors?
I captured replies to AOC’s account in March 2019. I assumed that info op accounts were created more recently, but reviewing the data, the first banned account was created in January 2007, and overall 88% of accounts were created before March 2019, so overlap was possible.
Twitter’s disclosures include username hashes, not original usernames, unless they had many followers. Instead I filtered the Saudi Tweets with a simple
cat tweets.csv | grep -i ‘@aoc’ (or RepAOC). She’s mentioned only 20–30 times, mostly to criticize Houthis in Yemen or praise a video, but one user replied to this conversation to recommend La Chula:
Disinfo bots have opinions on tacos, too 🌮 … who knew?
This article is from December 2019. For links to newer models and datasets read https://github.com/mapmeld/use-this-now#arabic-nlp