Twitter  Quartz found an unlikely inspiration for its relaunched homepage: The email newsletter. nie.mn/1AQXuxD  
Nieman Journalism Lab
Pushing to the future of journalism — A project of the Nieman Foundation at Harvard
A 404 error on Sina Weibo

Reverse engineering Chinese censorship: When and why are controversial tweets deleted?

An MIT student is working to detect patterns in the disappearance of thousands of weibos from the Chinese Internet.

A 404 error on Sina Weibo

Censoring the Chinese Internet must be exhausting work, like trying to stem the flow of a fire hose with your thumb. Sina Weibo, a popular Twitter-like service, says its 300 million registered users post more than 100 million weibos, or tweet-like posts, a day. (In Chinese, weibo means microblog or microblog post.)

And of course the entire Chinese Internet isn’t as censored as some might think. So why are some tweets deleted, not others? Which topics are seen as the biggest threat to harmony?

Chi-Chu Tschang wants to unwrap the black box. Tschang is an MBA student at MIT’s Sloan School and former China-based correspondent for BusinessWeek and a student in Ethan Zuckerman’s class this semester, “News in the Age of Participatory Media.” For his final project, Tschang built on data harvested from thousands of deleted weibos in China to look for answers. (I summarized some other interesting ideas from students in a previous post.)

“We know that certain topics are censored from blogs hosted in China, Chinese search engines and Weibos,” Tschang writes in his paper. “But we don’t know where the line lies. Part of the reason is because the line is constantly moving.”

Tschang drew on the work of researchers at the University of Hong Kong’s Journalism and Media Studies Center. Cedric Sam and King-wa Fu helped build WeiboScope, which visualizes the most popular content on Sina Weibo in something close to real time. On top of that app, they built WeiboScope Search, which includes deleted weibos — more than 12,000 since Feb. 1 — in its huge archive.

Using the data visualization software Tableau, Tschang plotted those deleted weibos on a timeline, then superimposed politically sensitive events to provide context. (Click to enlarge.)

Chi-Chu Tschang's timeline of censored weibos

The day that saw the highest volume of deletions, in a dataset covering Feb. 1 to May 20, was March 8: the day rumors of Bo Xilai’s fall from power began to spread. Bo was a high-ranking party secretary who was under scrutiny for, among other things, his tremendous apparent wealth. Bo’s son, studying here at Harvard, attracted a lot of attention when he reportedly picked up Jon Huntsman’s daughter in a red Ferrari for a date.

The second-busiest censorship day was March 15, the day Bo was sacked.

Here’s one more interesting data point: On March 18, word spread of a deadly car accident involving a Ferrari (a black one, not a red one). Nearly all information about the crash disappeared from the Internet, fueling speculation about who was involved. Even the word “Ferrari” was censored. Tschang observed moderate deletion activity that day on Sina Weibo.

There is one day of missing data: April 22, the day civil-rights activist Chen Guangcheng escaped from his house arrest in Shangdong. Why? An error message dated April 23, the day after, reports “load problems” that temporarily disabled data collection — disappointing timing. It could be that the Chinese Weibosphere was so jammed on that momentous day that the servers were crashing. Or it could be something else entirely. (Reader Samuel Wade notes that news of Chen’s escape was not widely known until days later.)1

Tschang crunched the raw data and generated a word cloud, to see which terms in deleted weibos appear most often.

Top 73 censored words from Weibo

Word clouds, though pretty, don’t provide a whole lot of context. Tschang said he wants to examine the list more carefully, filtering out words like the Chinese equivalents of “RT” and “ha ha.” He also wants to examine the relationships of the most censored Weibo users, creating, I don’t know, a Klout for civil disobedience?

Tschang’s hypothesis — that Sina Weibo deletions correlate highly with spikes in media coverage of sensitive stories — is consistent with the findings of a similar study from researchers at Carnegie Mellon University, who evaluated 56 million weibos, of which about 16 percent were deleted.

Those researchers found some key words were far more likely to get a weibo deleted: Ministry of Truth, Falun Gong, Ai Weiwei, Playboy, to name a few. “By revealing the variation that occurs in censorship both in response to current events and in different geographical areas,” the researchers wrote, “this work has the potential to actively monitor the state of social media censorship in China as it dynamically changes over time.”

Finally, Tschang also evaluated how long it took for deleted weibos to be deleted. He wrote:

The fastest a post was deleted on Sina Weibo was just over 4 minutes. The longest time it took for the censor to get around deleting a message on Sina Weibo was over four months. For the posts created on May 20, 2012 and deleted on the same day, it took on average 11 hours for Weibo Scope Search to detect the deletion.

Tschang said he suspects some weibos get deleted months later because they are about topics that suddenly re-surface in Chinese media.

Tschang even tried posting spare, scandalous messages to his own Sina Weibo account, just to see what would happen.2

  • Chen Guangcheng
  • Bo Xilai
  • Taiwanese independence

Here’s Tschang:

Less than 14 hours later, I received a message from Sina Weibo’s system administrator informing me that my two posts on “Chen Guangcheng” were “inappropriate” and had been censored. While I can still see the two “Chen Guangcheng” posts on my Sina Weibo account page, no one else can. Surprisingly, my posts on “Bo Xilai” and “Taiwan independence” were not censored.

One caveat: Tschang cannot be 100 percent sure that a deleted weibo wasn’t deleted by its creator, rather than Sina’s “monitoring editors.” But Sina Weibo’s API makes a helpful distinction in the way it returns data for deleted weibos. The error message for a non-existent weibo comes back as either “Weibo does not exist” or “Permission denied.” So one could assume, as do Tschang and the HKU researchers, that “permission denied” equals “censored.” (Sina could also delete spammy weibos from the system, a user-friendly form of censorship.)

And the best time to weibo something politically sensitive in China? After 11 o’clock on a Friday night, according to the data.

“Interestingly, deletion of Sina Weibo messages tend to hit a low on Saturdays,” Tschang wrote. “I’m not too sure why that is, except that maybe censors want to take time off on weekends as well.”

Notes
  1. Update: It’s important to emphasize this data can be unreliable, as Cedric Sam points out in the comments. Sam says the program does not capture the time of deletion but the first time a weibo is seen as missing. The program runs three to four times a day, crawling the timelines of about 5,000 users, but the crawler is occassionally rate-limited by Sina’s API, or the server crashes under heavy load. So a weibo may be deleted hours or days before Sam’s crawler discovers it missing. That means we should not try to draw definitive conclusions about some days having higher deletion rates than others. There may well be correlations in the data, but there are too many factors to make decisive conclusions. Chi-Chu Tschang’s analysis is an interesting, early insight into how Chinese censorship might work; there is a lot more work to be done. If you’re interested in the topic, Sam talks about this and other Chinese data projects on his Rice Cooker blog.
  2. Post-script: After posting my story to his Weibo account, Tschang’s account was deleted.
                                   
What to read next
Quartz_homepage
Joseph Lichterman    Aug. 26, 2014
Previously proudly without a homepage, the business site is trying to shift its email success to the web to build loyalty.
  • cedricsam

    Before someone else does, would like to point out that our method’s limitations may hinder some of the conclusions made here.

    We only have the posts that we are able to find. Our method is dependent on many factors — such as the speed at which you can download the data, etc. So, some conclusions may not (and probably are not) valid, such as for finding a “busy day” for the censors, or calculating the average time it takes to delete posts.

    Nonetheless, it is a great tool for flagging important topics deleted on Weibo, and can serve as a guide to finding out big topics that preoccupy the weibosphere in almost real-time.

    Kudos though to Chi-Chu for digging in the data and investigating it, something we didn’t have a chance to do yet.

    -Cedric (from JMSC)

  • ciantic

    The chart’s missing date “April 22, 2012″ (though indicated in article) should be indicated IMO in the chart also, with “no data” or something.

    (Edit: The bubble is not a good indicator as other bubbles do have data in the days they explain.)

  • http://andrewphelps.com Andrew Phelps

    P.S., I posted this story to Sina Weibo just to see what happens: http://www.weibo.com/2817376024/ylGzOvuMR

    Alternatively: http://weibo.cn/andrewphelps

    Tschang also posted the story to Weibo and found his account is now gone.

  • anxiaostudio

    Just a note about the word cloud: the majority, perhaps all, of the words are emoticons (Sina Weibo uses words in brackets that are then automatically turned into animated smileys). So naturally those words are popular. I’d love to see a word cloud sans 转发 (RT), 哈哈 (haha) and the emoticons.

  • Elapsetulip

    Yes, definitely you are correct. I also think that their algorithm of extracting keywords should be improved…

  • Zlapse

    extract some “real” key words from the deleted tweets would be interesting, they might be used to understand what is really happening in China….

  • A M I

    my weibo  http://weibo.com/amiblog

  • Carla James

    In relation to this quote by Tschang that “deletion of Sina Weibo messages tend to hit a low on Saturdays.” I work for a circumvention company that designs and propagates tools to Internet users living in China and other restrictive environments. Based on traffic usage trends, we see a spike in activity on our tools over the weekend. One reason may be that China censors may be taking time off over the weekend at the same time that Chinese netizens have greater chances of accessing our tools.  We also see spikes in activity during politically sensitive events, both local and international.

    Cheers,
    Carla James
    @PsiphonChina

  • OLD CHINA HAND

    back in my day this was known as TRAFFIC ANALYSIS

  • cedricsam

    Carla:

    As the ones who compiled the data for Mr. Tschang’s analysis, my humble and simple explanation is that people generally post less on weekends. The trend is seen on Twitter, and on Sina Weibo as well. I doubt it can be due to something as dubious as censors taking weekends off…And we gotta be careful about how to quantify censorship. Are people writing more often, more posts during sensitive events? What is the sampling? Etc. The analysis Chi-Chu did was a nice effort. But because of the nature of our data (as documented on our blog), I wouldn’t endorse any of his conclusions without very large doses of salt.-Cedric (from JMSC/HKU)

  • sqiar

    SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading global consultancy which provides innovative business intelligence services to small and medium size (SMEs) businesses. Our agile approach provides organizations with breakthrough insights and powerful data visualizations to rapidly analyse multiple aspects of their business in perspectives that matter most.