Twitter  This Week in Review: Facebook thinks they have the formular for cracking down on clickbait  
Nieman Journalism Lab
Pushing to the future of journalism — A project of the Nieman Foundation at Harvard
Twitter preserved

That plan to archive every tweet in the Library of Congress? Definitely still happening

It has turned out to be quite an undertaking, but the Library plans to make good on its promise to America.

Twitter preserved

A little more than two years ago, the Library of Congress announced it would preserve every public tweet, ever, for future generations.

That’s right. Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress. That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions.

Fifty million tweets a day. How cute. That number is now 400 million, according to Twitter CEO Dick Costolo. (The first comment on the project’s FAQ page sums up much of the Internet’s reaction: “It’s critical the future generations know what flavor burrito I had for lunch.”)

We hadn’t heard about this project in some time. Last week a story on quoted a social-media researcher as saying the LoC “has quietly backed away from the commitment.”

False, said Library spokeswoman Jennifer Gavin; the project is very much still happening. Good librarianship, she said, moves more slowly than Twitter.

“The process of how to serve it out to researchers is still being worked out, but we’re getting a lot of closer,” Gavin told me. “I couldn’t give you a date specific of when we’ll be ready to make the announcement.”

The Library first revealed its plans in a tweet on April 14, 2010, but apparently that was before sorting out with Twitter the logistics of acquiring all that data. Petabytes of data.

“We began receiving the material, portions of it, last year. We got that system down. Now we’re getting it almost daily,” Gavin said. “And of course, as I think is obvious to anyone who follows Twitter, it has ended up being a very large amount of material.”

Gavin said the archive will be made available to anyone with a library card, but only on the premises in Washington. “My understanding is that at this time we do not intend to make it available by web,” she said, but that may be subject to change. It’s not meant to be the Ultimate Twitter Search Box we’ve always dreamed of.

In fact, there will be a six-month embargo on fresh tweets (even though, obviously, the data is publicly available — if you can find it). That agreement has been in place since the deal was struck. Twitter said then the tweets could be used only “for internal library use, for non-commercial research, public display by the library itself, and preservation.”

The challenge now is finding ways to refine the raw data in useful ways. Sort by keywords? Date? Sentiment? Burrito flavor? Gavin said the Library is still figuring out the user interface.

What to read next
Justin Ellis    Aug. 27, 2014
What separates the successful innovation projects from their peers? Preparing for resistance, being agile about audience, and getting the user experience right.
  • JTDabbagian

    Anyone else think that this is a SERIOUSLY bad idea? I really think they should let it go… 
    A: All sorts of privacy/embarrassment issues
    B: The task is probably impossible. 

    I think the LoC should look for better things to do. 

  • Colin Rosenthal

    A. All tweets are public, so privacy is not an issue as such. There could be legal problems with e.g. Libellous material.
    B. It’s not impossible, but even if it were shouldn’t one try to archive as much as possible?

    I thibk the main problems are a) preserving thenlink structure of tweets – replies, conversations and b) the most interesting tweets are based around links to web material so a twitter archive needs tonbe linked to a web archive to be really interesting.

  • Andrew Phelps

    A lot of people make the “future generations know what flavor burrito I had for lunch” joke. 

    But then consider this 102-year-old photograph of New York City, which I grabbed from the web at random:

    It is a static image, not particularly well-composed, and it would have been seen as dull in 1910. But look at how much the image communicates about the way we lived then. What was mundane then is archaeological treasure today.

    Likewise, imagine the density of linguistic information encoded in tweets about burritos that appear pointless now but will fascinate future us in the next century. I think people are most interesting when they don’t know they are being studied. Nothing gets closer to the way we really talk (at least online) and the way we think than Twitter.

    And disk space is cheap, so why not?

  • Lee Keels

    If Twitter already stores it, and it can already be searched, why does the Library of Congress need to duplicate it?

  • BanReee

    Sounds like a pretty cool job to me dude. Wow.  

  • Greg Kochanski

    It’s a seriously bad idea to make it available for the first 60 years.    Then it’s an amusing way to skewer retired politicians, then after 100 years, it’s valuable history.    Give people some privacy and forgetting!

    Put the disks in a climate controlled safe for 100 years.   Then try to spin them up.  If only 1% still work, that’s OK.   We’ll still have gigabytes left.

  • Clyde Smith

     I do think it’s a misuse of the LIbrary of Congress’ resources.

    Not that it should matter but I say that not just as a blogger who’s on Twitter but as a PhD in Cultural Studies who understand the value of research archives and as a recipient of a Masters in LIbrary Science.

    I’d much rather they focus on making more rare print resources available in digital form.

  • Fashion Cappuccino

    This is such a pointless endeavor. 

  • Athox

    Essentially this is like archiving newspapers. So what’s the problem? If you’re putting it on twitter, don’t effing complain about privacy…

  • Adamantios Koumpis

    The very nature of tweets is ephemeral – the same as much of the content communicated through social networks; of course, there are many good reasons why one *should* archive ephemeral information – especially if this is of the burrito type. Unfortunately a project like this may only be realised because the technology is there. Or because the research challenge is there. Otherwise, there is no good reason at all to spend public money on a project like this..

  • Fun Virginian

    Then you hit the nail on the head. Obviously Twitter uses some type of database to store this. Database mirrors are very common today. It could be a mirrored at many sites. As such the search would be as simple as making the search then go to the site that is least used or closest to the user, etc. I am oversimplifying it quite a bit. But basically, you are right the data is already in the database at Twitter, or stored at Twitter, and this can be made available. It is possible that Twitter does not have all this data online computer console really available. Some could be off-line. After all that is a massive amount of data. The problem with the rest of this is privacy concerns. The service never said that the tweets will be made available to the public forever. Once this goes online:
    Expect lawsuits.

  • z s

    Possibly the most useful thing the LoC has ever done. Or is the other way around? There is no telling when we have official government branches who apparently do nothing but look at Twitter all day.