HOME
          
LATEST STORY
Where you get your news depends on where you stand on the issues
ABOUT                    SUBSCRIBE
Feb. 25, 2013, 12:11 p.m.
Reporting & Production
kennedyhoover

Hiding in public: How the National Archives wants to open up its data to Americans

The agency, home to more than 500 terabytes of electronic files alone, faces some of the same problems that data journalists do.

kennedyhooverThe National Archives is sitting on massive amounts of information — from specs for NASA projects to geological surveys to letters from presidents. But there’s a problem: “These records are held hostage,” said Bill Mayer, executive for research services for the National Archives and Records Administration.

“Hostage” might be a strong word for a organization responsible for 4.5 million cubic feet of physical documents and more than 500 terabytes of data, most which can be accessed online or by walking into one of their facilities around the country. But the challenge, Mayer explains, is making NARA’s vast stockpile more open and more discoverable. “They’re held hostage in a number of centers around the country — they’re held hostage by format,” Mayer said.

Mayer and other officials from the National Archives visited MIT recently to talk about how the agency is trying to increase access to records and deal with the challenges, and legal complications, of electronic documents. The archive is responsible for records from executive branch agencies, courts, Congress, and presidents. It preserves only 5 percent of the federal government’s records, and there’s a 15-year lag before records are available. But an estimated 30,000 linear feet of new records come in from agencies annually.

A visual summary of the National Archives’ MIT presentation by Willow Brugh (CC).

In order to deal with all of that the archive has to be smarter, quicker, and more technologically savvy in the way it catalogs the nation’s paper trail. In a way, the biggest obstacle the archive faces is itself. “The issue at hand is setting free these records,” Mayer said. “At the heart of what the archive is about is promoting access.”

That’s one of the reasons the archives created an office of innovation last fall. After experimenting around the edges for several years, it was time to put more energy behind finding new ways to surface interesting material and involve the public in the record-keeping process, said Pamela Wright, the archive’s first chief innovation officer.

What started with a small project making archive photos available on Flickr has now expanded into more than 135 projects running on outside platforms, like the Today’s Document Tumblr. The archive works with companies like Ancestry.com, which helps digitize records in exchange for a brief window of exclusive access to the data. They also have a deep partnership with the Wikimedia Foundation. The National Archives has a Wikipedian in Residence who helps coordinate an open transcription project that lets the public transcribe physical documents online through a simple interface. Another project, the Citizen Archivist Dashboard, asks the public to help tag photos and other imagery, as well as contribute edits to a research wiki. It’s a focused approach to crowdsourcing, not unlike the open scientific surveys of the ocean floor or deep space.

The archive’s partnering and outreach is getting results, with an increase in visits to its website, more than 100,000 images in Wikimedia Commons, and almost 100,000 followers on Tumblr. But the goal of the National Archive’s strategy isn’t to chase social media metrics, Wright said: By working with partners and increasing their reach through social media, the archive is fulfilling its mission to make their collections available to the public. “It goes directly to the mission of our agency: You can get at participatory democracy in new ways,” she said. “You are helping your government provide access to the records of the people.”

As more federal records become available in electronic form, that creates a new set of complications for the archive. One, Mayer said, is that even through the archive can get records more quickly, the custody of those records remains with the home agency. So even if that fisheries database you made a FOIA request for is technically at the National Archives, it may still belong to the Department of the Interior for several more years.

Another challenge — one that will come as no surprise to data journalists — is dealing with messy or incomplete federal data. The archive has to work around proprietary or outdated file formats just as newsrooms do, Mayer said. “This is actually the scary monster in the room in terms of format obsolescence,” he said. “We can maintain access to things that are currently available. But in the future? Who knows?” One solution: Work with outsiders. “We’re looking now at how do we work with the developer community,” Wright said, “working with people who want to do things with electronic datasets we can make available now.”

Wright said they want to follow in the footsteps of agencies like NASA that have held hack days and other events for coders. Finding life for the data beyond spreadsheets and XML files would be another way to accomplish their mission of openness and access, Wright said.

Photo of John F. Kennedy, J. Edgar Hoover, and Robert Kennedy from the National Archives’ Flickr account.

POSTED     Feb. 25, 2013, 12:11 p.m.
SEE MORE ON Reporting & Production
SHARE THIS STORY
   
Show comments  
Show tags
 
Join the 15,000 who get the freshest future-of-journalism news in our daily email.
Where you get your news depends on where you stand on the issues
A new study by the Pew Research Center examines how Americans’ news consumption habits correlate with where they fall on the political spectrum.
Light everywhere: The California Civic Data Coalition wants to make public datasets easier to crunch
Journalists from rival outlets are pursuing the dream of “pluggable data,” partnering to build open-source tools to analyze California campaign finance and lobbying data.
Ebola Deeply builds on the lessons of single-subject news sites: A news operation with an expiration date
Following the blueprint of Syria Deeply, the new Ebola-focused site hopes to deliver context and coherence in covering the spread and treatment of the virus.
What to read next
1020
tweets
The newsonomics of the millennial moment
The new wave of news startups is aiming at a younger audience. But do legacy media companies have a chance at earning their attention?
803A mixed bag on apps: What The New York Times learned with NYT Opinion and NYT Now
The two apps were part of the paper’s plan to increase digital subscribers through smaller, targeted offerings. Now, with staff cutbacks on the way, one app is being shuttered and the other is being adjusted.
537Watching what happens: The New York Times is making a front-page bet on real-time aggregation
A new homepage feature called “Watching” offers readers a feed of headlines, tweets, and multimedia from around the web.
These stories are our most popular on Twitter over the past 30 days.
See all our most recent pieces ➚
Encyclo is our encyclopedia of the future of news, chronicling the key players in journalism’s evolution.
Here are a few of the entries you’ll find in Encyclo.   Get the full Encyclo ➚
Chicago News Cooperative
Press+
Poynter Institute
The Boston Globe
The Ann Arbor Chronicle
USA Today
Davis Wiki
Neighborlogs
The Daily Telegraph
Wikipedia
The Daily Voice
MediaNews Group