The National Archives is sitting on massive amounts of information — from specs for NASA projects to geological surveys to letters from presidents. But there’s a problem: “These records are held hostage,” said Bill Mayer, executive for research services for the National Archives and Records Administration.
“Hostage” might be a strong word for a organization responsible for 4.5 million cubic feet of physical documents and more than 500 terabytes of data, most which can be accessed online or by walking into one of their facilities around the country. But the challenge, Mayer explains, is making NARA’s vast stockpile more open and more discoverable. “They’re held hostage in a number of centers around the country — they’re held hostage by format,” Mayer said.
Mayer and other officials from the National Archives visited MIT recently to talk about how the agency is trying to increase access to records and deal with the challenges, and legal complications, of electronic documents. The archive is responsible for records from executive branch agencies, courts, Congress, and presidents. It preserves only 5 percent of the federal government’s records, and there’s a 15-year lag before records are available. But an estimated 30,000 linear feet of new records come in from agencies annually.
In order to deal with all of that the archive has to be smarter, quicker, and more technologically savvy in the way it catalogs the nation’s paper trail. In a way, the biggest obstacle the archive faces is itself. “The issue at hand is setting free these records,” Mayer said. “At the heart of what the archive is about is promoting access.”
That’s one of the reasons the archives created an office of innovation last fall. After experimenting around the edges for several years, it was time to put more energy behind finding new ways to surface interesting material and involve the public in the record-keeping process, said Pamela Wright, the archive’s first chief innovation officer.
What started with a small project making archive photos available on Flickr has now expanded into more than 135 projects running on outside platforms, like the Today’s Document Tumblr. The archive works with companies like Ancestry.com, which helps digitize records in exchange for a brief window of exclusive access to the data. They also have a deep partnership with the Wikimedia Foundation. The National Archives has a Wikipedian in Residence who helps coordinate an open transcription project that lets the public transcribe physical documents online through a simple interface. Another project, the Citizen Archivist Dashboard, asks the public to help tag photos and other imagery, as well as contribute edits to a research wiki. It’s a focused approach to crowdsourcing, not unlike the open scientific surveys of the ocean floor or deep space.
The archive’s partnering and outreach is getting results, with an increase in visits to its website, more than 100,000 images in Wikimedia Commons, and almost 100,000 followers on Tumblr. But the goal of the National Archive’s strategy isn’t to chase social media metrics, Wright said: By working with partners and increasing their reach through social media, the archive is fulfilling its mission to make their collections available to the public. “It goes directly to the mission of our agency: You can get at participatory democracy in new ways,” she said. “You are helping your government provide access to the records of the people.”
As more federal records become available in electronic form, that creates a new set of complications for the archive. One, Mayer said, is that even through the archive can get records more quickly, the custody of those records remains with the home agency. So even if that fisheries database you made a FOIA request for is technically at the National Archives, it may still belong to the Department of the Interior for several more years.
Another challenge — one that will come as no surprise to data journalists — is dealing with messy or incomplete federal data. The archive has to work around proprietary or outdated file formats just as newsrooms do, Mayer said. “This is actually the scary monster in the room in terms of format obsolescence,” he said. “We can maintain access to things that are currently available. But in the future? Who knows?” One solution: Work with outsiders. “We’re looking now at how do we work with the developer community,” Wright said, “working with people who want to do things with electronic datasets we can make available now.”
Wright said they want to follow in the footsteps of agencies like NASA that have held hack days and other events for coders. Finding life for the data beyond spreadsheets and XML files would be another way to accomplish their mission of openness and access, Wright said.
Photo of John F. Kennedy, J. Edgar Hoover, and Robert Kennedy from the National Archives’ Flickr account.