Four crowdsourcing lessons from the Guardian’s (spectacular) expenses-scandal experiment
Okay, question time: Imagine you’re a major national newspaper whose crosstown archrival has somehow obtained two million pages of explosive documents that outed your country’s biggest political scandal of the decade. They’ve had a team of professional journalists on the job for a month, slamming out a string of blockbuster stories as they find them in their huge stack of secrets.
How do you catch up?
If you’re the Guardian of London, you wait for the associated public-records dump, shovel it all on your Web site next to a simple feedback interface and enlist more than 20,000 volunteers to help you find the needles in the haystack.
Your cost for the operation? One full week from a software developer, a few days’ help from others in his department, and £50 to rent temporary servers.
Journalism has been crowdsourced before, but it’s the scale of the Guardian’s project — 170,000 documents reviewed in the first 80 hours, thanks to a visitor participation rate of 56 percent — that’s breathtaking. We wanted the details, so I rang up the developer, Simon Willison, for his tips about deadline-driven software, the future of public records requests, and how a well-placed mugshot can make a blacked-out PDF feel like a detective story.
He offered four big lessons:
— Your workers are unpaid, so make it fun. Willison started coding one week before the Thursday launch date, teamed with a designer on Tuesday, a system administrator on Wednesday and leaned on everyone in his 15-person department for ad-hoc help on Thursday. But the bulk of the labor would come from Guardian readers.
How to lure them?
By making it feel like a game, said Willison, 28. The Guardian’s four-panel interface — “interesting,” “not interesting,” “interesting but known,” and “investigate this!” made categorization easy. And the progress bar on the project’s front page, immediately giving the community a goal to share.

But a video game needs more than an interface and a score. It needs a narrative — and this project offered that, too.
That was what Willison discovered when, on a whim, he added the Guardian’s mugshots of each MP to their pages in the database. Participation shot up, he said.
“There’s that wonderfully personal element, because everybody in the U.K. has an MP,” Willison said. “You’ve got this big smiling face looking at you while you’re digging through their expenses.”
On Monday, to add a competitive edge, Willison posted lists of the top-performing volunteers. By that point, the project had drawn 36,000 unique visitors and 20,440 participants.
“Any time that you’re trying to get people to give you stuff, to do stuff for you, the most important thing is that people know that what they’re doing is having an effect,” Willison said. “It’s kind of a fundamental tenet of social software. … If you’re not giving people the ‘I rock’ vibe, you’re not getting people to stick around.”
— Public attention is fickle, so launch immediately. Before Parliament released its records Thursday, Willison’s team thought they might be able to postpone their launch to Friday if necessary. When they saw Thursday’s newsbroadcasts, they realized they’d been wrong. The country’s imagination was caught.
“It became quickly clear on Thursday that it was a huge story, and if we failed to get it out on Thursday, we’d lose a lot of momentum,” Willison said.
The result: No time to load-test the program, perfect the interface, or even set up a system for Guardian reporters to view the vast amount of data that started pouring into their servers. (The first overview wasn’t ready for publication until Monday.)
Some programmers would be uncomfortable in those circumstances. Welcome to journalism, folks.
“We kind of load-tested it with our real audience, which guarantees that it’s going to work eventually,” Willison said impishly. “It’s a very realistic way of debugging the application.”
— Speed is mandatory, so use a framework. Willison’s project was built on Django, the custom Web framework “for perfectionists with deadlines” that he and Adrian Holovaty created for the Lawrence Journal-World. In the world of database programming, a framework is like an offset press: hard to build — Django 1.0 required three years of open-source development — but once it’s set up, there’s no faster way to churn out content. Hand-coding an application like the Guardian’s would have been like publishing a daily newspaper with movable type.
Other frameworks and languages would have worked, too. “You absolutely could build this in Ruby on Rails or in PHP,” Willison said, but “as far as I’m concerned, this is absolutely Django’s sweet spot. This is absolutely what Django is designed to do…Once I had a designer and a client-side engineer working on the project, I could really just hand it over to them and I didn’t have to worry about the front-end code any more.”
— Participation will come in one big burst, so have servers ready. As well as the Guardian’s first Django joint, this was its first project with EC2, the Amazon contract-hosting service beloved by startups for its low capital costs.
Willison’s team knew they would get a huge burst of attention followed by a long, fading tail, so it wouldn’t make sense to prepare the Guardian’s own servers for the task. In any case, there wasn’t time.
“The Guardian has lead time of several weeks to get new hardware bought and so forth,” Willison said. “The project was only approved to go ahead less than a week before it launched.”
With EC2, the Guardian could order server time as needed, rapidly scaling it up for the launch date and down again afterward. Thanks to EC2, Willison guessed the Guardian’s full out-of-pocket cost for the whole project will be around £50.
As for the software, it was all open-source, freely available to the Guardian — and to anyone else who might want to imitate them. Willison hopes to organize his work in the next few weeks.
“There’s a lot of stuff in there that’s potentially reusable,” Willison said.
Photo of Willison by Matt Patterson used under a Creative Commons license.






ouch. “tenet” not “tenant”. dude. [Fixed. —Josh]
Great post, ive just expanded on this with a focus around FMCG & Online companies and how they are using Crowd Sourcing.
Thanks for the post. I’m hoping that other news organizations will copy and improve on this model now that its potential has been shown.
This sort of thing is broadly useful. The “sausage business” relies on the obscurity of information. In the US, the full text of bills is often not available in time for serious review by anyone. The “final” text of the huge US recovery bill was available for less than a day before it was signed into law, and it was in the form of scanned documents with handwritten annotations.
Newspapers can prove their relevance in the future by being ready to enlist readers in stories like these. They can do the hard and sometimes expensive work of securing access to important documents and making them available, and they can provide the systems readers can use to help transcribe it and flag anything that looks fishy. In exchange for their investment in securing the documents, and providing the systems for reviewing them, their professional journalists can then get first crack at the aggregate work of their volunteers.
eas’ point that “the full text of bills is often not available in time for serious review by anyone” is addressed, in the UK at least by the Free Our Bills campaign.
great article. crowdsourcing is evolutionary.
I did some data entry for the Guardian – it would have been easier if they’d given some more guidelines: what is significant, what is known (other than duckhouses). Also date entry, not all line items have dates – expense cover a range of dates, should the date given be the submission date, start date or end date?
These small things may have reduced participation but I think they wouldn’t substantially and may even have increased participation as uncertainty as to if you’re doing it right tends to put one off. Also the date issue would have given them better data.
Data entry could have been speeded up with more standardised options – nearly all MPs I viewed use “Banner” (initially I thought that was advertising) which receipts show is a stationery supplier. Perhaps matching line entries repeated X times could have been used as type-in suggestions.
Impressed they put it all together so quick though. Well done Guardian.
Thanks for the comments, folks.
@eas – I totally agree. As a working reporter who sometimes chases database projects of his own, I can even tell you that almost every substantial public records request includes the question, “but how long will this take for me to read?” I think projects like this are the beginning of a huge shift.
@David – Nice post. Left a comment there for you.
@pbhj – Good point. I was about to say that any further complexity would have cut participation, but I think you’re right about clarity: better definitions for the buttons would have made me (as a participant) more certain that I was being helpful. I was haunted by the fear that I might be messing up.
Simon mentioned another thing I couldn’t figure out how to fit into the post: the buttons are essentially votes, because many of the documents have been viewed multiple times. If people disagree on a document, his system double-flags it.
Finally, to clarify my final paragraphs above: Willison says (on his Twitter feed) that though the software he used is open source (and can be used by imitators, etc.), the software he wrote is not.
Lesson 5: make sure you double check your ‘crowdsourced’ facts.
http://foiblesblog.wordpress.com/2009/06/21/guardian-mp-paler-than-we-might-have-led-you-to-believe/
Right on, and good show to the Guardian and Simon for a job well done. If I can add one thing, News organizations should pay special attention to Simon’s last point.
Even more than a framework (we use Ruby on Rails and a bit of Django), Amazon EC2 has allowed us to work miracles online. We’ve been able to go from a standing start to a fully deployed application in a matter of hours.
It takes technology largely (but not completely) out of the equation for these news apps, and allows us to focus on the important stuff.
“Guardian of London” – Manchester actually
I also felt that I might be doing it wrong. I wasn’t sure how to categorise expenses data that looked fine, if I’m categorising cover sheets as ‘not interesting’ should I really be categorising small expenses in the same way?
Same difficulty with expenses for durations, like hotel stays. Another difficulty was that landscape pages were a pain to work with so I tended to skip them.
I generally felt after doing a few pages that there was a real chance I wasn’t doing it right, and that stopped me doing more.
@ Pete – Ouch. Good call.
@ Duncan — originally, yep, but they relocated to the big city in the 60s. They’re definitely a Londocentric outlet now.
Willison says “There’s a lot of stuff in there that’s potentially reusable.”
But the genius of working this way, especially in frameworks like Django or Ruby on Rails, is that there’s no need to worry about stuff being reusable, since it’s so easy to build out bespoke applications tailored to specific projects. The reusable parts are already extracted out into the Django framework itself.
That visitor review idea is really brilliant.
It is a sample of how the Internet is enabling the democracy of the knowledge.
Bit of a damp squib in the end, don’t you think? The processing has dried up less than half way through the project and all they’ve got from it, as far as I can see, is the need to apologise for publishing some claims without checking them properly. If there are any Guardian people reading, could you give us an idea of what you’ve learned from this?
Why the dig at Movable Type? Last I hears, the Guardian was using Movable Type to power it’s massively-successful “Comment is free” community.
No disagreement about using a application framework vs. using a blogging tool — but I don’t see need to single out a single application unnecessarily.
Phillip.
Phillip, for the record, we weren’t knocking Movable Type, the blogging software. We were referring to movable type, the stuff Gutenberg invented for printing, which while useful for illuminated Bibles way back when, would make printing a daily newspaper today awfully painful.
Great point, Claire. I emailed it to Willison last week (along with some of the other critiques above), and he said the same.
He said he’d offer a longer response as soon as he’s finished working on version 2.0 of the crowdsourcing site. Stay tuned!
@Joshua Benton
Doh. My mistake. Many apologies, and thanks for the clarification. I guess I should have followed the link. :-)
Phillip.
Lies, Lies and Damn Lies.
(1) Tracing requests to mps-expenses.guardian.co.uk/ reveals that the site (or its content) is NOT hosted on EC2.
(2) 200k worth of documents would consume the bandwidth equivalent of 500gb of space. I wonder how $50 can pay for that kind of bandwidth consumption. Without going into details, I reckon it should take atleast 10-15 instances of Ec2 to support the kind of traffic suggested by the article. http://aws.amazon.com/ec2/#pricing
(3) 1 week of django – really ?
I’ve seen tech PR spins before – but this one pwns all.
The lessons learnt about how to effectively crowdsource are really relevant for libraries and archives. I wish more libraries would take them up with the specific aim of enhancing existing content. Another good example of effective crowdsourcing is the Australian Newspapers Digitisation Program http://newspapers.nla.gov.au where the National Library of Australia encourages the public to correct the OCR text of historic newspapers which improves the data quality and therefore search results for everyone. So far 5 million lines of text have been corrected by the public. Read more about it at: http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf “Many Hands Make Light Work”.
” Imagine you’re a major national newspaper whose crosstown archrival has somehow obtained two million pages of explosive documents that outed your country’s biggest political scandal of the decade. ”
No… I’m a community manager !