Twitter  Quartz found an unlikely inspiration for its relaunched homepage: The email newsletter.  
Nieman Journalism Lab
Pushing to the future of journalism — A project of the Nieman Foundation at Harvard

Coming soon to journalism: Matt Thompson sees the “Speakularity” and universal instant transcription

Editor’s Note: We’re wrapping up 2010 by asking some of the smartest people in journalism what the new year will bring.

We also want to hear your predictions: Take our Lab reader poll and tell us what you think we’ll be talking about in 2011. We’ll share those results later this week.

Here’s Matt Thompson, he of Newsless, Snarkmarket, and NPR fame.

At some point in the near future, automatic speech transcription will become fast, free, and decent. And this moment — let’s call it the Speakularity — will be a watershed moment for journalism.

So much of the raw material of journalism consists of verbal exchanges — phone conversations, press conferences, meetings. One of journalism’s most significant production challenges, even for those who don’t work at a radio company, is translating these verbal exchanges into text to weave scripts and stories out of them.

After the Speakularity, much more of this raw material would become available. It would render audio recordings accessible to the blind and aid in translation of audio recordings into different languages. Obscure city meetings could be recorded and auto-transcribed; interviews could be published nearly instantly as Q&As; journalists covering events could focus their attention on analyzing rather than capturing the proceedings.

Because text is much more scannable than audio, recordings automatically indexed to a transcript would be much quicker to search through and edit. Jon Stewart’s crew for The Daily Show uses expensive technology to process and search through the hundreds of hours of video the various news programs air each week. Imagine if that capability were opened up to citizens — if every on-air utterance of every pundit, politician, or policy wonk were searchable on Google.

The likeliest path to the Speakularity runs through Google. The company has already taken significant steps in this direction. They’ve trained their speech processing algorithms through the millions of queries submitted to Google 411, so that now, my Android phone is already pretty good at recognizing my voice commands. They automatically add captions to YouTube videos and transcribe voicemails through Google Voice. Developers can already call on Google’s voice recognition system when developing apps for Android devices.

The Speakularity itself probably won’t happen in 2011, but I think a key moment might. Let’s say that sometime in 2011, Google unveils a product called Google Transcribe. Not for charity, of course; better transcription = more relevant ads. The core of the product is a speech transcription API: send it audio and get back text in return. But there’s a front end to Transcribe where non-techies can get their mp3s auto-transcribed. Crucially, that app allows the user to manually correct the transcription (highlight a passage and it plays automatically), enabling a human feedback loop that makes the machine better and better over time. In addition to captioning, YouTube videos appear by default next to an automatically generated transcript that users can use for navigation, Debate-Viewer-style.

Constant social feedback plus machine learning could improve automatic speech transcription to the point where it’s finally ready for prime time. And when it does, the default expectation for recorded speech will be that it’s searchable and readable, nearly in the instant. I know this sounds totally retrograde, but I think it’s something like the future.

What to read next
Caroline O'Donovan    Aug. 20, 2014
Andrew Golis wanted to build a network for sharing stories that would provide relief from our contemporary content cascade.
  • Sally J.

    As an audio archivist responsible for a large oral history collection, I would LOVE for this day to come. But I think it’s much more than a year away.

    Have you read Robert Fortner’s “Rest in Peas” post?

  • Knotty Bitz

    I think you meant to say: “make audio recordings accessible to the DEAF.”

  • Patrick

    I imagine software companies will do what they can to ensure this technology is never free.

    For starters, the best software – like Dragon – is only good for 1 user per software license. Each license runs from $400 – $5500. Medical transcription, for example, costs around $22 per transcribed document. By the end of the year, hospitals can rack up massive bills on transcription charges. It’s far cheaper for hospitals to buy a software license for each of their physicians (about $1200) than it is to have each document manually transcribed. It represents a huge cost savings for the hospitals, but it’s also the bread & butter for software companies that have developed the technology.

    Hospitals may get the last say as to whether or not the technology will be made available FOC, but this technology is far more sophisticated that most people think and giving it away won’t be easy for the companies who have invested so much in this type of software.

  • Matt Mireles, SpeakerText

    Hi Matt,

    You’re 100% correct on the need. But you’re wrong about the the ability of machines to tackle this problem alone. What you’re asking for is genuine artificial intelligence. And we’re at least 15-20 yrs away from that.

    If you look at what actually comes out of the research labs and works, there’s always some human component––this includes Google, which, btw, was the first search engine to incorporate human link creation into its algorithm.

    SpeakerText combines speech recognition with crowdsourcing to provide the kind of on-demand speech-to-text that you want. It ain’t free, but unlike Google Voice, it’s reliable and it actually works. Check it out:

    -Matt Mireles
    CEO, SpeakerText

  • Tony

    Any thoughts about the ability of LiveScribe pens?

  • Jonathan Stray

    Matt, I think you’re absolutely correct, and I’ve speculated myself on the impact of cheap transcription. It’s going to change reporting, and it’s going to change search, and also archives, and…

    I’m guessing widespread use is probably five years out, and I agree it will be Google.

  • Danielle Desjardins

    I am curious: Is technology that close to bypass the quirks of human speech: talking too fast, mumbling, yelling over each other (as in a debate), etc.?

    I’ve worked in television, where speech recognition software is used to produce closed captioning but with the help of a “repeater” who rephrases the contents so they can be understood by a machine.

    I’ve changed paths so I’m not keeping track of the latest developments in that field anymore, hence the curiosity expressed at the beginning of my comment.

  • Pingback: A hypothetical path to the Speakularity « Snarkmarket

  • Brian Hayashi

    I call this the Translation Turing Test: at what point will translation technologies be able to allow any one to fully experience web sites without having to know the provenance of the publisher?

    And then…what happens when English ceases to be the lingua franca? I suspect we’ll quickly learn which benefits have been taken for granted.

  • Alain Saffel

    Ah, more information for the public to wade through and even more they’re unlikely to care about.

    I don’t think this is anywhere close to happening in the next few years.

    It still won’t alter the value of journalists who are already sifting through this information for the valuable bits.

    Journalists and media organizations will remain the aggregators.

  • Pingback: Feeder s05e15 – Roberta Deiana/Insalata di pollo, uvette e mele alla senape « FeederBlog

  • Pingback: Jonathan Stray » A computational journalism reading list

  • Pingback: Truly Freeing our Sources of News #jcarn » DigiDave - Journalism is a Process, Not a Product