Nieman Foundation at Harvard
HOME
          
LATEST STORY
Why “Sorry, I don’t know” is sometimes the best answer: The Washington Post’s technology chief on its first AI chatbot
ABOUT                    SUBSCRIBE
Nov. 13, 2014, 1:16 p.m.
LINK: blog.pastpages.org  ➚   |   Posted by: Joshua Benton   |   November 13, 2014

Hopefully you know about PastPages, the tool built by L.A. Times data journalist Ben Welsh to record what some of the web’s most important news sites have on their homepage — hour by hour, every single day. Want to see what The Guardian’s homepage looked like Tuesday night? Here you go. Want to see how that Ebola patient first appeared on DallasNews.com in September? Try the small item here. It’s a valuable service, particularly for future researchers who will want to study how stories moved through new media. (For print media, we have physical archives; for digital news, work even a few years old has an alarming tendency to disappear.)

Anyway, Ben is back with a new tool called StoryTracker, “a set of open source tools for archiving and analyzing news homepages,” backed in part by the Reynolds Journalism Institute at Mizzou.

It offers a menu of options, documented here, for creating an orderly archive of HTML snapshots, extracting hyperlinks with a bonus set of metadata that captures each link’s prominence on the page and visualizing a page’s layout with animations that show changes over time.

The potential uses for researchers are obvious, but I could also imagine plenty of realtime uses. Tracking your own homepage over time, you could get good data on how the granular movement of stories there correlates with traffic over time. (To ask questions like: Is the top slot more or less valuable on weekends or overnight than during the day Monday to Friday?) You could track your competition’s homepages to get hard data on what stories they’re pushing hardest. And unlike the base PastPages, which saves screenshots of homepages, StoryTracker gets at the HTML to determine what stories are where. It’s all open source, so have at it. (Here’s a sample analysis to see what sources the Drudge Report links to most.)

Ben presented StoryTracker at a conference at RJI earlier this week; here’s the video and his slide deck.

Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Why “Sorry, I don’t know” is sometimes the best answer: The Washington Post’s technology chief on its first AI chatbot
“For Google, that might be failure mode…but for us, that is success,” says the Post’s Vineet Khosla
Browser cookies, as unkillable as cockroaches, won’t be leaving Google Chrome after all
Google — which planned to block third-party cookies in 2022, then 2023, then 2024, then 2025 — now says it won’t block them after all. A big win for adtech, but what about publishers?
Would you pay to be able to quit TikTok and Instagram? You’d be surprised how many would
“The relationship he has uncovered is more like the co-dependence seen in a destructive relationship, or the way we relate to addictive products such as tobacco that we know are doing us harm.”