The New York Times redesigned its website in 2014, introducing a new homepage, section fronts, and article pages. The Times publishes roughly 230 articles and videos per day, and while new stories began showing up in the revamped presentation, converting the Times’ archival stories to the new format was a challenge.
Starting this week, however, most Times stories published since 2004 are now available in the newer article format, and it’s taking steps to bring the rest of its archives into the new system.
In a blog post published on the Times’ site, software engineer Sofia van Valkenburg and Evan Sandhaus, the Times’ director for search, archives and semantics, wrote that “engineering and resource challenges prevented us from migrating previously published articles into this new design.”
As so often happens, the seemingly ordinary task of content migration quickly ballooned into a complex project involving a number of technical challenges. Turns out, converting the approximately 14 million articles published between 1851–2006 into a format compatible with our current CMS and reader experiences was not so straightforward.
The Times’ archives were in XML format and it needed to convert the stories into JSON in order for it to be compatible with its CMS. This process worked smoothly for the paper’s archives from 1851 through 1980. For more recent coverage, however, there were stories missing from the archive, which only included the final print edition of stories. Staffers found that in just 2004, there were more than 60,000 stories that were published online but not included in the XML archive. As a result, the Times scoured other databases, sitemaps, and analytics to try and capture as many stories as it could in raw HTML format that weren’t in the XML database.
After locating the missing stories, the Times realized that there were a number of duplicates. Using a technique called shingling, which it initially used for its TimesMachine archive, the paper was able to eliminate many of the duplicates. From the 60,000 articles from 2004 that weren’t initially found, it was able to match more than 70 percent of missing stories.
Once it assembled a more complete list of stories, the Times undertook a six-step process to “derive structured data from raw HTML for items not present in our archive XML”:
1. Given the definitive list of URLs and archive XML for a given year, determine which URLs are missing from the XML.
2.Obtain raw HTML of the missing articles.
3.Compare archive XML and raw HTML to find duplicate data and output the “matches” between XML and HTML content.
4. Re-process the archive XML and convert into JSON for the CMS, taking into account extra metadata from corresponding HTML found in step 3.
5. Scrape and process the HTML that did not correspond to any XML from step 3 and convert into JSON for the CMS.
6. Combine the output from steps 4 + 5 to remove any duplicate URLs.
As part of this process, the Times also improved the SEO on old stories and made them easier to find:
For example, on Feb. 12, 2004, the article “San Francisco City Officials Perform Gay Marriages” appeared under a URL ending with “12CND-FRIS.html.”. Realizing we could provide a much more informative link, we derived a new URL from the headline. Now this article is referenced by a URL ending with “san-francisco-city-officials-perform-gay-marriages.html,” a far more intuitive scheme.
The full post, with more technical details, is available here.
Panic! People love blocking ads! Adblocking could cost the industry as much as $12 billion by 2020! But there’s hope yet, at least according to a new study conducted for the Interactive Advertising Bureau by C3Research released Tuesday. Two-thirds of users with adblockers might be convinced in the future to stop using them, the report suggests. For these users, the report recommended the following best practices (and recommended against certain ad types):
— Give users control: Video skip button, thumbs up/down ratings
— Assure users of site safety: Provide guarantees that site and ads are secure, malware and virus-free, and won’t slow down browsing
— Don’t disrupt their flow with: Ads that block content, long video ads before short video content, ads that follow down the page, autoplay, slow loading (especially on mobile), pop-ups, or full page ads
— In short, implement LEAN principles (Light, Encrypted, AdChoice supported, Noninvasive ads), which address the a number of these key issues
Surprise, surprise: The types of ads readers dislike most are the ones that block or delay access to actual website content, overly long video ads before short videos, and ads that followed readers around on the site and they read.
For the other, more resistant third of users, the report makes the following recommendations:
— Polite messaging to turn off their ad blocker in exchange for viewing content
— Block content from users of ad blockers who do not turn off their blockers
— In short, implement DEAL (Detect, Explain, Ask, and Lift or Limit)
(DEAL is the primer the IAB Tech Lab released earlier this year for actions publishers can take to mitigate adblock use among readers.) It’s worth noting that a recent study by adblocking company Adblock Plus and marketing company Hubspot found that 32 percent of people surveyed said they wouldn’t turn off their adblockers, and 28 percent said, if blocked from reading the site’s content unless they turned off adblockers, they were more likely to stop visiting that site entirely.
Other tidbits from the IAB-commissioned study:
— Some people in the study were totally confused about what adblockers actually are: 40 percent thought they were using an adblocker, but when asked to clarify and confirm the name of the adblocker they were using, turns out 26 percent were really using adblockers.
— 15 percent used adblockers on their phones (these users were not confused — they were asked to confirm the name of the blocker they were using).
— Adblock users were more likely to be men, between 18 and 34.
It’s official: Verizon has acquired Yahoo. The companies finally announced the $5 billion deal early Monday, after weeks of rumors all but confirmed the news. The move finally puts to rest the years of speculation about the web pioneer whose fate has looked increasingly grim. For Verizon, which acquired AOL last year, it marks yet another major investment in not only content but ad tech as well.
The move also has some implications for the digital publishing industry as a whole, which will no doubt see some significant shifts thanks to the deal. Here are a few takeaways.
Are 20-year old tech companies really that different from legacy publishers who migrated to the web? Perhaps not.
Yahoo’s troubles also offer a reminder that scale isn’t everything. Yahoo site traffic just trailed behind traffic to Facebook in the U.S. in May, according to comScore. AOL head Tim Armstrong, however, argues that scale, on the contrary, is the only way at AOL and Yahoo can keep up.
Verizon, which already owns a handful of big media sites via its AOL acquisition, now owns a few more. So far, though, Verizon plans to keep them separate. Still, there are residual concerns about the telecom giant having such a major stake in news, thanks to its mishandling of tech site SugarString.
Couldn’t watch the final night of the Republican National Convention last night? Mic had you covered. Through a push notification for mobile and desktop, Mic readers learned they could opt-in to receive live SMS messages from the Mic team on happenings at the RNC. Referring to it as “live microjournalism,” this was Mic’s first experiment with live SMS to share stories. “Everyone has so many push alerts on their phone,” said Cory Haik, chief strategy officer for Mic News. “We wanted it to be one where people are saying ‘Yes, I want it.'”
.@Mic is live covering Donald Trump’s RNC speech via SMS starting around 9 EST. Want us to text you? Text TRUMP to 1-316-854-1629.
As a joint effort between Mic’s policy team at the RNC and the New York City editorial team, reporters communicated through Slack channels to the main office, which drafted and sent the official SMS messages. Users opted-in to the texts received a steady stream from the start to the end of the night, including quotes, actions, and links to additional content relevant to event happenings. “It was very helpful for people who aren’t glued to livestream or TV,” said Haik. “It gives them what they need if they’re not in the moment.”
Why use texting over, say, livetweeting updates? The key is context, according to Haik. “If you’re watching Twitter for updates, you have to be following the livestream to get the context, because Twitter is live commentary.” With the texts, “we were doing a quick micro-analysis context for our specific audience and sending that quick update.”
Mic is not the first news outlet to use SMS for news updates, of course. Startup Purple uses SMS as the core of its distribution model.
Before the live text project, Mic experimented with with SMS updates through its 23 Ways You Could Be Killed If You Are Black in America story, by allowing users to opt-in for an SMS series of the video. It was less than successful, according to Haik: “The content wasn’t compelling in a way for where you really wanted to follow along.” (Haik wouldn’t say how many people signed up for the RNC texts, but she said the number was significant enough to want to continue the SMS experimentation in the future.)
Haik plans for the Mic team to use the SMS update feature again at the upcoming Democratic National Convention and expand beyond just live events to utilize the tool. “When there’s something that people want to know, if you can get in there and provide it at that moment, that’s a better service and utility for SMS.”
The London-based investigative news website Exaro is shutting down, despite assurances less than a week ago that it was still “open for business.” The small outlet was founded in 2011, backed by investor Jerome Booth through his company New Sparta, and has broken several major stories, including releasing a secretly taped meeting between a Sun journalist and Rupert Murdoch in which he revealed that he knew for years that his tabloid’s journalists had been bribing public officials. It also (controversially) broke the news of allegations of child abuse by senior government officials, a story that later rippled through larger news outlets.
Last Friday, David Hencke and Mark Conrad had been appointed to run the site jointly. (The site’s former editor-in-chief Mark Watts was let go; he’d warned last month that the site would go on leaderless and “its small team of about five full-time journalists and researchers appears set to be cut back severely.”)
1/2 Changes at Exaro. @davidhencke is now Head of Exaro News. @markconradhack is Head of News. David and Mark will jointly manage the site.
A board meeting Wednesday, however, instead led to the sudden decision to shut down the site altogether:
Hencke said: “We are absolutely devastated. We were going ahead with plans and had only just put up a story the previous day, with a lot more in the pipeline, and suddenly we are told it’s closed just like that.”
Despite the sudden closure, Hencke praised owner Jerome Booth, saying: “He has funded it very generously for years and never interfered editorially.”
Former editor Mark Watts was fired last month. He told Press Gazette this morning: “The management has made itself a laughing stock by closing just days announcing that it was ‘open for business’. To just shut it like that is an act of mindless vandalism.”
But, in a tale as old as time, despite financial support from a wealthy owner, the site couldn’t seem to find a sustainable business model, and “a number of attempts to build a business model around the coverage, such as charging for data services and events, have failed to take off,” according to The Guardian. (Anglosphere journalists may remember the similar surprise defunding of The Global Mail in Australia two years ago.)
Terrifically sad — Staff devastated as owners close investigative journalism website Exaro: https://t.co/wCkHUem7NL
On Wednesday, The Financial Times began testing blocking a tiny percentage of registered readers on desktop who have adblockers turned on, reports Jeremy Barr over at Ad Age. Instead of displaying a non-dismissable message to subscribe or a plea to turn off adblockers, the Financial Times is removing entire words from articles. The experiment is a real-life extension of the whimsical tactic of “disemvoweling,” put in place years ago on Boing Boing and Gawker Media sites and suggested not long ago by Washington Post owner Jeff Bezos as a way to extract a little money from interested readers.
According to Ad Age, the blocked words symbolize, roughly, “the percentage of the company’s revenue that comes from advertising.”
The proportion of words blocked isn’t scientific, and the Financial Times doesn’t break out the exact chunk of revenue that comes from ads, said global advertising sales director Dominic Good. “It’s more illustrative than specific,” he said.
The test group comprises registered desktop computer visitors who don’t pay for a subscription, about .075% of the company’s desktop traffic. Some ad-blocking members of this group won’t see any new messaging, some will be asked to whitelist the website’s ads but can still read regardless, some will see articles with many words blanked out if they won’t whitelist the site, and some will be blocked outright if they don’t whitelist the site.
The Financial Times is just the latest of many news organizations to begin campaigns against adblocking, which some forecasts suggest could cost the industry $12 billion in digital advertising revenue by 2020. The New York Times had been testing it, and is using some of what it’s learned to inform a potential “ad-free” digital Times subscription. About 20 publishers in Sweden are teaming up next month to collectively block adblockers, in an IAB Sweden-led initiative. And I’ve even encountered multiple “whitelist us” notices on Ad Age.
The Financial Times is in a relatively good place, making more money from readers than advertisers, and taking in more revenue from digital than print. With its adblocking tests, it’s also released an “advertising charter,” outlining its commitment to better advertising, which includes clearly labeling sponsored content, protecting readers’ privacy, and making sure ads are unobtrusive overall.
Many more people consume news via browsers than via native apps — but those who do use apps spend much more time using them each month, according to a new study out today from the University of Texas’ Engaging News Project.
The study analyzed comScore data (from September 2015) for 25 different news organizations. On average, people accessing sites via a mobile browser spent 3.4 minutes a month; via a desktop browser, 11.7 minutes a month. But people using a news organization’s mobile app, time spent soared to 95.7 minutes — and for tablet app users, a remarkable 111.7 minutes.
The flip side of this, though, is that there are many fewer of those app users. The study found roughly 18 desktop browser uniques and 12 mobile browser uniques for every 1 mobile app unique. It’s only logical that people with a strong enough connection to an outlet to download and use its app are more likely to be core, repeat readers than someone who sees a stray link on Facebook. (Nearly half of American adults get news on Facebook, according to a Pew study released earlier this year, and this analysis supports the idea that many visit a news site and then return quickly back to their feeds.)
“Our analysis shows that while questions may still remain about news apps, there’s no denying that users spend a significant amount of time on them,” Engaging News Project director Talia Stroud said in a release.
That question of building loyalty gains importance as Facebook takes steps to bring more content within its walls via Instant Articles, Facebook Live, and the preference it gives its native video player.
The study, meant to be a snapshot of the current digital performance of top news organizations, highlights a number of other trends in news consumption and presentation:
— More ads appear on desktop homepages than on mobile and tablet devices.
— Mobile and tablet apps are more likely to use “hamburger” navigation menus, indicated by a three-line graphic, than desktop versions.
— Apps offer a different news experience than browser-based news sites; they are more likely to give people the option to save content for later, less likely to require users to sign in, less likely to have comment sections, and less likely to include social media buttons.
— Those with low incomes and those who are Black / African-American tend to be underrepresented as news users, particularly on desktop and tablet browsers.
The dream of the chat bot is alive at CNN. On Monday, CNN launched its latest attempt at conversational news with a bot on chat app Kik. As with CNN’s previous effort on Facebook Messenger, the Kik bot will let users get the latest stories in a conversational format that’s meant to feel at home on mobile devices. Kik, which launched its Bot Shop back in April, already counts Yahoo News, Mic, and The Wall Street Journal as publishers developing bots for its platform.
CNN launched the Kik project with an interactive explainer feature about the ongoing Republican convention, which users can learn more about by tapping a series of conversation prompts that offer specific details about what goes on at the event, how long it lasts, who attends, and where the current one is being held.
That kind of information might be basic for seasoned political news junkies, but CNN is approaching the Kik project with the assumption that “the audience there is very young and might not understand as much of what is going on in the news and what it means,” said Masuma Ahuja, CNN’s social apps producer and lead on the news organization’s chat app efforts.
Like Snapchat, Kik has become a go-to chat app for teenagers, who represent around half of the app’s userbase. This understanding means that CNN plans to lean heavily on explainer formats that break big stories down to their basic parts — without assuming that users have been following the news for years prior.
Beyond the interactive features, CNN’s Kik bot will also recommend stories to readers based on previous stories that they’ve read.
CNN’s Kik effort is the latest in a string of uneven attempts at chatbots from news organizations. CNN in particular has gotten some flak for its Facebook Messenger bot, which users have critiqued for being spammy and, at times, sluggish. While publisher chat bots made a big splash on Facebook messenger in April, few so far have lived up to the hype.
Ahuja said that CNN, like all news organizations, is still figuring the best formula for chat bots. “It’s a constant learning process. This type of storytelling isn’t new in and of itself, but we’re finding new ways of doing it,” she said.
Sign up for our daily email for all the freshest future-of-journalism news in your inbox.