Nieman Foundation at Harvard
This is how The New York Times is using bots to create more one-to-one experiences with readers
ABOUT                    SUBSCRIBE
Sept. 3, 2009, 10 a.m.

SEO lessons from Google News: How to promote your stories, straight from the bot’s mouth

One of the keys to success in the online news game is making sure people who might be interested in your content can find it. And the most common path for those seekers goes straight through the multihued logo of search giant Google.

Google’s genius is using algorithms to determine the value of content — what search results best answer a user’s question, which ads are optimal to show on a particular page, and which articles most deserve the attention of a news consumer. The existence of those algorithms has spawned an entire industry dedicated to gaming those systems, in ways both approved and not.

Google would like to encourage the approved ways, of course, so on Tuesday the company posted the above 15-minute video of Googler Maile Ohye describing how news organizations can best ensure their stories are well represented in Google News. In case you don’t want to spend 15 minutes on it, I’ve posted a transcript of the video below. (You’ll want to see the slides Maile is referring to at several points, though.)

Here are five SEO facts I learned from it:

— Google News rates how “trusted” a news source is based on clickthrough data on its stories — another reason to create catchy headlines — but those trust levels are topic specific, so a newspaper could be more trusted as a source on some stories than on others.

— It can detect phrases like “the Los Angeles Times reported” in wire stories and promote the original L.A. Times piece among the many other versions of the story.

— While commentary and satire pieces are welcome inside Google News, they aren’t allowed to be the lead story on a given news subject — that’s a spot reserved for a hard-news story.

— Google News favors JPEGs over PNGs when selecting the pictures that go next to stories. Videos hosted on YouTube (which Google just happens to own) get a boost over videos hosted elsewhere. And you’re better off having at least three digits in the URLs of your stories.

— PageRank — the engine behind Google’s main search results — is used only “delicately” in Google News.

Here’s the full transcript:

Hi! My name is Maile Ohye, and I work at Google as a Developer Programs Tech Lead. I’m so glad to be speaking to you today, because for me, and on behalf of all my colleagues at Google, we understand how important it is to have a strong news ecosystem. So I hope you find something in this presentation that you find useful.

Today we’re going to talk about three main topics. First, the ranking factors of Google News search. Next we’re going to cover some of the frequently asked questions that we hear from publishers or from SEOs. And last, we’re going to talk more about the best practices when you publish articles.

So let’s take a first look at how your articles appear in a Google search result. There are several ways. First is obviously on, where people might see a news OneBox. And this here, in the upper screenshot, shows you news results for a search like “obama medals,” where now the user is shown some news articles. There’s one way where your article can appear in Google News.

On the second screenshot, this is from a user going directly to And here’s where they see a similar cluster of articles. But instead of the homepage, they’re seeing it on the News homepage.

So you might be asking yourself, “How did these articles appear?” Now, the way we gather these articles are by first crawling it, next grouping it, and then ranking all of the information. And we’ll cover each of these steps more in depth.

Let’s start with crawling. In the crawling stage, much like web search, we have Googlebot, who is going to go out to your news site to look for new articles. And there’s two ways that we retrieve these articles: One is through our discovery crawl, where Google sees new URLs and then crawls those articles. But in addition to that discovery crawl, you can also create news sitemaps. And news sitemaps are a way for you to list exactly what are your new URLs. So we can use that as well, in addition to our discovery crawl, to find your new information.

And of course, we respect the robots exclusion protocol. You can create a robots.txt file, or use HTTP headers, to let us know specifically what documents you want crawled, and what documents you want excluded from Google search results.

Last, once we’ve crawled and made sure that we have only crawled where we’re allowed to crawl, we bring those articles back to Google. And that’s the end of the crawling phase.

So next, we get into that grouping phase. And here’s where we have this classification idea. In classification, what we’re doing is actually looking at each individual article’s contents. So you can see on this article, “The Millions Kozlowski Didn’t Steal.” We actually take out individual words, like business, Tyco, money, and CFO, and understand that this article pertains to the section of business. That’s how we populate those different sections of Google News, like business, health, and entertainment.

Another thing we’re doing is populating our editions, whether it’s U.K. or U.S. or India. And we can take that from the text as well. Here we’ve taken words like New York and Manhattan, and that led us to believe that this article pertains to the United States. So this is that grouping stage, where we understand what an article is about, and also what sections and editions it pertains to.

So now that we’ve covered crawling, grouping — we now have ranking. And ranking is going to come in two phases. First, of course, is story ranking. Story ranking is much like what you see on the Google News page, where there’s a group of stories, whether it might be Obama and the medal ceremony, or it might be the death of Michael Jackson, or it might be rising oil prices. Story ranking is deciding which of these stories should be placed higher, which second, which third — that type of idea, these cluster of stories. And we rank these story clusters according to aggregate editorial interest.

So let’s take a deeper look at what that means. In the upper diagram, you can see that a small story has a small effect on publishing activity. Let’s say in North Carolina, a man was giving out free cars to those who really needed it. It’s a great human-interest story. It might be covered in their local newspaper, and also picked up by a few wires. But this is still a relatively small story, not showing as much aggregate editorial interest as say a larger story like the death of Michael Jackson, which is not only published on the local newspaper but also foreign and national papers, covered by many wires, also including op-ed articles and follow-up articles. You can see that due to all the editorial interest about this story, we will likely rank it higher than the interest story about a man giving out free cars in North Carolina. So that’s story ranking. We are actually ranking those clusters.

The next part about ranking is the individual article ranking. Article ranking helps us take a cluster of stories — say the death of Michael Jackson — and helps us determine out of those 200 stories, which one should be ranked first for our users, which should be ranked second and so on. There are many signals that go into article ranking, but I am just going to cover four of the major ones for you here.

First is fresh and new. It’s important to us that an article contain recent substantial information about a news topic, and it needs to be objective news to lead this cluster of stories. So press releases, satire, op-eds aren’t eligible to lead clusters.

Another factor is duplication and novelty detection. And that’s where we try to determine an original source of content from those that are duplicating the information. So something that we use there is this idea of citation rank. So per article, we can see that if a news story was broken by the Los Angeles Times, and then later another article, say in Washington, cited the Los Angeles Times as having been a source of their information, we can start to see the citation rank taking place for this story — that this article from the Los Angeles Times might have higher ranking now, because other people are citing it as being an original story.

Another factor is local and personal relevancy, and this applies to individual sections as well as editions of your publication. So what we want to do is actually give more weight to local sources that are likely more relevant to the news item. So if we take that idea of a man giving out free cars in North Carolina, it’s likely that we’d take a paper like the Charlotte Observer and know that that could be a higher authority for that story. And, therefore that article might be ranked higher in this cluster.

The last signal I want to cover in article ranking is the idea of trusted sources. For us, trusted sources doesn’t have to do with some arbitrary decision that we make, but it’s actually data driven. So, according to our data over time, did users start to look at your articles and then click on them? Let’s say that there were five articles being listed and a significant amount of users chose the third article and went to that source. Though we might start to determine that this source is actually very trusted for the certain type of information and over time, we start to build out what publications are trusted sources, but not for their entire publication. It is done on a section and category basis, so something like The Sporting News could be very trusted for sports information, but may be not so much for business. And likely something like the Wall Street Journal might be very trusted in the United States for business information, but may be not in India. So again these trusted sources have to do with section and edition, so it’s a very specific thing that we’re looking for due to aggregate user behavior. So, those are just four of the signals that we use in news search article ranking.

Next, let’s go into some of your frequently asked questions. You might be asking: What are the benefits of submitting a news sitemap? Well, we think that sitemaps are beneficial to us and to you as a publisher as well.

First of all they provide you greater control over which of your articles appear in Google News. And that is because, as I mentioned earlier, they help complement our discovery crawl, and tell us exactly what articles are new and which articles we should crawl.

Second, news sitemaps are great because they help you give us metainformation about your articles. So, rather than rely on our extractor, you can give us the publication date and rather than rely on just our extractor to determine the categories for your articles, you can give us good hints by using the keywords field. So, all in all we think news sitemaps provide a large benefit to publishers.

Another frequently asked question is: Can Googlebot visit our URLs more than once? And, the answer is yes, we can definitely recrawl URLs to check for updates. But just taking a step back, initially Googlebot can actually find your new content within a matter of minutes of when you published it. And, we find your new content through our discovery crawl or through news sitemaps, and after that initial discovery, we will definitely go back and retrack for new article content. So the time at which we may recrawl varies. So that recrawl rate varies, but it’s pretty safe to say that we will probably go back and check for new content within 12 hours. So we’ll find it within a matter of minutes and we’ll recrawl for new content within 12 hours.

You might also be asking: How do I optimize my multimedia content? Well, that’s a great question. So we’re going to take a look at two types of content. First, let’s talk about videos. With videos, you can create a YouTube channel and submit that to us. We are looking to include other types of video hosters, but right now with YouTube, we have a pretty good idea of the user experience, that the video will load, etc. So YouTube is a trusted video hoster platform for us. And if you do use YouTube, remember that including textual descriptions and transcripts are also helpful because that helps us associate a specific video with the subject matter.

Now let’s talk about images. With images, we have five tips that will help you get your images included in Google News Search.

— First, you want to use a large-sized image with good aspect ratio.

— Second, you want descriptive captions and alt text.

— Third, you want to keep your good image near the title and that again helps us associate an image with the subject matter.

— Fourth, you want your good image to be inline and not a clickable version. So again, you want your good image near the title and in line.

— And last, we prefer JPEG. So if you use things like PNG images, that’s not as good for Google News as for JPEG, so I would definitely stick with JPEG if you would like your images included in Google News.

So the last frequently asked question of course is: What about PageRank? PageRank is a lesser factor in Google News than it is in web search. And that makes sense, right, because the linking structure for an article that was only published minutes ago isn’t going to be the same as one that was published years or months ago. So we have to use PageRank delicately in Google News. So instead of using signals like PageRank, we actually use more signals like we talked about earlier — which is things like timeliness, is it fresh and new, or it does it have local or personal relevancy, those types of things.

So now that we have covered how Google crawls and groups and ranks articles, and we answered some of your frequently asked questions, let’s just get into some best practices.

First, it is important that you create permanent unique URLs with at least three digits. And the reason for this is that, traditionally, news publishers have used article IDs and then equals a number and their URL strings. And that has helped us to determine that it’s an article and not just a static HTML page. But if your news publishing system doesn’t include digits — three, at least three for Google News — then you can actually submit a news sitemap, so that’s the workaround. If you have three digits in your URLs, you can create a news site map and let us know which specific URLs belong in news.

The second best practice is to not break up the article body, so in your news article it should have sequential paragraphs that can all be included in Google News. You don’t want to break that up with user comments, or links to related posts, or even if you have things like it links to additional pages. That’s not as good for Google News. We’ll take all the article on that first page. So look again to not break up the article body.

A third best practice is to put dates between the title and the body, and that will help our data extractor to have the correct publication date.

Fourth, titles matter. And this is to have a good HTML title as well as an article title, so you want your title to be extremely indicative of the story at hand.

Fifth, it’s best for Google News if you separate your original article content from your press releases. And you can do this in a directory structure. And this helps us to determine what is specifically a news article versus what might be satire or opinion or a press release.

And the last tip, of course, is to create unique and informative content and that’s always going to help you do well in rankings. So the more unique content that you create and the more users that enjoy that, the more users we’ll send there. And this is kind of converse to the idea of just publishing other people’s content or just having duplicate information.

So again, the greater information you put out for all of us to read, the more users you’ll attract to your site. If you have additional questions, please feel free to visit our News Publisher Help Center and thanks so much for watching.

POSTED     Sept. 3, 2009, 10 a.m.
Show comments  
Show tags
Join the 35,000 who get the freshest future-of-journalism news in our daily email.
This is how The New York Times is using bots to create more one-to-one experiences with readers
“I’m not worried about this technology driving the humanity out of journalism. I’m really excited about the promise of technology bringing more humanity to journalism.” Also: a Michael Barbaro bot.
These are the bots powering Jeff Bezos’ Washington Post efforts to build a modern digital newspaper
“It’s this great, simple experience, and the technology is getting so much better for it: AI’s getting better. big data’s more accessible.” Also: a Marty Baron bot.
The Information’s new Briefing is a continuous update of opinionated takes on other people’s articles
Briefing is meant to be more Politico Playbook than Techmeme. It’s updated around the clock, but is also being sent out as a daily email newsletter for subscribers.