Click caps and crawlers: A simple look at two of Google’s recent moves

By Mac Slocum @macslocum Dec. 7, 2009, 2 p.m.

Discussions involving Google and news organizations took a technical turn this week. Robots.txt files, search crawlers, click caps … I’m guessing most people aren’t intimately familiar with these things (and if you are, this piece isn’t for you). I figured it might be useful to strip away the tech jargon and filter a couple of Google’s latest efforts through a journalism-centric lens.

The First Click Free program now has a five-click cap

Google’s First Click Free model was introduced years ago as a way to level the playing field for subscription-based websites.

Here’s a little background: A publisher who opts in to First Click Free allows a visitor from Google to see the full text of an article that’s housed behind a registration wall (here’s an example; click the ” Oil prices” headline). That same user would encounter a login or subscription prompt if they tried to access the article through a different process, be it via the publisher’s site itself or through another search engine.

There’s upside to First Click Free for publishers and users alike. Publishers get the benefit of inbound Google traffic, a major source of page views and unique visitors. Users see all of the information in an article, not just a headline and snippet. (Dunder Mifflin employees take note.)

But there’s a loophole in First Click Free, and it didn’t take long for people to figure it out. If you can’t get to the full text of an article on a registration-based site, simply search for the article’s headline in Google and click through to see the entire piece. There’s no limit to First Click Free. Technically, every click is a first click. That means as long as you’re willing to do the grunt work, you can access all the content you want without registering or paying.

That’s how it used to be, at least.

Google just introduced a new cap that addresses the First Click Free loophole. Now, a publisher can restrict a visitor to a specific number of daily referrals from Google. A login prompt will block a visitor’s path once the cap is exceeded.

The minimum cap is five daily clicks, according to Chris Gaither, senior communications manager for Google News. “They [publishers] can choose to allow users to click to more than five free articles, whether it’s six or 10 or 50 or unlimited,” Gaither wrote in an email. “But they aren’t permitted to allow users to click to fewer than five.”

That eliminates a “first and only click is free” scenario, which might sound enticing to publishers but isn’t really user friendly. It reduces Google’s utility as well, since no one wants to wait 24 hours to follow a search result.

Publishers are responsible for keeping track of visitors and clicks. Google doesn’t have any part in that, which is understandable since tracking and logging millions of visitors — and their associated clicks — is a quagmire waiting to happen. Google will run occasional spot checks to make sure a website is allowing a minimum of five daily clicks, Gaither said.

Google News gets its own search crawler

There’s been a lot of talk lately about robots.txt files. These things sound more complicated and ominous than they actually are.

Basically, a robots.txt file is a simple document that’s placed on a web server. It tells search engines exactly what they can and cannot access. If you want to keep a specific folder away from the prying eyes of Google (and every other legitimate search engine), you can add brief instructions to your robots.txt file and be done with it. Google will steer clear.

More broadly — and aggressively — you’ll often see “robots.txt” offered as a rebuttal whenever Google is accused of theft. That’s because a web administrator can use one of these files to block Google whenever he or she wants. It’s like locking the door on a retail store; a shoplifter would need to break in to steal anything.

The robots.txt process works great if you want to block Google outright, but the process doesn’t always address more nuanced use cases. Google News is (well, was) an example.

Up until this week, a news organization that wanted its content to only show up in Google search results, not Google News, needed to notify Google through a web-based form. That’s no longer the case. Google News indexing can now be controlled through a robots.txt file, which means publishers can calibrate the content that shows up in Google News and Google search.

Both of these moves — the click cap and the Google News crawler — give publishers more control over how and where their content appears in Google. It’s unclear if recent pressure from publishers motivated Google to implement these features. Gaither said that wasn’t the case; I’m not so sure. Regardless, it’s always good to have more buttons to push and levers to flip in the search engine world.

POSTED Dec. 7, 2009, 2 p.m.

Show tags

TWITTER FACEBOOK EMAIL