

News Archive
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
Why Your Robots.txt Blocked URLs May Show up in Google
October 6, 2009, 2:22 pmMatt Cutts has appeared in yet another Google Webmaster Video, and this time he has a whiteboard with him so he can illustrate what he's talking about. What he's talking about this time are uncrawled URLs in search results.
Cutts says Google gets a lot of complaints from webmasters who say the search engine is violating their robots.txt files, with which they intend to keep Google from crawling certain pages. Sometimes those URLs still end up in search results.
According to Matt, what is happening in most cases is that when someone's saying "I blocked example.com/go" in robots.txt, it turns out that the snippet Google returns in search results just brings back a URL with no text for the snippet. The reason for this is that Google didn't actually crawl the page.
"It did abide by robots.txt. You told us this page is blocked, so we did not fetch this page," says Matt. It is a URL reference. "We saw a link to it, but we didn't fetch the page itself," he explains.
Google didn't actually fetch the page itself, and that's why there's no text snippet. In case you were wondering what the point of showing them at all is, Cutts breaks out an example looking at the California DMV, whose site is: www.dmv.ca.gov.
Cutts notes that at one point the California Department of Motor Vehicles had a robots.txt that blocked all search engines. "Now these days pretty much every site is savvy enough, you know, at one point the New York Times and eBay and a whole bunch of different sites would use robots.txt," he says.
If someone searches for "California DMV" in Google, there's pretty much only one answer, he says. So that is the answer that Google wants to return. Luckily for Google a lot of people were linking to that page with the anchor text "California DMV". That helps Google be able to return the result without having to crawl the page.
Cutts also says that they can get descriptions from a directory like the Open Directory Project (DMOZ). He cites Nissan and Metallica.com as examples of sites that used to block Google with robots.txt. They had been listed in the Open Directory Project, however, and Google went and got the information from there to include as the snippet.
When this type of thing happens, it looks like the page was crawled, when in fact it wasn't. "So we are able to return something that can be very helpful to users without violating robots.txt by not crawling that page," says Cutts.
He also notes that when you don't want pages to show up, you can use the "noindex" meta tag at the top of the page. When Google sees this tag, it drops the page from its search results completely. Another option is the URL removal tool.




