

News Archive
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
How the Googlebot Sees and Ranks Different File Formats
February 11, 2010, 8:38 pm
Something that can be very helpful when you are designing and refining your website is knowing what it “looks like” to the bots that crawl the web and index your pages. If your site doesn’t have the information that the bots need to know what your content and graphics are all about, then they can’t do a very good job indexing your pages.
If you use Firefox, you can download and install the “User Agent Switcher” option for Firefox. You’ll have to restart Firefox once you’ve installed it. Once you have it, in Firefox, go to Tools, then User Agent Switcher, then Options, then Options again. In the User Agent Switcher window that comes up, select User Agents and click on “Add.”
In the Description box, type something like “Google Bot” and in the User Agent box, paste this:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
In the App Name box type Googlebot, then click OK. Now, any time you want to view one of your pages as if you were the Google bot, you go to Tools, User Agent Switcher, Googlebot.
You might have to block cookies to view some sites, and you can do this in Tools, Options, Privacy, Exceptions (then add the URL).
Another thing you can do is to use a text browser like Lynx to get a rough estimate as to how your site looks to Google. Google Webmaster Tools, however, has a feature that can help too. On the Webmaster Tools dashboard, click on the “+” sign by the “Labs” link in the left hand column. When you do, you’ll see an option called “Fetch as Googlebot” as you can see in the first screen shot. Click on it, and it will download your site (or whatever URL you enter) as the Googlebot sees it.

As in the second screen shot, you’ll see the html source just like that you’ll see when you click on “View Source” in your browser. You’ll get a response code, like 200, which means everything is peachy, or 301, which means “permanent redirect.” You’ll see what kind of server your website is on and any CSS files or scripts that are called upon and included.

One caveat, however is that it doesn’t always work with PDF files, but Google insists it’s working on fixing the problem and if your sites look OK in your browser, chances are it looks OK to the Googlebot (even if it’s PDF).
If you run a lot of scripts or have lots of layers on your sites, this can be particularly handy. If your site is mostly simple html, your normal web browser will give you a pretty good idea of what Google sees on your site.
What Googlebot Sees as it crawls your site
When the Googlebot crawls your site, it uses computer algorithms to determine which sites to crawl, how often to crawl them, and how many pages to get from each site. It starts with a list of URLs from earlier crawls and with sitemap data. The bot notes changes to existing sites, new sites, and dead links for the Google index. When the Googlebot processes each page it takes in content tags and things like ALT attributes and title tags. Googlebot can process a lot of content types, but not all. It cannot process contents of some dynamic pages or rich media files.
There has been plenty of talk about how to handle Flash on your site. Googlebot doesn’t cope well with Flash content and links that are contained within Flash elements. Google has made no secret about its dislike for Flash content, saying that it is too user-unfriendly and doesn’t render on devices like PDAs and phones.
You do have some options, however, such as replacing Flash elements with something more accessible like CSS/DHTML. Web design using “progressive enhancement,” where the site’s designs are layered, yet concatenated, will allow all users including the search bots to access content and functions. Amazon has a “Create your Own Ring” tool for designing engagement rings that is a good example of this type of functionality. Also, something called sIFR, or Scalable Inman Flash Replacement is an image replacement technique that uses CSS, Flash, or JavaScript to display any font in existence, even if it isn’t on the user’s computer, as long as the user can display Flash. Now, sIFR is officially approved by Google.
Google says that the bottom line is to show your users and Googlebot the same thing. Otherwise your site could look suspicious to the search algorithms. This rule takes care of a lot of potential problems, like the use of JavaScript redirects, cloaking, doorway pages, and hidden text, which Google strongly dislikes.
Google support engineers say that Google looks at the content inside “noscript” tags, but they should accurately reflect the Flash-based content included in the noscript tags, or else Googlebot may think it’s cloaking.
According to Google engineer Matt Cutts, it’s difficult to pull text from a Flash file, but they can do a fair job of it. They use the Search Engine SDK tool that comes from Adobe / Macromedia. Most search engines are expected to make that the standard for pulling text out of Flash graphics. People who regularly use Flash might consider getting that tool as well and seeing for themselves what kind of text it pulls out of your graphics. In fact, Google may work with Adobe on updates to the tool.
Related posts:

















