27
Aug
2012

Googlebot Spider, Crawler in the Spotlight

Last week Google introduced a slight change into its Search Result Pages by attaching a short explanation snippet to all URLs blocked by the Robots.txt file or an on-page Robots Meta-tag. The announcement simply stated: “A description for this result is not available because of this site’s robots.txt – learn more”, with a link pointing to an explanation about how to block or remove pages using a robots.txt file.

Robots.txt blocked snippet

Blocked by Robots.txt snippet

From Google’s stand point this probably was a purely cosmetic and mostly insignificant modification, a change so “small” it wasn’t even mentioned on any of Google’s official channels. Still, as with all things Google, the sheer scope of affected users sparked various questions, blog posts and forum discussions by concerned Internet citizens that demanded to know just who is this “robots” guy is, what’s his purpose and why is he/she blocking their site.

More importantly, this also got many people talking about Google’s crawling methods and Googlebot's behavior. After answering many such questions we decided to take this opportunity to cover a few of the most interesting facts you may or may not know about a SEO’s best friend, the almighty Googlebot.

Fact #1: Blocking URLs with robot.txt will not always remove them from SERP

From time to time we’ve seen website owners blocking URLs with robot.txt (i.e. admin panel URLs) and later on surprised to find them still appearing on Google. This is nothing new because, as officially stated by Google:

“While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.”

Basically this means that blocked pages will appear on SERP, but mostly for URL specific search queries (i.e. searching for: site:www.incapsula.com/plugins/). This will keep them hidden from the eyes of the general public but still make them accessible to someone who wants to dig a little deeper. Keep in mind that for these visitors a robot.txt file can even be used as a kind of a roadmap, as it will offer a list of all of your “hidden” folders.

Fact #2: Google will activate JavaScript

Some SEO’s will advise their clients to stay away from JS objects, such as drop-down menus, and dynamic content boxes, by claiming that JS is not “Google-Friendly” and using it can lead to partial indexing, hidden content penalties and many other long and threatening adjectives.

Well, while we are all for the “better safe than sorry” approach, fact is that Google can now understand and execute JS functions.

True, JS compatibility may have posed a problem several years ago, but the Web has moved on and Google has too, as evident form a recent discussion on Hacker News in which Matt Cutts commented:

“Google continues to work on better/smarter ways to crawl websites, including getting better at executing JavaScript to discover content.”

This remark was made following this “Public announcement” mad earlier this March, in which Matt asked website owners to allow Googlebot to crawl their JSS and CSS. This plea was met with some mixed reactions, but the core message was loud and clear: Google can now understand your JS and use it for indexing decisions.

"Let us crawl the JavaScript. Let us crawl the CSS..."

Fact #3: Legitimate Googlebot visits can originate from Chinese IPs

Almost a month ago we released a Fake Googlebot study that shed some light on a Googlebot impersonation phenomena. After analyzing traffic information of 1,000 different sites, we discovered that 21% of all Googlebot visits were made by various impersonators, some were “just curious” and others are outright malicious.

Many webmasters are aware of this issue and will use additional defense techniques to prevent Fake Googlebot access, most commonly by blocking access from all suspicious IP ranges. While this may seem like a good idea on paper, without proper professional research this solution may actually damage your website and your SEO efforts.

The most interesting example of this is a security rule-set that blocks Googlebot visits from Chinese IPs. Many will advocate this solution because; the REAL Goooglebot can never come from China, right? Wrong. In fact Googlebot will use different sets of IP ranges, including Chinese IPs and, just to spice things up. It will also use several different user-agents. So, before setting any rules, you should really double-check your information.

To learn more about Googlebot, you can refer to the Botopedia directory which will provide you with full details, including a complete user-agent list and an IP verification feature, to help you better identify Googlebot and many other bots.

Of course you can also just use Incapsula, as even our Free plan will provide complete Bad Bot protection.

Googlebot: Your most important visitor

Googlebot is a website’s most loyal returning visitor and for most of us, also the most important one. Understanding Googlebot’s behavior, needs and wants is key to a website's success.

To learn more about Googlebot and its several “siblings” (Google Feedfetcher, GooglePlus bot and others) you can always visit Botopedia.org or refer to documentation on official Google channels.