We all want Google to visit our site and index our content as often as possible. By doing so, Google learns what is new on our site and can immediately share our updated content with anyone searching online. Google uses a crawler called “Googlebot” that crawls millions of sites simultaneously and indexes their content in Google’s databases. The more Googlebot visits your site, the faster your site’s content updates will appear in Google’s search results. Consequently, it’s of the utmost importance to allow Googlebot to crawl your website without blocking or disturbing it. In fact, you want to give Googlebot the real VIP treatment. The problem? Hackers emulating the Googlebot have turned it into a target for impersonation.
In a recent study of 1,000 customer websites that we performed at Incapsula, we discovered the following:
16.3% of sites suffer from Googlebot Impersonation attacks of some kind.**
Among those targeted sites, 21% of those claiming to be Googlebot, were impersonators.
The vast majority of such impersonators post comment spam and also steal website content.
Fake Googlebot: How do the bad guys do it?
Googlebot has a very distinct way of identifying itself. It uses a specific user agent, it arrives from IP addresses that belong to Google and always adheres to the robots.txt (the crawling instructions that website owner provide to such bots).
Here are the most common methods used by Googlebot impersonators and how you can protect your Web site:
Method #1: Not validating Google IPs
It is not trivial to validate whether a bot declaring itself as Google is the real thing. Sure, it is easy enough to spot bots with fake or weird-looking user agents but what about the more sophisticated bots? Google has a number of user agents and a very wide range of (non public) IPs from which it can crawl a website. Do websites really validate IPs on the fly to check that they are associated with the Google network? Most likely not. And what about the rest of the traffic coming from these IPs (Google App Engine or Google Employees), is it Google or not? The lack of validation is the number one opening that lets the bad guys in.
Currently Google’ search bot has two official user agents: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and the less common Googlebot/2.1 (+http://www.google.com/bot.html). Webmaster Tools - Google Crawler
Method #2: Forging User-Agent strings
As mentioned, Google does not provide a list of IP addresses to white list since they change very often. The best way to identify Google’s crawlers is using the User-Agent string . Fortunately for the bad guys, user-Agent strings are very easy to forge.
There are various ways in which intruders impersonate Googlebot. The simple and non-sophisticated impersonators copy-paste its user agent into requests that their bot generates, many times with obvious mistakes.
Many people use cURL in their bot’s code and simply replace the default cURL user agent with Google’s. Other, more sophisticated, bots generate requests that seem identical to the original and can fool the naked eye. We have even seen bots that mimic Google’s crawling behavior, fetching the robots.txt first and taking a crawler- like method of browsing through the website.
Examples from the wild
- MaMa Casper worm disguised as Googlebot - A worm that scans for vulnerable PHP code in Joomla and e107, which are very common Content Management Systems. This fake Googlebot will scan multiple domains and once a vulnerable site is found, this worm will infect it with malicious code.
- SEO tools - We have observed Googlebot impersonators originating from IPs that are registered to SEO companies. These visits are a byproduct of online SEO tools which check for competitor information and view it as it’s presented to the Google search crawler.
The effect of letting a Googlebot impersonator into your website could be devastating. These fake Googlebots can litter your blog with comment spam and copy your website’s content to be published elsewhere. Worst all, they can suck up your website’s computing and bandwidth resources to the point that the server crashes or your hosting provider shuts you down for hogging too many resources.
Unfortunately, by accidentally blocking Googlebot you can also suffer devastating results. By not allowing Google to index your site, new content will not become “searchable” by the billions of people that use Google to search for content online. Prolonged blocking will also result in loss of valuable search engine rankings that are sometimes worth millions of dollars in brand equity and online awareness that took years to achieve.