24
Jul
2014

Dr. Crawlit - A Bot That Cares About the ‘Little Guy’

Working at Incapsula gives us a bird’s-eye view of the bot traffic landscape. Amongst the innumerable creatures roaming those fields, few are as intriguing as Googlebot – a web crawler that facilitates knowledge exchange between billions of humans, influencing our perceptions, preferences and imaginations in more ways than we can even comprehend.

Over the years, many efforts have been made to better understand Google’s behavior and motives. Today, we want to share with you some of our insights into Googlebot’s behavior, based on what we think is one of the most robust studies on the subject to date.

Tweetable Stats


Methodology

For purposes of this study, we observed over 400 million search engine visits to 10,000 sites, resulting in over 2.19 billion page crawls over a 30 day period.

Information about Googlebot impostors (a.k.a., Fake Googlebots) comes from inspection of more than 50 million Googlebot impostor visits, as well as findings from our ‘DDoS Threat Landscape’ report, published earlier this year.

60% of all Search Engine Traffic

The first interesting fact about Googlebot is just how active it really is.

It should come as no surprise that Googlebot is more thorough than any of its peers. However, it was interesting to see Googlebot actually crawling more pages than all other search engines combined.

Search EngineShare of Total Page Crawls
Googlebot60.5%
MSN/Bing Bot24.5%
Baidu Spider4.4%
Majestic12 Bot3.0%
Yandex Bot2.3%
Others3.0%

There are few better indications of just how much information Google has at its disposal. This is also a good reminder of the responsibility that comes with access to all that collective knowledge.

On a side note, we were also surprised to find Majestic12 Bot appears fourth on our “Most Active Search Engines” list, significantly outranking Yandex - a very popular Russian search engine.

Conspiracy theory buffs will recognize Majestic 12’s name for its connection to the (alleged) Roswell UFO landings. While the MJ12 bot is clearly non-human, it has much more earthlier origins, and its own share of controversy.

Paying Attention to the “Little Guy”

Our first goal was to investigate an assumption, expressed by some of our clients, of a supposed connection between a website’s “popularity” and Googlebot’s crawl rates.

Simply put, the hypothesis here was “popular sites get more crawls.” To test this premise, we performed several correlation tests – first on the whole sample group of 10,000 websites and then on five sub-segments, categorized by the number of daily human visitors.

Resutls of corelation analysis (R2) between the number of human/Googlebot visits.

Analysis performed on six sample groups, segmented by the volume of daily human visits.

As evident from the graph above, we found no significant correlation between the volume of human visits and the number of Googlebot visits. As it turns out, Google really doesn’t play favorites – paying as much attention to the voice of the “little guy” as it does to that of some of the Web’s MVPs.

This data also hints at a disconnect between the rate of Googlebot crawls and a website’s SEO performance. Typically, organic search traffic accounts for 10%-30% (or more) of a website’s total visits. Thus, if a high crawled rate would indeed translate into a higher share of organic visits, we would expect to see that reflect (at least to some extent) in our data. As it stands out, our numbers show no such correlation, giving web operators and SEO pros one less thing to worry about.

Crawl Patterns, Exceptions to the Average

One of our main goals in this study was to analyze Googlebot’s crawl patterns. This is a major focal point of most of the Google-related questions we get from our clients. Guided by those questions, we processed over 210 million Googlebot sessions and came out with the following key findings:

  • Googlebot’s average visit rate per website is 187 visits/day.
  • Googlebot’s average crawl rate is 4 pages/visit.

It should be noted that this data comes from a very diverse group of samples. At the high end, for instance, in 12.5% of cases we saw Googlebot visiting sites over 500 times/day (more than 3 times above the average). Crawl rate data was just as diversified, with the most extreme case being 210,000 pages crawled during a single uber-long 72 hour visit.

Through close inspection of dozens of websites, we spotted a possible explanation for the diversified Googlebot behavior. Drilling down, we saw that content-heavy and frequently updated websites were more thoroughly crawled. This behavior was most notable in the cases of big forums, news sites and high-scale e-shops with a wide array of frequently updated products.

As an SEO professional, it didn’t surprise me to see Googlebot preferring fresh information and crawling websites in a non-unified pattern based on content freshness and the site’s content structure.

Still, it was exciting and gratifying to see these notions being backed up by wide-scale empirical research. Seeing these numbers, I couldn’t help but wish that I had them up my sleeve all those times I was trying to convince past clients to stick to their content schedules.

Where Does Googlebot Come From?

Lastly, we wanted to share some lesser known facts about Googlebot’s origins. As one would expect, the bulk of Googlebot’s visits originate from the US.

However, we also see Googlebot coming from the UK, France, Belgium, Denmark and China.

Googlebots' Country of OriginShare of Visits
US98.12%
UK1.17%
France0.38%
Belgium0.16%
Denmark0.09%
China0.07%

We thought this information could prove useful to website operators who rely on indiscriminate, geography-based security rules that may block Googlebot together with unwanted visitors.


Click here for the second part of this post, where we get into the sketchier behaviors of Googlebot’s evil alter ego, Mr. Hack.