Tuesday, July 24, 2012

Is your web server load high?

A couple weeks ago we were getting worried about CIMLS.com's server.  We were running a larger AWS instance and hitting 80%+ utilization on a continual basis. For a site of our size slinging simple HTML this is ridiculous.  So, what was the problem?

Crawlers....

The bots that work for various search engines were furiously indexing CIMLS.com resulting in almost 40% of our server load.  About half of that was Google.  While a worthy cause, we cut the load in half by lowering the crawl rate to .5 requests/second.  This hasn't significantly impacted the number of indexed pages (let alone organic traffic/rank).

Adjusting the Googlebot's crawl rate can be accomplished by visiting google.com/webmastertools >> configuration >> settings >> crawl rate.  We tried a number of options, but found .5->1 requests/second to be a sweet spot that got our pages indexed without tanking server performance for real users.

Another big surprise for us was that the bulk of the rest of the load wasn't from other crawlers like Bing or Yahoo!  Instead, two foreign crawlers Yandex (Russian) and Baidu (Chinese) were just hammering the system.  As we are a US oriented site, the traffic from these sources was negligible (and probably irrelevant).

We found that modifying the robots.txt was not effective (these bots just ignored it).  We had to use a RewriteCond to block the bot activity:


RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} YandexBot
RewriteRule ^.*$ - [F]

Well, finally that did the trick - and removed nearly 20% of our server load overnight.  Phew!

No comments:

Post a Comment