Nginx + CDN + GoogleBot or how to avoid many useless Googlebot hits

If you’re like me and you’ve developed a CDN distribution for your website’s content (while waiting for SPDY to be widely adopted and available in mainstream distributions), you might have noted that the Googlebot is frequently scanning your CDNs, and this might have made your website a bit overloaded.

After all, the goal of the CDNs are (several but in my case only) to elegantly distribute contents across subdomains so your browser will load the page resources faster (otherwise it gets blocked by the HTTP limit or any higher limit set by your browser of simultaneous content download).

Hell, in my case, this is the number of page scans per day originating from the Googlebot on only one of my CDN-enabled sites (I think there are like 5 different subdomains). And these are only the IPs that requested the site the most:


As you can see, it sums up to about 13,000 requests in just 24h. On the main site (the www. prefixed one), I still get 10,000 requests per day from the Googlebot.

So if you want to avoid that, fixing it in Apache is out of the scope here, but you could easily do it with a RewriteCond line.
Doing it in Nginx should be relatively easy if you have different virtual host files for your main site and the CDN (which is recommended as they generally have different caching behaviour, etc). Find the top “location” block in your Nginx configuration. In my case, it looks like this:

        location / {
                index  index.php index.html index.htm;
                try_files $uri $uri/ @rewrite;

Change it to the following (chang by the name of your site):

        location / {
                index  index.php index.html index.htm;
                # Avoid Googlebot in here
                if ($http_user_agent ~ Googlebot) {
                    return 301$request_uri;
                try_files $uri $uri/ @rewrite;

Reload your Nginx configuration and… done.

To test it, use the User Agent Switcher extension for Firefox. Beware that your browser generally uses DNS caching, so if you have already loaded the page, you will probably have to restart your browser (or maybe use a new browser instance with firefox –no-remote and install the extension in that one *before* loading the page).

Once the extension is installed, choose one of the Googlebot user agents in Tools -> Default User Agent -> Spider – Search, then load your cdn page: you should get redirected to the www page straight away.

Related Posts

Drupal 7 + HTTPS + Nginx + Varnish + Apache + Boost + APC + Securepages + Drupal

If you happen to develop large sites in Drupal, you might fall upon...

Adding events management to Drupal 6

Being a repetitive task for me, it’s probably a good idea for me...