How I stopped these search engine pests from stealing my server bandwidth

From Is-beer-a-vegetable.com
Jump to: navigation, search

Majestic -12 (MJ12Bot) One of the many pests

How I stopped these search engine pests from stealing my server bandwidth

On this Internet server I run many busy websites so I am mindful of the effect of unnecessary extra web crawling traffic that increases the load on my server. One of these sites has 58,000 pages so when a spider decides to spider the whole site it has an impact. When 20 of them do the same, the effect is sobering!

Over the last few years, much of the problem has come from seemingly badly behaved web spiders or robots. The problem is now exacerbated because newer robots attempt to download multiple versions of the same page by pulling different versions of the same site; desktop, mobile (hi resolution), mobile (low resolution) , tablet etc. So now not only are these guys hammering my server, they do it four or five times at exactly the same time. At times the load average on the server can increase to 2 or 3. This costs me money poor response time for my loyal visitors and also in a notable drop in Adsense revenue.

As the majority of these search engine web spiders are Chinese, Russian, Korean or just plain unwanted SEO marketeers it highly unlikely that I would ever gat any useful referrals from them anyway. Time to stop them.

Initially I used a very successful combination of fail2ban, iptables, ipset and Apache vhost rules. I certainly managed to prevent many of these search engines from using up my valuable resources, but I started to worry that my overzealous attempts at blocking these nuisances might actually be preventing bona-fide visitors from getting to my sites. That's potentially more of a problem. It was time to look for a better solution.

Keep it simple, stupid

Because all of these robots appeared to be extremely badly behaved I naively supposed that they would certainly not honour the robots.txt code. Often the urls left in my logs with their robots explanation URLs were in Chinese, Korean or Russian with no English equivalent. These guys are not going to leave my websites alone based on the preferences in my robots.txt file, were they?

Well, it seems I was wrong. They all honour the robots exclusion code, even those like Majestic and Yandex!

I was completely staggered when I added the robot exclusion rules listed below, to all of my websites. The bothersome spidering ceased immedialty they read the robots.txt. I know that's what should happen, but I assumed (incorrectly) that it wouldn't.

Feel free to copy and use my robots.txt file

User-agent: 008
Disallow: /
User-agent: 200PleaseBot
Disallow: /
User-agent: 360Spider
Disallow: /
User-agent: adbeat_bot
Disallow: /
User-agent: ADmantX Platform Semantic Analyzer - ADmantX Inc. - www.admantx.com - support@admantx.com
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: Attribot/1.1 (compatible; Attribot-Site; http://static.attribyte.com/robotreadme.txt)
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: Bot.AraTurka.com
Disallow: /
User-agent: Bumble Bee
Disallow: /
User-agent: Butterfly/1.0
Disallow: /
User-agent: ClarityDailyBot
Disallow: /
User-agent: CMS Crawler
Disallow: /
User-agent: CRAZYWEBCRAWLER
Disallow: /
User-agent: Daum
Disallow: /
User-agent: DeuSu
Disallow: /
User-agent: diffbot
Disallow: /
User-agent: Domain Re-Animator Bot
Disallow: /
User-agent: DomainAppender
Disallow: /
User-agent: DomainSigmaCrawler
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: Exabot
Disallow: /
User-agent: Ezooms Robot
Disallow: /
User-agent: FavOrg
Disallow: /
User-agent: Findxbot
Disallow: /
User-agent: FlipboardProxy
Disallow: /
User-agent: GigablastOpenSource/1.0
Disallow: /
User-agent: Grapeshot
Disallow: /
User-agent: Heritrix
Disallow: /
User-agent: heritrix/2.0.2 +http://www.adsafemedia.com
Disallow: /
User-agent: HTTrack 3
Disallow: /
User-agent: InAGist URL Resolver
Disallow: /
User-agent: Insitesbot
Disallow: /
User-agent: jack
Disallow: /
User-agent: James BOT
Disallow: /
User-agent: Java
Disallow: /
User-agent: JS-Kit URL Resolver, http://js-kit.com/
Disallow: /
User-agent: linkdexbot
Disallow: /
User-agent: LivelapBot/0.2 (http://site.livelap.com/crawler)
Disallow: /
User-agent: LS Session
Disallow: /
User-agent: ltx71
Disallow: /
User-agent: meanpathbot
Disallow: /
User-agent: MetaURI API/2.0 +metauri.com
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: Mozilla/4.0 (CMS Crawler: http://www.cmscrawler.com)
Disallow: /
User-agent: NaverBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: SMTBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Yeti
Disallow: /

Other useful links