Our web servers were recently getting hit pretty hard by web crawler bots that simply ignore your Robots.txt file. We were getting hit especially hard by a bot called MJ12bot, which is supposed to be for a distributed search engine which they say lives in an alpha state at search.majestic12.co.uk (at the time of writing that is down). It seems they get people to run their SETI@Home like software that uses your power, CPU and bandwidth to help them build what they say is a distributed search engine. What they definitely are doing with the data at present is running a SEO business, that tells you how many back-links you have, etc. via what seems to be the huge index that they have built via people donating their computing resources.

We use Windows 2008 R2, so the solution for us was to block these crawlers via their user agent (easily done with Apache as well). To do so, open your site in IIS 7 and select the URL Rewrite module (you may have to add this via the Add/Remove Windows Features tool). Select Add Rule (in the top right) and then select ‘Blank rule’. Set you’re new rule up like the image below shows…

And add a condition like the one below…

Our regex string is “MJ12bot|soso|baidu|youdao|NaverBot|Yeti|ichiro|moget|sogou|Speedy”, however you may want to include Yandex in that as well. I’ve just removed them as they emailed us asking us to allow them to index our site and said that they obey Robots.txt rules, such as crawl-rate and disallow.

UPDATE

Steve from Majestic12 has posted a detailed response regarding MJ12bot below.