![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). |
#3
| |||
| |||
|
|
Philipp Lenssen wrote: Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). It wouldn't be unethical: you put a spider in a directory excluded by robots.txt, so only disobedient bots will find it.Wasting bandwidth is more of a problem, but it depends what you do with the script. If you generate loads of random email addresses it's probably unethical, just because some true ones could be generated by accident. You could simply write a script to make a note of the IP, and then block it, saving you bandwidth. I'm not sure it would work either. |
#4
| |||
| |||
|
|
I heard this before. Someone posting, "I've built a spider trap". And the post got closed down at Webmasterworld.com as far as I can tell. So I'd like to know, is it possible to built a "Spider trap" -- which if I understand correctly would automatically generate random pages for a searchbot to do infinite crawling on one's site? What it would take: - A searchbot that would never stop (e.g. Googlebot not saying "I crawled 10,000 from this domain, that's my limit) - Automatically generated random pages (e.g. from a dictionary database backend) - Automatically generated random links to more random pages from every random page (all on the same server, of course) - Maybe some way to hide that it's always the same script, if a bot cares about such (e.g. by the use of Apache's htaccess file) - ...? Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). -- Google Blogoscoped http://blog.outer-court.com |
#5
| |||
| |||
|
|
I suppose you could build a spider trap that adds robots user agents that are not obeying your robot exclusion file to your htaccess mod_rewrite list and then divert the visitor to http://www.paostyle.tv/contents/check/getfuck.html It would be possible but could cause your server to go down if the script that mods .htaccess should corrupt it. -- The Ultimate Search Engine Links Page http://www.searchenginelinks.co.uk "Philipp Lenssen" <info (AT) outer-court (DOT) com> wrote in message news:bjp9hr$lesqo$1 (AT) ID-203055 (DOT) news.uni-berlin.de... I heard this before. Someone posting, "I've built a spider trap". And the post got closed down at Webmasterworld.com as far as I can tell. So I'd like to know, is it possible to built a "Spider trap" -- which if I understand correctly would automatically generate random pages for a searchbot to do infinite crawling on one's site? What it would take: - A searchbot that would never stop (e.g. Googlebot not saying "I crawled 10,000 from this domain, that's my limit) - Automatically generated random pages (e.g. from a dictionary database backend) - Automatically generated random links to more random pages from every random page (all on the same server, of course) - Maybe some way to hide that it's always the same script, if a bot cares about such (e.g. by the use of Apache's htaccess file) - ...? Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). -- Google Blogoscoped http://blog.outer-court.com |
#6
| |||
| |||
|
|
You could probably reduce the amount of traffic it uses by adding some sleeps in the scripts and bogging it way down J |
#7
| ||||
| ||||
|
|
I heard this before. Someone posting, "I've built a spider trap". And the post got closed down at Webmasterworld.com as far as I can tell. So I'd like to know, is it possible to built a "Spider trap" -- which if I understand correctly would automatically generate random pages for a searchbot to do infinite crawling on one's site? |
|
What it would take: |
|
- A searchbot that would never stop (e.g. Googlebot not saying "I crawled 10,000 from this domain, that's my limit) |
|
- Automatically generated random pages (e.g. from a dictionary database backend) - Automatically generated random links to more random pages from every random page (all on the same server, of course) - Maybe some way to hide that it's always the same script, if a bot cares about such (e.g. by the use of Apache's htaccess file) - ...? Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). |
#8
| |||||
| |||||
|
|
In article <bjp9hr$lesqo$1 (AT) ID-203055 (DOT) news.uni-berlin.de>, Philipp Lenssen <info (AT) outer-court (DOT) com> writes I heard this before. Someone posting, "I've built a spider trap". And the post got closed down at Webmasterworld.com as far as I can tell. So I'd like to know, is it possible to built a "Spider trap" -- which if I understand correctly would automatically generate random pages for a searchbot to do infinite crawling on one's site? Why would you want to do this? I have enough problems with bots that appear to trap themselves. |
|
What it would take: First detecting it's a bot. Usually it effectively tells you it is but it doesn't have to. |
|
- A searchbot that would never stop (e.g. Googlebot not saying "I crawled 10,000 from this domain, that's my limit) This is outside your control. Even the most bizarre bots eventually give up and stop or go elsewhere. |
|
- Automatically generated random pages (e.g. from a dictionary database backend) - Automatically generated random links to more random pages from every random page (all on the same server, of course) - Maybe some way to hide that it's always the same script, if a bot cares about such (e.g. by the use of Apache's htaccess file) - ...? Is all this even possible? Well, I certainly wouldn't try with my own server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste my bandwith; it'd get me penalized). |
| -- Philip Baker |
![]() |
| Thread Tools | |
| Display Modes | |
| |