HighDots Forums  

"I've built a spider trap"

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss "I've built a spider trap" in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Philipp Lenssen
 
Posts: n/a

Default "I've built a spider trap" - 09-11-2003 , 03:51 AM






I heard this before. Someone posting, "I've built a spider trap". And
the post got closed down at Webmasterworld.com as far as I can tell.
So I'd like to know, is it possible to built a "Spider trap" -- which
if I understand correctly would automatically generate random pages for
a searchbot to do infinite crawling on one's site?

What it would take:
- A searchbot that would never stop (e.g. Googlebot not saying "I
crawled 10,000 from this domain, that's my limit)
- Automatically generated random pages (e.g. from a dictionary database
backend)
- Automatically generated random links to more random pages from every
random page (all on the same server, of course)
- Maybe some way to hide that it's always the same script, if a bot
cares about such (e.g. by the use of Apache's htaccess file)
- ...?

Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).

--
Google Blogoscoped
http://blog.outer-court.com

Reply With Quote
  #2  
Old   
Foxglove54321
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-11-2003 , 04:39 AM






Philipp Lenssen wrote:
Quote:
Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).
It wouldn't be unethical: you put a spider in a directory excluded by
robots.txt, so only disobedient bots will find it.Wasting bandwidth is more of
a problem, but it depends what you do with the script. If you generate loads of
random email addresses it's probably unethical, just because some true ones
could be generated by accident. You could simply write a script to make a note
of the IP, and then block it, saving you bandwidth.

I'm not sure it would work either.



--
Alice Woolley
http://www.insidethebubble.co.uk/
Inside the Bubble - autism information


Reply With Quote
  #3  
Old   
John A.
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-11-2003 , 06:14 PM



On 11 Sep 2003 08:39:02 GMT, foxglove54321 (AT) aol (DOT) comblahblah
(Foxglove54321) wrote:

Quote:
Philipp Lenssen wrote:
Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).

It wouldn't be unethical: you put a spider in a directory excluded by
robots.txt, so only disobedient bots will find it.Wasting bandwidth is more of
a problem, but it depends what you do with the script. If you generate loads of
random email addresses it's probably unethical, just because some true ones
could be generated by accident. You could simply write a script to make a note
of the IP, and then block it, saving you bandwidth.

I'm not sure it would work either.
Generating random email addresses at the domain of the visiting spider
would come up with far fewer innocent addresses, but would still
probably be unethical. (Darn it.)

It's certainly within the capabilities of most web servers, if not
most webmasters, to do this. Probably not worth the bandwidth, though.


Reply With Quote
  #4  
Old   
Sparky
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-12-2003 , 03:25 AM



I suppose you could build a spider trap that adds robots user agents that
are not obeying your robot exclusion file to your htaccess mod_rewrite list
and then divert the visitor to
http://www.paostyle.tv/contents/check/getfuck.html

It would be possible but could cause your server to go down if the script
that mods .htaccess should corrupt it.
--
The Ultimate Search Engine Links Page
http://www.searchenginelinks.co.uk

"Philipp Lenssen" <info (AT) outer-court (DOT) com> wrote

Quote:
I heard this before. Someone posting, "I've built a spider trap". And
the post got closed down at Webmasterworld.com as far as I can tell.
So I'd like to know, is it possible to built a "Spider trap" -- which
if I understand correctly would automatically generate random pages for
a searchbot to do infinite crawling on one's site?

What it would take:
- A searchbot that would never stop (e.g. Googlebot not saying "I
crawled 10,000 from this domain, that's my limit)
- Automatically generated random pages (e.g. from a dictionary database
backend)
- Automatically generated random links to more random pages from every
random page (all on the same server, of course)
- Maybe some way to hide that it's always the same script, if a bot
cares about such (e.g. by the use of Apache's htaccess file)
- ...?

Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).

--
Google Blogoscoped
http://blog.outer-court.com



Reply With Quote
  #5  
Old   
j
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-12-2003 , 09:38 AM



You could probably reduce the amount of traffic it uses by adding some
sleeps in the scripts and bogging it way down

J

"Sparky" <sparky (AT) nottellin (DOT) you> wrote

Quote:
I suppose you could build a spider trap that adds robots user agents that
are not obeying your robot exclusion file to your htaccess mod_rewrite
list
and then divert the visitor to
http://www.paostyle.tv/contents/check/getfuck.html

It would be possible but could cause your server to go down if the script
that mods .htaccess should corrupt it.
--
The Ultimate Search Engine Links Page
http://www.searchenginelinks.co.uk

"Philipp Lenssen" <info (AT) outer-court (DOT) com> wrote in message
news:bjp9hr$lesqo$1 (AT) ID-203055 (DOT) news.uni-berlin.de...
I heard this before. Someone posting, "I've built a spider trap". And
the post got closed down at Webmasterworld.com as far as I can tell.
So I'd like to know, is it possible to built a "Spider trap" -- which
if I understand correctly would automatically generate random pages for
a searchbot to do infinite crawling on one's site?

What it would take:
- A searchbot that would never stop (e.g. Googlebot not saying "I
crawled 10,000 from this domain, that's my limit)
- Automatically generated random pages (e.g. from a dictionary database
backend)
- Automatically generated random links to more random pages from every
random page (all on the same server, of course)
- Maybe some way to hide that it's always the same script, if a bot
cares about such (e.g. by the use of Apache's htaccess file)
- ...?

Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).

--
Google Blogoscoped
http://blog.outer-court.com





Reply With Quote
  #6  
Old   
John A.
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-12-2003 , 02:54 PM



On Fri, 12 Sep 2003 09:38:00 -0400, "j" <jay (AT) tequila-stuff (DOT) com> wrote:

Quote:
You could probably reduce the amount of traffic it uses by adding some
sleeps in the scripts and bogging it way down

J
Some hosting packages limit the number of concurrent connections, so
extending the connection length still robs you of effective bandwidth.


Reply With Quote
  #7  
Old   
Philip Baker
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-13-2003 , 11:22 PM



In article <bjp9hr$lesqo$1 (AT) ID-203055 (DOT) news.uni-berlin.de>, Philipp
Lenssen <info (AT) outer-court (DOT) com> writes
Quote:
I heard this before. Someone posting, "I've built a spider trap". And
the post got closed down at Webmasterworld.com as far as I can tell.
So I'd like to know, is it possible to built a "Spider trap" -- which
if I understand correctly would automatically generate random pages for
a searchbot to do infinite crawling on one's site?
Why would you want to do this? I have enough problems with bots that
appear to trap themselves.
Quote:
What it would take:
First detecting it's a bot. Usually it effectively tells you it is but
it doesn't have to.

Quote:
- A searchbot that would never stop (e.g. Googlebot not saying "I
crawled 10,000 from this domain, that's my limit)
This is outside your control. Even the most bizarre bots eventually give
up and stop or go elsewhere.

Quote:
- Automatically generated random pages (e.g. from a dictionary database
backend)
- Automatically generated random links to more random pages from every
random page (all on the same server, of course)
- Maybe some way to hide that it's always the same script, if a bot
cares about such (e.g. by the use of Apache's htaccess file)
- ...?

Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).

--
Philip Baker


Reply With Quote
  #8  
Old   
j
 
Posts: n/a

Default Re: "I've built a spider trap" - 09-19-2003 , 03:04 PM




"Philip Baker" <news0309 (AT) thalasson (DOT) com> wrote

Quote:
In article <bjp9hr$lesqo$1 (AT) ID-203055 (DOT) news.uni-berlin.de>, Philipp
Lenssen <info (AT) outer-court (DOT) com> writes
I heard this before. Someone posting, "I've built a spider trap". And
the post got closed down at Webmasterworld.com as far as I can tell.
So I'd like to know, is it possible to built a "Spider trap" -- which
if I understand correctly would automatically generate random pages for
a searchbot to do infinite crawling on one's site?

Why would you want to do this? I have enough problems with bots that
appear to trap themselves.
revenge? :-)

Quote:
What it would take:

First detecting it's a bot. Usually it effectively tells you it is but
it doesn't have to.
If you tell people on the page flat out that it's not a link they should
click, they usually avoid it. You don't really have to detect anything,
the page can say right on it "This is a spambot trap normal users should not
be here" It's only the automated spiders that ignore the robots.txt and
can't read that end up on the page, which feeds them hundreds of totally
bogus emails or something like that.

Quote:
- A searchbot that would never stop (e.g. Googlebot not saying "I
crawled 10,000 from this domain, that's my limit)

This is outside your control. Even the most bizarre bots eventually give
up and stop or go elsewhere.
Well, after I saw this thread I made one with perl just for fun to see if I
could catch anything. Only took a couple days before I had one locked in
real good. It's rather funny, I shoveled it 1000 totally random email
addresses per page and at the botton was a link to the same page. I also
had it email me so I could watch the fun!. It kept going till I renamed the
html file (didn't want to use too much bandwidth). It's not something I'm
going to do long term, just playing with it. Oh yea, I tried adding a 1
second delay between each line and that worked quite well. Slowed it right
down and it really didn't have much effect at all on my site.

Quote:
- Automatically generated random pages (e.g. from a dictionary database
backend)
- Automatically generated random links to more random pages from every
random page (all on the same server, of course)
- Maybe some way to hide that it's always the same script, if a bot
cares about such (e.g. by the use of Apache's htaccess file)
- ...?

Is all this even possible? Well, I certainly wouldn't try with my own
server (for 4 reasons: I doubt it works; it'd be unethical; it'd waste
my bandwith; it'd get me penalized).
It's possible, I can show you my log file. If you want to see it yourself
send me an email.

jay6R-E-M-O-V-E (AT) cox (DOT) net

Quote:

--
Philip Baker



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.