HighDots Forums  

Re: Warning: robots.txt unreliable in Apache servers

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss Re: Warning: robots.txt unreliable in Apache servers in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #11  
Old   
Philip Ronan
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-31-2005 , 08:42 AM






"Tim" wrote:

Quote:
Philip Ronan:

the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.
What you're saying is that it's pointless putting absolute faith in
robots.txt files because they are ignored by some robots. I'm not disputing
that. What I'm saying is that even genuine well-behaved robots like
Googlebot can be made to crawl content prohibited by robots.txt files.

So for example, if you're using a honeypot to block badly behaved robots
from your website automatically, then I can *remove your site from Google*
and probably other search engines simply by publishing a link to your
honeypot directory with an extra slash inserted somewhere. That's why this
issue is important.

I hope you understand now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/



Reply With Quote
  #12  
Old   
Guy Macon
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-31-2005 , 11:58 AM









Tim wrote:
Quote:
Philip Ronan:

the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.

It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.
The robots.txt protocol has always been ineffective on bad
robots, but this is, as far as I know, the first example of
it being ineffective on good robots.

--
Guy Macon <http://www.guymacon.com>




Reply With Quote
  #13  
Old   
Tim
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-01-2005 , 07:35 AM



Tim:

Quote:
It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.
Guy Macon:

Quote:
The robots.txt protocol has always been ineffective on bad robots, but
this is, as far as I know, the first example of it being ineffective on
good robots.
I'm not so sure that it's a fault with robots.text. After all,
strangeness notwithstanding ///example isn't the same as /example.
Personally, I think this is an issue you'd need to deal with within the
server (e.g. filter requests to disallow access to URIs with multiple
concurrent slashes in them, rather than work around such conditions).

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.



Reply With Quote
  #14  
Old   
Dave0x01
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-02-2005 , 05:34 PM



Borek wrote:

Quote:
On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <ask (AT) example (DOT) com> wrote:


It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?
<snip>

Quote:
All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs &amp; and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible

Sorry, I should've been more clear. I wanted to know whether anyone
could point to an actual URL (e.g., a search query) demonstrating that
URLs with multiple adjacent forward slashes are actually being indexed
by any of the major search engines. I haven't seen one.

However, I don't think that the original poster was concerned with
whether these multiple slashed URLs appear in the index as such, so it's
probably not terribly important.


Dave




Reply With Quote
  #15  
Old   
Dave0x01
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-02-2005 , 05:45 PM



Guy Macon wrote:

Quote:
Dave0x1 wrote:


It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.


If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.
A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave


Reply With Quote
  #16  
Old   
Dave0x01
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-02-2005 , 05:51 PM



Philip Ronan wrote:

Quote:
"Dave0x1" wrote:


I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.


OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.
I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.

Quote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?


Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:


Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.
Does the URL in question appear in the index as
<http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.

Dave







Reply With Quote
  #17  
Old   
Big Bill
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-02-2005 , 07:47 PM



On Wed, 02 Nov 2005 17:45:05 -0500, Dave0x01 <ask (AT) example (DOT) com> wrote:

Quote:
Guy Macon wrote:

Dave0x1 wrote:


It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.


If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.

Dave
Which is a good way of putting it.

BB
--
www.kruse.co.uk/ seo (AT) kruse (DOT) demon.co.uk
Elvis does my SEO


Reply With Quote
  #18  
Old   
Guy Macon
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-03-2005 , 01:37 AM






Dave0x01 wrote:
Quote:
Guy Macon wrote:

Dave0x1 wrote:

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.

If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.

A robots.txt file doesn't make any decisions about which parts of a site
are indexed; it merely offers suggestions.
A robots.txt file most certainly does decide which parts of a site
are indexed - by good robots. It offers suggestions that every good
robot obeys. The effect we are discussing someone else on the Internet
to override your good-robot spidering decisions as defined in robots.txt.




Reply With Quote
  #19  
Old   
Philip Ronan
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-03-2005 , 05:49 AM



"Dave0x01" wrote:

Quote:
Philip Ronan wrote:

"Dave0x1" wrote:

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.
I was being sarcastic. (You're American, right?)

Quote:
Does the URL in question appear in the index as
http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.
Then what the hell do you think this thread is all about??

For all you doubting Thomases out there:

Exhibit A: http://freespace.virgin.net/phil.ron...bad-google.png

Exhibit B: http://www.japanesetranslator.co.uk/robots.txt
(Last-Modified: Tue, 01 Mar 2005 08:45:29 GMT)

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/



Reply With Quote
  #20  
Old   
Dave0x01
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 11-15-2005 , 10:00 PM



Philip Ronan wrote:

Quote:
"Dave0x01" wrote:


Philip Ronan wrote:

OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

I wouldn't consider patching of the Apache source code either necessary
or desirable in this situation.


I was being sarcastic. (You're American, right?)
Yeah, I could tell. And I *wasn't* being sarcastic. What about my
comment do you think implies otherwise?

Quote:
Does the URL in question appear in the index as
http://###.####.###//contact/>, or as <http://###.####.###/contact/>?
My assumption is the latter.


Then what the hell do you think this thread is all about??
[snip]

One could obviously be concerned about any number of things resulting
from the behavior described.







Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.