HighDots Forums  

Re: Warning: robots.txt unreliable in Apache servers

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss Re: Warning: robots.txt unreliable in Apache servers in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Guy Macon
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 01:46 PM








David Ross wrote:
Quote:
Philip Ronan wrote:

I recently discovered that robots.txt files aren't necessarily any use on
Apache servers.

For some reason, the Apache developers decided to treat multiple consecutive
forward slashes in a request URI as a single forward slash. So for example,
http://apache.org/foundation/> and <http://apache.org//////foundation/
both resolve to the same page.

Let's suppose the Apache website owners want to stop search engine robots
crawling through their "foundation" pages. They could put this rule in their
robots.txt file:

Disallow: /foundation/

But if I posted a link to //////foundation/ somewhere, the search engines
will be quite happy to index it because it isn't covered by this rule.

As a result of all this, Google is currently indexing a page on my website
that I specifically asked it to stay away from :-(

You might want to check the behaviour of your servers to see if you're
vulnerable to the same sort of problem.

If anyone's interested, I've put together a .htaccess rule and a PHP script
that seem to sort things out.

I thought that parsing and processing a robots.txt file is the
responsibility of the bot and not the Web server. All the Web
server has to do is deliver the robots.txt file to the bot.

If that is true, the problem lies within Google and not Apache.
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

--
Guy Macon <http://www.guymacon.com/>





Reply With Quote
  #2  
Old   
Brian Wakem
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 02:14 PM






Guy Macon <http://www.guymacon.com/> wrote:

Quote:
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?


Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

The web servers are probably mimicking this behaviour.


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png


Reply With Quote
  #3  
Old   
alain
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 02:51 PM



Brian Wakem wrote:
Quote:
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

Don't know, but it seems to be the case on unix/linux filesystems too,

If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2
Same goes for Windows/DOS;
'cd ///windows///system32' brings you to '/windows/system32'.


--
Touched By His Noodly Appendage


Reply With Quote
  #4  
Old   
Borek
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 03:04 PM



On Sun, 30 Oct 2005 20:51:33 +0100, alain <alain (AT) spamcop (DOT) net> wrote:

Quote:
Don't know, but it seems to be the case on unix/linux filesystems too,
If I 'cd //////usr////////////local////apache2' I end up
in /usr/local/apache2

Same goes for Windows/DOS;
'cd ///windows///system32' brings you to '/windows/system32'.
Interesting, my Windows don't accept /, but they accept \

Best,
Borek
--
http://www.chembuddy.com
http://www.chembuddy.com/?left=BATE&...ion_equilibria
http://www.chembuddy.com/?left=CASC&...n_calcul ator



Reply With Quote
  #5  
Old   
Jim Moe
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 03:18 PM



Guy Macon wrote:
Quote:
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?

You are referring to which specs?
This behavior for following paths is from unix and is how all C
compilers handle paths. It is simply applied to URLs as well. There may
even be a requirement in the C specification about paths.

--
jmm (hyphen) list (at) sohnen-moe (dot) com
(Remove .AXSPAMGN for email)


Reply With Quote
  #6  
Old   
Dave0x1
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 03:45 PM



Guy Macon wrote:


Quote:
I was about to opine that "http://apache.org//////" is not the same
as "http://apache.org/", but it appears that IIS has the same behavior:
See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ].
Is there something in the specs that says that treating "//////" and
"/" the same is proper behavior?
Hint: Read the documentation offered at either of the first two URLs.

I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.

It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?

Dave



Reply With Quote
  #7  
Old   
Philip Ronan
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 06:38 PM



"Dave0x1" wrote:

Quote:
I don't understand why this is a big deal. The issue can be addressed
by numerous methods, including patching of the Apache web server source
code.
OK, so as long as the robots.txt documentation includes a note saying that
you have to patch your server software to get reliable results, then we'll
all be fine.

Quote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?
Which bit didn't I explain properly? I'm not going to post a link for you to
check, but here's the response I got from Google on the issue:

Quote:
Thank you for your note. We apologize for our delayed response.
We understand you're concerned about the inclusion of
http://###.####.###//contact/ in our index.

It's important to note that we visited the live page in question
and found that it currently exists on the web as listed above.
Because this page falls outside your robots.txt file, you may
want to use meta tags to remove this page from our index. For
more information about using meta tags, please visit
http://www.google.com/remove.html

[remainder snipped]
I didn't publish the link to //contact/, someone else did. So that means the
robots.txt protocol is ineffective on (probably) most servers because it can
be circumvented without your knowledge by a third party.

Hope that's all clear now.

--
phil [dot] ronan @ virgin [dot] net
http://vzone.virgin.net/phil.ronan/




Reply With Quote
  #8  
Old   
Borek
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-30-2005 , 06:43 PM



On Sun, 30 Oct 2005 21:45:32 +0100, Dave0x1 <ask (AT) example (DOT) com> wrote:

Quote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results. Does
someone have an example?
/%3Fleft%3DpH-calculation%26right%3Dtoc&hl=pt-BR&lr=lang_pt&sa=G
/?left=BATE&amp%3Bright=phcalculation
/?left=BATE&amp;amp;right=dissociation_constants
/?left=BATE&right=basic_acid_titr
/?left=BATE&right=basic_acid_titration_equilbria
/?left=BATE&right=basic_acid_titration_equilibri
/?left=BATE&right=basic_acid_titration_equilibria"> pH
/?left=BATE&right=basic_acid_titration_equilibria%2 2%3EpH
/?left=BATE&right=basic_acid_titration_equilibria/////////////////////////////////////////////////////
/?left=BATE&right=dissociation_constants]</td></tr><tr>
/?left=casc&amp/
/?left=casc&amp;right=download
/?left=faq/
/?left=dave-is-great
/?left=BATE&right=basic_acid_titration_equilibria/
/index.php[left]BATE[right]overview[SiteID]simtel.net
/pHlecimg/3-f.png
/pHlecimg/3-g.png
/?left=pH-calculation
/?left=casc&right=concentration_and_solution_calcul ator
/?left=casc&right=density_tables
/files/CASCInstall.ziphttp:/www.chembuddy.com/files/CASCInstall.exe
/?left=bate&right=dissociation_constants
/?left=bate&right=download
/?left=bate&right=screenshots
/this_is_a_test_of_404_response
/?left=CASC&amp;right=buy
/?left=CASC&right=concentration_and_solution_calcul ator://
/?left=CASC&amp;right=density_tables
/?left=BATE&right=right=basic_acid_titration_equili bria

All of these generated 404 in last few weeks on my site.

No additional slashes inside of the url, although several times
they were added at the end.

& vs &amp; and wrong capitalization (bate, casc instead of BATE, CASC)
are most prominent sources of errors. But it seems every error is possible


Best,
Borek
--
http://www.chembuddy.com - chemical calculators for labs and education
BATE - program for pH calculations
CASC - Concentration and Solution Calculator
pH lectures - guide to hand pH calculation with examples


Reply With Quote
  #9  
Old   
Guy Macon
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-31-2005 , 01:07 AM





Dave0x1 wrote:

Quote:
It's not clear exactly what the problem *is*. I've never seen a URL
with multiple adjacent forward slashes in my search results.
If there exists a way for someone else on the Internet to override
your spidering decisions as defined in robots.txt, there will be
those who use that ability to inconvenience, harass or hurt others.





Reply With Quote
  #10  
Old   
Tim
 
Posts: n/a

Default Re: Warning: robots.txt unreliable in Apache servers - 10-31-2005 , 06:33 AM



Philip Ronan:

Quote:
the robots.txt protocol is ineffective on (probably) most servers because
it can be circumvented without your knowledge by a third party.
It always has been, anyway. For numerous reasons. Your multiple slash
example is just one of them. Some robots will ignore them altogether,
others will deliberately look at what you tell them to ignore.

Likewise with Google's advice:

Quote:
Because this page falls outside your robots.txt file, you may want to
use meta tags to remove this page from our index.
In either case, such restrictions only help reduce the load on your server
from well meaning robots. If you want to truly restrict access, you need
to use some form of authentication.

There was moves to suggest the robots exclusion ought to let you specify
what you allow and disallow. For some cases it'd be easier to exclude
everything by default, only allowing what you want through. Though I
don't think that ever took off.

--
If you insist on e-mailing me, use the reply-to address (it's real but
temporary). But please reply to the group, like you're supposed to.

This message was sent without a virus, please destroy some files yourself.



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.