![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
|
Philip Ronan wrote: I recently discovered that robots.txt files aren't necessarily any use on Apache servers. For some reason, the Apache developers decided to treat multiple consecutive forward slashes in a request URI as a single forward slash. So for example, http://apache.org/foundation/> and <http://apache.org//////foundation/ both resolve to the same page. Let's suppose the Apache website owners want to stop search engine robots crawling through their "foundation" pages. They could put this rule in their robots.txt file: Disallow: /foundation/ But if I posted a link to //////foundation/ somewhere, the search engines will be quite happy to index it because it isn't covered by this rule. As a result of all this, Google is currently indexing a page on my website that I specifically asked it to stay away from :-( You might want to check the behaviour of your servers to see if you're vulnerable to the same sort of problem. If anyone's interested, I've put together a .htaccess rule and a PHP script that seem to sort things out. I thought that parsing and processing a robots.txt file is the responsibility of the bot and not the Web server. All the Web server has to do is deliver the robots.txt file to the bot. If that is true, the problem lies within Google and not Apache. |
#2
| |||
| |||
|
|
I was about to opine that "http://apache.org//////" is not the same as "http://apache.org/", but it appears that IIS has the same behavior: See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ]. Is there something in the specs that says that treating "//////" and "/" the same is proper behavior? |
#3
| |||
| |||
|
|
I was about to opine that "http://apache.org//////" is not the same as "http://apache.org/", but it appears that IIS has the same behavior: See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ]. Is there something in the specs that says that treating "//////" and "/" the same is proper behavior? Don't know, but it seems to be the case on unix/linux filesystems too, If I 'cd //////usr////////////local////apache2' I end up in /usr/local/apache2 |
#4
| |||
| |||
|
|
Don't know, but it seems to be the case on unix/linux filesystems too, If I 'cd //////usr////////////local////apache2' I end up in /usr/local/apache2 Same goes for Windows/DOS; 'cd ///windows///system32' brings you to '/windows/system32'. |

#5
| |||
| |||
|
|
I was about to opine that "http://apache.org//////" is not the same as "http://apache.org/", but it appears that IIS has the same behavior: See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ]. Is there something in the specs that says that treating "//////" and "/" the same is proper behavior? You are referring to which specs? |
#6
| |||
| |||
|
|
I was about to opine that "http://apache.org//////" is not the same as "http://apache.org/", but it appears that IIS has the same behavior: See for example [ http://www.adsi4nt.com//////demo//////adviisprop.asp ]. Is there something in the specs that says that treating "//////" and "/" the same is proper behavior? |
#7
| |||
| |||
|
|
I don't understand why this is a big deal. The issue can be addressed by numerous methods, including patching of the Apache web server source code. |
|
It's not clear exactly what the problem *is*. I've never seen a URL with multiple adjacent forward slashes in my search results. Does someone have an example? |
|
Thank you for your note. We apologize for our delayed response. We understand you're concerned about the inclusion of http://###.####.###//contact/ in our index. It's important to note that we visited the live page in question and found that it currently exists on the web as listed above. Because this page falls outside your robots.txt file, you may want to use meta tags to remove this page from our index. For more information about using meta tags, please visit http://www.google.com/remove.html [remainder snipped] |
#8
| |||
| |||
|
|
It's not clear exactly what the problem *is*. I've never seen a URL with multiple adjacent forward slashes in my search results. Does someone have an example? |

#9
| |||
| |||
|
|
It's not clear exactly what the problem *is*. I've never seen a URL with multiple adjacent forward slashes in my search results. |
#10
| |||
| |||
|
|
the robots.txt protocol is ineffective on (probably) most servers because it can be circumvented without your knowledge by a third party. |
|
Because this page falls outside your robots.txt file, you may want to use meta tags to remove this page from our index. |
![]() |
| Thread Tools | |
| Display Modes | |
| |