HighDots Forums  

Google does not obey robots.txt

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss Google does not obey robots.txt in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
wd
 
Posts: n/a

Default Google does not obey robots.txt - 06-01-2006 , 01:26 PM






Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password
/tracker

Has anyone else seen this problem? (or is there a mistake in robots.txt?)
This is the second site that I have seen it on.

Reply With Quote
  #2  
Old   
Big Bill
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-01-2006 , 04:08 PM






On Thu, 01 Jun 2006 13:26:23 -0400, wd <mail (AT) mail (DOT) invalid> wrote:

Quote:
Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password
/tracker

Has anyone else seen this problem? (or is there a mistake in robots.txt?)
This is the second site that I have seen it on.
It could be, as can always be the case, that a less conscientious
spider is indexing the disallowed files and that Google is in turn
indexing them from there. If you don't want it indexed, don't put it
on the web.

BB
--

http://www.kruse.co.uk/seo-competition.htm
http://www.here-be-posters.co.uk/lin...an-posters.htm
http://www.crystal-liaison.com/angel...-me/index.html



Reply With Quote
  #3  
Old   
Charles C.
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-01-2006 , 04:33 PM



wd wrote:
Quote:
Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password
/tracker

Has anyone else seen this problem? (or is there a mistake in robots.txt?)
This is the second site that I have seen it on.
Try a trailing slash if you exclude directories. It may or may not help.

Regards
Charles

--
Please remove _removeme_ to reply.


Reply With Quote
  #4  
Old   
Brian Wakem
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-01-2006 , 05:35 PM



wd wrote:

Quote:
Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password

Sounds interesting. What's in /user/password?


--
Brian Wakem
Email: http://homepage.ntlworld.com/b.wakem/myemail.png


Reply With Quote
  #5  
Old   
John Bokma
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-01-2006 , 06:23 PM



Brian Wakem <no (AT) email (DOT) com> wrote:

Quote:
wd wrote:

Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password


Sounds interesting. What's in /user/password?
151,000 answers:

http://www.google.com/search?q=inurl...er/password%22

--
John Skilled Perl programmer for hire: http://castleamber.com/

Fox noGO:http://johnbokma.com/firefox/removin...earch-bar.html


Reply With Quote
  #6  
Old   
Roy Schestowitz
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-02-2006 , 03:38 AM



__/ [ John Bokma ] on Thursday 01 June 2006 23:23 \__

Quote:
Brian Wakem <no (AT) email (DOT) com> wrote:

wd wrote:

Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password


Sounds interesting. What's in /user/password?

151,000 answers:

http://www.google.com/search?q=inurl...er/password%22
Seems like most answers are irrelevant, but in certain CMS's, this is a page
what serves a function such as password changes. Interestingly, if it were
/users/password , you'd probably get plenty of Web sites that had registered
users whose username is 'password'.

Best wishes,

Roy

PS - I think that's what they call Google hacking. I once found someone's
entire PDA data on some Webspace and reported this to him. He was an MIT
sysadmin, as ironic as it may seem.

--
Roy S. Schestowitz
http://Schestowitz.com | GNU is Not UNIX ¦ PGP-Key: 0x74572E8E
8:35am up 35 days 15:07, 16 users, load average: 3.22, 3.11, 3.03
http://iuron.com - proposing a non-profit search engine


Reply With Quote
  #7  
Old   
wd
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-02-2006 , 09:06 AM



On Thu, 01 Jun 2006 22:35:35 +0100, Brian Wakem wrote:

Quote:
wd wrote:

Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password


Sounds interesting. What's in /user/password?
A "request new password" screen It's a standard page in the Drupal
content management system.



Reply With Quote
  #8  
Old   
wd
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-02-2006 , 09:09 AM



On Thu, 01 Jun 2006 21:33:25 +0100, Charles C. wrote:

Quote:
wd wrote:
Google is not obeying robots.txt at all now.

Here is the typical Drupal robots.txt with a few modifications:

User-agent: *
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /search
Disallow: /admin

Google has quite a few pages like the following indexed:
/comment/reply/1
/user/register
/aggregator/sources/1
/user/password
/tracker

Has anyone else seen this problem? (or is there a mistake in
robots.txt?) This is the second site that I have seen it on.

Try a trailing slash if you exclude directories. It may or may not help.
I will try it, but even if helps, Google still has a serious robots.txt
bug. It should obey these rules -- anything starting with /user should be
ignored whether it is a directory or not.


Reply With Quote
  #9  
Old   
wd
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-02-2006 , 09:15 AM



On Thu, 01 Jun 2006 20:08:33 +0000, Big Bill wrote:

Quote:
It could be, as can always be the case, that a less conscientious spider
is indexing the disallowed files and that Google is in turn indexing them
from there. If you don't want it indexed, don't put it on the web.
From what I understand Googlebot is supposed to grab the robots.txt
even if it enters the site from a page other than the home page.

There is no confidential information on the site. I just want to keep the
engines out of unnecessary places.

I can probably fix it with the robots.txt URL removal tool again. I'm
just pointing out that Googlebot is buggy.


Reply With Quote
  #10  
Old   
John Bokma
 
Posts: n/a

Default Re: Google does not obey robots.txt - 06-02-2006 , 09:58 AM



Roy Schestowitz <newsgroups (AT) schestowitz (DOT) com> wrote:

Quote:
PS - I think that's what they call Google hacking.
Yup. OTOH, I guess anything you do with Google involving one or two of the
advanced search operators is called Google hacking :-D.

--
John

Google Suggest Perl script http://johnbokma.com/perl/google-suggest.html


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.