HighDots Forums  

Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Guy Macon
 
Posts: n/a

Default Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard - 06-16-2008 , 08:31 AM









Col Steve Austin Ret wrote:
Quote:
kenneth wrote:

Disallow: /*?*
Disallow: /*?

and here I thought robots.txt didn't include wildcards
You thought wrong.

Google, Yahoo!, and Microsoft have agreed upon a standard Robots
Exclusion Protocol with wildcard support. See references below.

Quote:
From [ http://www.google.com/support/webmas...y?answer=40367 ]:


I don't want to list every file that I want to block. Can
I use pattern matching?

Yes, Googlebot interprets some pattern matching. This is an
extension of the standard, so not all bots may follow it.

Matching a sequence of characters using *

You can use an asterisk (*) to match a sequence of
characters. For instance, to block access to all
subdirectories that begin with private, you could use the
following entry:

User-agent: Googlebot
Disallow: /private*/

To block access to all URLs that include a question mark
(?), you could use the following entry:

User-agent: *
Disallow: /*?

Matching the end characters of the URL using $
You can use the $ character to specify matching the end of
the URL. For instance, to block an URLs that end with .asp,
you could use the following entry:

User-agent: Googlebot
Disallow: /*.asp$

You can use this pattern matching in combination with the
Allow directive. For instance, if a ? indicates a session
ID, you may want to exclude all URLs that contain them to
ensure Googlebot doesn't crawl duplicate pages. But URLs
that end with a ? may be the version of the page that you do
want included. For this situation, you can set your
robots.txt file as follows:

User-agent: *
Allow: /*?$
Disallow: /*?

The Disallow:/ *? line will block any URL that includes a ?
(more specifically, it will block any URL that begins with
your domain name, followed by any string, followed by a
question mark, followed by any string).

The Allow: /*?$ line will allow any URL that ends in a ?
(more specifically, it will allow any URL that begins with
your domain name, followed by a string, followed by a ?,
with no characters after the ?).


From the Google Webmaster Central Blog: Improving
on Robots Exclusion Protocol
[ http://googlewebmastercentral.blogsp...-protocol.html ]

Quote:
From the Official Google Blog: Controlling how
search engines access and index your website
[ http://googleblog.blogspot.com/2007/...es-access.html ]
[ http://googleblog.blogspot.com/2007/...-protocol.html ]

Quote:
From the Yahoo search blog: One Standard Fits All: Robots Exclusion
Protocol for Yahoo!, Google and Microsoft
[ http://www.ysearchblog.com/archives/000587.html ]

Quote:
From the Microsoft Live Search Webmaster Center Blog:
Robots Exclusion Protocol: Joining Together to Provide
Better Documentation
[ http://blogs.msdn.com/webmaster/arch...mentation.aspx ]

Quote:
From Google: How do I create a robots.txt file?
[ http://www.google.com/support/webmas...y?answer=40362 ]

SearchTools.com: About Robots.txt and Search Indexing Robots
[ http://www.searchtools.com/robots/robots-txt.html ]

Wikipedia: Robots.txt
[ http://en.wikipedia.org/wiki/Robots.txt ]

Who invented robots.txt and why is it so brain-dead?
[ http://yro.slashdot.org/comments.pl?...5&cid=21554125 ]

Checklist for Search Robot Crawling and Indexing
[ http://www.searchtools.com/robots/robot-checklist.html ]

Web robots and dynamic content issues
[ http://www.ghita.ro/article/23/web_r...nt_issues.html ]

Appendix B, section B.4.1 of the HTML 4.01 Specification
[ http://www.w3.org/TR/html4/appendix/...html#h-B.4.1.1 ].

A Standard for Robot Exclusion
[ http://www.robotstxt.org/orig.html ]
[ http://www.robotstxt.net/ ]
[ http://www.hirschle.ch/html-kurs/robots/robots.html ]

A Method for Web Robots Control
[ http://www.robotstxt.org/norobots-rfc.txt ]

Proposal: An Extended Standard for Robot Exclusion
[ http://www.conman.org/people/spc/robots2.html ]

Parasites.txt: Addressing The Need for Parasite Inclusion
[ http://www.parasitestxt.org/index.php?page=3 ]

Using Apache to stop bad robots
[ http://evolt.org/article/Using_Apach...126/index.html ]

BotSeer: a search engine of robots.txt files
[ http://botseer.ist.psu.edu/about.html ]
[ http://botseer.ist.psu.edu/ ]
[ http://botseer.ist.psu.edu/stat.jsp ]
[ http://botseer.ist.psu.edu/help.jsp ]

Robotcop: block robots that ignore your robots.txt
[ http://www.robotcop.org/ ]
[ http://www.robotcop.org/details.html ]







Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>
Guy Macon <http://www.guymacon.com/> Guy Macon <http://www.guymacon.com/>



Reply With Quote
  #2  
Old   
Nikita the Spider
 
Posts: n/a

Default Re: Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard - 06-16-2008 , 08:53 PM






In article <O6mdnZUKW4sKxcvVRVn_vwA (AT) giganews (DOT) com>,
Guy Macon <"http://www.guymacon.com/"@-.-> wrote:

Quote:
Col Steve Austin Ret wrote:

kenneth wrote:

Disallow: /*?*
Disallow: /*?

and here I thought robots.txt didn't include wildcards

You thought wrong.

Google, Yahoo!, and Microsoft have agreed upon a standard Robots
Exclusion Protocol with wildcard support. See references below.
Not so fast...just because the biggest search engines have agreed on
something doesn't make it a universal standard. What's described on
robotstxt.org is the closest thing there is to a universal standard, and
that standard does *not* allow for wildcards. That is, * and ? will be
interpreted literally.

Popular libraries like those for Python and Perl still make no allowance
for wildcard extensions:
http://docs.python.org/lib/module-robotparser.html
http://perl.active-venture.com/lib/WWW/RobotRules.html

Anything written using those libraries won't respect wildcards in
robots.txt.

The practical upshot is that one can use wildcards in robots.txt, some
bots will respect them and some will not. I'd argue that neither is
terribly wrong.

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more



Reply With Quote
  #3  
Old   
Guy Macon
 
Posts: n/a

Default Re: Guy Macon on the new Google/Yahoo/Microsoft extended ROBOTS.TXT standard - 06-17-2008 , 12:55 PM






Nikita the Spider wrote:

Quote:
Google, Yahoo!, and Microsoft have agreed upon a standard Robots
Exclusion Protocol with wildcard support. See references below.

...just because the biggest search engines have agreed on
something doesn't make it a universal standard. What's described on
robotstxt.org is the closest thing there is to a universal standard, and
that standard does *not* allow for wildcards. That is, * and ? will be
interpreted literally.

Popular libraries like those for Python and Perl still make no allowance
for wildcard extensions:
http://docs.python.org/lib/module-robotparser.html
http://perl.active-venture.com/lib/WWW/RobotRules.html

Anything written using those libraries won't respect wildcards in
robots.txt.

The practical upshot is that one can use wildcards in robots.txt,
some bots will respect them and some will not. I'd argue that
neither is terribly wrong.
The good news is that every version of robots.txt allows you to
specify which exclusion rules apply to which robots, and thus
you can make an exclusion rule with wildcards for Google, Yahoo,
and Microsoft, and an exclusion rule without wildcards (one that
conforms to Appendix B, section B.4.1 of the HTML 4.01 Spec) for
all other web crawling robots. I would put the sections with the
wildcards at the end so as to minimize the chances of confusing
the other robots.

What we really need is an updated version of robotcop that works
with the latest version of Apache and the latest robot exclusion
standards. I would argue that stopping spambots and other rude
robots is more important than accomodating older robots that do
not follow the new Google/Yahoo/Microsoft standard. Alas, the
robotcop project at www.robotcop.org appears to have been
abandoned by the developers back in 2002.






Guy Macon
<http://www.guymacon.com/>



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.