![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| ||||||
| ||||||
|
|
kenneth wrote: Disallow: /*?* Disallow: /*? and here I thought robots.txt didn't include wildcards |
|
From [ http://www.google.com/support/webmas...y?answer=40367 ]: I don't want to list every file that I want to block. Can I use pattern matching? Yes, Googlebot interprets some pattern matching. This is an extension of the standard, so not all bots may follow it. Matching a sequence of characters using * You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry: User-agent: Googlebot Disallow: /private*/ To block access to all URLs that include a question mark (?), you could use the following entry: User-agent: * Disallow: /*? Matching the end characters of the URL using $ You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry: User-agent: Googlebot Disallow: /*.asp$ You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows: User-agent: * Allow: /*?$ Disallow: /*? The Disallow:/ *? line will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string). The Allow: /*?$ line will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?). From the Google Webmaster Central Blog: Improving on Robots Exclusion Protocol |
|
From the Official Google Blog: Controlling how search engines access and index your website |
|
From the Yahoo search blog: One Standard Fits All: Robots Exclusion Protocol for Yahoo!, Google and Microsoft |
|
From the Microsoft Live Search Webmaster Center Blog: Robots Exclusion Protocol: Joining Together to Provide |
|
From Google: How do I create a robots.txt file? [ http://www.google.com/support/webmas...y?answer=40362 ] |
#2
| |||
| |||
|
|
Col Steve Austin Ret wrote: kenneth wrote: Disallow: /*?* Disallow: /*? and here I thought robots.txt didn't include wildcards You thought wrong. Google, Yahoo!, and Microsoft have agreed upon a standard Robots Exclusion Protocol with wildcard support. See references below. |
#3
| |||
| |||
|
|
Google, Yahoo!, and Microsoft have agreed upon a standard Robots Exclusion Protocol with wildcard support. See references below. ...just because the biggest search engines have agreed on something doesn't make it a universal standard. What's described on robotstxt.org is the closest thing there is to a universal standard, and that standard does *not* allow for wildcards. That is, * and ? will be interpreted literally. Popular libraries like those for Python and Perl still make no allowance for wildcard extensions: http://docs.python.org/lib/module-robotparser.html http://perl.active-venture.com/lib/WWW/RobotRules.html Anything written using those libraries won't respect wildcards in robots.txt. The practical upshot is that one can use wildcards in robots.txt, some bots will respect them and some will not. I'd argue that neither is terribly wrong. |

![]() |
| Thread Tools | |
| Display Modes | |
| |