HighDots Forums  

blocking robots.txt from non-robots

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss blocking robots.txt from non-robots in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #31  
Old   
Joe Fox
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 09:00 PM






John Bokma <john (AT) castleamber (DOT) com> wrote in
news:Xns9A4BB7830D344castleamber (AT) 130 (DOT) 133.1.4:

Quote:
Joe Fox <ny152 (AT) none (DOT) invalid> wrote:

As I said in another post, Can we save discussion of *why* for
another time and talk about *how*?

Like I wrote in reply to that other post: we're trying to help you for
*free*. But even if you were paying me, I would ask you *why*.
How about that. A voice of reason on UseNet. Nearly an extinct species
these days. ;-)

Quote:
Too often people think they have an X problem, and try to find all
kinds of solutions to that, while the real problem is Y. If you ever
have been helping others on Usenet, you certainly know what I mean.
NOW I see what you're saying and because you're saying it so reasonably,
I'll go into why.

Quote:
It sounds to me like you're afraid to educated others (give away your
secrets) by hiding your robots.txt. If that is indeed the case you
*do* have a X -> Y problem.
You tell me. I'm one of the several hundred thousand bloggers that have
had their page rank bitchslapped by google, first to zero pr, then to
status "unranked".

There's a big problem with this because up until then pr has been a
determining factor in the income of these bloggers. Obviously a lot of
people have want to get pagerank back and still be able to do the work
they enjoy and get paid for it as before... without google hitting them
with pr zero or unranking them entirely. Getting advertisers and
intermediaries to stop using pr as one of their value assessment metrics
is being attempted, but simply put, advertisers want "link juice" and
visibility in search engines and they're never going to stop wanting the
links they pay for on pages with a certain pagerank.

Andy Beard's blog had an idea recently about using robots.txt to tell
google not to index pages that contain the paid links that they're so
upset about. I'm thinking that this gives the best of both worlds.. the
advertiser gets their paid link that doesn't have rel="nofollow" on it
and google gets told "don't crawl this page".

Google gets to keep their index "pure" by not crawling (and thus
indexing ) the page with the paid links, and the advertisers get some pr
because while the page won't be crawled, there will still be links to it
so that it can pass pagerank (though maybe not as much as otherwise).

The problem is with intermediaries that might decide this is no better
than putting rel="nofollow" on the link (which I don't see that it is).
The idea is to keep them from being able to read the robots.txt that is
being given to Google should they think of it.

Advertiser wants links without rel="nofollow"
Google doesn't like paid links unless they have rel="nofollow".

There are bloggers who need the money and must find a way to do both at
the same time. Seems to me that this method should work as long as
intermediaries don't get the robots.txt being given to google. Thus the
need to ensure that ONLY google or other Search engines get the "real"
robots.txt.

Problem is, I'm not a coder. I'm trying to figure out how to do this
with .htaccess and it's very slow going. There are seemingly more
pitfalls than answers simply because I do not understand the language.
Thus I seek help from those who Do know the language.

Please forgive my attitude earlier.... Real life is
..... being a problem

I'm not trying to defraud, I just want to get back to earning a living.
I was doing pretty good untill the "BitchSlap of '07"


Reply With Quote
  #32  
Old   
Don
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 09:21 PM






Don <lostinspace (AT) 123-universe (DOT) com> wrote in
news:Xns9A4BD3F6022E7lostinspace123univer (AT) 207 (DOT) 115.33.102:


Quote:
Perhpas you might survey two versions of robots.text that are served
behind the scenes?


Here's an example of the method:
http://www.webmasterworld.com/apache...htm#msg3561497


Reply With Quote
  #33  
Old   
Don
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 09:24 PM



Joe Fox <ny152 (AT) none (DOT) invalid> wrote in news:Xns9A4BCB831B73891563@
127.0.0.1:

Quote:
John Bokma <john (AT) castleamber (DOT) com> wrote in
news:Xns9A4BB7830D344castleamber (AT) 130 (DOT) 133.1.4:

Joe Fox <ny152 (AT) none (DOT) invalid> wrote:

As I said in another post, Can we save discussion of *why* for
another time and talk about *how*?

Like I wrote in reply to that other post: we're trying to help you for
*free*. But even if you were paying me, I would ask you *why*.

How about that. A voice of reason on UseNet. Nearly an extinct
species
these days. ;-)

Too often people think they have an X problem, and try to find all
kinds of solutions to that, while the real problem is Y. If you ever
have been helping others on Usenet, you certainly know what I mean.

NOW I see what you're saying and because you're saying it so
reasonably,
I'll go into why.

It sounds to me like you're afraid to educated others (give away your
secrets) by hiding your robots.txt. If that is indeed the case you
*do* have a X -> Y problem.

You tell me. I'm one of the several hundred thousand bloggers that
have
had their page rank bitchslapped by google, first to zero pr, then to
status "unranked".

There's a big problem with this because up until then pr has been a
determining factor in the income of these bloggers. Obviously a lot of
people have want to get pagerank back and still be able to do the work
they enjoy and get paid for it as before... without google hitting them
with pr zero or unranking them entirely. Getting advertisers and
intermediaries to stop using pr as one of their value assessment
metrics
is being attempted, but simply put, advertisers want "link juice" and
visibility in search engines and they're never going to stop wanting
the
links they pay for on pages with a certain pagerank.

Andy Beard's blog had an idea recently about using robots.txt to tell
google not to index pages that contain the paid links that they're so
upset about. I'm thinking that this gives the best of both worlds..
the
advertiser gets their paid link that doesn't have rel="nofollow" on it
and google gets told "don't crawl this page".

Google gets to keep their index "pure" by not crawling (and thus
indexing ) the page with the paid links, and the advertisers get some
pr
because while the page won't be crawled, there will still be links to
it
so that it can pass pagerank (though maybe not as much as otherwise).

The problem is with intermediaries that might decide this is no better
than putting rel="nofollow" on the link (which I don't see that it is).
The idea is to keep them from being able to read the robots.txt that is
being given to Google should they think of it.

Advertiser wants links without rel="nofollow"
Google doesn't like paid links unless they have rel="nofollow".

There are bloggers who need the money and must find a way to do both at
the same time. Seems to me that this method should work as long as
intermediaries don't get the robots.txt being given to google. Thus
the
need to ensure that ONLY google or other Search engines get the "real"
robots.txt.

Problem is, I'm not a coder. I'm trying to figure out how to do this
with .htaccess and it's very slow going. There are seemingly more
pitfalls than answers simply because I do not understand the language.
Thus I seek help from those who Do know the language.

Please forgive my attitude earlier.... Real life is
.... being a problem

I'm not trying to defraud, I just want to get back to earning a living.
I was doing pretty good untill the "BitchSlap of '07"

These hordes of bloggers that are attemtpting to grasp htaccess and
Rewrites soley to benefit their existing pages and retain their rankings.

The WordPress tutorials examples seem to hold many misconceptions in
their understanding of htaccess.


Reply With Quote
  #34  
Old   
Joe Fox
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 09:33 PM



Don <lostinspace (AT) 123-universe (DOT) com> wrote in
news:Xns9A4BD360F498Clostinspace123univer (AT) 207 (DOT) 115.33.102:

Quote:
Joe Fox <ny152 (AT) none (DOT) invalid> wrote in
news:Xns9A49EC6462EC5891563 (AT) 127 (DOT) 0.0.1:


I'm using a robots.txt file to control what is and is not crawled by
search engine bots but I'd like to block anything that isn't a known
search engine bot doesn't get the file I'm feeding to google, yahoo
and the others.

From what I've read this could be done with .htacess but I've not been
able to make heads or tails out of that.

I'd really be grateful for some help here.

Thanks

Some tutorials
http://baremetal.com/gadgets/htaccess/
http://evolt.org/node/226
http://www.edginet.org/techie/website/htaccess.html
http://www.dimi.uniud.it/labs/docume.../Challenger1.2
/U
ser/htaccess/htaccess.html
http://www.webhelpinghand.com/htaccess_deny.htm
http://www.javascriptkit.com/howto/htaccess.shtml
http://www.serverwatch.com/tutorials...0825_1127711_1
http://www.verio.com/support/documen...fm?doc_id=3624

Some of those are familiar but I'll take a look at 'em anyway.

My big problem is I'm not a coder. Simple stuf I can handle but
figuring out docs and helps takes forever


Reply With Quote
  #35  
Old   
John Bokma
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 09:53 PM



Paul <noone (AT) houstoncrafts (DOT) com> wrote:

Quote:
You have email John.
Thanks Paul, looking into it (got the Gecko one as well, haven't had time
to check it out, thanks).


--
John Bokma http://johnbokma.com/


Reply With Quote
  #36  
Old   
John Bokma
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 10:03 PM



Joe Fox <ny152 (AT) none (DOT) invalid> wrote:

Quote:
John Bokma <john (AT) castleamber (DOT) com> wrote in
[..]

Quote:
That being said: there are two ways that might do what you want:

1 IP address based: you have to find out the IP address ranges
each bot you want to allow.
2 UserAgent string based: you have to find out each UA string for
each bot you want to allow.

In .htaccess you can redirect internally using either 1 or 2 to the
right robots.txt.

Thank you very much for a useful answer.

Sorry if I've come off like an ass.
Thanks, no problem.

Like I said, a lot of people on Usenet think they have an X problem, while
the real one is Y, so people often assume this is the case.

I also still can't see why you want to do this, but like you wrote, it's
your server:

method 1: if you miss out spiders, you might lose traffic.
hard to test (it can be done, with 2 computers + router)
method 2: if you miss out spiders, you might lose traffic
easy to test: you can either write a Perl program
that changes the UA for each request, or check
manually with Firefox + UA switcher add-on


Untested:

RewriteCond %{HTTP_USER_AGENT} =UA1 [OR]
RewriteCond %{HTTP_USER_AGENT} =UA2 [OR]
RewriteCond %{HTTP_USER_AGENT} =UA3 [OR]
RewriteRule ^robots.txt$ real-robots.txt [L]

with UA1..UAn the *exact* UA plain string, e.g.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

--
John Bokma http://johnbokma.com/


Reply With Quote
  #37  
Old   
John Bokma
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 10:06 PM



Don <lostinspace (AT) 123-universe (DOT) com> wrote:

Quote:
John Bokma <john (AT) castleamber (DOT) com> wrote in
news:Xns9A4B64E22EB17castleamber (AT) 130 (DOT) 133.1.4:

Joe Fox <ny152 (AT) none (DOT) invalid> wrote:

Not really, or is it possible that they could also get my .htaccess?
I didn't think that was possible. If they ask for a robots.txt and
get one that's got nothing more than a pointer to a sitemap that will
satisfy 'em.

Let's assume for arguments sake that those people *want* to see your
robots.txt. If you feed Google something different than them, they
will notice as soon as they check Google, because if you disallow
Google some directories, while your robots.txt says allow, they will
wonder why all pages in some directory don't show up in Google, but
are available on your site.

You give the majority of the general public too much credit
Comparing a websites robots.txt to google results!
People who are interested in robots.txt certainly do *not* fall in what
you call the "general public". Furthermore, I have no doubt that most
people who *do* know to request robots.txt and need to be stopped from
seeing the one that the OP wants to feed to SEs have at least some basic
knowledge about robots.txt and how to use Google.

--
John Bokma http://johnbokma.com/


Reply With Quote
  #38  
Old   
John Bokma
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 10:10 PM



Don <lostinspace (AT) 123-universe (DOT) com> wrote:

Quote:
It's done ALL the time.
What matters is that it's done for an appropiate reason and is
accomplished server side.
To me, what matters, is that a user doesn't click on a search result, and
comes on a page that doesn't make the expected data available.

And yes, even that does seem to be allowed by Google. There is a forum
that uses JS cloaking. Can't find a quick example, will post when I bump
into it again. The solution is to turn either JS off, or click on cached
link. But it's a sad practice and even sadder that Google allows for it.

webmasterworld (IIRC) did use cloaking, maybe still does. I've reported
this several times to Google, but no use (unless it has been fixed)

--
John Bokma http://johnbokma.com/


Reply With Quote
  #39  
Old   
Don
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 10:10 PM



Joe Fox <ny152 (AT) none (DOT) invalid> wrote in news:Xns9A4BD1118343D891563@
127.0.0.1:

Quote:
Don <lostinspace (AT) 123-universe (DOT) com> wrote in
news:Xns9A4BD360F498Clostinspace123univer (AT) 207 (DOT) 115.33.102:

Joe Fox <ny152 (AT) none (DOT) invalid> wrote in
news:Xns9A49EC6462EC5891563 (AT) 127 (DOT) 0.0.1:


I'm using a robots.txt file to control what is and is not crawled by
search engine bots but I'd like to block anything that isn't a known
search engine bot doesn't get the file I'm feeding to google, yahoo
and the others.

From what I've read this could be done with .htacess but I've not
been
able to make heads or tails out of that.

I'd really be grateful for some help here.

Thanks

Some tutorials
http://baremetal.com/gadgets/htaccess/
http://evolt.org/node/226
http://www.edginet.org/techie/website/htaccess.html

http://www.dimi.uniud.it/labs/docume.../Challenger1.2
/U
ser/htaccess/htaccess.html
http://www.webhelpinghand.com/htaccess_deny.htm
http://www.javascriptkit.com/howto/htaccess.shtml
http://www.serverwatch.com/tutorials...0825_1127711_1
http://www.verio.com/support/documen...fm?doc_id=3624


Some of those are familiar but I'll take a look at 'em anyway.

My big problem is I'm not a coder. Simple stuf I can handle but
figuring out docs and helps takes forever

Joe,
There are more beneficial forums for htaccess and Apache.
The Apache Server forum at Webmaster World is excellent and the moderator
makes a superb effort to assist far too many people.

The Search Engine Spider ID was the predecessor to the Apache as far as
htaccess coding.

Rgistration is free to most forums.

I may be able to assist you, however my extensive use of htaccess has
been limited to the "KISS" thought.
When it comes to simulated-wildcards and complicated expressions, I'm
daft!


Reply With Quote
  #40  
Old   
Don
 
Posts: n/a

Default Re: blocking robots.txt from non-robots - 02-21-2008 , 10:13 PM



John Bokma <john (AT) castleamber (DOT) com> wrote in
news:Xns9A4BD629B7E64castleamber (AT) 130 (DOT) 133.1.4:

Quote:
Joe Fox <ny152 (AT) none (DOT) invalid> wrote:

John Bokma <john (AT) castleamber (DOT) com> wrote in

[..]

That being said: there are two ways that might do what you want:

1 IP address based: you have to find out the IP address ranges
each bot you want to allow.
2 UserAgent string based: you have to find out each UA string for
each bot you want to allow.

In .htaccess you can redirect internally using either 1 or 2 to the
right robots.txt.

Thank you very much for a useful answer.

Sorry if I've come off like an ass.

Thanks, no problem.

Like I said, a lot of people on Usenet think they have an X problem,
while the real one is Y, so people often assume this is the case.

I also still can't see why you want to do this, but like you wrote,
it's your server:

method 1: if you miss out spiders, you might lose traffic.
hard to test (it can be done, with 2 computers + router)
method 2: if you miss out spiders, you might lose traffic
easy to test: you can either write a Perl program
that changes the UA for each request, or check
manually with Firefox + UA switcher add-on


Untested:

RewriteCond %{HTTP_USER_AGENT} =UA1 [OR]
RewriteCond %{HTTP_USER_AGENT} =UA2 [OR]
RewriteCond %{HTTP_USER_AGENT} =UA3 [OR]
RewriteRule ^robots.txt$ real-robots.txt [L]

with UA1..UAn the *exact* UA plain string, e.g.
Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)

See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

John,
Just a heads up (not critique).
The last "[OR]" is invalid.


Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.