![]() | |
![]() |
| | Thread Tools | Display Modes |
#31
| |||
| |||
|
|
Joe Fox <ny152 (AT) none (DOT) invalid> wrote: As I said in another post, Can we save discussion of *why* for another time and talk about *how*? Like I wrote in reply to that other post: we're trying to help you for *free*. But even if you were paying me, I would ask you *why*. |
|
Too often people think they have an X problem, and try to find all kinds of solutions to that, while the real problem is Y. If you ever have been helping others on Usenet, you certainly know what I mean. |
|
It sounds to me like you're afraid to educated others (give away your secrets) by hiding your robots.txt. If that is indeed the case you *do* have a X -> Y problem. |
#32
| |||
| |||
|
|
Perhpas you might survey two versions of robots.text that are served behind the scenes? |
#33
| |||
| |||
|
|
John Bokma <john (AT) castleamber (DOT) com> wrote in news:Xns9A4BB7830D344castleamber (AT) 130 (DOT) 133.1.4: Joe Fox <ny152 (AT) none (DOT) invalid> wrote: As I said in another post, Can we save discussion of *why* for another time and talk about *how*? Like I wrote in reply to that other post: we're trying to help you for *free*. But even if you were paying me, I would ask you *why*. How about that. A voice of reason on UseNet. Nearly an extinct species these days. ;-) Too often people think they have an X problem, and try to find all kinds of solutions to that, while the real problem is Y. If you ever have been helping others on Usenet, you certainly know what I mean. NOW I see what you're saying and because you're saying it so reasonably, I'll go into why. It sounds to me like you're afraid to educated others (give away your secrets) by hiding your robots.txt. If that is indeed the case you *do* have a X -> Y problem. You tell me. I'm one of the several hundred thousand bloggers that have had their page rank bitchslapped by google, first to zero pr, then to status "unranked". There's a big problem with this because up until then pr has been a determining factor in the income of these bloggers. Obviously a lot of people have want to get pagerank back and still be able to do the work they enjoy and get paid for it as before... without google hitting them with pr zero or unranking them entirely. Getting advertisers and intermediaries to stop using pr as one of their value assessment metrics is being attempted, but simply put, advertisers want "link juice" and visibility in search engines and they're never going to stop wanting the links they pay for on pages with a certain pagerank. Andy Beard's blog had an idea recently about using robots.txt to tell google not to index pages that contain the paid links that they're so upset about. I'm thinking that this gives the best of both worlds.. the advertiser gets their paid link that doesn't have rel="nofollow" on it and google gets told "don't crawl this page". Google gets to keep their index "pure" by not crawling (and thus indexing ) the page with the paid links, and the advertisers get some pr because while the page won't be crawled, there will still be links to it so that it can pass pagerank (though maybe not as much as otherwise). The problem is with intermediaries that might decide this is no better than putting rel="nofollow" on the link (which I don't see that it is). The idea is to keep them from being able to read the robots.txt that is being given to Google should they think of it. Advertiser wants links without rel="nofollow" Google doesn't like paid links unless they have rel="nofollow". There are bloggers who need the money and must find a way to do both at the same time. Seems to me that this method should work as long as intermediaries don't get the robots.txt being given to google. Thus the need to ensure that ONLY google or other Search engines get the "real" robots.txt. Problem is, I'm not a coder. I'm trying to figure out how to do this with .htaccess and it's very slow going. There are seemingly more pitfalls than answers simply because I do not understand the language. Thus I seek help from those who Do know the language. Please forgive my attitude earlier.... Real life is .... being a problem I'm not trying to defraud, I just want to get back to earning a living. I was doing pretty good untill the "BitchSlap of '07" |
#34
| |||
| |||
|
|
Joe Fox <ny152 (AT) none (DOT) invalid> wrote in news:Xns9A49EC6462EC5891563 (AT) 127 (DOT) 0.0.1: I'm using a robots.txt file to control what is and is not crawled by search engine bots but I'd like to block anything that isn't a known search engine bot doesn't get the file I'm feeding to google, yahoo and the others. From what I've read this could be done with .htacess but I've not been able to make heads or tails out of that. I'd really be grateful for some help here. Thanks Some tutorials http://baremetal.com/gadgets/htaccess/ http://evolt.org/node/226 http://www.edginet.org/techie/website/htaccess.html http://www.dimi.uniud.it/labs/docume.../Challenger1.2 /U ser/htaccess/htaccess.html http://www.webhelpinghand.com/htaccess_deny.htm http://www.javascriptkit.com/howto/htaccess.shtml http://www.serverwatch.com/tutorials...0825_1127711_1 http://www.verio.com/support/documen...fm?doc_id=3624 |
#35
| |||
| |||
|
|
You have email John. |
#36
| |||
| |||
|
|
John Bokma <john (AT) castleamber (DOT) com> wrote in |
|
That being said: there are two ways that might do what you want: 1 IP address based: you have to find out the IP address ranges each bot you want to allow. 2 UserAgent string based: you have to find out each UA string for each bot you want to allow. In .htaccess you can redirect internally using either 1 or 2 to the right robots.txt. Thank you very much for a useful answer. Sorry if I've come off like an ass. |
#37
| |||
| |||
|
|
John Bokma <john (AT) castleamber (DOT) com> wrote in news:Xns9A4B64E22EB17castleamber (AT) 130 (DOT) 133.1.4: Joe Fox <ny152 (AT) none (DOT) invalid> wrote: Not really, or is it possible that they could also get my .htaccess? I didn't think that was possible. If they ask for a robots.txt and get one that's got nothing more than a pointer to a sitemap that will satisfy 'em. Let's assume for arguments sake that those people *want* to see your robots.txt. If you feed Google something different than them, they will notice as soon as they check Google, because if you disallow Google some directories, while your robots.txt says allow, they will wonder why all pages in some directory don't show up in Google, but are available on your site. You give the majority of the general public too much credit ![]() Comparing a websites robots.txt to google results! |
#38
| |||
| |||
|
|
It's done ALL the time. What matters is that it's done for an appropiate reason and is accomplished server side. |
#39
| |||
| |||
|
|
Don <lostinspace (AT) 123-universe (DOT) com> wrote in news:Xns9A4BD360F498Clostinspace123univer (AT) 207 (DOT) 115.33.102: Joe Fox <ny152 (AT) none (DOT) invalid> wrote in news:Xns9A49EC6462EC5891563 (AT) 127 (DOT) 0.0.1: I'm using a robots.txt file to control what is and is not crawled by search engine bots but I'd like to block anything that isn't a known search engine bot doesn't get the file I'm feeding to google, yahoo and the others. From what I've read this could be done with .htacess but I've not been able to make heads or tails out of that. I'd really be grateful for some help here. Thanks Some tutorials http://baremetal.com/gadgets/htaccess/ http://evolt.org/node/226 http://www.edginet.org/techie/website/htaccess.html http://www.dimi.uniud.it/labs/docume.../Challenger1.2 /U ser/htaccess/htaccess.html http://www.webhelpinghand.com/htaccess_deny.htm http://www.javascriptkit.com/howto/htaccess.shtml http://www.serverwatch.com/tutorials...0825_1127711_1 http://www.verio.com/support/documen...fm?doc_id=3624 Some of those are familiar but I'll take a look at 'em anyway. My big problem is I'm not a coder. Simple stuf I can handle but figuring out docs and helps takes forever |
#40
| |||
| |||
|
|
Joe Fox <ny152 (AT) none (DOT) invalid> wrote: John Bokma <john (AT) castleamber (DOT) com> wrote in [..] That being said: there are two ways that might do what you want: 1 IP address based: you have to find out the IP address ranges each bot you want to allow. 2 UserAgent string based: you have to find out each UA string for each bot you want to allow. In .htaccess you can redirect internally using either 1 or 2 to the right robots.txt. Thank you very much for a useful answer. Sorry if I've come off like an ass. Thanks, no problem. Like I said, a lot of people on Usenet think they have an X problem, while the real one is Y, so people often assume this is the case. I also still can't see why you want to do this, but like you wrote, it's your server: method 1: if you miss out spiders, you might lose traffic. hard to test (it can be done, with 2 computers + router) method 2: if you miss out spiders, you might lose traffic easy to test: you can either write a Perl program that changes the UA for each request, or check manually with Firefox + UA switcher add-on Untested: RewriteCond %{HTTP_USER_AGENT} =UA1 [OR] RewriteCond %{HTTP_USER_AGENT} =UA2 [OR] RewriteCond %{HTTP_USER_AGENT} =UA3 [OR] RewriteRule ^robots.txt$ real-robots.txt [L] with UA1..UAn the *exact* UA plain string, e.g. Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) See: http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html |
![]() |
| Thread Tools | |
| Display Modes | |
| |