HighDots Forums  

spidering Amazon website

Search Engine Optimization Discussion about SEO/Search Engine Optimization (alt.internet.search-engines)


Discuss spidering Amazon website in the Search Engine Optimization forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
alexxx.magni@gmail.com
 
Posts: n/a

Default spidering Amazon website - 12-06-2007 , 10:03 AM






Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?


Thanks for any info!

Alessandro


Reply With Quote
  #2  
Old   
Big Bill
 
Posts: n/a

Default Re: spidering Amazon website - 12-06-2007 , 02:31 PM






On Thu, 6 Dec 2007 07:03:52 -0800 (PST), "alexxx.magni (AT) gmail (DOT) com"
<alexxx.magni (AT) gmail (DOT) com> wrote:

Quote:
Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?


Thanks for any info!

Alessandro
I'm thinking you might have got this wrong somehow... what is it you
want to do exactly?

BB
--

http://www.kruse.co.uk/
http://www.fat-odin.com/
http://www.here-be-posters.co.uk/


Reply With Quote
  #3  
Old   
Don
 
Posts: n/a

Default Re: spidering Amazon website - 12-06-2007 , 11:32 PM



"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in news:df50f25d-
6845-4c06-a900-e39801e869f1 (AT) e23...oglegroups.com:

Quote:
Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?


Thanks for any info!

Alessandro

Why not just run your software and see if the bot UA is denied or allowed?

There are tons and tons of bots that are spidering sites, in which the
spidered sites fail to insall any prevention against spidering.



Reply With Quote
  #4  
Old   
alexxx.magni@gmail.com
 
Posts: n/a

Default Re: spidering Amazon website - 12-07-2007 , 03:54 AM



thank you for the reply...
The bot is running already, their robots.txt does not deny it:
my problem was to find in which way to let it behave correctly, not
too hammer the site too much (although, being such a large site as
Amazon, I dont think I'm doing too harm...).
Currently I wait a random time between 3 and 30 seconds between
requests - I dont know if it's too high / too low


Alessandro


On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote:
Quote:
"alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in news:df50f25d-
6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com:

Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?

Thanks for any info!

Alessandro

Why not just run your software and see if the bot UA is denied or allowed?

There are tons and tons of bots that are spidering sites, in which the
spidered sites fail to insall any prevention against spidering.


Reply With Quote
  #5  
Old   
Don
 
Posts: n/a

Default Re: spidering Amazon website - 12-07-2007 , 08:26 AM



"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in
news:66592aa2-8467-4416-b032-f16dd9e17405 (AT) j20g2000hsi (DOT) googlegroups.com:

Quote:
thank you for the reply...
The bot is running already, their robots.txt does not deny it:
my problem was to find in which way to let it behave correctly, not
too hammer the site too much (although, being such a large site as
Amazon, I dont think I'm doing too harm...).
Currently I wait a random time between 3 and 30 seconds between
requests - I dont know if it's too high / too low


Alessandro


On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote:
"alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in
news:df50f25d-
6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com:

Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?

Thanks for any info!

Alessandro

Why not just run your software and see if the bot UA is denied or
allowed?

There are tons and tons of bots that are spidering sites, in which
the
spidered sites fail to insall any prevention against spidering.

There's a delay inclusion in robots.txt

User-agent: *
Crawl-delay: xx

Here's a Yahoo link which provides that explanation
http://help.yahoo.com/l/us/yahoo/sea.../slurp-03.html


Reply With Quote
  #6  
Old   
alexxx.magni@gmail.com
 
Posts: n/a

Default Re: spidering Amazon website - 12-07-2007 , 09:25 AM



where did you find it??
www.amazon.com/robots.txt:

# Disallow all crawlers access to certain pages.

User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/customer-media
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/gurupamacro
Disallow: /gp/vote
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images


Alessandro

Don wrote:
Quote:
"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in
news:66592aa2-8467-4416-b032-f16dd9e17405 (AT) j20g2000hsi (DOT) googlegroups.com:

thank you for the reply...
The bot is running already, their robots.txt does not deny it:
my problem was to find in which way to let it behave correctly, not
too hammer the site too much (although, being such a large site as
Amazon, I dont think I'm doing too harm...).
Currently I wait a random time between 3 and 30 seconds between
requests - I dont know if it's too high / too low


Alessandro


On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote:
"alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in
news:df50f25d-
6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com:

Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?

Thanks for any info!

Alessandro

Why not just run your software and see if the bot UA is denied or
allowed?

There are tons and tons of bots that are spidering sites, in which
the
spidered sites fail to insall any prevention against spidering.


There's a delay inclusion in robots.txt

User-agent: *
Crawl-delay: xx

Here's a Yahoo link which provides that explanation
http://help.yahoo.com/l/us/yahoo/sea.../slurp-03.html

Reply With Quote
  #7  
Old   
Don
 
Posts: n/a

Default Re: spidering Amazon website - 12-07-2007 , 09:37 AM



"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in news:bffac00f-
df5b-4458-804e-48b0bca90d0c (AT) w28...oglegroups.com:

Quote:
where did you find it??
www.amazon.com/robots.txt:

# Disallow all crawlers access to certain pages.

User-agent: *
Disallow: /exec/obidos/account-access-login
snip
</snip>
Quote:
Disallow: /gp/customer-images


Alessandro

Don wrote:
"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in
news:66592aa2-8467-4416-b032-f16dd9e17405
@j20g2000hsi.googlegroups.com:

thank you for the reply...
The bot is running already, their robots.txt does not deny it:
my problem was to find in which way to let it behave correctly, not
too hammer the site too much (although, being such a large site as
Amazon, I dont think I'm doing too harm...).
Currently I wait a random time between 3 and 30 seconds between
requests - I dont know if it's too high / too low


Alessandro


On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote:
"alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in
news:df50f25d-
6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com:

Having a large list of books, I wanted to spider data from Amazon
(isbns, reviews...)
Do you know if Amazon accepts kindly this behavior?
And, just in case, which timeout between requests do you think is
acceptable?

Thanks for any info!

Alessandro

Why not just run your software and see if the bot UA is denied or
allowed?

There are tons and tons of bots that are spidering sites, in which
the
spidered sites fail to insall any prevention against spidering.


There's a delay inclusion in robots.txt

User-agent: *
Crawl-delay: xx

Here's a Yahoo link which provides that explanation
http://help.yahoo.com/l/us/yahoo/sea.../slurp-03.html

I was merely suggesting that since their robots was absent of a crawl
delay that it was not an issue for their websites.



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.