![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro |
#3
| |||
| |||
|
|
Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro |
#4
| |||
| |||
|
|
"alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in news:df50f25d- 6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com: Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro Why not just run your software and see if the bot UA is denied or allowed? There are tons and tons of bots that are spidering sites, in which the spidered sites fail to insall any prevention against spidering. |
#5
| |||
| |||
|
|
thank you for the reply... The bot is running already, their robots.txt does not deny it: my problem was to find in which way to let it behave correctly, not too hammer the site too much (although, being such a large site as Amazon, I dont think I'm doing too harm...). Currently I wait a random time between 3 and 30 seconds between requests - I dont know if it's too high / too low Alessandro On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote: "alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in news:df50f25d- 6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com: Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro Why not just run your software and see if the bot UA is denied or allowed? There are tons and tons of bots that are spidering sites, in which the spidered sites fail to insall any prevention against spidering. |
#6
| |||
| |||
|
|
"alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in news:66592aa2-8467-4416-b032-f16dd9e17405 (AT) j20g2000hsi (DOT) googlegroups.com: thank you for the reply... The bot is running already, their robots.txt does not deny it: my problem was to find in which way to let it behave correctly, not too hammer the site too much (although, being such a large site as Amazon, I dont think I'm doing too harm...). Currently I wait a random time between 3 and 30 seconds between requests - I dont know if it's too high / too low Alessandro On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote: "alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in news:df50f25d- 6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com: Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro Why not just run your software and see if the bot UA is denied or allowed? There are tons and tons of bots that are spidering sites, in which the spidered sites fail to insall any prevention against spidering. There's a delay inclusion in robots.txt User-agent: * Crawl-delay: xx Here's a Yahoo link which provides that explanation http://help.yahoo.com/l/us/yahoo/sea.../slurp-03.html |
#7
| |||
| |||
|
|
where did you find it?? www.amazon.com/robots.txt: # Disallow all crawlers access to certain pages. User-agent: * Disallow: /exec/obidos/account-access-login snip |
|
Disallow: /gp/customer-images Alessandro Don wrote: "alexxx.magni (AT) gmail (DOT) com" <alexxx.magni (AT) gmail (DOT) com> wrote in news:66592aa2-8467-4416-b032-f16dd9e17405 @j20g2000hsi.googlegroups.com: thank you for the reply... The bot is running already, their robots.txt does not deny it: my problem was to find in which way to let it behave correctly, not too hammer the site too much (although, being such a large site as Amazon, I dont think I'm doing too harm...). Currently I wait a random time between 3 and 30 seconds between requests - I dont know if it's too high / too low Alessandro On Dec 7, 5:32 am, Don <lostinsp... (AT) 123-universe (DOT) com> wrote: "alexxx.ma... (AT) gmail (DOT) com" <alexxx.ma... (AT) gmail (DOT) com> wrote in news:df50f25d- 6845-4c06-a900-e39801e86... (AT) e23g2000prf (DOT) googlegroups.com: Having a large list of books, I wanted to spider data from Amazon (isbns, reviews...) Do you know if Amazon accepts kindly this behavior? And, just in case, which timeout between requests do you think is acceptable? Thanks for any info! Alessandro Why not just run your software and see if the bot UA is denied or allowed? There are tons and tons of bots that are spidering sites, in which the spidered sites fail to insall any prevention against spidering. There's a delay inclusion in robots.txt User-agent: * Crawl-delay: xx Here's a Yahoo link which provides that explanation http://help.yahoo.com/l/us/yahoo/sea.../slurp-03.html |
![]() |
| Thread Tools | |
| Display Modes | |
| |