I run a website with about 4 million pages (
http://cxp.paterra.com).
Although the Google spiders are very active and have been pulling
100,000+ pages per day for the last 3 months, few pages show up on
Google. See
http://www.paterra.com/GoogleVsAskVsBaidu.pdf. Google
indexing of this site essentially collapsed in January 2005 when the
number of pages was increased from about 1 million to 4 million.
AskJeeves, on the other hand, indexes 95% of the site.
My current working hypothesis as to why these pages don't show up on
Google centers on Google's repetitive pulling of pages to test
stability and refresh its indexes. Suppose Google has to be able to
pull the same page twice over a two week period before it posts to the
index. Suppose also that Google has a maximum pull rate per site.
Also, suppose that Google expires pages after a month. With more than
4 million pages, Google cannot do repeat pulls fast enough to keep the
pulled pages in the index.
Does this make sense to anyone intimately familiar with Google
indexing? If this hypothesis is correct, is there a way to get Google
to ease the repeatability requirements?