博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

crawler basic conception

Posted on 2010-06-11 09:48  xuczhang  阅读(115)  评论(0编辑  收藏  举报

 

 

Difficulties

There are important characteristics of the Web that make crawling very difficult:

  • its large volume,
  • its fast rate of change, and
  • dynamic page generation.

 

The behavior of a Web crawler is the outcome of a combination of policies:

  • a selection policy that states which pages to download,
  • a re-visit policy that states when to check for changes to the pages,
  • a politeness policy that states how to avoid overloading Web sites, and
  • a parallelization policy that states how to coordinate distributed Web crawlers.

glossary

1. seeds: a list of URLs to visit

2. crawl frontier: As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit