

摘 要











The Research and Implementation of Distributed Web Crawler


With the rapid development of Internet, search engines as the main entrance of the Internet plays a more and more important role. Web crawler is a very important part of the search engines, which is responsible to collect web pages from Internet. These pages are used to build index and provide support for search engines. Because of the great expansion of Internet Information, a centralized and stand-alone web crawler has not long been able to adapt to the Internet scale. So high-performance distributed web crawler system is becoming the focus of current research in the field of information collection.

This paper researches and demonstrates some topics about principle, distributed architecture design, keymodules, the bottleneck problem and solution in web crawler system. The main work as following :

1. This paper introduces a hash algorithm called Consistent Hash, which is used to solve the strategy of URL partition, hot-spot problem and load balancing between web crawler nodes and ensure that the distributed crawler has good scalability, balancing, fault tolerance.

2. In order to meet the politeness and priority needs of the web crawler, this paper designs and implements a URL queue based on Mercator model.

3. The solutions to large-scale URLs deduplication,DNS resolution,page crawling and parsing and some other key problems are given.

4.This paper designs and implements a thread pool model for efficient and  multi-threaded page collection.

5.A scheme for downloaded page storage is given, which creats indexd files and data files to manage and store the downloaded data.

On the basis of the above work, this paper designs and implements a high-performance distributed web crawler prototype system. The experiments at the end of this paper show that the Web crawler not only has the characteristics such as high efficiency page fetching, highly configurable, stable, but also has good distributed features such as good scalability, fault tolerance, load balancing and so on.

KeywordsWeb Crawler;Distributed;Consistent Hash  Algorithm;Information Retrieval;Thread Pool



posted @ 2012-06-29 23:02  糖拌咸鱼  阅读(8113)  评论(7编辑  收藏  举报