C#写的爬虫 -- Simple Crawler using C# sockets

CodeProject上看见的感兴趣的文章,先研究着,有空翻译一下:

源代码:下载

Sample Image - Crawler.jpg

简介
      网页爬虫(也被称做蚂蚁或者蜘蛛)是一个自动抓取万维网中网页数据的程序.网页爬虫一般都是用于抓取大量的网页,为日后搜索引擎处理服务的.抓取的网页由一些专门的程序来建立索引(如:Lucene,DotLucene),加快搜索的速度.爬虫也可以作为链接检查器或者HTML代码校验器来提供一些服务.比较新的一种用法是用来检查E-mail地址,用来防止Trackback spam.
 
爬虫概述
在这篇文章中,我将介绍一个用C#写的简单的爬虫程序.这个程序根据输入的目标URL地址,来有针对性的进行网页抓取.用法相当简单,只需要输入你想要抓取的网站地址,按下"GO"就可以了.
Web Crawler Architecture from Wikipedia, the free encyclopedia
 
      这个爬虫程序有一个队列,存储要进行抓取的URL,这个设计和一些大的搜索引擎是一样的.抓取时是多线程的,从URL队列中取出URL进行抓取,然后将抓取的网页存在指定的存储区(Storage 如图所示).用C#Socket库进行Web请求.分析当前正在抓取的页面中的链接,存入URL队列中(设置中有设置抓取深度的选项)
 
状态查看
这个程序提供三种状态查看:
  1. 抓取线程列表
  2. 每个抓取线程的详细信息
  3. 查看错误信息

线程查看
线程列表查看显示所有正在工作中的线程.每个线程从URI队列中取出一个URI,进行链接.

Threads tab view.
.
请求查看
请求查看显示了所有的最近下载的页面的列表,也显示了HTTP头中的详细信息.
Requests tab view
 
每个请求的头显示类似下面的信息:
GET / HTTP/1.0
 Host: www.cnn.com
 Connection: Keep-Alive
 
Response头显示类似如下信息:
HTTP/1.0 200 OK
 Date: Sun, 
19 Mar 2006 19:39:05
 GMT
 Content
-Length: 65730

 Content
-Type: text/html
 Expires: Sun, 
19 Mar 2006 19:40:05
 GMT
 Cache
-Control: max-age=60private

 Connection: keep
-alive
 Proxy
-Connection: keep-
alive
 Server: Apache
 Last
-Modified: Sun, 19 Mar 2006 19:38:58
 GMT
 Vary: Accept
-Encoding,User-
Agent
 Via: 
1.1 webcache (NetCache NetApp/6.0
.1P3)
 
还有一个最近下载的页面的列表
Parsing page 
 Found: 
356 ref
(s)
 http:
//www.cnn.com/

 http://www.cnn.com/search/
 http://www.cnn.com/linkto/intl.html
 
设置
这个程序提供一些参数的设置,包括:
  1. MIME types
  2. 存储目的文件夹
  3. 最大抓取线程数
  4. 等等...

MIME types

      爬虫支持的下载下来的文件类型,用户可以添加
MIME types are the types that are supported to be downloaded by the crawler and the crawler includes a default types to be used. The user can add, edit and delete MIME types. The user can select to allow all MIME types as in the following figure. 

Files Matches Settings

Output
Output settings include the download folder, and the number of requests to keep in the requests view for review requests details.
Output Setings

Connections
Connections settings contain:
  • Thread count: number of concurrent working threads in the crawler.
  • Thread sleep time when refs queue empty: the time that each thread sleeps when the refs queue empty.
  • Thread sleep time between two connection: the time that each thread sleep after handling any request, which is very important value to prevent Hosts from blocking the crawler due to heavy load.
  • Connection timeout: represents send and receive timeout to all crawler sockets.
  • Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
  • Keep same URL server: to limit crawling process to the same host of the original URL. Keep connection alive: means keep socket connection opened for subsequent requests to avoid reconnect time.
Connections Settings

Advanced
Advanced settings contain:
Code page to encode downloaded text pages
List of a user defined list of restricted words to enable user to prevent any bad pages
List of a user defined list of restricted hosts extensions to avoid blocking by these hosts
List of a user defined list of restricted files extensions to avoid paring non-text data
Advanced Settings


Points of Interest

Keep Alive Connection:
Keep-Alive is a request form the client to the server to keep the connection opened after response finished for subsequent requests. That can be done by adding an HTTP header in the request to the server as in the following request: 

GET /CNN/Programs/nancy.grace/ HTTP/1.0
 Host: www.cnn.com
 Connection: Keep
-Alive
 
The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket by its decision.
So the server can keep tell the client that he will keep it opened by include "Connection: Keep-Alive" in his replay as follows:
 HTTP/1.0 200 OK
 Date: Sun, 
19 Mar 2006 19:38:15
 GMT
 Content
-Length: 29025

 Content
-Type: text/html
 Expires: Sun, 
19 Mar 2006 19:39:15
 GMT
 Cache
-Control: max-age=60private

 Connection: keep
-alive
 Proxy
-Connection: keep-
alive
 Server: Apache
 Vary: Accept
-Encoding,User-
Agent
 Last
-Modified: Sun, 19 Mar 2006 19:38:15
 GMT
 Via: 
1.1 webcache (NetCache NetApp/6.0
.1P3)
Or it can tell the client that it refuses as follows: 
 
HTTP/1.0 200 OK
 Date: Sun, 
19 Mar 2006 19:38:15
 GMT
 Content
-Length: 29025

 Content
-Type: text/html
 Expires: Sun, 
19 Mar 2006 19:39:15
 GMT
 Cache
-Control: max-age=60private

 Connection: Close
 Server: Apache
 Vary: Accept
-Encoding,User-Agent
 Last
-Modified: Sun, 19 Mar 2006 19:38:15
 GMT
 Via: 
1.1 webcache (NetCache NetApp/6.0
.1P3)

WebRequest and WebResponse problems:

When I started this article code I was using WebRequest class and WebResponse like in the following code:
WebRequest request = WebRequest.Create(uri);
 WebResponse response 
=
 request.GetResponse();
 Stream streamIn 
=
 response.GetResponseStream();
 BinaryReader reader 
= new
 BinaryReader(streamIn, TextEncoding);
 
byte[] RecvBuffer = new byte[10240
];
 
int nBytes, nTotalBytes = 0
;
 
while((nBytes = reader.Read(RecvBuffer, 010240)) > 0
)
 
{
  nTotalBytes 
+=
 nBytes;
  
 }

 reader.Close();
 streamIn.Close();
 response.Close();
 
This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all others process WebRequest tell the retrieved response closed as in the last line in the previous code. So I noticed that always one thread is downloading while others are waiting to GetResponse. To solve this serious problem I have implemented my two classes MyWebRequest and MyWebResponse .
MyWebRequest and MyWebResponse use Socket class to manage connections and they are similar to WebRequest and WebResponse but they support concurrent responses at the same time. In addition MyWebRequest supports a built in flag KeepAlive to support Keep-Alive connections.
So, my new code whould be like:
 request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
 MyWebResponse response 
=
 request.GetResponse();
 
byte[] RecvBuffer = new byte[10240
];
 
int nBytes, nTotalBytes = 0
;
 
while((nBytes = response.socket.Receive(RecvBuffer, 010240, SocketFlags.None)) > 0
)
 
{
  nTotalBytes 
+=
 nBytes;
  
  
if(response.KeepAlive && nTotalBytes >= response.ContentLength && response.ContentLength > 0
)
   
break
;
 }

 
if(response.KeepAlive == false)
  response.Close();
just replacing the GetResponseStream with a direct access to the socket member of MyWebResponse class. To do that I did a simple trick to make socket next reading to start after the reply header, by reading one byte at a time tell header completion as in the following code:
 /* reading response header */
 Header 
= "";
 
byte[] bytes = new byte[10
];
 
while(socket.Receive(bytes, 01, SocketFlags.None) > 0
)
 
{
  Header 
+= Encoding.ASCII.GetString(bytes, 01
);
  
if(bytes[0== '\n' && Header.EndsWith("\r\n\r\n"
))
   
break
;
 }

So, the user of MyResponse class will just continue receiving from the first position of the page.
Thread Management:
Number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the settings tab Connections.
The Crawler code handles this change by the property ThreadCount as in the following code:
private int ThreadCount
{
 
get return nThreadCount; }

 
set
 
{
  Monitor.Enter(
this
.listViewThreads);
  
for(int nIndex = 0; nIndex < value; nIndex ++
)
  
{
   
if(threadsRun[nIndex] == null || threadsRun[nIndex].ThreadState !=
 ThreadState.Suspended)
   
{
    threadsRun[nIndex] 
= new Thread(new
 ThreadStart(ThreadRunFunction));
    threadsRun[nIndex].Name 
=
 nIndex.ToString();
    threadsRun[nIndex].Start();
    
if(nIndex == this
.listViewThreads.Items.Count)
    
{
     ListViewItem item 
= this.listViewThreads.Items.Add((nIndex+1).ToString(), 0
);
     
string[] subItems = """""""0""0%" }
;
     item.SubItems.AddRange(subItems);
    }

   }

   
else if(threadsRun[nIndex].ThreadState == ThreadState.Suspended)
   
{
    ListViewItem item 
= this
.listViewThreads.Items[nIndex];
    item.ImageIndex 
= 1
;
    item.SubItems[
2].Text = "Resume"
;
    threadsRun[nIndex].Resume();
   }

  }

  nThreadCount 
= value;
  Monitor.Exit(
this
.listViewThreads);
 }

}

If TheadCode increased by the user the code creates a new threads or suspends suspended threads. Else, the system leaves the process of suppending extra working thread to threads themselves as follows.
Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount it continues its job and go for suspension mode.
Crawling Depth:
It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page are inserted in the end of the URL queue, means "first in first out" operation. and all thread can insert in the queue at any time as in the following part of code: 
void EnqueueUri(MyUri uri)
{
 Monitor.Enter(queueURLS);
 
try

 
{
  queueURLS.Enqueue(uri);
 }

 
catch(Exception)
 
{
 }

 Monitor.Exit(queueURLS);
}

  
And each thread can retrieve first URL in the queue to request it as in the following part of code: 
MyUri DequeueUri()
{
 Monitor.Enter(queueURLS);
 MyUri uri 
= null
;
 
try

 
{
  uri 
=
 (MyUri)queueURLS.Dequeue();
 }

 
catch(Exception)
 
{
 }

 Monitor.Exit(queueURLS);
 
return uri;
}

  
http://blog.handsbrain.com/leezjs/archive/2006/03/20/106368.html
posted @ 2006-04-02 11:51  torome  阅读(5469)  评论(0编辑  收藏  举报