Nutch爬虫实验运行及抓取数据分析（二）

续接《Nutch爬虫实验运行及抓取数据分析（一）》：

在分析了WebDB之后，下面我们继续分析Nutch爬虫在对实验网络抓取之后其它的结果文件内容。

Segments

Crawler在抓取中共生成了三个segment，分别存放于segments文件夹下的以时间戳为文件夹名的三个子文件夹下面。每个segment代表Crawler的一次“产生/抓取/更新”循环。Nutch中提供了如下的命令可以清晰的看到segments的简介：

bin/nutch segread -list -dir crawl-tinysite/segments/

命令结果如下所示：

PARSED? STARTED           FINISHED          COUNT DIR NAME

true    20051025-12:13:35 20051025-12:13:35 1     crawl-tinysite/segments/20051025121334

true    20051025-12:13:37 20051025-12:13:37 1     crawl-tinysite/segments/20051025121337

true    20051025-12:13:39 20051025-12:13:39 2     crawl-tinysite/segments/20051025121339

TOTAL: 4 entries in 3 segments.

结果中PARSED?列表示的是在抓取之后是否接着进行解析和索引，默认的都是true。但是如果你利用底层命令进行抓取操作的时候，你可以在抓取之后独立地再另外进行解析和索引工作，此时此列才会为false。STARTED和FINISHED两列记录的是此循环的开始时间和结束时间，这些信息可以帮助用户分析那些抓取时间过长的页面是怎么回事。COUNT列代表的是此segment内包含的被抓取回来的网页数目，例如最后一个segment此列值为2，代表的是segment中有两个被抓取回来的网页，即C网页和C-dup网页。

但是这些简介信息并不够详细，下面的命令可以可以更清楚的看到单个segment的详细信息，我们以第一个segment为例：

s=`ls -d crawl-tinysite/segments/* | head -1`

bin/nutch segread -dump $s

结果为：

Recno:: 0

FetcherOutput::

FetchListEntry: version: 2

fetch: true

page: Version: 4

URL: http://keaton/tinysite/A.html

ID: 6cf980375ed1312a0ef1d77fd1760a3e

Next fetch: Tue Nov 01 11:13:34 GMT 2005

Retries since fetch: 0

Retry interval: 30 days

Num outlinks: 0

Score: 1.0

NextScore: 1.0

anchors: 1

anchor: A

Fetch Result:

MD5Hash: fb8b9f0792e449cda72a9670b4ce833a

ProtocolStatus: success(1), lastModified=0

FetchDate: Tue Oct 25 12:13:35 BST 2005

Content::

url: http://keaton/tinysite/A.html

base: http://keaton/tinysite/A.html

contentType: text/html

metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1,

Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000",

Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106}

Content:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html

 PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

<head>

<title>'A' is for Alligator</title>

</head>

<body>

<p>

Alligators live in freshwater environments such as ponds,

marshes, rivers and swamps. Although alligators have

heavy bodies and slow metabolisms, they are capable of

short bursts of speed that can exceed 30 miles per hour.

Alligators' main prey are smaller animals that they can kill

and eat with a single bite. Alligators may kill larger prey

by grabbing it and dragging it in the water to drown.

Food items that can't be eaten in one bite are either allowed

to rot or are rendered by biting and then spinning or

convulsing wildly until bite size pieces are torn off.

(From

<a href="http://en.wikipedia.org/wiki/Alligator">the

Wikipedia entry for Alligator</a>.)

</p>

<p><a href="B.html">B</a></p>

</body>

</html>

ParseData::

Status: success(1,0)

Title: 'A' is for Alligator

Outlinks: 2

outlink: toUrl: http://en.wikipedia.org/wiki/Alligator

anchor: the Wikipedia entry for Alligator

outlink: toUrl: http://keaton/tinysite/B.html anchor: B

Metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT,

CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1,

Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000",

Content-Type=text/html, Connection=close, Content-Length=1106}

ParseText::

'A' is for Alligator Alligators live in freshwater environments such

as ponds, marshes, rivers and swamps. Although alligators have heavy

bodies and slow metabolisms, they are capable of short bursts of

speed that can exceed 30 miles per hour. Alligators' main prey are

smaller animals that they can kill and eat with a single bite.

Alligators may kill larger prey by grabbing it and dragging it in

the water to drown. Food items that can't be eaten in one bite are

either allowed to rot or are rendered by biting and then spinning or

convulsing wildly until bite size pieces are torn off.

(From the Wikipedia entry for Alligator .) B

可以看到结果中有多个数据块，但是不要弄错，这些数据块都是属于一个网页实体的，那就是A网页。只不过这些数据块属于不同的种类，是不同的阶段Crawler所得到的A网页的数据。数据总共是三种：fetch data、raw content和pared content。Fetch data被放置于FetcherOutput标识段之内，这些data是Crawler在“产生/抓取/更新”循环的更新阶段为了更新WebDB写入WebDB中的关于A网页的数据。

Raw content被放置于Content标识段之内，是Fetcher从网络上抓取的网页的原始文本数据，包括了完整的网页头数据和网页体。默认的是通过http协议插件进行下载此数据。这些数据当你设置nutch拥有网页快照能力的时候被保存。

最后，Raw content被解析模块进行解析生成parsed content，解析模块在Nutch中是以插件的形式进行实现的，根据下载网页的格式决定利用哪个插件进行解析。Parsed content被放置于ParseData和ParseText标识段之内，它被用于建立segment的索引。

Index

我们利用Luke工具来进行Nutch索引的分析。Luke可以查看索引中的单个索引项，也可以进行关键字查询。下面图3即是本次实验中nutch建立的索引，这些索引存在于index目录之下。

图3 利用Luke查看Nutch的index

上篇文章说过最终的索引是通过对segments的索引进行合并和除去重复建立的，所以你可以利用Luke来查看最后一个segment的索引你会发现，索引中并没有C-dup网页，因为其内容与C网页重复已经在索引建立的过程中被移除。于是，最终的索引中也只有三个索引项，分别对应的是网页A、B和C。

图3中显示的是关于网页A的所有索引fields，这些fields的含义很清楚，我们主要对boost进行一下解释。Boost field的值是通过链接到此网页的链接数目来进行计算的，链接到此网页的链接越多，此值越大。但是两者并不是成线性关系的，而是通过对数计算而得，计算公式为ln(e+n)，其中参数n即是链接数目。例如此例中链接到网页A的只有一个网页B，所以boost的值计算为ln(e+1) = 1.3132616…

但是本质上，boost的值并不是仅仅依赖于链接数目，它还与链接到本网页的母网页的分数有关。我们分析过WebDB和segment，里面存储了每个网页的得分，在boost的计算中，本质上是需要参考这些母网页的自身分数的。但是因为局域网爬虫操作中默认取消了所有网页评分的操作，于是所有网页的分数被认定为1.0，所以在boost的计算中，实际上只单独考虑了链接的数目。

那么，什么时候网页的评分将不会为1.0呢？Nutch利用了LinkAnalysisTool操作来对网页进行评分，此操作利用的是google的pagerank算法。Pagerank算法在这里不进行详细叙述了，有兴趣的朋友可以进行研究。我们知道，pagerank算法是一个递归的过程，其算法复杂度和花费的时间较为巨大，所以在进行局域网爬行的时候Nutch自动省略了这一操作，幸运的是，局域网中网页的检索在没有这一步的情况下效果是理想的。但是对于整个网络的爬行以及检索，LinkAnalysisTool操作是至关重要的，它也是google获得巨大成功的重要原因。

备注

参考文章：http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

本文章为原创，如要转载请务必注明本文章出处http://blog.donews.com/52se。

posted @ 2006-07-08 15:00 kwklover 阅读(8187) 评论(0) 收藏举报

刷新页面返回顶部

Nutch爬虫实验运行及抓取数据分析（二）

Segments

Index

公告