星云外

Introduction to Nutch, Part 1: Crawling

  {cs.r.title}

Introduction to Nutch, Part 1: Crawling

Tom White

Tue, 2006-01-10

 

Nutch isan open source Java implementation of a search engine. It providesall of the tools you need to run your own search engine. But why wouldanyone want to run their own search engine? After all, there'salways Google. There are at least three reasons.

  1. Transparency. Nutch is open source, so anyone can see howthe ranking algorithms work. With commercial search engines, theprecise details of the algorithms are secret so you can never knowwhy a particular search result is ranked as it is. Furthermore,some search engines allow rankings to be based on payments, ratherthan on the relevance of the site's contents. Nutch is a good fitfor academic and government organizations, where the perception offairness of rankings may be more important.
  2. Understanding. We don't have the source code to Google,so Nutch is probably the best we have. It's interesting to see howa large search engine works. Nutch has been built using ideas fromacademia and industry: for instance, core parts of Nutch arecurrently being re-implemented to use the Map Reducedistributed processing model, which emerged from Google Labs lastyear. And Nutch is attractive for researchers who want to try outnew search algorithms, since it is so easy to extend.
  3. Extensibility. Don't like the way other search enginesdisplay their results? Write your own search engine--using Nutch!Nutch is very flexible: it can be customized and incorporated intoyour application. For developers, Nutch is a great platform foradding search to heterogeneous collections of information, andbeing able to customize the search interface, or extend theout-of-the-box functionality through the plugin mechanism. Forexample, you can integrate it into your site to add a searchcapability.

Nutch installations typically operate at one of three scales:local filesystem, intranet, or whole web. Allthree have different characteristics. For instance, crawling alocal filesystem is reliable compared to the other two, sincenetwork errors don't occur and caching copies of the page contentis unnecessary (and actually a waste of disk space). Whole-webcrawling lies at the other extreme. Crawling billions of pagescreates a whole host of engineering problems to be solved: whichpages do we start with? How do we partition the work between a setof crawlers? How often do we re-crawl? How do we cope with brokenlinks, unresponsive sites, and unintelligible or duplicate content?There is another set of challenges to solve to deliver scalablesearch--how do we cope with hundreds of concurrent queries on sucha large dataset? Building a whole-web search engine is a majorinvestment. In "Building Nutch: Open Source Search," authors Mike Cafarella andDoug Cutting (the prime movers behind Nutch) conclude that:

... a complete system might cost anywhere between $800per month for two-search-per-second performance over 100 millionpages, to $30,000 per month for 50-page-per-second performance over1 billion pages.

This series of two articles shows you how to use Nutch at themore modest intranet scale (note that you may see this term beingused to cover sites that are actually on the public internet--thepoint is the size of the crawl being undertaken, which ranges froma single site to tens, or possibly hundreds, of sites). This firstarticle concentrates on crawling: the architecture of theNutch crawler, how to run a crawl, and understanding what itgenerates. The second looks at searching, and shows you howto run the Nutch search application, ways to customize it, andconsiderations for running a real-world system.

Nutch Vs. Lucene

Nutch is built on top of Lucene, which is an API for textindexing and searching. A common question is: "Should I use Luceneor Nutch?" The simple answer is that you should use Lucene if youdon't need a web crawler. A common scenario is that you have a webfront end to a database that you want to make searchable. The bestway to do this is to index the data directly from the databaseusing the Lucene API, and then write code to do searches against theindex, again using Lucene. Erik Hatcher and Otis Gospodnetić'sLucene inAction gives all of the details. Nutch is a better fit for siteswhere you don't have direct access to the underlying data, or itcomes from disparate sources.

Architecture

Nutch divides naturally into two pieces: the crawler and thesearcher. The crawler fetches pages and turns them into an invertedindex, which the searcher uses to answer users' search queries. Theinterface between the two pieces is the index, so apart from anagreement about the fields in the index, the two are highlydecoupled. (Actually, it is a little more complicated than this,since the page content is not stored in the index, so the searcherneeds access to the segments described below in order to producepage summaries and to provide access to cached pages.)

The main practical spin-off from this design is that the crawlerand searcher systems can be scaled independently on separatehardware platforms. For instance, a highly trafficked search pagethat provides searching for a relatively modest set of sites may onlyneed a correspondingly modest investment in the crawlerinfrastructure, while requiring more substantial resources forsupporting the searcher.

We will look at the Nutch crawler here, and leave discussion ofthe searcher to part two.

The Crawler

The crawler system is driven by the Nutch crawltool, and a family of related tools to build and maintainseveral types of data structures, including the webdatabase, a set of segments, and the index. Wedescribe all of these in more detail next.

The web database, or WebDB, is a specializedpersistent data structure for mirroring the structure andproperties of the web graph being crawled. It persists as long asthe web graph that is being crawled (and re-crawled) exists, whichmay be months or years. The WebDB is used only by the crawler anddoes not play any role during searching. The WebDB stores two typesof entities: pages and links. A pagerepresents a page on the Web, and is indexed by its URL and the MD5hash of its contents. Other pertinent information is stored, too,including the number of links in the page (also calledoutlinks); fetch information (such as when the page is dueto be refetched); and the page's score, which is a measure of howimportant the page is (for example, one measure of importanceawards high scores to pages that are linked to from many otherpages). A link represents a link from one web page (thesource) to another (the target). In the WebDB web graph, thenodes are pages and the edges are links.

A segment is a collection of pages fetched and indexed bythe crawler in a single run. The fetchlist for a segment isa list of URLs for the crawler to fetch, and is generated from theWebDB. The fetcher output is the data retrieved from thepages in the fetchlist. The fetcher output for the segment isindexed and the index is stored in the segment. Any given segmenthas a limited lifespan, since it is obsolete as soon as all of itspages have been re-crawled. The default re-fetch interval is 30days, so it is usually a good idea to delete segments older thanthis, particularly as they take up so much disk space. Segments arenamed by the date and time they were created, so it's easy to tellhow old they are.

The index is the inverted index of all of the pages thesystem has retrieved, and is created by merging all of theindividual segment indexes. Nutch uses Lucene for its indexing, soall of the Lucene tools and APIs are available to interact with thegenerated index. Since this has the potential to cause confusion, itis worth mentioning that the Lucene index format has a concept ofsegments, too, and these are different from Nutch segments. A Lucenesegment is a portion of a Lucene index, whereas a Nutch segment isa fetched and indexed portion of the WebDB.

The crawl tool

Now that we have some terminology, it is worth trying to understandthe crawl tool, since it does a lot behind the scenes.Crawling is a cyclical process: the crawler generates a set offetchlists from the WebDB, a set of fetchers downloads the contentfrom the Web, the crawler updates the WebDB with new links thatwere found, and then the crawler generates a new set of fetchlists (forlinks that haven't been fetched for a given period, including thenew links found in the previous cycle) and the cycle repeats. Thiscycle is often referred to as the generate/fetch/updatecycle, and runs periodically as long as you want to keep yoursearch index up to date.

URLs with the same host are always assigned to the samefetchlist. This is done for reasons of politeness, so that aweb site is not overloaded with requests from multiple fetchers inrapid succession. Nutch observes the Robots ExclusionProtocol, which allows site owners to control which parts oftheir site may be crawled.

The crawl tool is actually a front end to other,lower-level tools, so it is possible to get the same results byrunning the lower-level tools in a particular sequence. Here is abreakdown of what crawl does, with the lower-leveltool names in parentheses:

  1. Create a new WebDB (admin db -create).
  2. Inject root URLs into the WebDB (inject).
  3. Generate a fetchlist from the WebDB in a new segment(generate).
  4. Fetch content from URLs in the fetchlist(fetch).
  5. Update the WebDB with links from fetched pages(updatedb).
  6. Repeat steps 3-5 until the required depth is reached.
  7. Update segments with scores and links from the WebDB(updatesegs).
  8. Index the fetched pages (index).
  9. Eliminate duplicate content (and duplicate URLs) from theindexes (dedup).
  10. Merge the indexes into a single index for searching(merge).

After creating a new WebDB (step 1), the generate/fetch/updatecycle (steps 3-6) is bootstrapped by populating the WebDB withsome seed URLs (step 2). When this cycle has finished, the crawlergoes on to create an index from all of the segments (steps 7-10).Each segment is indexed independently (step 8), before duplicatepages (that is, pages at different URLs with the same content) areremoved (step 9). Finally, the individual indexes are combined intoa single index (step 10).

The dedup tool can remove duplicate URLs from thesegment indexes. This is not to remove multiple fetches of the sameURL because the URL has been duplicated in the WebDB--this cannothappen, since the WebDB does not allow duplicate URL entries.Instead, duplicates can arise if a URL is re-fetched and the oldsegment for the previous fetch still exists (because it hasn't beendeleted). This situation can't arise during a single run of thecrawl tool, but it can during re-crawls, so this iswhy dedup also removes duplicate URLs.

While the crawl tool is a great way to get startedwith crawling websites, you will need to use the lower-level toolsto perform re-crawls and other maintenance on the data structuresbuilt during the initial crawl. We shall see how to do this in thereal-world example later, in part two of this series. Also, crawl isreally aimed at intranet-scale crawling. To do a whole webcrawl, you should start with the lower-level tools. (See the"Resources" section for more information.)

Configuration and Customization

All of Nutch's configuration files are found in the confsubdirectory of the Nutch distribution. The main configuration fileis conf/nutch-default.xml. As the name suggests, it containsthe default settings, and should not be modified. To change asetting you create conf/nutch-site.xml, and add yoursite-specific overrides.

Nutch defines various extension points, which allowdevelopers to customize Nutch's behavior by writing plugins,found in the pluginssubdirectory. Nutch's parsing andindexing functionality is implemented almost entirely by plugins--it isnot in the core code. For instance, the code for parsing HTMLis provided by the HTML document parsing plugin, parse-html. You cancontrol which plugins are available to Nutch with theplugin.includes and plugin.excludesproperties in the main configuration file.

With this background, let's run a crawl on a toy site to get afeel for what the Nutch crawler does.

Running a Crawl

First, download the latestNutch distribution and unpack it on your system (I used version0.7.1). To use the Nutch tools, you will need to make sure theNUTCH_JAVA_HOME or JAVA_HOME environmentvariable is set to tell Nutch where Java is installed.

I created a contrived example with just four pages to understandthe steps involved in the crawl process. Figure 1 illustrates thelinks between pages. C and C-dup (C-duplicate) have identicalcontent.

Figure 1
Figure 1. The site structure for the site we are going to crawl

Before we run the crawler, create a file called urls thatcontains the root URLs from which to populate the initial fetchlist. Inthis case, we'll start from page A.

echo 'http://keaton/tinysite/A.html' > urls

The crawl tool uses a filter to decide which URLsgo into the WebDB (in steps 2 and 5 in the breakdown ofcrawl above). This can be used to restrict the crawlto URLs that match any given pattern, specified by regularexpressions. Here, we just restrict the domain to the server on myintranet (keaton), by changing the line in theconfiguration file conf/crawl-urlfilter.txt from

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

to

+^http://keaton/

Now we are ready to crawl, which we do with a singlecommand:

bin/nutch crawl urls -dir crawl-tinysite -depth 3 >& crawl.log

The crawl uses the root URLs in urls to start the crawl,and puts the results of the crawl in the directorycrawl-tinysite. The crawler logs its activity tocrawl.log. The -depth flag tells the crawlerhow many generate/fetch/update cycles to carry out to get full pagecoverage. Three is enough to reach all of the pages in this example, butfor real sites it is best to start with five (the default), andincrease it if you find some pages aren't being reached.

We shall now look in some detail at the data structures crawl hasproduced.

Examining the Results of the Crawl

If we peek into the crawl-tinysite directory, we findthree subdirectories: db, segments, and index(see Figure 2). These contain the WebDB, the segments, and theLucene index, respectively.

Figure 2
Figure 2. The directories and files created after running thecrawl tool

Nutch comes with several tools for examining the data structuresit builds, so let's use them to see what the crawl has created.

WebDB

The first thing to look at is the number of pages and links inthe database. This is useful as a sanity check to give us someconfidence that the crawler did indeed crawl the site, and how muchof it. The readdb tool parses the WebDB and displaysportions of it in human-readable form. We use the-stats option here:

bin/nutch readdb crawl-tinysite/db -stats

which displays:

Number of pages: 4
Number of links: 4

As expected, there are four pages in the WebDB (A, B, C,and C-duplicate) and four links between them. The links to Wikipediaare not in the WebDB, since they did match the pattern in theURL filter file. Both C and C-duplicate are in the WebDB since theWebDB doesn't de-duplicate pages by content, only by URL (which iswhy A isn't in twice). Next, we can dump all of the pages, by usinga different option for readdb:

bin/nutch readdb crawl-tinysite/db -dumppageurl

which gives:

Page 1: Version: 4
URL: http://keaton/tinysite/A.html
ID: fb8b9f0792e449cda72a9670b4ce833a
Next fetch: Thu Nov 24 11:13:35 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 1
Score: 1.0
NextScore: 1.0

 


Page 2: Version: 4
URL: http://keaton/tinysite/B.html
ID: 404db2bd139307b0e1b696d3a1a772b4
Next fetch: Thu Nov 24 11:13:37 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 3
Score: 1.0
NextScore: 1.0


Page 3: Version: 4
URL: http://keaton/tinysite/C-duplicate.html
ID: be7e0a5c7ad9d98dd3a518838afd5276
Next fetch: Thu Nov 24 11:13:39 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0


Page 4: Version: 4
URL: http://keaton/tinysite/C.html
ID: be7e0a5c7ad9d98dd3a518838afd5276
Next fetch: Thu Nov 24 11:13:40 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

Each page appears in a separate block, with one field per line.The ID field is the MD5 hash of the page contents: note that C andC-duplicate have the same ID. There is also information about whenthe pages should be next fetched (which defaults to 30 days), andpage scores. It is easy to dump the structure of the web graph,too:

bin/nutch readdb crawl-tinysite/db -dumplinks

which produces:

from http://keaton/tinysite/B.html
to http://keaton/tinysite/A.html
to http://keaton/tinysite/C-duplicate.html
to http://keaton/tinysite/C.html

 

from http://keaton/tinysite/A.html
to http://keaton/tinysite/B.html

For sites larger than a few pages, it is less useful to dump theWebDB in full using these verbose formats. The readdbtool also supports extraction of an individual page or link by URLor MD5 hash. For example, to examine the links to page B, issue thecommand:

bin/nutch readdb crawl-tinysite/db -linkurl http://keaton/tinysite/B.html

to get:

Found 1 links.
Link 0: Version: 5
ID: fb8b9f0792e449cda72a9670b4ce833a
DomainID: 3625484895915226548
URL: http://keaton/tinysite/B.html
AnchorText: B
targetHasOutlink: true

Notice that the ID is the MD5 hash of the source page A.

There are other ways to inspect the WebDB. Theadmin tool can produce a dump of the whole database inplain-text tabular form, with one entry per line, using the-textdump option. This format is handy for processingwith scripts. The most flexible way of reading the WebDB is throughthe Java interface. See the Nutch source code and APIdocumentation for more details. A good starting point isorg.apache.nutch.db.WebDBReader, which is the Java class thatimplements the functionality of the readdb tool(readdb is actually just a synonym fororg.apache.nutch.db.WebDBReader).

Segments

The crawl created three segments in timestamped subdirectoriesin the segments directory, one for eachgenerate/fetch/update cycle. The segread tool gives auseful summary of all of the segments:

bin/nutch segread -list -dir crawl-tinysite/segments/

giving the following tabular output (slightly reformatted to fitthis page):

PARSED? STARTED           FINISHED          COUNT DIR NAME
true 20051025-12:13:35 20051025-12:13:35 1 crawl-tinysite/segments/20051025121334
true 20051025-12:13:37 20051025-12:13:37 1 crawl-tinysite/segments/20051025121337
true 20051025-12:13:39 20051025-12:13:39 2 crawl-tinysite/segments/20051025121339
TOTAL: 4 entries in 3 segments.

The PARSED? column is always true when using thecrawl tool. This column is useful when runningfetchers with parsing turned off, to be run later as a separateprocess. The STARTED and FINISHED columns indicate the times whenfetching started and finished. This information is invaluable forbigger crawls, when tracking down why crawling is taking a longtime. The COUNT column shows the number of fetched pages in thesegment. The last segment, for example, has two entries,corresponding to pages C and C-duplicate.

Sometimes it is necessary to find out in more detail what is ina particular segment. This is done using the -dumpoption for segread. Here we dump the first segment(again, slightly reformatted to fit this page):

s=`ls -d crawl-tinysite/segments/* | head -1`
bin/nutch segread -dump $s


Recno:: 0
FetcherOutput::
FetchListEntry: version: 2
fetch: true
page: Version: 4
URL: http://keaton/tinysite/A.html
ID: 6cf980375ed1312a0ef1d77fd1760a3e
Next fetch: Tue Nov 01 11:13:34 GMT 2005
Retries since fetch: 0
Retry interval: 30 days
Num outlinks: 0
Score: 1.0
NextScore: 1.0

 

anchors: 1
anchor: A
Fetch Result:
MD5Hash: fb8b9f0792e449cda72a9670b4ce833a
ProtocolStatus: success(1), lastModified=0
FetchDate: Tue Oct 25 12:13:35 BST 2005

Content::
url: http://keaton/tinysite/A.html
base: http://keaton/tinysite/A.html
contentType: text/html
metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT, Server=Apache-Coyote/1.1,
Connection=close, Content-Type=text/html, ETag=W/"1106-1130238131000",
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, Content-Length=1106}
Content:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>'A' is for Alligator</title>
</head>
<body>
<p>
Alligators live in freshwater environments such as ponds,
marshes, rivers and swamps. Although alligators have
heavy bodies and slow metabolisms, they are capable of
short bursts of speed that can exceed 30 miles per hour.
Alligators' main prey are smaller animals that they can kill
and eat with a single bite. Alligators may kill larger prey
by grabbing it and dragging it in the water to drown.
Food items that can't be eaten in one bite are either allowed
to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From
<a href="http://en.wikipedia.org/wiki/Alligator">the
Wikipedia entry for Alligator</a>.)
</p>
<p><a href="B.html">B</a></p>
</body>
</html>

ParseData::
Status: success(1,0)
Title: 'A' is for Alligator
Outlinks: 2
outlink: toUrl: http://en.wikipedia.org/wiki/Alligator
anchor: the Wikipedia entry for Alligator
outlink: toUrl: http://keaton/tinysite/B.html anchor: B
Metadata: {Date=Tue, 25 Oct 2005 11:13:34 GMT,
CharEncodingForConversion=windows-1252, Server=Apache-Coyote/1.1,
Last-Modified=Tue, 25 Oct 2005 11:02:11 GMT, ETag=W/"1106-1130238131000",
Content-Type=text/html, Connection=close, Content-Length=1106}

ParseText::
'A' is for Alligator Alligators live in freshwater environments such
as ponds, marshes, rivers and swamps. Although alligators have heavy
bodies and slow metabolisms, they are capable of short bursts of
speed that can exceed 30 miles per hour. Alligators' main prey are
smaller animals that they can kill and eat with a single bite.
Alligators may kill larger prey by grabbing it and dragging it in
the water to drown. Food items that can't be eaten in one bite are
either allowed to rot or are rendered by biting and then spinning or
convulsing wildly until bite size pieces are torn off.
(From the Wikipedia entry for Alligator .) B

There's a lot of data for each entry--remember this is just asingle entry, for page A--but it breaks down into the followingcategories: fetch data, raw content, and parsed content. The fetchdata, indicated by the FetcherOutput section,is data gathered by the fetcher to be propagated back to the WebDBduring the update part of the generate/fetch/update cycle.

The raw content, indicated by the Contentsection, contains the page contents as retrieved by the fetcher,including HTTP headers and other metadata. (By default, theprotocol-httpclient plugin is used to do this work.)This content is returned when you ask Nutch search for a cachedcopy of the page. You can see the HTML page for page A in thisexample.

Finally, the raw content is parsed using an appropriate parserplugin--determined by looking at the content type and then thefile extension. In this case, parse-html was used,since the content type is text/html. The parsed content (indicatedby the ParseData andParseText sections) is used by the indexer tocreate the segment index.

Index

The tool of choice for examining Lucene indexes is Luke. Luke allows you to look atindividual documents in an index, as well as perform ad hocqueries. Figure 3 shows the merged index for our example, found inthe index directory.

Figure 3
Figure 3. Browsing the merged index in Luke

Recall that the merged index is created by combining all of thesegment indexes after duplicate pages have been removed. In fact,if you use Luke to browse the index for the last segment (found inthe index subdirectory of the segment) you will see thatpage C-duplicate has been removed from the index. Hence, the mergedindex only has three documents, corresponding to pages A, B, andC.

Figure 3 shows the fields for page A. Most are self-explanatory,but the boost field deserves a mention. It is calculated onthe basis of the number of pages linking to this page--the morepages that link to the page, the higher the boost. The boost is notproportional to the number of inbound links; instead, it is dampedlogarithmically. The formula used is ln(e +n), where n is the number of inbound links. In ourexample, only page B links to page A, so there is only one inboundlink, and the boost works out as ln(e + 1) =1.3132616 ...

You might be wondering how the boost field is related to thepage score that is stored in the WebDB and the segment fetcheroutput. The boost field is actually calculated by multiplying thepage score by the formula in the previous paragraph. For our crawl--indeed, for all crawls performed using the crawl tool--the page scores are always 1.0, so the boosts depend simply on thenumber of inbound links.

When are page scores not 1.0? Nutch comes with a tool forperforming link analysis, LinkAnalysisTool, which usesan algorithm like Google's PageRank to assign ascore to each page based on how many pages link to it (and is weightedaccording to the score of these linking pages). Notice that this isa recursive definition, and it is for this reason that linkanalysis is expensive to compute. Luckily, intranet search usuallyworks fine without link analysis, which is why it is not a part ofthe crawl tool, but it is a key part of whole websearch--indeed, PageRank was crucial to Google's success.

Conclusion

In this article, we looked at the Nutch crawler in some detail.The second article will show how to get the Nutch searchapplication running against the results of a crawl.

Resources

  • The Nutchproject page is the place to start for more information onNutch. The mailing lists for nutch-userand nutch-devare worth searching if you have a question.
  • At the time of this writing, the Map Reduce version of Nutch is in themain trunkand is not in a released version. This means that you need to build ityourself if you want to use it (or wait for version 0.8 to bereleased).
  • For more on whole-web crawling, see the Nutchtutorial.
  • For more information on Nutch plugins (which are based on theEclipse 2.0 plugin architecture), a good starting point isPluginCentral onthe Nutch Wiki.
  • Creative Commons provides a Nutch-powered search option for findingCreative-Commons-licensed content (see also this blogentry).
  • "Building Nutch: Open Source Search" (ACM Queue, vol. 2, no. 2, April 2004), by Mike Cafarella and Doug Cutting, is a goodhigh-level introduction to Nutch.
  • "Nutch: AFlexible and Scalable Open Source Web Search Engine" (PDF) (CommerceNet Labs Technical Report 04-04, November 2004), by RohitKhare, Doug Cutting, Kragen Sitaker, and Adam Rifkin, covers thefilesystem scale of Nutch particularly well.
Acknowledgments

Thanks to Doug Cutting and Piotr Kosiorowski for their feedbackand helpful suggestions.

Dedication

This article is for my younger daughter Charlotte who learned tocrawl while I was writing it.

posted on 2010-04-04 10:35  星云外  阅读(618)  评论(0编辑  收藏  举报