[root@ewanalysis ~]# nutch
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command                        instead
 solrindex      run the solr indexer on parsed batches - DEPRECATED use the inde                       x command instead
 solrdedup      remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the                        clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing b                       ackends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.



Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>






nutch inject

Usage: InjectorJob <url_dir> [-crawlId <id>]



nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]


[-topN N]:选取前多少个链接,默认值为Long.MAX_VALUE 


[-noNorm] :不激活normalizer插件规范化的url,默认是true

[-adddays numDays]: 添加 <numDays>到当前时间,配置crawling urls ,以将很快被爬取db.default.fetch.interval默认值为0。爬取结束时间在当前时间以前的。 

nutch fetch

Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N]


[-crawlId <id>]:

[-threads N]:运行的fetcher线程数默认值为 Configuration Key -> fetcher.threads.fetch -> 10 


[-numTasks N]:如果N>0,则使用设定的N减少抓取任务(默认值:

nutch parse

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]


[-crawlId <id>]:



nutch updatedb

Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all
generated batchId-s
-crawlId <id> - the id to prefix the schemas to operate on,


nutch index

Usage: IndexingJob (<batchId> | -all | -reindex) [-crawlId <id>]



posted @ 2015-08-05 12:42  HuijunZhang  阅读(439)  评论(0编辑  收藏  举报