hbase-writer

http://code.google.com/p/hbase-writer/

 

What is HBase-Writer?

HBase-Writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://hadoop.apache.org/core/). HBase-Writer writes crawled content into a given hbase table as individual records or "rowkeys". In turn, these tables are directly supported by the MapReduce framework via HBase and Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase.

News

March 29th, 2010

HBase-Writer 0.9-SNAPSHOT has now been released. This version is compatible with both Heritrix 2.X and Heritrix 3.X. Much thanks to Greg Lu for spearheading this effort and sending in the initial patch. Once Heritrix has an official 3.0.0-RELEASE, then HBase-writer will release version 0.9-RELEASE. Thanks again Greg!

October 18th, 2009

HBase-Writer 0.20.3 has now been released. Thanks to the patch sent in by Joost Ouwerkerk, a major bug was discovered and fixed; the issue was that replayInputStream was not being closed properly after hbase writer wrote the raw content to the hbase table. As a result, there was a steady growing number of open stream objects in memory as the crawler ran. This has been fixed by manually forcing a close of the stream object. Thanks again Joost!

October 14th 2009

HBase-Writer 0.20.2 has now been released. A new runtime jar dependency has been added: zookeeper.jar. This is because HBase 0.20.X now depends on the Zookeeper service layer to manage the connections to the HBase master. This change is significant for the client side because you no longer require the HBase master address to access your HBase tables, but instead you need the list of hosts serving in the zookeeper quorum. The new input is now a comma-separated list of these zk quorum hosts in the global sheet for the HBaseWriter Processor configuration. These zk hosts are analogous to the values set for the "hbase.zookeeper.quorum" property in hbase-site.xml. I have also added support for the hbase property: "hbase.zookeeper.property.clientPort" if you happen to run your zk quorum on a non-standard port. Enjoy.

 

posted @ 2010-04-27 21:40  searchDM  阅读(335)  评论(0编辑  收藏  举报