Hama——BSP、Graph教程
1. BSP
Hama提供纯BSP模型,支持消息传递与全局通信。BSP模型由一系列超步组成,每一个超步包括3个部分:
1)本地计算
2)进程通信
3)障栅同步
针对大量的科学计算问题,使用BSP模型可以编写高性能的并行计算算法。
通过继承 org.apache.hama.bsp.BSP 类,创建自己的BSP类。
继承类必须实现如下方法:
public abstract void bsp(BSPPeer<K1, V1, K2, V2, M extends Writable> peer) throws IOException, SyncException, InterruptedException{}
每一个BSP程序有一些系列的超步组成,但是BSP方法只被调用一次,这一点与MapReduce有所不同。在计算的前后,可以选择实现setup()和cleanup()方法,对每次计算的数据作进一步处理。建议在计算结束或计算失败时执行cleanup()。
配置job:
HamaConfiguration conf = new HamaConfiguration();
BSPJob job = new BSPJob(conf, MyBSP.class);
job.setJobName("My BSP program");
job.setBspClass(MyBSP.class);
job.setInputFormat(NullInputFormat.class);
job.setOutputKeyClass(Text.class);
...
job.waitForCompletion(true);
用户接口
输入输出
对BSPJob进行设置时,输入输出路径形式如下:
job.setInputPath(new Path("/tmp/sequence.dat");
job.setInputFormat(org.apache.hama.bsp.SequenceFileInputFormat.class);
or,
SequenceFileInputFormat.addInputPath(job, new Path("/tmp/sequence.dat"));
or,
SequenceFileInputFormat.addInputPaths(job, "/tmp/seq1.dat,/tmp/seq2.dat,/tmp/seq3.dat");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setOutputFormat(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("/tmp/result"));
以上三种方式可以任选一种作为输入代码。
然后,是对输入的数据的读取和输出数据。BSP创建一个方法,以BSPPeer作为参数。BSPPeer包含了通信、计数器和IO接口。读取一个文件,代码如下:
@Override
public final void bsp(
BSPPeer<LongWritable, Text, Text, LongWritable, Text> peer)
throws IOException, InterruptedException, SyncException {
// this method reads the next key value record from file
KeyValuePair<LongWritable, Text> pair = peer.readNext();
// the following lines do the same:
LongWritable key = new LongWritable();
Text value = new Text();
peer.readNext(key, value);
// write
peer.write(value, key);
}
可以对输入文件进行重复读取:
for(int i = 0; i < 5; i++){
LongWritable key = new LongWritable();
Text value = new Text();
while (peer.readNext(key, value)) {
// read everything
}
// reopens the input
peer.reopenInput() //***************
}
通信:
| 方法 | 描述 |
| send(String peerName, BSPMessage msg) | 向另外一个peer发送消息 |
| getCurrentMessage() | 返回接收到的消息 |
| getNumCurrentMessages() | 返回接收到的消息数 |
| sync() | 障栅同步 |
| getPeerName() | 返回peer的名称 |
| getAllPeerNames() | 返回所有peer的名称 |
| getSuperstepCount() | 返回超步数 |
以上方法都比较灵活,下面是一个向所有peer传递消息的代码:
@Override
public void bsp(
BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
throws IOException, SyncException, InterruptedException {
for (String peerName : peer.getAllPeerNames()) {
peer.send(peerName,
new Text("Hello from " + peer.getPeerName(), System.currentTimeMillis()));
}
peer.sync();
}
同步:
当所有的进程都进入同步状态,接下来将就进入下一个超步。需要注意的是,sync()方法并不是BSP Job的结束。如前所述,所有的通信方法都非常的灵活。例如,可以在一个for循环中执行sync(),这样就可以对迭代顺序进行控制。
@Override
public void bsp(
BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, Text> peer)
throws IOException, SyncException, InterruptedException {
for (int i = 0; i < 100; i++) {
// send some messages
peer.sync();
}
}
最后,给出一个求取PI值的完整例子:
private static Path TMP_OUTPUT = new Path("/tmp/pi-" + System.currentTimeMillis());
public static class MyEstimator extends
BSP<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> {
public static final Log LOG = LogFactory.getLog(MyEstimator.class);
private String masterTask;
private static final int iterations = 10000;
@Override
public void bsp(
BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
throws IOException, SyncException, InterruptedException {
int in = 0;
for (int i = 0; i < iterations; i++) {
double x = 2.0 * Math.random() - 1.0, y = 2.0 * Math.random() - 1.0;
if ((Math.sqrt(x * x + y * y) < 1.0)) {
in++;
}
}
double data = 4.0 * in / iterations;
peer.send(masterTask, new DoubleWritable(data));
peer.sync();
}
@Override
public void setup(
BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
throws IOException {
// Choose one as a master
this.masterTask = peer.getPeerName(peer.getNumPeers() / 2);
}
@Override
public void cleanup(
BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer)
throws IOException {
if (peer.getPeerName().equals(masterTask)) {
double pi = 0.0;
int numPeers = peer.getNumCurrentMessages();
DoubleWritable received;
while ((received = peer.getCurrentMessage()) != null) {
pi += received.get();
}
pi = pi / numPeers;
peer.write(new Text("Estimated value of PI is"), new DoubleWritable(pi));
}
}
}
static void printOutput(HamaConfiguration conf) throws IOException {
FileSystem fs = FileSystem.get(conf);
FileStatus[] files = fs.listStatus(TMP_OUTPUT);
for (int i = 0; i < files.length; i++) {
if (files[i].getLen() > 0) {
FSDataInputStream in = fs.open(files[i].getPath());
IOUtils.copyBytes(in, System.out, conf, false);
in.close();
break;
}
}
fs.delete(TMP_OUTPUT, true);
}
public static void main(String[] args) throws InterruptedException,
IOException, ClassNotFoundException {
// BSP job configuration
HamaConfiguration conf = new HamaConfiguration();
BSPJob bsp = new BSPJob(conf, PiEstimator.class);
// Set the job name
bsp.setJobName("Pi Estimation Example");
bsp.setBspClass(MyEstimator.class);
bsp.setInputFormat(NullInputFormat.class);
bsp.setOutputKeyClass(Text.class);
bsp.setOutputValueClass(DoubleWritable.class);
bsp.setOutputFormat(TextOutputFormat.class);
FileOutputFormat.setOutputPath(bsp, TMP_OUTPUT);
BSPJobClient jobClient = new BSPJobClient(conf);
ClusterStatus cluster = jobClient.getClusterStatus(true);
if (args.length > 0) {
bsp.setNumBspTask(Integer.parseInt(args[0]));
} else {
// Set to maximum
bsp.setNumBspTask(cluster.getMaxTasks());
}
long startTime = System.currentTimeMillis();
if (bsp.waitForCompletion(true)) {
printOutput(conf);
System.out.println("Job Finished in "
+ (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
}
}
2. Graph
hama提供了Graph包,支持顶点为中心的图计算,使用较少的代码就可以实现google Pregel风格的应用。
Vertex API
实现一个Hama Graph应用包括对预定义的Vertex类进行子类化,模板参数涉及3种类型,顶点、边和消息(vertices, edges, and messages):
public abstract class Vertex<V extends Writable, E extends Writable, M extends Writable>
implements VertexInterface<V, E, M> {
public void compute(Iterator<M> messages) throws IOException;
..
}
用户重写compute()方法,该方法将在每个超步的活跃顶点中执行。Compute()方法可以查询当前顶点及其边的信息,并向其他顶点发送消息。
VertexReader API
通过继承 org.apache.hama.graph.VertexInputReader 类,根据自己的文件格式创建自己的 VertexReader,示例如下:
public static class PagerankTextReader extends
VertexInputReader<LongWritable, Text, Text, NullWritable, DoubleWritable> {
/**
* 输入文件的格式
* The text file essentially should look like: <br/>
* VERTEX_ID\t(n-tab separated VERTEX_IDs)<br/>
* E.G:<br/>
* 1\t2\t3\t4<br/>
* 2\t3\t1<br/>
* etc.
*/
@Override
/***
* 解析节点,如hadoop类似,以行为一个单位进行输入。以制表符作为分割符,
* 将每一行分割为String类型的数组,最后转化为vertex类的一个实例
*/
public boolean parseVertex(LongWritable key, Text value,
Vertex<Text, NullWritable, DoubleWritable> vertex) throws Exception {
String[] split = value.toString().split("\t");
for (int i = 0; i < split.length; i++) {
if (i == 0) {
vertex.setVertexID(new Text(split[i]));
} else {
vertex
.addEdge(new Edge<Text, NullWritable>(new Text(split[i]), null));
}
}
return true;
}
}
PageRank的例子,很简单,不解释了:
public static class PageRankVertex extends
Vertex<Text, NullWritable, DoubleWritable> {
@Override
public void compute(Iterator<DoubleWritable> messages) throws IOException {
if (this.getSuperstepCount() == 0) {
this.setValue(new DoubleWritable(1.0 / (double) this.getNumVertices()));
}
if (this.getSuperstepCount() >= 1) {
double sum = 0;
while (messages.hasNext()) {
DoubleWritable msg = messages.next();
sum += msg.get();
}
double ALPHA = (1 - 0.85) / (double) this.getNumVertices();
this.setValue(new DoubleWritable(ALPHA + (0.85 * sum)));
}
if (this.getSuperstepCount() < this.getMaxIteration()) {
int numEdges = this.getOutEdges().size();
sendMessageToNeighbors(new DoubleWritable(this.getValue().get()
/ numEdges));
}
}
}
参考资料:
1、http://hama.apache.org/hama_bsp_tutorial.html
2、http://hama.apache.org/hama_graph_tutorial.html
转载请保留:http://www.cnblogs.com/Deron/archive/2013/06/09/3128135.html
浙公网安备 33010602011771号