天龙八部谁是主角？（MR词频统计）

天龙八部主要是对段誉、萧峰、虚竹三人的描写，那么谁才是真正的主角呢？这次姑且认为小说中谁的出现次数多谁是主角。

实验在linux环境下

首先下载天龙八部
wget http://labfile.oss.aliyuncs.com/hadoop/tlbbtestfile.txt
安装结巴分词
sudo pip install jieba
hdfs dfs -put tlbbtestfile.txt /tlbb.txt

# 创建代码文件夹
mkdir tlbbwordcount
# 创建 Mapper 程序文件
touch tlbbwordcount/mapper.py
# 创建 Reducer 程序文件
touch tlbbwordcount/reducer.py
# 给所有 Python 脚本增加可执行权限
chmod a+x tlbbwordcount/*.py

mapper程序：

 1 #!/usr/bin/env python
 2 
 3 # 引入 jieba 分词模块
 4 import jieba
 5 import sys
 6 
 7 # 从 stdin 标准输入中依次读取每一行
 8 for line in sys.stdin:
 9 
10      # 对每一行使用 jieba 分词进行分词
11     wlist = jieba.cut(line.strip())
12 
13     # 对分词得到的词汇列表进行 Map 操作
14     for word in wlist:
15         try:
16               # 每个词都映射成（word，1）这样的二元组
17               # 输出到标准输出 stdout 中
18             print "%s\t1" % (word.encode("utf8"))
19         except:
20             pass

reducer程序：

#!/usr/bin/env python
import sys

# 定义临时变量存储中间数据
current_word, current_count, word=None,1,None

# 依次从标准输入读取每一行
for line in sys.stdin:
    try:
          # 每一行都是一个（word，count）的二元组，从中提取信息词语和数量
        line = line.rstrip()
        word, count = line.split("\t", 1)
        count = int(count)
    except: continue

    # 判断当前处理的词是从当前行提取的词
    if current_word == word:
         # 如果是，则增加当前处理的词出现的频次
        current_count += count
    else:
        # 如果不是，则需要输出当前处理的词和词频到标准输出
        if current_word:
            print "%s\t%u" % (current_word, current_count)
        current_count, current_word = count, word

# 读取完毕后需要处理当前词是读取词，但没有输出的情况
if current_word == word:
    print "%s\t%u" % (current_word, current_count)

执行任务：

hadoop jar /opt/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -mapper mapper.py -reducer reducer.py -input /tlbb.txt -output tlbbout -jobconf mapred.map.tasks=4 -jobconf mapred.reduce.tasks=2

结果：

实验地址：

https://www.shiyanlou.com/courses/40/labs/305/document

发表于 2017-11-20 11:20 mycoding 阅读(564) 评论(0) 编辑收藏举报

公告