每日随笔
1、 数据采集(要求至少爬取三千条记录,时间跨度超过一星期):(10分)
(1) 源程序代码:
# 胖虎Johnny
# 2020.11.10
import urllib.request
import json
import random
import time as time0
import re, os
import pandas as pd
# 设置代理
agents = [
"Mozilla/5.0 (Windows NT 6.1;
WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1
Safari/537.1",
"Mozilla/5.0 (Windows NT 6.2;
WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5
Safari/537.1",
"Mozilla/5.0 (Windows NT 6.2;
WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1
Safari/535.24"
]
def product_reviews(product_id=None, p=0, maxPage=99):
root_dir = '京东手机评论_详细字典'
# 判断之前是否爬取过这个型号手机的评论(一种型号的手机,颜色和内存不同,但评论共享)
os.makedirs(root_dir, exist_ok=True)
phone_list = os.listdir(root_dir)
phone_txt = str(product_id) + '.txt'
if phone_txt in phone_list:
print(product_id)
return []
# 对每一页循环爬取
# "maxPage": 45
k_head = 0
while p < maxPage:
# 所有品牌评论
# url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize={}&isShadowSku=0&rid=0&fold=1'
# 只看当前商品的评论
# url =
'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize={}&isShadowSku=0&fold=1'
url='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100016944073&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'
url = url.format(product_id, p, maxPage)
# print(url)
# 仿造请求头,骗过浏览器
# cookie可以查找自己浏览器中的cookie,直接复制过来
cookie = ''
headers = {
'User-Agent': ''.join(random.sample(agents,
1)),
'Referer': 'https://item.jd.com/100016944073.html',
'Cookie': cookie
}
# 发起请求
request = urllib.request.Request(url=url,
headers=headers)
time0.sleep(2.5)
# 得到响应ti'm
try:
content =
urllib.request.urlopen(request).read().decode('gbk')
except:
print('第%d页评论代码出错' % p)
p = p + 1
continue
# 去掉多余得到json格式
content = content.strip('fetchJSON_comment98vv995();')
# 评论的最大页数
try:
maxPage = int(re.findall('"maxPage":(.*?),"',
content, re.S)[0])
except:
pass
try:
obj = json.loads(content)
except:
print('信号不好,再次尝试!')
print([content])
print(url)
continue
comments = obj['comments']
# 产品评论总结
# productCommentSummary =
obj['productCommentSummary']
# dict_pars_info = {}
# # 平均分
# dict_pars_info['平均分'] = str(productCommentSummary['averageScore'])
# # 好评率
# dict_pars_info['好评率'] =
str(productCommentSummary['goodRate'])
# # 当前总评论数
# dict_pars_info['当前评论数'] =
str(productCommentSummary['commentCount'])
# # 默认评论数
# dict_pars_info['默认评论数'] =
str(productCommentSummary['defaultGoodCount'])
# # 追评、好评、中评、差评
# dict_pars_info['追评数'] =
str(productCommentSummary['afterCount'])
# dict_pars_info['好评数'] = str(productCommentSummary['goodCount'])
# dict_pars_info['中评数'] =
str(productCommentSummary['generalCount'])
# dict_pars_info['差评数'] =
str(productCommentSummary['poorCount'])
if len(comments) > 0:
# print(comments)
for comment in comments:
# print(comment)
name = comment['referenceName']
id = comment['id']
guid=comment['guid']
content = comment['content']
creationTime = comment['creationTime']
isTop=comment['isTop']
referenceTime=comment['referenceTime']
firstCategory=comment['firstCategory']
secondCategory=comment['secondCategory']
thirdCategory=comment['thirdCategory']
replyCount=comment['replyCount']
score = comment['score']
nickname=comment['nickname']
userClient=comment['userClient']
# userLevelName=comment['userLevelName']
plusAvailable=comment['plusAvailable']
productSales=comment['productSales']
#
userClientShow=comment['userClientShow']
days = comment['days']
afterDays = comment['afterDays']
mobileVersion=comment['mobileVersion']
try:
productColor =
comment['productColor']
except:
productColor = ''
try:
productSize =
comment['productSize']
except:
productSize = ''
likes = comment['usefulVoteCount']
item = {
'name': name,
'id': id,
'guid':guid,
'content': content,
'creationTime':
creationTime,
'isTop':isTop,
'referenceTime':referenceTime,
'firstCategory':firstCategory,
'secondCategory':secondCategory,
'thirdCategory':thirdCategory,
'replyCount':replyCount,
'score': score,
'nickname':nickname,
'userClient':userClient,
#
'userLevelName':userLevelName,
'plusAvailable':plusAvailable,
'productSales':productSales,
#
'userClientShow':userClientShow,
'days':days,
'afterDays':afterDays,
'mobileVersion':mobileVersion,
'productColor':
productColor,
'productSize':
productSize,
'likes': likes,
}
#
item.update(dict_pars_info)
# print(item)
string = str(item)
# 1.保存为csv格式
item_dataframe =
pd.DataFrame([item])
# print(item_dataframe)
if k_head == 0:
item_dataframe.to_csv(root_dir + '/%d.csv' % product_id, mode='w', header=True,
index=False,encoding='gbk')
k_head += 1
else:
item_dataframe.to_csv(root_dir
+ '/%d.csv' % product_id, mode='a', header=False, index=False,encoding='gbk')
# 2.保存成txt
fp = open(root_dir + '/%d.txt'
% product_id, 'a', encoding='gbk')
fp.write(string + '\n')
fp.close()
print('%s-page---finish(%s/%s)'
% (p, p, maxPage))
else:
return []
p = p + 1
if __name__ == '__main__':
phone_id = 100008348542
product_reviews(product_id=phone_id)
(2) 数据采集到本地文件内容截图(显示统计条数超过3000条和时间跨度)
2、数据预处理:要求使用MapReduce或者kettle实现源数据的预处理,对大量的Json文件,进行清洗,以得到结构化的文本文件。(10分)
(1)去除用户评论表的重复记录结果截图;
只保留id,删除选中的行
打开去除重复记录,同上一步一样,获取字段,只保留商品ID,删除其余字段
选择内容,指定分隔符为英文状态下的逗号“,”,编码指定UTF-8/GBK
(2)按照清洗后的数据格式要求提取相应的数据字段文件截图。
3、 数据统计:生成Hive用户评论数据:(15分)
(1)在Hive创建一张表,用于存放清洗后的数据,表名为pinglun,(创建数据表SQL语句),创建成功导入数据截图:
创建数据库
创建数据表
drop table if exists pinglun;
create table pinglun(
id string,
guid string,
content string,
creationTime string,
referenceTime string,
score string,
nickname string,
userlevelname string,
ismobile string,
userclientshow string,
days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
加载数据
load data local inpath '/opt/module/hive/datas/file1.csv' into table pinglun;
需求1:分析用户使用移动端购买还是PC端购买,及移动端和PC端的用户比例,生成ismobilehive表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图
创建数据表
drop table if exists pinglun;
create table pinglun(
id string,
guid string,
content string,
creationTime string,
referenceTime string,
score string,
nickname string,
userlevelname string,
ismobile string,
userclientshow string,
days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
Create table ismobilehive(
buyloc string,
buysum string
) row format delimited fields terminated by ',';
Insert into table ismobilehive select p.ismobile ,count(p.ismobile) from pinglun p group by p.ismobile;
需求2:分析用户评论周期(收到货后,一般多久进行评论),生成dayssql表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图
Create table dayssql(
days string,
dayssum string
)row format delimited fields terminated by ',';
Insert into table dayssql select p.days ,count(p.days) from pinglun p group by p.days;
数据展示
需求3:分析会员级别(判断购买此商品的用户级别),生成userlevelname_out表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图
Create table userlevelname_out (
userlevelname string,
userlevelnamesum string
)row format delimited fields terminated by ',';
Insert into table userlevelname_out select p.userlevelname,count(p.userlevelname) from pinglun p group by p.userlevelname;
需求4:分析每天评论量,生成creationtime_out表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图
drop table if exists creationtime_out;
Create table creationtime_out(
creationtime string,
creationtimesum string
) row format delimited fields terminated by ',';
Insert into table creationtime_out select p.creationtime,count(p.creationtime) from pinglun1 p group by p.creationtime;
需求5:日期格式标准化后数据表前后对照截图
创建表
drop table if exists pinglun1;
create table pinglun1(
id string,
guid string,
content string,
creationTime string,
referenceTime string,
score string,
nickname string,
userlevelname string,
ismobile string,
userclientshow string,
days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
日期格式标准化
Insert into table pinglun1 select
p.id,p.guid,p.content, to_date(p.creationtime),
to_date(p.referenceTime), p.score,p.nickname,p.userLevelName,p.userClientShow,p.isMobile,p.days from pinglun p;
标准化之后的数据
标准化之前的数据
4、 利用Sqoop进行数据迁移至Mysql数据库:(5分)
五个表导入mysql数据库中五个表截图。
ismobilehive表
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table ismobilehive --export-dir /user/hive/warehouse/pinglun.db/ismobilehive --input-fields-terminated-by ',' --columns="buyloc,buysum";
dayssql表
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table dayssql --export-dir /user/hive/warehouse/pinglun.db/dayssql --input-fields-terminated-by ',' --columns="days,dayssum";
userlevelname_out
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table userlevelname_out --export-dir /user/hive/warehouse/pinglun.db/userlevelname_out --input-fields-terminated-by ',' --columns=" userlevelname, userlevelnamesum";
creationtime_out
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table creationtime_out --export-dir /user/hive/warehouse/pinglun.db/creationtime_out --input-fields-terminated-by ',' --columns=" creationtime, creationtimesum";
pinglun1表
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table pinglun1 --export-dir /user/hive/warehouse/pinglun.db/pinglun1 --input-fields-terminated-by ',' ;
5、 数据可视化:利用JavaWeb+Echarts完成数据图表展示过程(20分)
需求1:可视化展示截图
需求2:可视化展示截图
需求3:可视化展示截图
需求4:可视化展示截图
6、 中文分词实现用户评价分析。(20分)
(1)本节通过对商品评论表中的差评数据,进行分析,筛选用户差评点,以知己知彼。(筛选差评数据集截图)
(3) 利用 python 结巴分词实现用户评价信息中的中文分词及词频统计;(分词后截图)
使用mysql导出文件的content
成功后使用python脚本
代码
import cur as cur
import item as item
import jieba
import pandas as pd
import re
from collections import Counter
import pymysql
conn = pymysql.Connect(host = 'hadoop102',
port = 3306,
user = 'root',
passwd = '000000',
db = 'pinglun',
charset='utf8')
cut_words=""
cur = conn.cursor()
fileNameStr1 = r'D:\PythonProject\have2022data\pinglun1.csv'
for line in open(fileNameStr1,encoding='utf-8'):
line.strip('\n')
line = re.sub("[A-Za-z0-9\:\·\—\,\。\“ \”]", "",
line)
seg_list=jieba.cut(line,cut_all=False)
cut_words+=(" ".join(seg_list))
all_words=cut_words.split()
print(all_words)
c=Counter()
for x in all_words:
if len(x)>1 and x != '\r\n':
c[x] += 1
print('\n词频统计结果:')
for (k,v) in c.most_common():# 输出词频最高的前两个词
into = "INSERT INTO
fencientent(mainword,countnum) VALUES (%s,%s)"
values = (k, v)
cur.execute(into, values)
conn.commit()
print("%s:%d"%(k,v))
conn.close()
展示分词数据成功
(3)在 hive 中新建词频统计表并加载分词数据;
要求实现:
①实现用户评价信息中的中文分词;
②实现中文分词后的词频统计;
③在 hive 中新建词频统计表加载分词数据;
创建数据表pingpin
drop table if exists pingpin;
create table pingpin(
mainword string,
countnum string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
将数据导入hive中
先导入到hdfs
bin/sqoop import --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table fencientent --target-dir /user/hive/warehouse/pinglun.db/pingpin --fields-terminated-by ',';;
再导入到hive
load data inpath '/user/hive/warehouse/pinglun.db/pingpin' into table pinglun.pingpin;
导入hive成功
④柱状图可视化展示用户差评的统计前十类。
创建数据表chaping
drop table if exists chaping;
create table chaping(
id string,
guid string,
content string,
creationTime string,
referenceTime string,
score string,
nickname string,
userlevelname string,
ismobile string,
userclientshow string,
days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
建立差评表
导入差评数据
load data local inpath '/opt/module/hive/datas/file2.csv' into table chaping;
创建chaping1表
drop table if exists chaping1;
create table chaping1(
id string,
guid string,
content string,
creationTime string,
referenceTime string,
score string,
nickname string,
userlevelname string,
ismobile string,
userclientshow string,
days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
);
存储标准化时间数据
Insert into table chaping1 select
p.id,p.guid,p.content, to_date(p.creationtime),
to_date(p.referenceTime), p.score,p.nickname,p.userLevelName,p.userClientShow,p.isMobile,p.days from chaping p;
导出到可视化中中
bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table chaping1 --export-dir /user/hive/warehouse/pinglun.db/chaping1 --input-fields-terminated-by ',' ;
差评数据
导出差评评论
差评分词
⑤用词云图可视化展示用户差评分词。
7、利用Spark进行实时数据分析。(20分)
本实验以京东商品评论为目标网站,架构采用爬虫+Flume+Kafka+Spark Streaming+Mysql,实现数据动态实时的采集、分析、展示数据。
具体工作流程如下图:
操作步骤截图
启动Zookeeper服务:
启动kafaka
创建topic
查看topic
在flume的job下创建配置文件
启动flume
编写脚本
正在实时爬取
对数据进行处理
package show
import java.sql.Connection
import bean.Ping_log
import handle.{PingHandler}
import utils.{JdbcUtil, MyKafkaUtil, PropertiesUtil}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object RealTimeApp {
def main(args: Array[String]): Unit = {
//创建 SparkConf
val sparkConf: SparkConf = new SparkConf().setAppName("RealTimeApp").setMaster("local[2]")
//创建
StreamingContext
val ssc = new StreamingContext(sparkConf,
Seconds(3))
// val topic: String =
PropertiesUtil.load("config.properties").getProperty("kafka.topic")
val topics = "first"
val kafkaDStream:
InputDStream[ConsumerRecord[String, String]] = MyKafkaUtil.getKafkaStream(topics,
ssc)
//4.将每一行数据转换为样例类对象
val adsLogDStream: DStream[Ping_log]
= kafkaDStream.map(record => {
//a.取出 value 并按照" "切分
val arr: Array[String] =
record.value().split(" ")
Ping_log(arr(0), arr(1), arr(2),
arr(3), arr(4),arr(5),arr(6),arr(7),arr(8),arr(9),arr(10),arr(11))
})
//统计每天评论总数并保存至
MySQL 中
PingHandler.savePingCountToMysql(adsLogDStream)
//10.开启任务
ssc.start()
ssc.awaitTermination()
} }
数据库展示
只有时间和数量
实时展示可视化