每日随笔

 

1、 数据采集(要求至少爬取三千条记录,时间跨度超过一星期):(10分)

(1)    源程序代码:

 

# 胖虎Johnny
# 2020.11.10
import urllib.request
import json
import random
import time as time0
import re, os
import pandas as pd

# 设置代理
agents = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]


def product_reviews(product_id=None, p=0, maxPage=99):
    root_dir = '京东手机评论_详细字典'
    # 判断之前是否爬取过这个型号手机的评论(一种型号的手机,颜色和内存不同,但评论共享)
    os.makedirs(root_dir, exist_ok=True)
    phone_list = os.listdir(root_dir)
    phone_txt = str(product_id) + '.txt'
    if phone_txt in phone_list:
        print(product_id)
        return []

    # 对每一页循环爬取
    # "maxPage": 45
    k_head = 0
    while p < maxPage:
        # 所有品牌评论
        # url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize={}&isShadowSku=0&rid=0&fold=1'
        # 只看当前商品的评论

       # url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize={}&isShadowSku=0&fold=1'
        url='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100016944073&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'
        url = url.format(product_id, p, maxPage)
        # print(url)
        # 仿造请求头,骗过浏览器
        # cookie可以查找自己浏览器中的cookie,直接复制过来
        cookie = ''
        headers = {
            'User-Agent': ''.join(random.sample(agents, 1)),
            'Referer': 'https://item.jd.com/100016944073.html',
            'Cookie': cookie
        }
        # 发起请求
        request = urllib.request.Request(url=url, headers=headers)
        time0.sleep(2.5)
        # 得到响应ti'm
        try:
            content = urllib.request.urlopen(request).read().decode('gbk')
        except:
            print('第%d页评论代码出错' % p)
            p = p + 1
            continue
        # 去掉多余得到json格式
        content = content.strip('fetchJSON_comment98vv995();')

        # 评论的最大页数
        try:
            maxPage = int(re.findall('"maxPage":(.*?),"', content, re.S)[0])
        except:
            pass

        try:
            obj = json.loads(content)
        except:
            print('信号不好,再次尝试!')
            print([content])
            print(url)
            continue

        comments = obj['comments']
        # 产品评论总结
        # productCommentSummary = obj['productCommentSummary']
        # dict_pars_info = {}
        # # 平均分
        # dict_pars_info['平均分'] = str(productCommentSummary['averageScore'])
        # # 好评率
        # dict_pars_info['好评率'] = str(productCommentSummary['goodRate'])
        # # 当前总评论数
        # dict_pars_info['当前评论数'] = str(productCommentSummary['commentCount'])
        # # 默认评论数
        # dict_pars_info['默认评论数'] = str(productCommentSummary['defaultGoodCount'])
        # # 追评、好评、中评、差评
        # dict_pars_info['追评数'] = str(productCommentSummary['afterCount'])
        # dict_pars_info['好评数'] = str(productCommentSummary['goodCount'])
        # dict_pars_info['中评数'] = str(productCommentSummary['generalCount'])
        # dict_pars_info['差评数'] = str(productCommentSummary['poorCount'])

        if len(comments) > 0:
            # print(comments)
            for comment in comments:
                # print(comment)
                name = comment['referenceName']

                id = comment['id']

                guid=comment['guid']

                content = comment['content']

                creationTime = comment['creationTime']

                isTop=comment['isTop']

                referenceTime=comment['referenceTime']

                firstCategory=comment['firstCategory']

                secondCategory=comment['secondCategory']

                thirdCategory=comment['thirdCategory']

                replyCount=comment['replyCount']

                score = comment['score']

                nickname=comment['nickname']

                userClient=comment['userClient']

                # userLevelName=comment['userLevelName']

                plusAvailable=comment['plusAvailable']

                productSales=comment['productSales']

                # userClientShow=comment['userClientShow']

                days = comment['days']

                afterDays = comment['afterDays']

                mobileVersion=comment['mobileVersion']







                try:
                    productColor = comment['productColor']
                except:
                    productColor = ''

                try:
                    productSize = comment['productSize']
                except:
                    productSize = ''

                likes = comment['usefulVoteCount']
                item = {
                    'name': name,
                    'id': id,
                    'guid':guid,
                    'content': content,
                    'creationTime': creationTime,
                    'isTop':isTop,
                    'referenceTime':referenceTime,
                    'firstCategory':firstCategory,
                    'secondCategory':secondCategory,
                    'thirdCategory':thirdCategory,
                    'replyCount':replyCount,
                    'score': score,
                    'nickname':nickname,
                    'userClient':userClient,
                    # 'userLevelName':userLevelName,
                    'plusAvailable':plusAvailable,
                    'productSales':productSales,
                    # 'userClientShow':userClientShow,
                    'days':days,
                    'afterDays':afterDays,
                    'mobileVersion':mobileVersion,


                    'productColor': productColor,
                    'productSize': productSize,
                    'likes': likes,

                }
                # item.update(dict_pars_info)
                # print(item)
                string = str(item)

                # 1.保存为csv格式
                item_dataframe = pd.DataFrame([item])
                # print(item_dataframe)
                if k_head == 0:
                    item_dataframe.to_csv(root_dir + '/%d.csv' % product_id, mode='w', header=True, index=False,encoding='gbk')
                    k_head += 1
                else:
                    item_dataframe.to_csv(root_dir + '/%d.csv' % product_id, mode='a', header=False, index=False,encoding='gbk')

                # 2.保存成txt
                fp = open(root_dir + '/%d.txt' % product_id, 'a', encoding='gbk')
                fp.write(string + '\n')
                fp.close()
            print('%s-page---finish(%s/%s)' % (p, p, maxPage))
        else:
            return []
        p = p + 1


if __name__ == '__main__':
    phone_id = 100008348542
    product_reviews(product_id=phone_id)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(2)    数据采集到本地文件内容截图(显示统计条数超过3000条和时间跨度)

 

 

 

 

2、数据预处理:要求使用MapReduce或者kettle实现源数据的预处理,对大量的Json文件,进行清洗,以得到结构化的文本文件。(10分)

(1)去除用户评论表的重复记录结果截图;

 

 

只保留id,删除选中的行

 

 

打开去除重复记录,同上一步一样,获取字段,只保留商品ID,删除其余字段

 

 

选择内容,指定分隔符为英文状态下的逗号“,”,编码指定UTF-8/GBK

 

 

 

 

 

(2)按照清洗后的数据格式要求提取相应的数据字段文件截图。

 

 

 

3、 数据统计:生成Hive用户评论数据:(15分)

(1)在Hive创建一张表,用于存放清洗后的数据,表名为pinglun,(创建数据表SQL语句),创建成功导入数据截图:

创建数据库

 

 

 

创建数据表

drop table if exists pinglun;

create table pinglun(

id string,

guid string,

content string,

creationTime string,

referenceTime string,

score string,

nickname string,

userlevelname string,

ismobile string,

userclientshow string,

days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

加载数据

load data local inpath '/opt/module/hive/datas/file1.csv' into table pinglun;

 

 

 

 

需求1:分析用户使用移动端购买还是PC端购买,及移动端和PC端的用户比例,生成ismobilehive表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图

 

 

 

 

创建数据表

drop table if exists pinglun;

create table pinglun(

id string,

guid string,

content string,

creationTime string,

referenceTime string,

score string,

nickname string,

userlevelname string,

ismobile string,

userclientshow string,

days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

Create table ismobilehive(

  buyloc string,

buysum string

) row format delimited fields terminated by ',';

Insert into table ismobilehive select p.ismobile ,count(p.ismobile) from pinglun p group by p.ismobile;

 

 

需求2:分析用户评论周期(收到货后,一般多久进行评论),生成dayssql表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图

Create table dayssql(

 days string,

 dayssum string

)row format delimited fields terminated by ',';

 

Insert into table dayssql select p.days ,count(p.days) from pinglun p group by p.days;

数据展示

 

 

 

 

 

 

 

需求3:分析会员级别(判断购买此商品的用户级别),生成userlevelname_out表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图

 

Create table userlevelname_out (

 userlevelname string,

 userlevelnamesum string

)row format delimited fields terminated by ',';

 

Insert into table userlevelname_out select p.userlevelname,count(p.userlevelname) from pinglun p group by p.userlevelname;

 

 

 

 

 

 

需求4:分析每天评论量,生成creationtime_out表,存储统计结果;创建数据表SQL语句,创建成功导入数据截图

drop table if exists creationtime_out;

Create table creationtime_out(

creationtime  string,

creationtimesum string

) row format delimited fields terminated by ',';

 

Insert into table creationtime_out select p.creationtime,count(p.creationtime) from pinglun1 p group by p.creationtime;

 

 

 

 

 

需求5:日期格式标准化后数据表前后对照截图

 

创建表

drop table if exists pinglun1;

create table pinglun1(

id string,

guid string,

content string,

creationTime string,

referenceTime string,

score string,

nickname string,

userlevelname string,

ismobile string,

userclientshow string,

days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

 

日期格式标准化

Insert into table pinglun1 select

p.id,p.guid,p.content, to_date(p.creationtime),

to_date(p.referenceTime), p.score,p.nickname,p.userLevelName,p.userClientShow,p.isMobile,p.days from pinglun p;  

 标准化之后的数据

 

 标准化之前的数据

 

 

 

 

 

 

 

4、 利用Sqoop进行数据迁移至Mysql数据库:(5分)

五个表导入mysql数据库中五个表截图。

  ismobilehive表

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table ismobilehive  --export-dir /user/hive/warehouse/pinglun.db/ismobilehive --input-fields-terminated-by ','  --columns="buyloc,buysum";

 

  

dayssql表

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table dayssql  --export-dir /user/hive/warehouse/pinglun.db/dayssql --input-fields-terminated-by ','  --columns="days,dayssum";

 

userlevelname_out

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table userlevelname_out  --export-dir /user/hive/warehouse/pinglun.db/userlevelname_out --input-fields-terminated-by ','  --columns=" userlevelname, userlevelnamesum";

 

 

     creationtime_out

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table creationtime_out --export-dir /user/hive/warehouse/pinglun.db/creationtime_out --input-fields-terminated-by ','  --columns=" creationtime, creationtimesum";

 

 

pinglun1表

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table pinglun1 --export-dir /user/hive/warehouse/pinglun.db/pinglun1 --input-fields-terminated-by ',' ;

 

 

5、 数据可视化:利用JavaWeb+Echarts完成数据图表展示过程(20分)

需求1可视化展示截图

 

 

 

 

 

 

需求2可视化展示截图

 

 

 

 

 

 

需求3可视化展示截图

 

 

 

 

 

 

需求4可视化展示截图

 

 

 

 

 

6、 中文分词实现用户评价分析。(20分)

(1)本节通过对商品评论表中的差评数据,进行分析,筛选用户差评点,以知己知彼。(筛选差评数据集截图)

 

 

 

 

 

(3)    利用 python 结巴分词实现用户评价信息中的中文分词及词频统计;(分词后截图)

 

使用mysql导出文件的content

 

 

 

成功后使用python脚本

 

 

 

代码

import cur as cur
import item as item
import jieba
import pandas as pd
import re
from collections import Counter
import pymysql

conn = pymysql.Connect(host = 'hadoop102',
                       port = 3306,
                       user = 'root',
                       passwd = '000000',
                       db = 'pinglun',
                       charset='utf8')


cut_words=""
cur = conn.cursor()
fileNameStr1 = r'D:\PythonProject\have2022data\pinglun1.csv'
for line in open(fileNameStr1,encoding='utf-8'):
    line.strip('\n')
    line = re.sub("[A-Za-z0-9\:\·\—\,\。\“ \”]", "", line)
    seg_list=jieba.cut(line,cut_all=False)
    cut_words+=(" ".join(seg_list))
all_words=cut_words.split()
print(all_words)
c=Counter()
for x in all_words:
    if len(x)>1 and x != '\r\n':
        c[x] += 1

print('\n词频统计结果:')
for (k,v) in c.most_common():# 输出词频最高的前两个词
    into = "INSERT INTO fencientent(mainword,countnum) VALUES (%s,%s)"
    values = (k, v)
    cur.execute(into, values)
    conn.commit()
    print("%s:%d"%(k,v))

conn.close()

 

展示分词数据成功

(3)在 hive 中新建词频统计表并加载分词数据;

要求实现:

       ①实现用户评价信息中的中文分词;

       ②实现中文分词后的词频统计;

 ③在 hive 中新建词频统计表加载分词数据;

创建数据表pingpin

drop table if exists pingpin;

create table pingpin(

mainword string,

countnum string

)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

 

将数据导入hive中

先导入到hdfs

bin/sqoop import --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table fencientent  --target-dir /user/hive/warehouse/pinglun.db/pingpin --fields-terminated-by ',';;

     再导入到hive

load data inpath '/user/hive/warehouse/pinglun.db/pingpin' into table pinglun.pingpin;

 

       导入hive成功

 ④柱状图可视化展示用户差评的统计前十类。

创建数据表chaping

drop table if exists chaping;

create table chaping(

id string,

guid string,

content string,

creationTime string,

referenceTime string,

score string,

nickname string,

userlevelname string,

ismobile string,

userclientshow string,

days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

建立差评表

 

导入差评数据

load data local inpath '/opt/module/hive/datas/file2.csv' into table chaping;

 

 

 

 

创建chaping1表

drop table if exists chaping1;

create table chaping1(

id string,

guid string,

content string,

creationTime string,

referenceTime string,

score string,

nickname string,

userlevelname string,

ismobile string,

userclientshow string,

days string)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

    "separatorChar" = ",",

    "quoteChar" = "\"",

    "escapeChar" = "\\"

);

 

存储标准化时间数据

Insert into table chaping1 select

p.id,p.guid,p.content, to_date(p.creationtime),

to_date(p.referenceTime), p.score,p.nickname,p.userLevelName,p.userClientShow,p.isMobile,p.days from chaping p;  

 

导出到可视化中中

bin/sqoop export --connect jdbc:mysql://hadoop102:3306/pinglun --username root -P --table chaping1 --export-dir /user/hive/warehouse/pinglun.db/chaping1 --input-fields-terminated-by ',' ;

 

 

差评数据

 

导出差评评论

 

 

 

差评分词

 

 

⑤用词云图可视化展示用户差评分词。

 

 

 

 

 

 

7、利用Spark进行实时数据分析。(20分)

本实验以京东商品评论为目标网站,架构采用爬虫+Flume+Kafka+Spark Streaming+Mysql,实现数据动态实时的采集、分析、展示数据。

 

 

具体工作流程如下图:

 

操作步骤截图

启动Zookeeper服务:

 

启动kafaka

 

 

创建topic

 

查看topic

 

在flume的job下创建配置文件

 

启动flume

 

 

 

编写脚本

正在实时爬取

 

 

对数据进行处理

package show
import java.sql.Connection

import bean.Ping_log
import handle.{PingHandler}
import utils.{JdbcUtil, MyKafkaUtil, PropertiesUtil}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object RealTimeApp {
  def main(args: Array[String]): Unit = {
    //创建 SparkConf
    val sparkConf: SparkConf = new SparkConf().setAppName("RealTimeApp").setMaster("local[2]")
    //创建 StreamingContext
    val ssc = new StreamingContext(sparkConf, Seconds(3))

   // val topic: String = PropertiesUtil.load("config.properties").getProperty("kafka.topic")
    val topics = "first"
    val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = MyKafkaUtil.getKafkaStream(topics, ssc)
    //4.将每一行数据转换为样例类对象
    val adsLogDStream: DStream[Ping_log] = kafkaDStream.map(record => {
      //a.取出 value 并按照" "切分
      val arr: Array[String] = record.value().split(" ")
      Ping_log(arr(0), arr(1), arr(2), arr(3), arr(4),arr(5),arr(6),arr(7),arr(8),arr(9),arr(10),arr(11))
    })


    //统计每天评论总数并保存至 MySQL 中
    PingHandler.savePingCountToMysql(adsLogDStream)
    //10.开启任务
    ssc.start()
    ssc.awaitTermination()
  } }

 

数据库展示

 

只有时间和数量

 

 

 

实时展示可视化

 

posted @ 2022-03-22 23:07  好吗,好  阅读(66)  评论(0)    收藏  举报