2023 级课堂测试试卷—数据分析

课程名称: 大型数据库应用技术 任课教师: 孙静 考试时间: 90 分钟

Result文件数据说明:
Ip:106.39.41.166,(城市)
Date:10/Nov/2016:00:01:02 +0800,(日期)
Day:10,(天数)
Traffic: 54 ,(流量)
Type: video,(类型:视频video或文章article)
Id: 8701(视频或者文章的id)
测试要求:
1、 数据清洗:按照进行数据清洗,并将清洗后的数据导入hive数据库中。
两阶段数据清洗:
(1)第一阶段:把需要的信息从原始日志中提取出来
ip: 199.30.25.88
time: 10/Nov/2016:00:01:03 +0800
traffic: 62
文章: article/11325
视频: video/3235
(2)第二阶段:根据提取出来的信息做精细化操作
ip--->城市 city(IP)
date--> time:2016-11-10 00:01:03
day: 10
traffic:62
type:article/video
id:11325
(3)hive数据库表结构:
create table data( ip string, time string , day string, traffic bigint,
type string, id string )
2、数据分析:在HIVE统计下列数据。
(1)统计最受欢迎的视频/文章的Top10访问次数 (video/article)
(2)按照地市统计最受欢迎的Top10课程 (ip)
(3)按照流量统计最受欢迎的Top10课程 (traffic)
3、数据可视化:
将统计结果倒入MySql数据库中,通过图形化展示的方式展现出来。

数据集已经给好了只需要简单清洗一下就可以

这是先把原数据中有用的信息提取出来,并以,隔开


import csv
from datetime import datetime
from pathlib import Path

RAW_FILE = Path("raw.csv")
OUT_FILE = Path("cleaned.csv")


def parse_time(raw_time: str) -> str:
    # 原始格式: 10/Nov/2016:00:01:02 +0800
    main = raw_time.split()[0]
    dt = datetime.strptime(main, "%d/%b/%Y:%H:%M:%S")
    return dt.strftime("%Y-%m-%d %H:%M:%S")


def clean(in_path: Path = RAW_FILE, out_path: Path = OUT_FILE) -> None:
    with in_path.open("r", encoding="utf-8", newline="") as fin, \
         out_path.open("w", encoding="utf-8", newline="") as fout:
        reader = csv.reader(fin)
        writer = csv.writer(fout)
        for row in reader:
            # 结构校验
            if len(row) != 6:
                continue
            ip, raw_time, day, traffic, typ, _id = (c.strip() for c in row)

            # 类型/id 过滤
            if typ not in ("video", "article"):
                continue
            if not _id.isdigit():
                continue

            # 解析时间与流量
            try:
                time_fmt = parse_time(raw_time)
                traffic_val = int(traffic.replace(" ", ""))
            except Exception:
                continue

            writer.writerow([ip, time_fmt, day, traffic_val, typ, _id])


if __name__ == "__main__":
    clean()
    print(f"done: {OUT_FILE}")



需要下载
GeoLite2-City.mmdb才可以把坐标转换为城市的名字
下载地址
https://gitcode.com/open-source-toolkit/94ce3/tree/main

再清洗一遍,把地址转换为城市的名字


import csv
from pathlib import Path
import geoip2.database

CLEANED_FILE = Path("cleaned.csv")
ENRICHED_FILE = Path("enriched.csv")
GEOIP_DB = Path("GeoLite2-City.mmdb")  # 请将库文件放在同目录或改为绝对路径

def ip_to_city(reader):
    try:
        return reader.city.name or "UNKNOWN"
    except Exception:
        return "UNKNOWN"

def enrich(cleaned=CLEANED_FILE, out=ENRICHED_FILE, geo_db=GEOIP_DB):
    if not geo_db.exists():
        raise FileNotFoundError(f"GeoIP DB not found: {geo_db}")
    reader = geoip2.database.Reader(str(geo_db))

    with cleaned.open("r", encoding="utf-8", newline="") as fin, \
         out.open("w", encoding="utf-8", newline="") as fout:
        r = csv.reader(fin)
        w = csv.writer(fout)
        for row in r:
            if len(row) != 6:
                continue
            ip, ts, day, traffic, typ, _id = row
            try:
                city = reader.city(ip).city.name or "UNKNOWN"
            except Exception:
                city = "UNKNOWN"
            w.writerow([ip, ts, day, traffic, typ, _id, city])

    reader.close()
    print(f"done -> {out}")

if __name__ == "__main__":
    enrich()


需要提前配置好hive,用datagrip将清洗的数据上传到hive中
image
根据需求生成sql语句,交给hive执行mapreduce语句,我这个虚拟机分配的内存比较少,主4GB,从1GB,从1GB
用了5-6分钟才计算出来,我还以为没连接好,可以把内存分配的多一些,这样就不需要等太长时间了,我没有装sqoop所以是手动导入mysql的,生成sql语句到mysql数据库,最后使用html可视化,因为数据量还是不大,所有可以这么做,数据大就必须用sqoop来传输运算结果了

295f98d5d37a249ed99892b6a7c48827
5b3c51fa1380bc1baf8909c79d76c2fe
3e93e3ab27140e176d46f55df2c78d76

image

posted @ 2025-12-22 14:51  向恦  阅读(0)  评论(0)    收藏  举报