2023 级课堂测试试卷—数据分析

课程名称：大型数据库应用技术任课教师：孙静考试时间： 90 分钟

Result文件数据说明：
Ip：106.39.41.166,（城市）
Date：10/Nov/2016:00:01:02 +0800,（日期）
Day：10,（天数）
Traffic: 54 ,（流量）
Type: video,（类型：视频video或文章article）
Id: 8701（视频或者文章的id）
测试要求：
1、数据清洗：按照进行数据清洗，并将清洗后的数据导入hive数据库中。
两阶段数据清洗：
（1）第一阶段：把需要的信息从原始日志中提取出来
ip: 199.30.25.88
time: 10/Nov/2016:00:01:03 +0800
traffic: 62
文章： article/11325
视频： video/3235
（2）第二阶段：根据提取出来的信息做精细化操作
ip--->城市 city（IP）
date--> time:2016-11-10 00:01:03
day: 10
traffic:62
type:article/video
id:11325
（3）hive数据库表结构:
create table data( ip string, time string , day string, traffic bigint,
type string, id string )
2、数据分析：在HIVE统计下列数据。
（1）统计最受欢迎的视频/文章的Top10访问次数（video/article）
（2）按照地市统计最受欢迎的Top10课程（ip）
（3）按照流量统计最受欢迎的Top10课程（traffic）
3、数据可视化：
将统计结果倒入MySql数据库中，通过图形化展示的方式展现出来。

数据集已经给好了只需要简单清洗一下就可以

这是先把原数据中有用的信息提取出来，并以，隔开


import csv
from datetime import datetime
from pathlib import Path

RAW_FILE = Path("raw.csv")
OUT_FILE = Path("cleaned.csv")


def parse_time(raw_time: str) -> str:
    # 原始格式: 10/Nov/2016:00:01:02 +0800
    main = raw_time.split()[0]
    dt = datetime.strptime(main, "%d/%b/%Y:%H:%M:%S")
    return dt.strftime("%Y-%m-%d %H:%M:%S")


def clean(in_path: Path = RAW_FILE, out_path: Path = OUT_FILE) -> None:
    with in_path.open("r", encoding="utf-8", newline="") as fin, \
         out_path.open("w", encoding="utf-8", newline="") as fout:
        reader = csv.reader(fin)
        writer = csv.writer(fout)
        for row in reader:
            # 结构校验
            if len(row) != 6:
                continue
            ip, raw_time, day, traffic, typ, _id = (c.strip() for c in row)

            # 类型/id 过滤
            if typ not in ("video", "article"):
                continue
            if not _id.isdigit():
                continue

            # 解析时间与流量
            try:
                time_fmt = parse_time(raw_time)
                traffic_val = int(traffic.replace(" ", ""))
            except Exception:
                continue

            writer.writerow([ip, time_fmt, day, traffic_val, typ, _id])


if __name__ == "__main__":
    clean()
    print(f"done: {OUT_FILE}")

需要下载
GeoLite2-City.mmdb才可以把坐标转换为城市的名字
下载地址
https://gitcode.com/open-source-toolkit/94ce3/tree/main

再清洗一遍，把地址转换为城市的名字


import csv
from pathlib import Path
import geoip2.database

CLEANED_FILE = Path("cleaned.csv")
ENRICHED_FILE = Path("enriched.csv")
GEOIP_DB = Path("GeoLite2-City.mmdb")  # 请将库文件放在同目录或改为绝对路径

def ip_to_city(reader):
    try:
        return reader.city.name or "UNKNOWN"
    except Exception:
        return "UNKNOWN"

def enrich(cleaned=CLEANED_FILE, out=ENRICHED_FILE, geo_db=GEOIP_DB):
    if not geo_db.exists():
        raise FileNotFoundError(f"GeoIP DB not found: {geo_db}")
    reader = geoip2.database.Reader(str(geo_db))

    with cleaned.open("r", encoding="utf-8", newline="") as fin, \
         out.open("w", encoding="utf-8", newline="") as fout:
        r = csv.reader(fin)
        w = csv.writer(fout)
        for row in r:
            if len(row) != 6:
                continue
            ip, ts, day, traffic, typ, _id = row
            try:
                city = reader.city(ip).city.name or "UNKNOWN"
            except Exception:
                city = "UNKNOWN"
            w.writerow([ip, ts, day, traffic, typ, _id, city])

    reader.close()
    print(f"done -> {out}")

if __name__ == "__main__":
    enrich()

需要提前配置好hive，用datagrip将清洗的数据上传到hive中

根据需求生成sql语句，交给hive执行mapreduce语句，我这个虚拟机分配的内存比较少，主4GB，从1GB，从1GB
用了5-6分钟才计算出来，我还以为没连接好，可以把内存分配的多一些，这样就不需要等太长时间了，我没有装sqoop所以是手动导入mysql的，生成sql语句到mysql数据库，最后使用html可视化，因为数据量还是不大，所有可以这么做，数据大就必须用sqoop来传输运算结果了

posted @ 2025-12-22 14:51 向恦阅读(0) 评论(0) 收藏举报

刷新页面返回顶部

2023 级课堂测试试卷—数据分析

公告