数据采集第四次作业

作业1

要求:
熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。
使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。
候选网站:东方财富网:http://quote.eastmoney.com/center/gridlist.html#hs_a_board
输出信息:MYSQL数据库存储和输出格式如下,表头应是英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计表头:

image
以上是网页的解析过程,解析html结构,找到股票的具体存储位置

相关代码与结果

代码
点击查看代码
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymysql

# 数据库设置
DB_SETTINGS = {
    "host": "localhost",
    "user": "fangpu",
    "password": "Ffp051208",
    "database": "dbstock",
    "charset": "utf8mb4"
}

# 建立数据库连接
connection = pymysql.connect(**DB_SETTINGS)
db_cursor = connection.cursor()

def store_data(record):
    """将数据存入数据库"""
    insert_sql = """
    INSERT INTO eastmoney_stock(
        board, xuhao, daima, mingcheng, zuixinbaojia,
        zhangdiefu, zhangdiee, chengjiaoliang,
        chengjiaoe, zhenfu, zuigao, zuidi,
        jinkai, zuoshou
    ) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
    ON DUPLICATE KEY UPDATE
        zuixinbaojia=VALUES(zuixinbaojia),
        zhangdiefu=VALUES(zhangdiefu),
        zhangdiee=VALUES(zhangdiee),
        chengjiaoliang=VALUES(chengjiaoliang),
        chengjiaoe=VALUES(chengjiaoe),
        zhenfu=VALUES(zhenfu),
        zuigao=VALUES(zuigao),
        zuidi=VALUES(zuidi),
        jinkai=VALUES(jinkai),
        zuoshou=VALUES(zuoshou)
    """
    db_cursor.execute(insert_sql, record)
    connection.commit()
def fetch_stock_data(section_name, page_url, page_limit=2):
    """获取指定板块的股票数据"""
    print(f"开始获取 {section_name} 数据")
    driver.get(page_url)
    for current_page in range(1, page_limit + 1):
        print(f"  处理第 {current_page} 页")
        # 等待数据加载
        WebDriverWait(driver, 20).until(
            EC.presence_of_all_elements_located((By.XPATH, "//tbody/tr"))
        )
        # 提取表格数据
        table_rows = driver.find_elements(By.XPATH, "//tbody/tr")
        for row in table_rows:
            cells = row.find_elements(By.TAG_NAME, "td")
            if len(cells) < 13:
                continue
            stock_info = {
                'xuhao': cells[0].text.strip(),
                'daima': cells[1].text.strip(),
                'mingcheng': cells[2].text.strip(),
                'zuixinbaojia': cells[3].text.strip(),
                'zhangdiefu': cells[4].text.strip(),
                'zhangdiee': cells[5].text.strip(),
                'chengjiaoliang': cells[6].text.strip(),
                'chengjiaoe': cells[7].text.strip(),
                'zhenfu': cells[8].text.strip(),
                'zuigao': cells[9].text.strip(),
                'zuidi': cells[10].text.strip(),
                'jinkai': cells[11].text.strip(),
                'zuoshou': cells[12].text.strip()
            }
            if not stock_info['daima'].isdigit():
                continue
            # 准备存储数据
            data_record = (
                section_name,
                stock_info['xuhao'],
                stock_info['daima'],
                stock_info['mingcheng'],
                stock_info['zuixinbaojia'],
                stock_info['zhangdiefu'],
                stock_info['zhangdiee'],
                stock_info['chengjiaoliang'],
                stock_info['chengjiaoe'],
                stock_info['zhenfu'],
                stock_info['zuigao'],
                stock_info['zuidi'],
                stock_info['jinkai'],
                stock_info['zuoshou']
            )
            store_data(data_record)
            print(f"    已保存 {stock_info['daima']} {stock_info['mingcheng']}")
        # 翻页操作
        if current_page < page_limit:
            try:
                next_page = WebDriverWait(driver, 10).until(
                    EC.element_to_be_clickable((By.XPATH, '//a[@title="下一页"]'))
                )
                next_page.click()
                print("  翻到下一页")
                time.sleep(2)
            except:
                print("  已到最后一页")
                break
if __name__ == "__main__":
    # 初始化浏览器
    chrome_options = webdriver.ChromeOptions()
    # chrome_options.add_argument("--headless") 
    driver = webdriver.Chrome(options=chrome_options)
    
    # 定义要爬取的板块
    stock_sections = {
        "沪深A股": "https://quote.eastmoney.com/center/gridlist.html#hs_a_board",
        "上证A股": "https://quote.eastmoney.com/center/gridlist.html#sh_a_board",
        "深证A股": "https://quote.eastmoney.com/center/gridlist.html#sz_a_board",
    }
    # 遍历各板块获取数据
    for section, url in stock_sections.items():
        fetch_stock_data(section, url, page_limit=2)
    # 清理资源
    driver.quit()
    db_cursor.close()
    connection.close()
    print("\n数据获取完成!")
结果

image

作业2

要求:
熟练掌握 Selenium 查找HTML元素、实现用户模拟登录、爬取Ajax网页数据、等待HTML元素等内容。
使用Selenium框架+MySQL爬取中国mooc网课程资源信息(课程号、课程名称、学校名称、主讲教师、团队成员、参加人数、课程进度、课程简介)
候选网站:中国mooc网:https://www.icourse163.org
输出信息:MYSQL数据库存储和输出格式

image
本题的爬取是通过解析网页XPATH实现

相关代码与结果

代码
点击查看代码
import json
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymysql

# 配置信息
DRIVER_PATH = r"D:\python\pythonProject1\.venv\Scripts\chromedriver.exe"
COOKIE_FILE = "cookies.json"

# 数据库设置
DB_SETTINGS = {
    "host": "localhost",
    "user": "fangpu",
    "password": "Ffp051208",
    "database": "dbstock",
    "charset": "utf8mb4"
}

# 课程链接列表
COURSE_LIST = [
    "https://www.icourse163.org/course/ZJU-199001",
    "https://www.icourse163.org/course/XJTU-1002529011",
    "https://www.icourse163.org/course/PKU-1206624828"
]
def initialize_database():
    """初始化数据库连接并创建表"""
    connection = pymysql.connect(**DB_SETTINGS)
    cursor = connection.cursor()
    create_table_sql = """
    CREATE TABLE IF NOT EXISTS course_info(
        id INT AUTO_INCREMENT PRIMARY KEY,
        url VARCHAR(255),
        cCourse VARCHAR(255),
        cCollege VARCHAR(255),
        cTeacher VARCHAR(255),
        cTeam TEXT,
        cCount VARCHAR(50),
        cProcess VARCHAR(255),
        cBrief TEXT
    )
    """
    cursor.execute(create_table_sql)
    connection.commit()
    return connection, cursor
def load_user_cookies(browser):
    """加载保存的cookies实现登录"""
    if not os.path.exists(COOKIE_FILE):
        print("未找到cookies文件,请先生成cookie!")
        exit()
    browser.get("https://www.icourse163.org/")
    time.sleep(2)
    with open(COOKIE_FILE, "r", encoding="utf-8") as file:
        cookies = json.load(file)
    for cookie in cookies:
        cookie.pop("sameSite", None)
        cookie.pop("expiry", None)
        try:
            browser.add_cookie(cookie)
        except Exception as e:
            print("添加cookie失败:", e)
    browser.get("https://www.icourse163.org/")
    time.sleep(2)
    print("使用cookie登录成功!")
def extract_element_text(browser, wait_time, element_xpath):
    """安全提取元素文本"""
    try:
        element = wait_time.until(
            EC.presence_of_element_located((By.XPATH, element_xpath))
        )
        return element.text.strip()
    except:
        return ""
def extract_element_attribute(browser, wait_time, element_xpath, attribute_name):
    """安全提取元素属性"""
    try:
        element = wait_time.until(
            EC.presence_of_element_located((By.XPATH, element_xpath))
        )
        return element.get_attribute(attribute_name)
    except:
        return ""

def collect_course_teachers(browser):
    """收集课程教师信息"""
    teacher_list = []
    try:
        teacher_container = browser.find_element(
            By.XPATH, "//div[@class='m-teachers_teacher-list']"
        )
        slider_container = teacher_container.find_element(
            By.XPATH, ".//div[@class='um-list-slider_con']"
        )
        teacher_items = slider_container.find_elements(
            By.XPATH, ".//div[@class='um-list-slider_con_item']"
        )

        for item in teacher_items:
            try:
                img_element = item.find_element(By.XPATH, ".//img")
                teacher_name = img_element.get_attribute("alt")
                if teacher_name:
                    teacher_list.append(teacher_name)
            except:
                continue
    except Exception as e:
        print("获取教师信息失败:", e)
    return teacher_list
def process_course_page(browser, wait_time, course_url, db_cursor, db_connection):
    """处理单个课程页面"""
    print(f"开始处理:{course_url}")
    browser.get(course_url)
    time.sleep(2)
    # 提取课程信息
    course_name = extract_element_text(
        browser, wait_time,
        "/html/body/div[5]/div[2]/div[1]/div/div/div/div[2]/div[2]/div/div[2]/div[1]/span[1]"
    )
    college_name = extract_element_attribute(
        browser, wait_time,
        "/html/body/div[5]/div[2]/div[2]/div[2]/div[2]/div[2]/div[2]/div/a/img",
        "alt"
    )
    # 获取教师信息
    teachers = collect_course_teachers(browser)
    # 尝试处理教师翻页
    try:
        page_buttons = browser.find_elements(
            By.XPATH,
"/html/body/div[5]/div[2]/div[2]/div[2]/div[2]/div[2]/div[2]/div/div/div[2]/div/div[1]/span"
        )
        if page_buttons:
            page_buttons[0].click()
            time.sleep(2)
            while True:
                additional_teachers = collect_course_teachers(browser)
                teachers.extend(additional_teachers)
                page_buttons = browser.find_elements(
                    By.XPATH,
                    "/html/body/div[5]/div[2]/div[2]/div[2]/div[2]/div[2]/div[2]/div/div/div[2]/div/div[1]/span"
                )
                if len(page_buttons) == 2:
                    page_buttons[1].click()
                    time.sleep(2)
                else:
                    break
    except Exception as e:
        print("教师翻页处理失败:", e)
    main_teacher = teachers[0] if teachers else ""
    teacher_team = ",".join(teachers)
    # 获取参加人数
    try:
        count_element = wait_time.until(EC.presence_of_element_located((
            By.XPATH,
            "/html/body/div[5]/div[2]/div[1]/div/div/div/div[2]/div[2]/div/div[3]/div/div[1]/div[4]/span[2]"
        )))
        participant_count = count_element.text.strip()
    except:
        participant_count = ""
    # 获取进度信息
    progress_info = extract_element_text(
        browser, wait_time,
"/html/body/div[5]/div[2]/div[1]/div/div/div/div[2]/div[2]/div/div[3]/div/div[1]/div[2]/div/span[2]"
    )
    # 获取课程简介
    course_description = extract_element_text(
        browser, wait_time,
        "/html/body/div[5]/div[2]/div[2]/div[2]/div[1]/div[1]/div[2]/div[2]/div[1]"
    )
    # 保存到数据库
    insert_sql = """
    INSERT INTO course_info(url, cCourse, cCollege, cTeacher, cTeam, cCount, cProcess, cBrief)
    VALUES (%s,%s,%s,%s,%s,%s,%s,%s)
    """
    course_data = (
        course_url, course_name, college_name, main_teacher, 
        teacher_team, participant_count, progress_info, course_description
    )
    db_cursor.execute(insert_sql, course_data)
    db_connection.commit()
    
    print(f"完成处理:{course_name}")

def main():
    """主程序"""
    # 初始化数据库
    db_connection, db_cursor = initialize_database()
    # 设置浏览器
    browser_service = Service(DRIVER_PATH)
    browser = webdriver.Chrome(service=browser_service)
    wait_handler = WebDriverWait(browser, 20)
    # 加载cookies登录
    load_user_cookies(browser)
    # 处理每个课程链接
    for course_link in COURSE_LIST:
        process_course_page(browser, wait_handler, course_link, db_cursor, db_connection)
    # 清理资源
    browser.quit()
    db_cursor.close()
    db_connection.close()
    print("\n所有课程处理完成!")
if __name__ == "__main__":
    main()
结果

image

作业3

要求:
掌握大数据相关服务,熟悉Xshell的使用
完成文档 华为云_大数据实时分析处理实验手册-Flume日志采集实验(部分)v2.docx 中的任务,即为下面5个任务,具体操作见文档。
环境搭建:
任务一:开通MapReduce服务
实时分析开发实战:
任务一:Python脚本生成测试数据
任务二:配置Kafka
任务三: 安装Flume客户端
任务四:配置Flume采集数据

开通mapreduce服务

image

任务1 python脚本生成测试数据

image

任务2 配置Kafka

image
image

任务3 安装Flume客户端

image
image
image
image
image

任务4 配置Flume采集数据

image
image

实验心得体会:

第一题:Selenium爬取股票数据与MySQL数据库集成
本次实验我通过Selenium成功爬取了沪深股市的动态表格数据,并将其高效存储于MySQL数据库中。相较于之前使用的SQLite,MySQL的部署与连接配置过程让我更深入地理解了服务型数据库的应用。在编码中,我重点实践了数据库表的规范设计、连接池的配置以及如何将网页解析出的非结构化数据转换为结构化记录并批量插入。通过处理实时变动的股价信息,我体会到了数据抓取与持久化存储结合在金融数据分析中的实际价值。
第二题:模拟登录与复杂页面结构下的MOOC课程信息爬取
实验挑战在于应对中国大学MOOC网站的模拟登录和复杂页面结构。我不仅处理了嵌套在iframe中的登录框,还运用了Cookie持久化技术来维持会话状态,避免了重复登录。在信息提取环节,通过分析XPath与CSS选择器,精准定位了课程名称、教师团队、开课进度等分散在页面多处的关键信息。这个过程深化了我对Selenium等待机制、动态元素查找以及反爬虫策略应对的理解,提升了从复杂交互式网站中提取规整数据的能力。
第三题:基于Kafka与Flume的大数据实时采集流水线搭建
从编写脚本模拟日志数据生成,到配置Kafka消息队列作为缓冲中枢,再到部署Flume Agent完成从Kafka到HDFS的可靠传输,我亲手搭建了一套完整的流水线。通过解决各组件间的配置协调与端口连通性问题,我直观认识到数据流在各个环节的形态变化与可靠性要求。这让我对生产环境中实时数据处理的架构设计有了更具体的认知。

gitee文件链接:https://gitee.com/fang-pu666/fp888/tree/homework/

posted @ 2025-12-10 10:59  ygtr3ce  阅读(0)  评论(0)    收藏  举报