数据采集与融合技术实践第四次实验作业

作业①:

1.题目

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；Scrapy+Xpath+MySQL数据库存储技术路线爬取当当网站图书数据
候选网站：http://www.dangdang.com/
关键词：学生自由选择
输出信息:MySQL数据库存储和输出格式如下：

2.实验思路

class Test41Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    publisher = scrapy.Field()
    date = scrapy.Field()
    price = scrapy.Field()
    detail = scrapy.Field()
    pass

items.py中写上这次作业需要爬取的信息

def start_requests(self):
    url = "http://search.dangdang.com/?"
    for page in range(1, 4):
        params = {"key": "python",
                  "act": "input",
                  "page_index": str(page)}
        # 构造get请求
        yield scrapy.FormRequest(url=url, callback=self.parse, method="GET",
                                 headers=self.headers,formdata=params)

首先在start_requests函数中实现参数的设置来进行翻页操作，爬取3页的书本内容

def parse(self, response):
    try:
        data=response.body.decode("gbk")
        selector = scrapy.Selector(text=data)
        books = selector.xpath("//*[@id='search_nature_rg']/ul/li")
        for book in books:
            item = Test41Item()
            item["title"] = book.xpath("./a/@title").extract_first()
            item["author"] = book.xpath("./p[@class='search_book_author']/span")[0].xpath("./a/@title").extract_first()
            item["publisher"] = book.xpath("./p[@class='search_book_author']/span")[2].xpath("./a/text()").extract_first()
            item["date"] = book.xpath("./p[@class='search_book_author']/span")[1].xpath("./text()").extract_first()
            try:
                item["date"] = item["date"].split("/")[-1]
            except:
                item["date"] = " "
            item["price"] = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
            item["detail"] = book.xpath("./p[@class='detail']/text()").extract_first() if book.xpath("./p[@class='detail']/text()").extract_first() else ""
            print(item["title"])
            print(item["author"])
            print(item["publisher"])
            print(item["date"])
            print(item["price"])
            print(item["detail"])
            yield item

    except Exception as err:
        print(err)

在parse函数中进行所需数据的获取，其中date数据进行一个额外的处理，把年月日前的空格和“/”去掉

class Test41Pipeline:
    db = DB()
    count = 1
    def __init__(self):
        self.db.openDB() #创建并打开一个表
    def process_item(self, item, spider):
        try:
            self.db.insert(self.count, item['title'], item['author'], item['publisher'], item['date'],
                           item['price'], item['detail'])
            self.count += 1
        except Exception as err:
            print(err)
        return item

在pipelines脚本中，链接mysql数据库并创建一个表，将之前获取到的数据存入mysql数据库中

数据库中的信息如上所示，爬取每页60本书本信息共180本

3.心得体会

作业①重温了scrapy框架和xpath，使用scrapy框架爬取当当网的内容，使用xpath解析信息

作业②:

1.题目

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站：招商银行网：http://fx.cmbchina.com/hq/
输出信息:MYSQL数据库存储和输出格式

2.实验思路

class Test42Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    currency = scrapy.Field()
    tsp = scrapy.Field()
    csp = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    time = scrapy.Field()
    pass

items.py中写上这次作业需要爬取的信息

def start_requests(self):
    url = "http://fx.cmbchina.com/hq/"
    # 构造get请求
    yield scrapy.FormRequest(url=url, callback=self.parse, method="GET",
                                 headers=self.headers)

在start_requests函数中构造get请求

    def parse(self, response):
        try:
            data=response.body.decode("utf-8")
            selector = scrapy.Selector(text=data)
            #print(data)
            trs = selector.xpath("//*[@id='realRateInfo']/table/tr")
            for tr in trs[1:]:
                item = Test42Item()
                item["currency"] = tr.xpath("./td[position()=1]/text()").extract_first().replace("\n","").replace(" ","")
                item["tsp"] = tr.xpath("./td[position()=4]/text()").extract_first().replace("\n","").replace(" ","")
                item["csp"] = tr.xpath("./td[position()=5]/text()").extract_first().replace("\n","").replace(" ","")
                item["tbp"] = tr.xpath("./td[position()=6]/text()").extract_first().replace("\n","").replace(" ","")
                item["cbp"] = tr.xpath("./td[position()=7]/text()").extract_first().replace("\n","").replace(" ","")
                item["time"] = tr.xpath("./td[position()=8]/text()").extract_first().replace("\n","").replace(" ","")
                print(item["currency"])
                print(item["tsp"])
                print(item["csp"])
                print(item["tbp"])
                print(item["cbp"])
                print(item["time"])
                yield item

        except Exception as err:
            print(err)

观察后在parse函数中使用xpath获取自己需要的数据，并将数据中的换行符和空格去除

class Test42Pipeline:
    db = DB()
    count = 1
    def __init__(self):
        self.db.openDB()
    def process_item(self, item, spider):
        try:
            self.db.insert(self.count, item['currency'], item['tsp'], item['csp'], item['tbp'],
                           item['cbp'], item['time'])
            self.count += 1
        except Exception as err:
            print(err)
        return item

最后同作业①，将数据存入数据库中

数据库的信息如图所示

3.心得体会

作业②较简单，并再次熟悉了scrapy框架和xpath方法

作业③:

1.题目

要求：熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容；

使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。
候选网站：东方财富网：http://quote.eastmoney.com/center/gridlist.html#hs_a_board

输出信息:MySQL数据库存储和输出格式如下

表头应是英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计表头

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N 世华	28.47	62.22%	10.92	26.13 万	7.6 亿	22.34	32.0	28.08	30.2	17.55
2......

2.实验思路

if __name__ == "__main__":
    url='http://quote.eastmoney.com/center/gridlist.html#hs_a_board'
    #type记录需要点击的两个按钮的xpath路径
    type = ['//*[@id="nav_sh_a_board"]/a','//*[@id="nav_sz_a_board"]/a']
    driver = OpenDriver(url)
    db = DB()
    db.openDB()
    #获取数据部分
    for page in range(1,10):
        GetInfo(page)
        #翻页部分
        if page==3:#翻页2次后第三次翻页改为换一个股票类型
            ToNextType(type[0])
        elif page==6:
            ToNextType(type[1])
        else:
            ToNextPage()#正常翻页
    db.closeDB()
    driver.close()
    print("爬取结束")

主函数部分，主要流程为打开浏览器、进入网页、获取数据、翻页/换股票类型继续获取数据、最后爬取结束。获取数据时将数据存入mysql数据库中

def OpenDriver(url):
    chrome_options = Options()
    # chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=chrome_options)
    driver.get(url)  # 浏览器访问首页
    return driver

OpenDriver函数为使用webdriber打开浏览器

def GetInfo(page):
    time.sleep(3) #等待网页加载
    print(driver.find_element(By.XPATH, '//*[@id="table_wrapper-table"]/tbody').text)
    stocks = driver.find_element(By.XPATH, '//*[@id="table_wrapper-table"]/tbody').text.split("\n")
    #print(stocks)
    if (page-1)//3 == 0:
        type = '沪深A股'
    elif (page-1)//3 == 1:
        type = '上证A股'
    else:
        type = '深证A股'
    for stock in stocks:
        stock = stock.split(" ")
        no = int(stock[0])
        StockNo = stock[1]
        StockName = stock[2]
        LatestQuotation =stock[4]
        FluctuationRange =stock[5]
        RiseAndFall =stock[6]
        Turnover =stock[7]
        TurnoverMoney =stock[8]
        Amplitude =stock[9]
        Highest =stock[10]
        Minimum =stock[11]
        TodayOpen =stock[12]
        ReceivedYesterday =stock[13]
        db.insert(no,type,StockNo,StockName,LatestQuotation,FluctuationRange,RiseAndFall,Turnover,TurnoverMoney,Amplitude,Highest,Minimum,TodayOpen,ReceivedYesterday)

GetInfo函数根据传入的page参数判断现在获取的股票的种类并记录，然后依次获取所有需要的信息并存入数据库中

def ToNextPage():
    locator = (By.XPATH, '//*[@id="main-table_paginate"]/a[2]')
    WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located(locator))#等待按钮加载完毕
    NextPage = driver.find_element(By.XPATH, '//*[@id="main-table_paginate"]/a[2]')
    NextPage.click()

ToNextPage函数为点击按钮进入下一页，准备读取新的数据

def ToNextType(type):
    locator = (By.XPATH, type)
    WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located(locator))#等待按钮加载完毕
    NextType = driver.find_element(By.XPATH, type)
    NextType.click()

ToNextType按钮通过传入的按钮的Xpath路径进行追踪按钮并点击，进入另一种类的股票页面

数据库中存储的信息如图，爬取了三种股票各三页内容，共180行信息

3.心得体会

作业③主要复习了Selenium框架的使用，使用Selenium框架爬取Ajax网页数据比较简单便捷

码云上的代码：2019数据采集与融合技术: 数据采集与融合技术实践作业 - Gitee.com

posted @ 2021-11-10 17:37 zhuangxinpeng 阅读(80) 评论(0) 收藏举报

刷新页面返回顶部

zhuangxinpeng

数据采集与融合技术实践第四次实验作业

1.题目

2.实验思路

3.心得体会

1.题目

2.实验思路

3.心得体会

1.题目

2.实验思路

3.心得体会

公告