数据采集与融合技术_实验四

  • 作业①:

    1)当当网图书信息爬取

– 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;Scrapy+Xpath+MySQL数据库存储技术路线爬取当当网站图书数据
– 候选网站:http://search.dangdang.com/?key=python&act=input
– 关键词:学生可自由选择
– 输出信息:MySQL的输出信息如下:

完成过程:
1.创建数据库mydb,同时创建books表用于存储数据:

2.编写item.py

    title = scrapy.Field()
    author = scrapy.Field()
    date = scrapy.Field()
    publisher = scrapy.Field()
    detail = scrapy.Field()
    price = scrapy.Field()
    pass

3.修改setting.py

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'pro4_test1.pipelines.Pro4Test1Pipeline': 300,
}

4.编写spider
1)设置关键词和初始url

    name = "book"
    key = 'python'

    source_url = 'http://search.dangdang.com/'

    def start_requests(self):
        url = BookSpider.source_url + "?key=" + BookSpider.key
        yield scrapy.Request(url=url, callback=self.parse)
2)使用xpath匹配文本信息,同时找到翻页接口,实现翻页
    def parse(self, response):
        try:
            dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            selector = scrapy.Selector(text=data)
            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_first()
                price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()
                # detail有时没有,结果None
                item = Pro4Test1Item()
                item["title"] = title.strip() if title else ""
                item["author"] = author.strip() if author else ""
                item["date"] = date.strip()[1:] if date else ""
                item["publisher"] = publisher.strip() if publisher else ""
                item["price"] = price.strip() if price else ""
                item["detail"] = detail.strip() if detail else ""
                yield item
            link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next'] / a / @ href").extract_first()
            if link:
                url = response.urljoin(link)
                yield scrapy.Request(url=url, callback=self.parse)
        except Exception as err:
            print(err)

5.编写piplines.py

    def open_spider(self, spider):
        print("opened")
        try:
            self.con = pymysql.connect(host="localhost", port=3306, user="root",
                                       passwd="123456", db="mydb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")
        print("总共爬取", self.count, "本书籍")

    def process_item(self, item, spider):
        try:
            print(item["title"])
            print(item["author"])
            print(item["publisher"])
            print(item["date"])
            print(item["price"])
            print(item["detail"])
            print()
            if self.opened:
                self.cursor.execute(
                    "insert into books (bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) values(%s,%s,%s,%s,%s,%s)",
                    (item["title"], item["author"], item["publisher"], item["date"], item["price"], item["detail"]))
                self.count += 1
        except Exception as err:
            print(err)
        return item

6.数据库存储结果

7.代码链接:https://gitee.com/huang-dunn/crawl_project/tree/master/实验四作业1

2)心得体会:通过复现书本上的代码,使得我更加熟悉scrapy框架的应用,也加深了对数据库操作的理解。

  • 作业②

    1)

– 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
– 候选网站:招商银行网:http://fx.cmbchina.com/hq/
– 输出信息:MySQL数据库存储和输出格式

Id Currency TSP CSP TBP CBP Time
1 港币 86.60 86.60 86.26 85.65 15:36:30
2...

完成过程:
1.创建数据库

2.编写item.py

    name = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    time = scrapy.Field()

3.编写spiders,使用xpath:

    def start_requests(self):
        url = 'http://fx.cmbchina.com/hq/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        selector = scrapy.Selector(text=data)
        tr_list = selector.xpath('//tr')
        del (tr_list[0],tr_list[1])
        item = Pro4Test2Item()
        flag = 0
        for tr in tr_list:
            # print(tr)
            if flag == 0:
                flag = flag+1
                continue
            item['name'] = tr.xpath('./td[1]/text()').extract_first().strip()
            item['TSP'] = tr.xpath('./td[4]/text()').extract_first()
            item['CSP'] = tr.xpath('./td[5]/text()').extract_first()
            item['TBP'] = tr.xpath('./td[6]/text()').extract_first()
            item['CBP'] = tr.xpath('./td[7]/text()').extract_first()
            item['time'] = tr.xpath('./td[8]/text()').extract_first()
            # 去空格和换行符
            item['TSP'] = str(item['TSP']).strip()
            item['CSP'] = str(item['CSP']).strip()
            item['TBP'] = str(item['TBP']).strip()
            item['CBP'] = str(item['CBP']).strip()
            item['time'] = str(item['time']).strip()
            print(item['name'], item['TSP'], item['CSP'], item['TBP'], item['CBP'], item['time'])
            yield item

4.编写pipeline.py,将获取的数据插入数据库中:

    def open_spider(self, spider):
        print("opened")
        try:
            self.con = pymysql.connect(host="localhost", port=3306, user="root",
                                       passwd="123456", db="mydb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from banks")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")

    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute(
                    "insert into banks (name,TSP,CSP,TBP,CBP,time) values(%s,%s,%s,%s,%s,%s)",
                    (item['name'], item['TSP'], item['CSP'], item['TBP'], item['CBP'], item['time']))
                self.count += 1
        except Exception as err:
            print(err)
        return item

5.输出结果展示:

6.数据库存储展示:

7.代码链接:https://gitee.com/huang-dunn/crawl_project/tree/master/实验四作业2

2)心得体会:对scrapy框架的使用更加的熟练,对xpath匹配文本也理解的更深,并对MYSQL数据库的增删改查有了更进一步的了解。

  • 作业③

    1)

– 要求:熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容;使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、
– “上证A股”、“深证A股”3个板块的股票数据信息。
– 候选网站:东方财富网:http://quote.eastmoney.com/center/gridlist.html#hs_a_board
– 输出信息:MySQL数据库存储和输出格式如下,表头应是英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计表头:

序号 股票代码 股票名称 最新报价 涨跌幅 涨跌额 成交量 成交额 振幅 最高 最低 今开 昨收
1 688093 N世华 28.47 62.22% 10.92 26.13万 7.6亿 22.34 32.0 28.08 30.2 17.55
2......

完成过程:
1.分析页面:
1)所需信息位置:

可以看出每支股票的信息均在tr标签下

2)翻页链接的位置

点击下一页可以看出翻页接口链接的xpath路径为//*[@id="main-table_paginate"]/a[2]

2.创建数据库:

3.编写爬虫函数:

def spider(url, name_):
    global count
    driver.get(url)
    sleep(3)
    temp = 1
    while True:
        tr_list = driver.find_elements('xpath', '//*[@id="table_wrapper-table"]/tbody/tr')
        for tr in tr_list:
            data = tr.text
            # print(data)
            data = data.split(" ")
            print(count, data[1], data[2], data[6], data[7], data[8], data[9], data[10], data[11], data[12], data[13],
                  data[14], data[15])
            try:
                cursor.execute("insert into informs (part,id,node,name,new_price,rise_range,rise_price,change_num,"
                               "change_price,rise,highest,lowest,today,yesterday) "
                               "values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
                               (
                                   name_, count, data[1], data[2], data[6], data[7], data[8], data[9], data[10],
                                   data[11], data[12], data[13], data[14], data[15]))
                count += 1
            except Exception as err:
                print(err)
        try:  # 翻页处理
            next_page = WebDriverWait(driver, 3, 0.2).until(
                lambda x: x.find_element('xpath', '//*[@id="main-table_paginate"]/a[2]'))
        except Exception as e:
            print(e)
            break
        if temp >= 5:  # 每种类型爬5页
            break
        else:
            next_page.click()
            sleep(3)
            webdriver.Chrome().refresh()  # 页面刷新
            temp = temp + 1

4.主函数:

conn = pymysql.connect(host='localhost', port=3306, user='root', password='123456', db='mydb',
                       charset='utf8')
cursor = conn.cursor()
count = 1
code = ["hs_a_board", "sh_a_board", "sz_a_board"]
name = ['沪深A股', '上证A股', '深证A股']
for i in range(0, 3):
    url_ = 'http://quote.eastmoney.com/center/gridlist.html#' + code[i]
    spider(url_, name[i])
    conn.commit()
    print(name[i]+'信息已导入数据库')

5.输出结果:

6.数据库存储结果:

7.相关代码链接:https://gitee.com/huang-dunn/crawl_project/blob/master/实验四作业3/project_four_test3.py

2)心得体会:对selenium模拟浏览器爬取数据有了一定的了解,点击完下一页要预留页面加载时间并刷新页面,同时要下次要认真审题(表头要英文)。

posted @ 2021-11-18 10:19  Dun_w  阅读(43)  评论(0编辑  收藏  举报