数据采集与融合技术_实验四

作业①：

1）当当网图书信息爬取

– 要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；Scrapy+Xpath+MySQL数据库存储技术路线爬取当当网站图书数据
– 候选网站：http://search.dangdang.com/?key=python&act=input
– 关键词：学生可自由选择
– 输出信息：MySQL的输出信息如下:

完成过程：
1.创建数据库mydb,同时创建books表用于存储数据：

2.编写item.py

    title = scrapy.Field()
    author = scrapy.Field()
    date = scrapy.Field()
    publisher = scrapy.Field()
    detail = scrapy.Field()
    price = scrapy.Field()
    pass

3.修改setting.py

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'pro4_test1.pipelines.Pro4Test1Pipeline': 300,
}

4.编写spider
1)设置关键词和初始url

    name = "book"
    key = 'python'

    source_url = 'http://search.dangdang.com/'

    def start_requests(self):
        url = BookSpider.source_url + "?key=" + BookSpider.key
        yield scrapy.Request(url=url, callback=self.parse)

2）使用xpath匹配文本信息，同时找到翻页接口，实现翻页

    def parse(self, response):
        try:
            dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            selector = scrapy.Selector(text=data)
            lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
            for li in lis:
                title = li.xpath("./a[position()=1]/@title").extract_first()
                price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
                author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
                date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
                publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
                detail = li.xpath("./p[@class='detail']/text()").extract_first()
                # detail有时没有，结果None
                item = Pro4Test1Item()
                item["title"] = title.strip() if title else ""
                item["author"] = author.strip() if author else ""
                item["date"] = date.strip()[1:] if date else ""
                item["publisher"] = publisher.strip() if publisher else ""
                item["price"] = price.strip() if price else ""
                item["detail"] = detail.strip() if detail else ""
                yield item
            link = selector.xpath("//div[@class='paging']/ul[@name='Fy']/li[@class='next'] / a / @ href").extract_first()
            if link:
                url = response.urljoin(link)
                yield scrapy.Request(url=url, callback=self.parse)
        except Exception as err:
            print(err)

5.编写piplines.py

    def open_spider(self, spider):
        print("opened")
        try:
            self.con = pymysql.connect(host="localhost", port=3306, user="root",
                                       passwd="123456", db="mydb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")
        print("总共爬取", self.count, "本书籍")

    def process_item(self, item, spider):
        try:
            print(item["title"])
            print(item["author"])
            print(item["publisher"])
            print(item["date"])
            print(item["price"])
            print(item["detail"])
            print()
            if self.opened:
                self.cursor.execute(
                    "insert into books (bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) values(%s,%s,%s,%s,%s,%s)",
                    (item["title"], item["author"], item["publisher"], item["date"], item["price"], item["detail"]))
                self.count += 1
        except Exception as err:
            print(err)
        return item

6.数据库存储结果

7.代码链接：https://gitee.com/huang-dunn/crawl_project/tree/master/实验四作业1

2）心得体会：通过复现书本上的代码，使得我更加熟悉scrapy框架的应用，也加深了对数据库操作的理解。

作业②

1）

– 要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
– 候选网站：招商银行网：http://fx.cmbchina.com/hq/
– 输出信息：MySQL数据库存储和输出格式

Id	Currency	TSP	CSP	TBP	CBP	Time
1	港币	86.60	86.60	86.26	85.65	15：36：30
2...

完成过程：
1.创建数据库

2.编写item.py

    name = scrapy.Field()
    TSP = scrapy.Field()
    CSP = scrapy.Field()
    TBP = scrapy.Field()
    CBP = scrapy.Field()
    time = scrapy.Field()

3.编写spiders，使用xpath：

    def start_requests(self):
        url = 'http://fx.cmbchina.com/hq/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        selector = scrapy.Selector(text=data)
        tr_list = selector.xpath('//tr')
        del (tr_list[0],tr_list[1])
        item = Pro4Test2Item()
        flag = 0
        for tr in tr_list:
            # print(tr)
            if flag == 0:
                flag = flag+1
                continue
            item['name'] = tr.xpath('./td[1]/text()').extract_first().strip()
            item['TSP'] = tr.xpath('./td[4]/text()').extract_first()
            item['CSP'] = tr.xpath('./td[5]/text()').extract_first()
            item['TBP'] = tr.xpath('./td[6]/text()').extract_first()
            item['CBP'] = tr.xpath('./td[7]/text()').extract_first()
            item['time'] = tr.xpath('./td[8]/text()').extract_first()
            # 去空格和换行符
            item['TSP'] = str(item['TSP']).strip()
            item['CSP'] = str(item['CSP']).strip()
            item['TBP'] = str(item['TBP']).strip()
            item['CBP'] = str(item['CBP']).strip()
            item['time'] = str(item['time']).strip()
            print(item['name'], item['TSP'], item['CSP'], item['TBP'], item['CBP'], item['time'])
            yield item

4.编写pipeline.py,将获取的数据插入数据库中：

    def open_spider(self, spider):
        print("opened")
        try:
            self.con = pymysql.connect(host="localhost", port=3306, user="root",
                                       passwd="123456", db="mydb", charset="utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from banks")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")

    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute(
                    "insert into banks (name,TSP,CSP,TBP,CBP,time) values(%s,%s,%s,%s,%s,%s)",
                    (item['name'], item['TSP'], item['CSP'], item['TBP'], item['CBP'], item['time']))
                self.count += 1
        except Exception as err:
            print(err)
        return item

5.输出结果展示：

6.数据库存储展示：

7.代码链接：https://gitee.com/huang-dunn/crawl_project/tree/master/实验四作业2

2）心得体会：对scrapy框架的使用更加的熟练，对xpath匹配文本也理解的更深，并对MYSQL数据库的增删改查有了更进一步的了解。

作业③

1）

– 要求：熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容；使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、
– “上证A股”、“深证A股”3个板块的股票数据信息。
– 候选网站：东方财富网：http://quote.eastmoney.com/center/gridlist.html#hs_a_board
– 输出信息：MySQL数据库存储和输出格式如下，表头应是英文命名例如：序号id，股票代码：bStockNo……，由同学们自行定义设计表头：

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.2	17.55
2......

完成过程：
1.分析页面：
1）所需信息位置：

可以看出每支股票的信息均在tr标签下

2）翻页链接的位置

点击下一页可以看出翻页接口链接的xpath路径为//*[@id="main-table_paginate"]/a[2]

2.创建数据库：

3.编写爬虫函数：

def spider(url, name_):
    global count
    driver.get(url)
    sleep(3)
    temp = 1
    while True:
        tr_list = driver.find_elements('xpath', '//*[@id="table_wrapper-table"]/tbody/tr')
        for tr in tr_list:
            data = tr.text
            # print(data)
            data = data.split(" ")
            print(count, data[1], data[2], data[6], data[7], data[8], data[9], data[10], data[11], data[12], data[13],
                  data[14], data[15])
            try:
                cursor.execute("insert into informs (part,id,node,name,new_price,rise_range,rise_price,change_num,"
                               "change_price,rise,highest,lowest,today,yesterday) "
                               "values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
                               (
                                   name_, count, data[1], data[2], data[6], data[7], data[8], data[9], data[10],
                                   data[11], data[12], data[13], data[14], data[15]))
                count += 1
            except Exception as err:
                print(err)
        try:  # 翻页处理
            next_page = WebDriverWait(driver, 3, 0.2).until(
                lambda x: x.find_element('xpath', '//*[@id="main-table_paginate"]/a[2]'))
        except Exception as e:
            print(e)
            break
        if temp >= 5:  # 每种类型爬5页
            break
        else:
            next_page.click()
            sleep(3)
            webdriver.Chrome().refresh()  # 页面刷新
            temp = temp + 1

4.主函数：

conn = pymysql.connect(host='localhost', port=3306, user='root', password='123456', db='mydb',
                       charset='utf8')
cursor = conn.cursor()
count = 1
code = ["hs_a_board", "sh_a_board", "sz_a_board"]
name = ['沪深A股', '上证A股', '深证A股']
for i in range(0, 3):
    url_ = 'http://quote.eastmoney.com/center/gridlist.html#' + code[i]
    spider(url_, name[i])
    conn.commit()
    print(name[i]+'信息已导入数据库')

5.输出结果：

6.数据库存储结果：

2）心得体会：对selenium模拟浏览器爬取数据有了一定的了解，点击完下一页要预留页面加载时间并刷新页面，同时要下次要认真审题（表头要英文）。

posted @ 2021-11-18 10:19 Dun_w 阅读(43) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Dun_w

数据采集与融合技术_实验四

作业①：

1）当当网图书信息爬取

2）心得体会：通过复现书本上的代码，使得我更加熟悉scrapy框架的应用，也加深了对数据库操作的理解。

作业②

1）

2）心得体会：对scrapy框架的使用更加的熟练，对xpath匹配文本也理解的更深，并对MYSQL数据库的增删改查有了更进一步的了解。

作业③

1）

2）心得体会：对selenium模拟浏览器爬取数据有了一定的了解，点击完下一页要预留页面加载时间并刷新页面，同时要下次要认真审题（表头要英文）。

公告