数据采集第四次作业

1.作业①:

码云地址:https://gitee.com/wjz51/wjz/tree/master/project_4/4_1

1.1 要求:

熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法; Scrapy+Xpath+MySQL数据库存储技术路线爬取当当网站图书数据

1.2 解题思路:

1.2.1 spider部分

lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")
for li in lis:
    title = li.xpath("./a[position()=1]/@title").extract_first()
    price =li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
    author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()
    date =li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()
    publisher = li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title ").extract_first()
    detail = li.xpath("./p[@class='detail']/text()").extract_first()

detail有时没有,为None,解决如下:

item = DangdangItem()
item["title"] = title.strip() if title else ""
item["author"] = author.strip() if author else ""
item["date"] = date.strip()[1:] if date else ""
item["publisher"] = publisher.strip() if publisher else ""
item["price"] = price.strip() if price else ""
item["detail"] = detail.strip() if detail else ""

1.2.2 items部分

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    date = scrapy.Field()
    publisher = scrapy.Field()
    detail = scrapy.Field()
    price = scrapy.Field()

1.2.3 settings部分

添加以下内容:

ITEM_PIPELINES = {
	'dangdang.pipelines.DangdangPipeline': 300,
}

1.2.4 pipelines部分

class DangdangPipeline:
    def open_spider(self, spider):#打开数据库
        print("opened")
        try:
            self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
            passwd = "Wjz20010501", db = "mydb", charset = "utf8")
            self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
            self.cursor.execute("delete from books")
            self.opened = True
            self.count = 0
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):#关闭数据库
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("closed")
        print("总共爬取", self.count, "本书籍")

    def process_item(self, item, spider):#向数据库中插入数据
        try:
            print(item["title"])
            print(item["author"])
            print(item["publisher"])
            print(item["date"])
            print(item["price"])
            print(item["detail"])
            print()
            if self.opened: