【学习笔记】Scrapy

　　刚刚接触Scrapy

　　根据书上的内容做练习

　　爬取书上的链接里的书本信息

 1 import scrapy
 2 class FuzxSpider(scrapy.Spider):
 3     name = "Fuzx"
 4     start_urls=['http://books.toscrape.com/']
 5     def parse(self, response):
 6         for book in response.css('article.product_pod'):
 7             name=book.xpath('./h3/a/@title').extract_first()
 8             price=book.css('p.price_color::text').extract_first()
 9             yield {
10                 'name':name,
11                 'price':price,
12             }
13         next_url=response.css('ul.pager li.next a::attr(href)').extract_first()
14         if next_url:
15             next_url=response.urljoin(next_url)
16             yield scrapy.Request(next_url,callback=self.parse)

　　命令：scrapy crawl Fuzx -o fuzx.csv

　　将内容保存在fuzx.csv里

　　看样子爬成功了

　　延伸一下网上说练爬虫豆瓣比较容易

　　我打开的是豆瓣阅读的小说界面 https://read.douban.com/kind/100

　　爬取图书名和作者

　　然后改了css选择器的内容

　　报错403

　　然后我加了个User-Agent

　　第二次出现问题的地方是在这里有空格

for book in response.css('li.item.store-item'):

　　参考别人的文章，可以用上面这种写法

　　完整代码：

 1 import scrapy
 2 class DoubSpider(scrapy.Spider):
 3     name = "doub"
 4     headers = {
 5         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0',
 6     }
 7     def start_requests(self):
 8         url = 'https://read.douban.com/kind/100'
 9         yield scrapy.Request(url, headers=self.headers)
10 
11     def parse(self, response):
12         for book in response.css('li.item.store-item'):
13             name=book.css('div.title a::text').extract_first()
14             author=book.css('p span.labeled-text a.author-item::text').extract_first()
15             yield {
16                 'name':name,
17                 'author':author,
18             }
19         next_url=response.css('div.pagination li.next a::attr(href)').extract_first()
20         if next_url:
21             next_url=response.urljoin(next_url)
22             yield scrapy.Request(next_url,headers=self.headers,callback=self.parse)

　　然后scrapy crawl doub -o db.csv

　　爬完查看db.csv

posted @ 2018-03-28 02:13 Flakyfuzx 阅读(267) 评论(0) 收藏举报

刷新页面返回顶部

Flakyfuzx

努力，奋斗！

【学习笔记】Scrapy

公告