Learning Scrapy《精通Python爬虫框架Scrapy》 02：示例代码中的一些典型问题

学了几天Scrapy，觉得功能确实非常强大，但是在使用书里自带的代码直接运行的时候，总会出现一些意外的错误，无外乎以下几点：

1. Scrapy版本

这本书虽然是2018年出版，但默认使用的是Scrapy 1.0，而写本篇的时候是2019年12月，当前版本是1.8.0，这个我真无力吐槽了！我查了一下，scrapy 1.0.0是2015-6-19发布的，而1.8.0发布于2019-10-28，中间差了4年多。为什么呢？我仔细翻了下书，发现原作者使用的操作系统是Ubuntu 14.04.3 LTS，呵呵，显示是apt安装ubuntu提供的python-scrapy包，而我用的是pip install安装的最新包。

2. Python 2 与Python 3

这个问题比较麻烦，从书中的前几章的代码是看不出什么区别，但是从第5章开始就有问题了，先看第5章第1个例子login.py的代码（原书附带的代码，来自http://github.com/scalingexcellence/scrapybook）：

 1 import datetime
 2 import urlparse
 3 import socket
 4 
 5 from scrapy.loader.processors import MapCompose, Join
 6 from scrapy.linkextractors import LinkExtractor
 7 from scrapy.spiders import CrawlSpider, Rule
 8 from scrapy.loader import ItemLoader
 9 from scrapy.http import FormRequest
10 
11 from properties.items import PropertiesItem
12 
13 
14 class LoginSpider(CrawlSpider):
15     name = 'login'
16     allowed_domains = ["web"]
17 
18     # Start with a login request
19     def start_requests(self):
20         return [
21             FormRequest(
22                 "http://web:9312/dynamic/login",
23                 formdata={"user": "user", "pass": "pass"}
24             )]
25 
26     # Rules for horizontal and vertical crawling
27     rules = (
28         Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
29         Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
30              callback='parse_item')
31     )
32 
33     def parse_item(self, response):
34         """ This function parses a property page.
35 
36         @url http://web:9312/properties/property_000000.html
37         @returns items 1
38         @scrapes title price description address image_urls
39         @scrapes url project spider server date
40         """
41 
42         # Create the loader using the response
43         l = ItemLoader(item=PropertiesItem(), response=response)
44 
45         # Load fields using XPath expressions
46         l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
47                     MapCompose(unicode.strip, unicode.title))
48         l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
49                     MapCompose(lambda i: i.replace(',', ''), float),
50                     re='[,.0-9]+')
51         l.add_xpath('description', '//*[@itemprop="description"][1]/text()',
52                     MapCompose(unicode.strip), Join())
53         l.add_xpath('address',
54                     '//*[@itemtype="http://schema.org/Place"][1]/text()',
55                     MapCompose(unicode.strip))
56         l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
57                     MapCompose(lambda i: urlparse.urljoin(response.url, i)))
58 
59         # Housekeeping fields
60         l.add_value('url', response.url)
61         l.add_value('project', self.settings.get('BOT_NAME'))
62         l.add_value('spider', self.name)
63         l.add_value('server', socket.gethostname())
64         l.add_value('date', datetime.datetime.now())
65 
66         return l.load_item()

第2行，python 2里是有urlparse的，但是python 3已经将这个模块移到了urllib.parse中。但是即使改成“from urllib.parse import urlparse”，仍然会报错“Attribute Error: 'function' object has no attribute'urljoin'”，为什么呢？原因就是第57行，用到的方法是urlparse.urljoin()，而urljoin()方法在python 3中也被移到了与urlparse同级，也就是说应该改成：

from urllib.parse import urljoin
.
.
.
.........lambda i: urljoin(response.url, i)

再一个就是可能会报错的地方，第49、54、57行，用了unicode.strip、unicode.title，可能会出现“NameError: name'unicode' is not defined”错误，原因是python 3里没有unicode模块，直接用str替代就行。

我为什么会说可能？实际上我也不确定是不是一定会出现，因为我的主机上同时安装了python 2.7和python 3.7，居然没报错！而在另一台只安装了python 3的笔记本上却出现了这个错误。我也没搞清楚深层次的原因。

最后只想说一句，凡事只能靠自己琢磨。共勉！

posted @ 2019-12-17 11:10 cbdeng 阅读(358) 评论(0) 收藏举报

刷新页面返回顶部

含章可贞

能掐会算的非标准程序员

Learning Scrapy《精通Python爬虫框架Scrapy》 02：示例代码中的一些典型问题

公告