Python爬虫碎碎念

最近领导给了一个任务,从单位的数据库里面导出所有的数据,存到本地excel表格。我就想,这不挺简单的么,给我数据库的密码账户,几条语句搞定。

结果让人大失所望,单位数据库只能通过后台管理系统查看,平台压根不提供批量导出功能,至于数据库直接访问什么的,更是想都别想,大领导不给批。

所以,只能采取笨办法了,网络爬虫给爬下来!

于是乎,重拾丢弃了大半年的python。开始钻研如何写一个简单的小爬虫。

 

python写爬虫的思路其实很简单。下面简单说下

1)python模拟登录。主要是获取cookie~

2)分析与平台交互过程中http包所含的数据特点。主要就是请求和响应。

这个平台诡异的地方在于,要想提取数据,并不是一次到位。首先,得获取大的列表,列表会有分页,然后,点进列表中的每一项查看详情。

通过对来往http包的分析,其流程大致如下:

模拟登录->发起获取列表请求(post, json)->返回列表数据(json)->发起获取详情请求(get)->返回详情页面(html)

完整的数据需要拼合列表数据和详情页面数据。前者,只需解析json数据既可,但是后面的详情页面,得对html页面进行解析,提取所需项。

流程并不复杂,但是写起来坑却太多。此文着力记录踩过的坑。主要是三大坑。

 

坑No1:蛋疼的python编码方式


 

这个坑可以分为几个小问题来解答

1)unicode和utf-8是什么关系?

这个问题,知乎上有一句话解释挺好的,那就是:utf8是对unicode字符集进行编码的一种编码方式。

unicode字符集本身是一种映射,它将每一个真实世界的字符与某一数值联系在一起,是一种逻辑关系。utf-8则是额外的一种编码方式,是对unicode所代表的值进行编码的算法。

简单地说,就是:字符->unicode->utf-8

例如:中文“你好” -> \u4f60\u597d -> \xe4\xbd\xa0\xe5\xa5\xbd

 

2)str和unicode又是什么关系?

str和unicode是python2.X里面的概念。

例如 s=u'你好'

s变量就是一个unicode字符串,是一个unicode对象(type(s) == unicode),严格意义上说,unicode就是python内部自定义的一个数据类型,是抽象的,而非存储实体。

python官方language reference给出的解释是

Unicode
The items of a Unicode object are Unicode code units. A Unicode code unit is represented by a Unicode object of one item and can hold either a 16-bit or 32-bit value representing a Unicode ordinal (the maximum value for the ordinal is given in sys.maxunicode, and depends on how Python is configured at compile time). Surrogate pairs may be present in the Unicode object, and will be reported as two separate items. The built-in functions unichr() and ord()convert between code units and nonnegative integers representing the Unicode ordinals as defined in the Unicode Standard 3.0. Conversion from and to other encodings are possible through the Unicode method encode() and the built-in function unicode().

其中len(s) = 2,存储的值为\u4f60\u597d。

至于str,除了表示一般的字符串,还可以表示python中原始的数据流。可以理解为字节流,即二进制码。

python官方language reference给出的解释是:

Strings
The items of a string are characters. There is no separate character type; a character is represented by a string of one item. Characters represent (at least) 8-bit bytes. The built-in functions chr() and ord() convert between characters and nonnegative integers representing the byte values. Bytes with the values 0-127 usually represent the corresponding ASCII values, but the interpretation of values is up to the program. The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file.
(On systems whose native character set is not ASCII, strings may use EBCDIC in their internal representation, provided the functions chr() and ord() implement a mapping between ASCII and EBCDIC, and string comparison preserves the ASCII order. Or perhaps someone can propose a better rule?) 

此外,还有以下描述

Python has two different datatypes. One is 'unicode' and other is 'str'.

Type 'unicode' is meant for working with codepoints of characters.

Type 'str' is meant for working with encoded binary representation of characters.

以上述“你好”为例,其unicode是\u4f60\u597d。这个值还可以进行一次utf-8的编码,最终成为新的字节流,也就是\xe4\xbd\xa0\xe5\xa5\xbd

在python3中,所有的str都变成了unicode,3中bytes则替代了2.X中的str

stackoverflow有一个解答说的挺好:http://stackoverflow.com/questions/18034272/python-str-vs-unicode-types

unicode, which is python 3's str, is meant to handle text. Text is a sequence of code points whichmay be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...). Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.

On the contrary str is a plain sequence of bytes. It does not represent text! In fact, in python 3 str is called bytes.

You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str. 

Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. 

如此,便很明了了~

 

3)encode和decode的使用方法。

有了上述两点的基础,这里的使用方法就不难了。

所谓encode,就是unicode->str,好比,有意义的文字,变为字节流。

而decode,就是str->unicode,好比,字节流,变为有意义的文字。

decode对str使用,encode对unicode使用。

例如:

u_a=u'你好'   #这里是unicode字符

u_a   #输出u'\u4f60\u597d'

s_a = u_a.encode('utf-8')  #对u_a进行utf-8编码,转化为字节流

s_a   #输出'\xe4\xbd\xa0\xe5\xa5\xbd'

u_a_ = s_a.decode('utf-8') #对s_a进行utf-8解码,还原为unicode

u_a_  #输出u'\u4f60\u597d'

utf-8是一种编码方法,此外,常见的还有gbk等等。

 

4)#coding:utf-8和setdefaultencoding有什么区别?

#coding:utf-8作用是定义源代码的编码,如果没有定义,此源码中不可以包含中文字符。

setdefaultencoding是python代码在执行时,unicode类型数据默认的编码方式。(Set the current default string encoding used by the Unicode implementation.)这是因为,unicode有很多编码方式,包括UTF-8、UTF-16、UTF-32,中文还有gbk。在调用decode和encode函数时,在不显示指定参数的情况下,就会采用上述默认的编解码方式。

需要注意的是,在windows下的idle中,在不显式指出u前缀的前提下,会默认采用gbk编码。

下面看一个例子:

a = '你好'  #在windows下的idle里面,a是gbk编码 
a    #输出'\xc4\xe3\xba\xc3' 这是gbk
b = a.decode('gbk')  #进行gbk解码为unicode
b    #输出u'\u4f60\u597d'
print b  #输出 你好
b = a.decode() #在不指定参数情况下,默认采用ascii编解码,此时会报错, 
               #UnicodeDecodeError: 'ascii' codec can't decode byte
a = u'你好'
b = a.encode() #同理,也会报错
               #UnicodeEncodeError: 'ascii' codec can't encode characters

 那么,到底python什么时候会调用默认的编码呢?

这里我不做全面的总结,但目前实践的情况看来,以下几种情况肯定会有默认的转换

1.试图对str进行encode,试图对unicode进行decode。

stackoverflow上有人解释道:http://stackoverflow.com/questions/11339955/python-string-encode-decode

In the second case you do the reverse attempting to encode a byte string. Encoding is an operation that converts unicode to a byte string so Python helpfully attempts to convert your byte string to unicode first 

也就是说,str.encode()实际上等效于:str.decode(sys.getdefaultencoding()).encode()。

对str进行encode,python首先会默认对str先行进行一次decode,然后再进行encode()。

假设系统默认编码方式是ascii,那么如果此时str中包含有不在ascii范围内的codepoint,即有类似于中文字符这样的东西,那么利用ascii试图进行解码,必然会报错~

2.任何可能调用str()的地方,如调用系统函数write的时候。

看下面一个例子

a = u'你好'

f=open('test.txt', 'w')

f.write(a)  #这里会报错,UnicodeEncodeError: 'ascii' codec can't   
            #encode characters in position 0-1: ordinal not in range(128)

str(a)  #这里同样会报错,内容同上

由于a是unicode,所以在写入文件时,或者转换为str时,必然会进行一次encode。如果系统默认的是ascii编码,那么就会报错了。

上述语句中,将f.write(a) 改为f.write(a.encode('utf-8')) 就ok了。显式指定其编码方式即可~

如果不想这么麻烦,python脚本开头,加上sys.setdefaultencoding( "utf-8" )就好了。

【注意】随着python3的兴起,sys.setdefaultencoding( "utf-8" )这种用法便可以舍弃了。即便放在2.X,这种方法也是不被推荐的。

(参见:http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script

"Also, the use of sys.setdefaultencoding() has always been discouraged, and it has become a no-op in py3k. The encoding is hard-wired to utf-8 and changing it raises an error.")

所以,在此讨论甚多,到了python3.0时代,也并没有什么卵用了。不过了解下python的发展历史和自我更新,对这门语言也算是有一个新的了解吧~

 

坑No2: 奇葩的用户输入与繁杂的正则表达式~


 

之前只对正则表达式有过粗略的了解,并没有细细深入其中了解并使用。这一次,因为需要从html提取我所需的文字,所以不得不摆开正则表达式大干一场。

按照惯例,应该使用例如beautifulsoup来提取。无奈,所有的关键项都没有div,name,class等标签。所以最终只能使用正则表达式硬抽了~

基本的语法不在此累述。我只想描述下我这个初学者在实际运用中出现的几个小问题

 

1)用户输入奇葩,没有考虑到所有情况。

原本,我想提取中文,以为用[\u4e00-\u9fa5]就ok了。结果我错了~

倒不是说中文提取不能用这个,而是说,用户的输入太奇葩了。

比如,某一个老师填写学校地址,填写的是:北京市 海淀区 学院路(32)号

对,没错,你看见了!还有空格!还有英文的括号!还有中文的括号!还有数字!

这类的坑真是防不胜防啊~后来不得不扩大通配符的范围了。这才解决了问题

 

2)深入理解(.*) 和 诸如(b*)之间的区别

所谓*是指前一个字符匹配多次(大于等于0),这个字符可以是通配符*,也可以是具体的字符如b。

需要记住的是,一旦匹配开始,那么就会计算连续匹配的情况。

*之所以会重复匹配,因为它将不同的字符看做好比是重复的。

例如.*匹配abc,能匹配出完整的abc,是因为abc都是属于'.',在此处,abc都算是“连续重复”的,都是'.'的重复。

同理,具体的字符b*,去匹配abcd,则会得到b。有人说,c也可以匹配啊,重复了零次嘛,d也可以匹配嘛,重复零次嘛。

其实这么理解是完全错误的。所谓b*,是从遇到b开始,计算能够连续匹配的b,匹配的计算只会在碰到b的一瞬间开始,直到往后遇到不是b的字符了,匹配即停止了。而后,接下来的待匹配字符,并不会接着按照b*去匹配了,也无所谓记为零次了。

所以,概念要抓准~

 

3)正则匹配通用原则

这里的通用原则,更多指的是greedy和lazy的问题~

我们知道.*是greedy原则,.*?是lazy原则,那么当这两个匹配法则连在一块了怎么办?如何处理其优先级?

这里有一篇很好的英文文章。在此转载一下~

Regular Expressions: The Rules
By hari on Jan 24, 2010
The following are the rules, a non-POSIX regular expression engine(such as in PERL, JAVA, etc ) would adhere to while attempting to match with the string,
Notation: the examples would list the given regex(pattern) , the string tested against (string) and the actual match happened in the string in between '<<<' and '>>>'.
1. The match that begins earliest/leftmost wins.
The intention is to match the cat at the end but the 'cat' in the catalogue won the match as it appears leftmost in the string.
pattern :cat
string :This catalogue has the names of different species of cat.
Matched: This <<< cat >>> alogue has the names of different species of cat.
1a.The leftmost match in the string wins, irrespective of the order a pattern appears in alternation
Though last in the alternation, 'catalogue' got the match as it appeared leftmost among the patterns in the alternation.
pattern :species|names|catalogue
string :This catalogue has the names of different species of cat.
Matched: This <<< catalogue >>>  has the names of different species of cat.
1b. If there are more than one plausible match occurs in the same position, then the order of the plausible matching patterns in the alternation counts.
All three patterns have a possible match at the same position, but 'over' is successful as it appeared first in the alternation.
pattern :over|o|overnight
string :Actually, I'm an overnight success. But it took twenty years.
Matched: Actually, I'm an <<< over >>> night success. But it took twenty years.
2. The standard quantifiers (\* +, ? and {m,n}) are greedy
Greediness (\*,+,?) would always try to match more before it tries to match minimum characters needed for the match to be successful ( '0' for \*,? ; '1' for + )
The intention is to match the "Joy is prayer", though .\* went pass across all the double quotes and grabbing all the strings only to match the last double quote (").
pattern :".\*"
string :"Joy is prayer"."Joy is strength"."Joy is Love".
Matched: <<< "Joy is prayer"."Joy is strength"."Joy is Love" >>> .
2a. Lazy quantifiers would  favor the minimum match
Laziness (\*?,+?,??) would always try to settle with minimum characters needed for the match to be successful before it tries to match the maximum.
The first double quote (') appeared was matched using lazy quantifier.
pattern :".\*?"
string :"Joy is prayer"."Joy is strength"."Joy is Love".
Matched: <<< "Joy is prayer" >>> ."Joy is strength"."Joy is Love".
2b. The only time the greedy quantifiers would give up what they've matched earlier and settle for less is 'when matching too much ends up causing some later part of the regex to fail'.
The \\w\* would match the whole word 'regular_expressions' initially. Later, since 's' didn't have a character to match and tend to fail would trigger the \\w\* to backtrack and match one character less. Thus the final 's' matches the 's' just released by \\w\* and whole match succeeds. 
Note: Though the pattern would work the same way without paranthesis, I'd used them to show the individual matches in $1, $2, etc.
pattern :(\\w\*)(s)
string :regular_expressions
Matched: <<< regular_expressions >>> 
$1 = regular_expression
$2 = s
Similarly, the initial match 'x' by 'x\*' was given by later for the favor of the last 'x' in the pattern.
pattern :(x\*)(x)
string :ox
Matched: o<<< x >>> 
$1 = 
$2 = x
2c. When more than one greedy quantifiers appears in a pattern, the first greedy would get the preference.
Though the .\* initially matched the whole string, the [0-9]+ would able to grab just one digit '5' from the .\*, and the 0-9]+ settles with it since that satisfies its minimum match criteriat. Note that the '+' is also a greedy quantifier and here it cant grab beyond its minimum requirement, since already there is an another greedy quantifier shares the same match.
Enter pattern :(.\*)([0-9]+)
Enter string :Bangalore-560025
Matched: <<< Bangalore-560025 >>> 
$1 = Bangalore-56002
$2 = 5
3. Overall match takes precedence. 
Ability to report a successful match takes precedence. As its shown in previous example, if its necessary for a successful match the quantifiers ( greedy or lazy ) would work in harmony with the rest of the pattern.

一共3大条原则,详情参见文章。如果把贪婪简记为G,把懒惰简记为L。那么看看不同组合的优先级吧~

首先,优先级最高的是,以满足成功匹配为准!

其次,在能够匹配成功的前提下,分为以下四种组合:

G+G:这种情况,前者尽管贪婪,尽可能多的匹配,后者满足最少情况即可

G+L:同上,前者可以尽管贪婪,尽可能多的匹配,后者满足最小情况即可

L+G:前者尽量少的匹配,后者尽可能多的匹配,先保证前者最小,再照顾后者

L+L:前后都尽可能少的匹配,优先保证前者,其次照顾后者。

其实总结起来也容易,优先保证全局能匹配,然后最先来的先匹配,最先来的具有更高优先级(无论是最lazy还是最greedy)。这么说容易晕,举几个例子吧。例子比较乱,慢慢体会吧~

>>> t = u'城市:</td>    <td>   ="张好"; 参数";'
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)";', t)   
>>> print temp.group(1)

>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)";', t)   
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)";', t)  
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*)";', t) 
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)
好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)";', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]*?)";', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)";', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+?)', t)
>>> print temp.group(1)
张
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*?([\u4e00-\u9fa5]+)', t)
>>> print temp.group(1)
张好
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+)"', t)
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]?)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]+?)"', t)
>>> print temp.group(1)
数
>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]*)"', t)
>>> print temp.group(1)

>>> temp = re.search(ur'城市:</td>[\s\S]*?<td>[\s\S]*([\u4e00-\u9fa5]{2})"', t)
>>> print temp.group(1)
参数
>>> 

 

坑No3:python读写文件


 

也是分为几个小问题来解释一下吧

1)w+和r+有什么区别?

r+ : Open for reading and writing. The stream is positioned at the beginning of the file.

w+ : Open for reading and writing. The file is created if it does not exist, otherwise it is truncated. The stream is positioned at the beginning of the file.

本质上说,r+是可谓是“先读后写”。w+是“先写后读”。r+并不会清空文件,但是w+一开始就会清空文件。

所以,如果要读写文件, 一定要区分是先读还是先写。不然,如果首先要读文件的话,而使用了w+,那就什么都读不到了。

2)w+和r+使用过程中需要注意的地方

举一个例子,如果使用r+,而文件是空的话,先调用一个readline函数,再调用write函数,则会报IOError: [Errno 0] Error,这时候必须要在write之前添加一个seek(0)才能续写;如果文件非空的话,那么先调用readline,再调用write,则会直接在后面续写,不会报错。

究其原因,官方文档是这样解释的:

When the "r+", "w+", or "a+" access type is specified, both reading and writing are allowed (the file is said to be open for "update"). However, when you switch between reading and writing, there must be an intervening fflush, fsetpos, fseek, or rewind operation. The current position can be specified for the fsetpos or fseek operation, if desired.

也就是说,在读与写之间,必须添加seek函数,重新定位当前文件位置。不然各种错误折腾够呛

3)truncate函数的使用

如果需要不断往一个文件里写数据,又要先清空文件的话,可以调用truncate函数。

注意,在不指定truncate()参数的情况下,会默认从当前文件位置开始,砍掉后面的数据。所以,要清空文件的话,

要么,使用先seek(0),再truncate(),要么,直接调用truncate(0)。好好体会下面的定义~

The method truncate() truncates the file's size. If the optional size argument is present, the file is truncated to (at most) that size..

The size defaults to the current position. The current file position is not changed. Note that if a specified size exceeds the file's current size, the result is platform-dependent.

Note: This method would not work in case file is opened in read-only mode.

4)write函数,flush函数,到底都做了些什么?

援引stackoverflow上的一段回答:

There's typically two levels of buffering involved:

  1. Internal buffers
  2. Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been "permanently" stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they're for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.

①write,是应用程序写到program buffer里;

②flush,则是从program buffer到OS buffer;

③os.fsync是从OS buffer到硬盘。

f.close函数,实际上包含了flush函数,同样,也不包含写入硬盘。

那么,问题来了哈,请看下列场景:

windows下,开启一个cmd,运行一段python代码,里面有循环写文件的操作(每隔2s写一次,一共写20次),并且被try语句所包含,后续的finally语句则是包含写一段字符串(例如haha,随意哪个都行,做个标记就好,表明执行到这里)和关闭文件f.close()函数的调用。现程序执行到一半,还在write过程中:

1. 如果使用ctrl+C,则程序相当于书捕获了KeyboardInterrupt异常,执行finally语句。查看文件,成功写入haha。

2.如果关闭cmd窗口,文件同样会被写入之前的内容,但没有haha。这表明,OS在关闭进程时,做了一些clean up的工作,例如关闭文件句柄,将program buffer数据写到硬盘上。但是这不属于异常,故finally语句并没有调用。

3.如果直接在任务管理器里面直接结束任务,那么,文件将是空文件,没有任何字符,即便已经write过一部分了。

为什么呢?首先,我们不去深究上述三种关闭程序的方法windows究竟是怎么做的。但至少,我们知道,它们背后的隐藏过程肯定是不一样的。

情景2里面,此种关闭方式,操作系统有帮助完成program buffer到OS buffer这个过程,而场景3,却不包含这个过程。

事实上,如果在场景3里面,在每一条write后面都加一条flush,那么即便进程被任务管理器终止了,字符还是会被写入文件的。

如果要了解背后的原因,那就要多多学习操作系统背后到底做了些什么,这个坑,以后再填吧。

 

同样,在python多进程多线程程序的编写过程中,signal的处理,背后操作系统的行为,子进程父进程的工作模式和关系,都不是特别清楚,坑也很多,还是等有时间和精力了,再好好琢磨下吧~

 

最后的吐槽 :

做web系统开发,前端框架最好规范化,用户输入最好经过合法化检查、规范后再入库,数据库最好别太烂,一个页面加载15s也是醉了。此吐槽纯粹针对该管理平台,无他~

就这样~

posted @ 2015-11-03 20:31  IT屁民  阅读(1316)  评论(0编辑  收藏  举报