爬虫八：从0开始（一）：基本请求、正则爬取、beautiful的简单使用

http://www.zhyea.com/2016/08/13/python-spider-4-multi-thread.html

一、Python网络爬虫1 – 简单的Http请求:

from urllib import urlopen
url = 'http://www.zhyea.com/2016/07/17/memory-analyzer-all.html'
response = urlopen(url)
content = response.read()
print type(content)
# out:   <type 'str'>

通常，在命令行打印出来的是网页的源代码。想从中过滤出来需要的信息需要进行匹配和筛选。比如使用正则式匹配获取title和body中的内容：

import re
title = re.search(r"<title>.*</title>", content)
print "title>>>>>>>>>>>",title
# out:    title>>>>>>>>>>> <title>MemoryAnalyzer介绍及使用 | ZY笔记</title>
body = re.search(r"<body[\w|\W]*</body>", content)
print "body>>>>>>>>>>>",body

归纳：将get, post, get_something写成一般的、通用的函数

import re
from urllib import urlopen
from urllib import urlencode
def get(url):
    response = urlopen(url)
    content = ""
    if response:
        content = response.read().decode("utf8")
        response.close()
    return content
def post(url, **paras):
    import requests
    # param = urlencode(paras).encode('utf8')
    param = urlencode(paras)
    rep = requests.post(url, data=param)
    content = ""
    if rep:
        content = rep.content.decode("utf8")
        rep.close()
    return content
def get_target(pattern, content):
    m = re.search(pattern, content)
    target = ""
    if m:
        target = m.group(0)
    return target
def main(url):
    content = get(url)
    title = get_target(r"<title>.*<\title>", content)
    body = get_target(r"<body[\w|\W]*<body>", content)
　　print "title>>>>>>>>>>>", title
　　print "body>>>>>>>>>>>", body

二、Python网络爬虫2 – 请求中遇到的几个问题:

继续使用以上方法爬虫，出现错误：

    url = "https://www.torrentkitty.tv/search/蝙蝠侠/"
    main(url)

Traceback (most recent call last):
  File "D:/PythonDevelop/spider/grab.py", line 22, in <module>
    main()
  File "D:/PythonDevelop/spider/grab.py", line 17, in main
    content = get(url)
  File "D:/PythonDevelop/spider/grab.py", line 7, in get
    response = request.urlopen(url)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 465, in open
    response = self._open(req, data)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 483, in _open
    '_open', req)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 443, in _call_chain
    result = func(*args)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "D:\Program Files\python\python35\lib\urllib\request.py", line 1240, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "D:\Program Files\python\python35\lib\http\client.py", line 1083, in request
    self._send_request(method, url, body, headers)
  File "D:\Program Files\python\python35\lib\http\client.py", line 1118, in _send_request
    self.putrequest(method, url, **skips)
  File "D:\Program Files\python\python35\lib\http\client.py", line 960, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-12: ordinal not in range(128)

根据错误栈信息可以看出是在发送http请求时报错的，是因为编码导致的错误。在python中使用中文经常会遇到这样的问题。因为是在http请求中出现的中文编码异常，所以可以考虑使用urlencode加密。

在python中对字符串进行urlencode编码，使用的是urllib库中的quote

from urllib import quote
print quote("蝙蝠侠")
# out:     %E8%9D%99%E8%9D%A0%E4%BE%A0

总结：

1.首先去你要去的网站，测试编码是什么格式的是utf8或者gb2312

2.然后把要编码的文字encode成所需格式

3.最后进行quote或urlencode

示例：

from urllib import quote
test1 = '美国队长'.encode('gb2312')
test1_1 = quote(test1)
print(test1_1)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 0: ordinal not in range(128)

但是抛出异常，原因：'美国队长'是utf8字符串，在python2中使用的是unicode，因此不能直接由utf8编码为gb2312，中间需要经过一层unicode，即先将utf8解码为unicode然后再编码为gb2312。

from urllib import quote
test1 = '美国队长'.decode('utf8').encode('gb2312')
test1_1 = quote(test1)
print(test1_1)
# OUT:    %C3%C0%B9%FA%B6%D3%B3%A4

如果不需要编码为gb2312，即网站不是gb2312编码；如果网站需要的是utf8编码，那么即使在python2中，也可以直接使用utf8，不需要多一层编码解码；直接使用urlencode编码即可。

test1 = '美国队长'
test1_1 = quote(test1)
print(test1_1)

# OUT: %E7%BE%8E%E5%9B%BD%E9%98%9F%E9%95%BF

但是如果字符串是unicode，直接编码urlencode将抛也异常。因为在urlencode或quote编码之前，必须要将之转成utf8。总结中的第二条原则。

test1 = u'美国队长'
test1_1 = quote(test1)
print(test1_1)
# OUT：    KeyError: u'\u7f8e'

from urllib import quote, unquote, urlencode
test1 = u'美国队长'.encode('utf8')
test1_1 = quote(test1)
print(test1_1)+
#  OUT：    %E7%BE%8E%E5%9B%BD%E9%98%9F%E9%95%BF

%C3%C0%B9%FA%B6%D3%B3%A4是web程序中使用的一种编码方式。

相反，从web中取出来的字符串，使用unquote解码为编程语言的字符串。但是，和编码一样，要注意网站使用的编码；如果网站使用的gb2312编码，那么使用urlencode解码以后，还要使用gb2312解码为unicode，否则乱码。如果网站使用的是utf8，那么也应该解码除了urlencode解码，还要使用utf8解码为unicode (当然不从utf8解码为unicode也不会出乱码)。

print unquote("%C3%C0%B9%FA%B6%D3%B3%A4").decode('gb2312')  # 由于这串字符串是gb2312编码，所以解码后，还需从gb2312解码，否则会是乱码。

print unquote("%E7%BE%8E%E5%9B%BD%E9%98%9F%E9%95%BF")    # 这串web编码为utf8，直接unquote解码为utf8，正常，不会抛错。当然也可以再解码为unicode

print unquote("%E7%BE%8E%E5%9B%BD%E9%98%9F%E9%95%BF").decode('utf8')

或urlencode方法；

print urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
# OUT:    eggs=2&bacon=0&spam=1

同样，在urlencode编码之前，必须是utf8，否则将同quote编码一样，抛也异常。总结中的第二原则。因此准确的说是在urlencode和quote编码之前，一定要先将字符串或字典编码为网站需要的编码，然后才能使urlencode和quote进行web编码。

因为网站上一般都不会使用unicode编码，所以在使用web编码之前，一般都需要将python2的unicode进行编码。

print urlencode({'spam': u'美国队长', 'eggs': 2, 'bacon': 0})
#  OUT：   UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

print urlencode({'spam': u'\u7f8e\u56fd\u961f\u957f'.encode('utf8'), 'eggs': 2, 'bacon': 0})
# OUT:    eggs=2&bacon=0&spam=%E7%BE%8E%E5%9B%BD%E9%98%9F%E9%95%BF

或者

print json.dumps({'spam': u'\u7f8e\u56fd\u961f\u957f', 'eggs': 2, 'bacon': 0}, ensure_ascii=False)    # 使用utf8序列化

整理：

#!python
# encoding: utf-8
from urllib import request
from urllib import parse

DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}
DEFAULT_TIMEOUT = 120


def get(url):
    req = request.Request(url, headers=DEFAULT_HEADERS)
    response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)
    content = ""
    if response:
        content = response.read().decode("utf8")
        response.close()
    return content


def post(url, **paras):
    param = parse.urlencode(paras).encode("utf8")
    req = request.Request(url, param, headers=DEFAULT_HEADERS)
    response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)
    content = ""
    if response:
        content = response.read().decode("utf8")
        response.close()
    return content


def main():
    url = "https://www.torrentkitty.tv/search/"
    get_content = post(url, q=parse.quote("蝙蝠侠"))
    print(get_content)
    get_content = get(url)
    print(get_content)


if __name__ == "__main__":
    main()

三、Python网络爬虫3 – 使用BeautifulSoup解析网页：

在第一节演示过如何使用正则表达式截取网页内容。不过html是比正则表达式更高一级的语言，仅仅使用正则表达式来获取内容还是有些困难的。

这次会使用一个新的工具：python的BeautifulSoup库，BeautifulSoup是用来从HTML或XML文件中提取数据的工具。

BeautifulSoup需要先安装才能使用。关于BeautifulSoup安装和使用可以参考这份文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/。

最好使用chrome浏览器，因为chrome浏览器的Developer Tools（类似FireBug）有一个功能可以获取CSS选择器（选中目标 –> Copy –> Copy selector），这是很有用的（有的时候也会出现问题，不过还是可以用来做参考的）

1.先使用BeautifulSoup获取title试试：

html_doc = get("https://www.douban.com/")
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser")
print soup.select("title")[0]
# OUT:  <title>豆瓣</title>

这里是用了CSS选择器来获取HTML中的内容。

在搜索结果中，点击每个结果项右侧的“open”按钮可以打开下载。使用DeveloperTools可以看到“open”按钮实际上是一个超链接，超链接指向的是一个磁力链接。这个磁力链接就是我们要采集的目标。使用Chrome的DeveloperTools的选择器选中任意一个“open”按钮，然后在Elements项中，找到我们选中的项的源码（很容易找到，选中项的背景是蓝色的），右键 –> Copy –> Copy selector可以获取到这个按钮的CSS选择器：

#archiveResult > tbody > tr:nth-child(10) > td.action > a:nth-child(2)

将这个选择器放到代码中却是不生效的：

def detect(html_doc):
    soup = BeautifulSoup(html_doc, "html.parser")
    print(len(soup.select("#archiveResult > tbody > tr:nth-child(10) > td.action > a:nth-child(2)")))

执行结果输出的是0。soup.select()方法返回的是一个列表，长度为0……好像不用解释了。

python是不支持上面的选择器的部分语法的，比如nth或者tbody，修改下就可以了：

去掉了tbody；将nth-of-type(10)-->改为nth-of-type(10)

print(soup.select("#archiveResult > tr:nth-of-type(10) > td.action > a:nth-of-type(2)"))

直接执行上面的代码就可以得到一个超链接的源码：

[<a title="[BT乐园·bt606.com]蝙蝠侠大战超人：正义黎明.BD1080P.X264.AAC.中英字幕"
 href="magnet:?xt=urn:btih:DD6A680A7AE85F290A76826AA4D2E72194975EC8&dn=%5BBT%E4%B9%90%E5%9B%AD%C2%B7bt606.com%5D%E8%9D%99%E8%9D%A0%E4%BE%A0%E5%A4%A7%E6%88%98%E8%B6%85%E4%BA%BA%EF%BC%9A%E6%AD%A3%E4%B9%89%E9%BB%8E%E6%98%8E.BD1080P.X264.AAC.%E4%B8%AD%E8%8B%B1%E5%AD%97%E5%B9%95&tr=http%3A%2F%2Ftracker.ktxp.com%3A6868%2Fannounce&tr=http%3A%2F%2Ftracker.ktxp.com%3A7070%2Fannounce&tr=udp%3A%2F%2Ftracker.ktxp.com%3A6868%2Fannounce&tr=udp%3A%2F%2Ftracker.ktxp.com%3A7070%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A8000%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A8080%2Fannounce&tr=http%3A%2F%2Fbtfans.3322.org%3A6969%2Fannounce&tr=http%3A%2F%2Ftracker.bittorrent.am%2Fannounce&tr=udp%3A%2F%2Ftracker.bitcomet.net%3A8080%2Fannounce&tr=http%3A%2F%2Ftk3.5qzone.net%3A8080%2F&tr=http%3A%2F%2Ftracker.btzero.net%3A8080%2Fannounce&tr=http%3A%2F%2Fscubt.wjl.cn%3A8080%2Fannounce&tr=http%3A%2F%2Fbt.popgo.net%3A7456%2Fannounce&tr=http%3A%2F%2Fthetracker.org%2Fannounce&tr=http%3A%2F%2Ftracker.prq.to%2Fannounce&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=http%3A%2F%2Ftracker.dmhy.org%3A8000%2Fannounce&tr=http%3A%2F%2Fbt.titapark.com%3A2710%2Fannounce&tr=http%3A%2F%2Ftracker.tjgame.enorth.com.cn%3A8000%2Fannounce&"
 rel="magnet">Open</a>]

超链接中的href和title属性就是我们的目标。BeautifulSoup也提供了获取属性的方案，select方法返回的每个值中都包含一个attrs字典，可以从字典中获取到相关的属性信息：

def detect(html_doc):
    html_soup = BeautifulSoup(html_doc, "html.parser")
    anchor = html_soup.select("#archiveResult > tr:nth-of-type(10) > td.action > a:nth-of-type(2)")[0]
    print(anchor.attrs['href'])
    print(anchor.attrs['title'])

好了，大体就是这样。

不过程序中最难看的就是获取超链接的方案：一个一个地获取是不可能。好在BeautifulSoup支持通过属性的值来获取对象，最后调整下就是这样子了：

def detect(html_doc):
    html_soup = BeautifulSoup(html_doc, "html.parser")
    anchors = html_soup.select('a[href^="magnet:?xt"]')
    for i in range(len(anchors)):
        print(anchors[i].attrs['title'])
        print(anchors[i].attrs['href'])

上面的代码中的a[href^=”magnet:?xt”]表示查询的是所有<a>标签，且<a>标签的href属性需要以“magnet:?xt”开头。（看到“^”有没有觉得熟悉，这个“^”和正则式中的“^”意义是一样的）。通过这个select方法得到<a>标签列表，然后遍历标签列表，从标签的attrs字典中读取到相关的属性信息。

#!python
# encoding: utf-8
from urllib import request
from urllib import parse
from bs4 import BeautifulSoup

DEFAULT_HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}
DEFAULT_TIMEOUT = 360


def get(url):
    req = request.Request(url, headers=DEFAULT_HEADERS)
    response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)
    content = ""
    if response:
        content = response.read().decode("utf8")
        response.close()
    return content


def post(url, **paras):
    param = parse.urlencode(paras).encode("utf8")
    req = request.Request(url, param, headers=DEFAULT_HEADERS)
    response = request.urlopen(req, timeout=DEFAULT_TIMEOUT)
    content = ""
    if response:
        content = response.read().decode("utf8")
        response.close()
    return content


def detect(html_doc):
    html_soup = BeautifulSoup(html_doc, "html.parser")
    anchors = html_soup.select('a[href^="magnet:?xt"]')
    for i in range(len(anchors)):
        print(anchors[i].attrs['title'])
        print(anchors[i].attrs['href'])


def main():
    url = "https://www.torrentkitty.tv/search/"
    html_doc = post(url, q=parse.quote("超人"))
    detect(html_doc)


if __name__ == "__main__":
    main()

posted on 2018-03-02 16:44 myworldworld 阅读(431) 评论(0) 收藏举报

刷新页面返回顶部

myworldworld

爬虫八：从0开始（一）：基本请求、正则爬取、beautiful的简单使用

一、Python网络爬虫1 – 简单的Http请求:

二、Python网络爬虫2 – 请求中遇到的几个问题:

三、Python网络爬虫3 – 使用BeautifulSoup解析网页：

导航

公告