爬虫相关

爬虫基础：requests以及BeautifulSoup模块
http://www.cnblogs.com/wupeiqi/articles/6283017.html

爬虫性能相关以及Scrapy框架
http://www.cnblogs.com/wupeiqi/articles/6283017.html

Python开发【第十五篇】：Web框架之Tornado
http://www.cnblogs.com/wupeiqi/articles/5702910.html

200行自定义异步非阻塞Web框架
http://www.cnblogs.com/wupeiqi/p/6536518.html

模块

requests.get(url='URL路径)

beautifulsoup

soup = beautifulsoup('HTML格式字符串'，‘html.parser’)

tag = soup.find(name='div',attrs = {'id':'t'})

tags = soup.find_all(name='div',attrs = {'id':'t'})

tag.find('h3).text

tag.find('h3').get('属性名称') #get('href')

tag.find('h3').attrs

Http请求基础

requests
GET:
requests.get(url="http://www.oldboyedu.com")
# data="http GET / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

requests.get(url="http://www.oldboyedu.com/index.html?p=1")
# data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
# data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

POST:
requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头：url-formend....
data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\nname=alex&age=18"

requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头：application/json
data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n{"name": "alex", "age": 18}"

requests.post(
url="http://www.oldboyedu.com",
params={'p':1},
json={'name':'alex','age':18}
) # 默认请求头：application/json

GET请求

有参数实例：

1 import requests
2 
3 payload = {'key1':'v1'，‘key2’：‘v2’}
4 
5 ret = requests.get("http://test.cn/get"，params=payload)
6 
7 print(ret.url)
8 
9 print(ret.text)

POST请求

import requests

import json

url = 'https://api.github.com/some/endpoint'

payload = {'some': 'data'}

headers = {'content-type': 'application/json'}

ret = requests.post(url, data=json.dumps(payload), headers=headers)

print ret.text

print ret.cookies

其他请求

1. method
2. url
3. params
4. data
5. json
6. headers
7. cookies
8. files
9. auth
10. timeout
11. allow_redirects
12. proxies
13. stream
14. cert

requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是：k1=v1;k2=v2;k3=v3;k3=v4
 headers={'Content-Type': 'application/x-www-form-urlencoded'}
 )

def param_cookies():
    # 发送Cookie到服务器端
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     data={'k1': 'v1', 'k2': 'v2'},
                     cookies={'cook1': 'value1'},
                     )

def param_auth():

    from requests.auth import HTTPBasicAuth, HTTPDigestAuth

    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)

BeautifulSoup模块

该模块用于接收一个HTML或XML字符串，然后将其格式化，之后可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

安装：

pip3 install beautifulsoup4

使用实例：

 1 from bs4 import BeautifulSoup
 2 
 3 html_doc = """
 4 <html>
 5     <head>
 6         <title>The Dormouse's story</title>
 7     </head>
 8     <body>
 9         <div id='i1'>
10             <a>sdfs</a>
11         </div>
12         <p class='c2'>asdfa</p>
13     </body>
14 </html>
15 """

具体方法：

1.name，标签名称

1 soup = BeautifulSoup(html_doc,'html.parser')
2 tag = soup.find('a')
3 name = tag.name #获取
4 print(name)
5 tag.name = 'span' #设置
6 print(soup)

2.attrs，标签属性

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
attrs = tag.attrs #获取
print(attrs)
tag.attrs['id'] = 'iiii' #设置
print(soup)
attrs = tag.attrs #获取
print(attrs)

3.children，所有子标签

body = soup.find('body')
v = body.children
print(list(v))

4.descendants，找所有后代　

body = soup.find('body')
v = body.descendants
print(list(v))

5.clear，将标签的所有子标签全部清空（保留标签名）

tag = soup.find('body')
tag.clear()
print(soup)

6.decompose，递归的删除所有标签

tag = soup.find('body')
tag.decompose()
print(soup)

7.extract，递归删除所有的标签，并获取删除的标签

tag = soup.find('body')
tag.extract()
print(soup)

8.decode，转换为字符串（含当前字符串）；decode_contents(不含当前标签）

tag = soup.find('body')
v = tag.decode()   #对象变成字符串
v1 = tag.encode()   #对象变成字节
print(v,v1)

9.find ，获取匹配的第一个标签 find_all，获取匹配的所有标签

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
tags = soup.find_all('body')
for i in tags:
    print(i)
v = soup.find_all(name=['a','div'])
print(v)

10.has_attr，检查标签是否具有该属性

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
v = tag.has_attr('class')
v1 = tag.has_attr('id')
print(v,v1)

11.get_text，获取标签内部文本内容

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
v = tag.get_text()
print(v)

其他标签：

index，检查标签在某标签中的索引位置

is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

select,select_one, CSS选择器

soup.select("body a")

标签内容：

print(tag.string)

tag.string = 'new content' # 设置

append在当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before 在当前标签后面或前面插入

replace_with 在当前标签替换为指定标签

wrap，将指定标签把当前标签包裹起来

unwrap，去掉当前标签，将保留其包裹的标签

posted @ 2017-07-03 09:38 liumj 阅读(120) 评论(0) 收藏举报

刷新页面返回顶部

liumj