爬虫相关

爬虫基础:requests以及BeautifulSoup模块
http://www.cnblogs.com/wupeiqi/articles/6283017.html

爬虫性能相关以及Scrapy框架
http://www.cnblogs.com/wupeiqi/articles/6283017.html

Python开发【第十五篇】:Web框架之Tornado
http://www.cnblogs.com/wupeiqi/articles/5702910.html

200行自定义异步非阻塞Web框架
http://www.cnblogs.com/wupeiqi/p/6536518.html

 

模块

requests.get(url='URL路径)

beautifulsoup

soup = beautifulsoup('HTML格式字符串',‘html.parser’)

tag = soup.find(name='div',attrs = {'id':'t'}) 

tags = soup.find_all(name='div',attrs = {'id':'t'}) 

tag.find('h3).text

tag.find('h3').get('属性名称')  #get('href')

tag.find('h3').attrs

Http请求基础

requests
GET:
requests.get(url="http://www.oldboyedu.com")
# data="http GET / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

requests.get(url="http://www.oldboyedu.com/index.html?p=1")
# data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

requests.get(url="http://www.oldboyedu.com/index.html",params={'p':1})
# data="http GET /index.html?p=1 http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n"

POST:
requests.post(url="http://www.oldboyedu.com",data={'name':'alex','age':18}) # 默认请求头:url-formend....
data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\nname=alex&age=18"


requests.post(url="http://www.oldboyedu.com",json={'name':'alex','age':18}) # 默认请求头:application/json
data="http POST / http1.1\r\nhost:oldboyedu.com\r\n....\r\n\r\n{"name": "alex", "age": 18}"


requests.post(
url="http://www.oldboyedu.com",
params={'p':1},
json={'name':'alex','age':18}
) # 默认请求头:application/json

 GET请求

有参数实例:

1 import requests
2 
3 payload = {'key1':'v1',‘key2’:‘v2’}
4 
5 ret = requests.get("http://test.cn/get",params=payload)
6 
7 print(ret.url)
8 
9 print(ret.text)

POST请求

import requests

import json
  
url = 'https://api.github.com/some/endpoint'
payload = {'some''data'}
headers = {'content-type''application/json'}
  
ret = requests.post(url, data=json.dumps(payload), headers=headers)
  
print ret.text
print ret.cookies

其他请求

1. method
2. url
3. params
4. data
5. json
6. headers
7. cookies
8. files
9. auth
10. timeout
11. allow_redirects
12. proxies
13. stream
14. cert

requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
 headers={'Content-Type': 'application/x-www-form-urlencoded'}
 )

def param_cookies():
    # 发送Cookie到服务器端
    requests.request(method='POST',
                     url='http://127.0.0.1:8000/test/',
                     data={'k1': 'v1', 'k2': 'v2'},
                     cookies={'cook1': 'value1'},
                     )

def param_auth():
    from requests.auth import HTTPBasicAuth, HTTPDigestAuth

    ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
    print(ret.text)

BeautifulSoup模块

该模块用于接收一个HTML或XML字符串,然后将其格式化,之后可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

安装:

pip3 install beautifulsoup4

使用实例:

 1 from bs4 import BeautifulSoup
 2 
 3 html_doc = """
 4 <html>
 5     <head>
 6         <title>The Dormouse's story</title>
 7     </head>
 8     <body>
 9         <div id='i1'>
10             <a>sdfs</a>
11         </div>
12         <p class='c2'>asdfa</p>
13     </body>
14 </html>
15 """

具体方法:

1.name,标签名称

1 soup = BeautifulSoup(html_doc,'html.parser')
2 tag = soup.find('a')
3 name = tag.name #获取
4 print(name)
5 tag.name = 'span' #设置
6 print(soup)

2.attrs,标签属性

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
attrs = tag.attrs #获取
print(attrs)
tag.attrs['id'] = 'iiii' #设置
print(soup)
attrs = tag.attrs #获取
print(attrs)

3.children,所有子标签

body = soup.find('body')
v = body.children
print(list(v))

4.descendants,找所有后代 

body = soup.find('body')
v = body.descendants
print(list(v))

5.clear,将标签的所有子标签全部清空(保留标签名)

tag = soup.find('body')
tag.clear()
print(soup)

6.decompose,递归的删除所有标签

tag = soup.find('body')
tag.decompose()
print(soup)

7.extract,递归删除所有的标签,并获取删除的标签

tag = soup.find('body')
tag.extract()
print(soup)

8.decode,转换为字符串(含当前字符串);decode_contents(不含当前标签)

tag = soup.find('body')
v = tag.decode()   #对象变成字符串
v1 = tag.encode()   #对象变成字节
print(v,v1)

9.find ,获取匹配的第一个标签  find_all,获取匹配的所有标签

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
tags = soup.find_all('body')
for i in tags:
    print(i)
v = soup.find_all(name=['a','div'])
print(v)

10.has_attr,检查标签是否具有该属性

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
v = tag.has_attr('class')
v1 = tag.has_attr('id')
print(v,v1)

11.get_text,获取标签内部文本内容

soup = BeautifulSoup(html_doc,'html.parser')
tag = soup.find('a')
v = tag.get_text()
print(v)

其他标签:

index,检查标签在某标签中的索引位置

is_empty_element,是否是空标签(是否可以是空)或者自闭合标签

select,select_one, CSS选择器

soup.select("body a")

标签内容:

print(tag.string)

tag.string = 'new content' # 设置

append在当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before 在当前标签后面或前面插入

replace_with 在当前标签替换为指定标签

wrap,将指定标签把当前标签包裹起来

unwrap,去掉当前标签,将保留其包裹的标签

 

 


posted @ 2017-07-03 09:38  liumj  阅读(104)  评论(0编辑  收藏  举报