Urllib--爬虫
1.简单爬虫
from urllib import request
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #赋给一个实例,请求
data = resp.read() #把结果读出来
f=open('url.html','wb')
f.write(data)
f.close()
print('%d bytes received from %s.' % (len(data), url))
f('http://www.cnblogs.com/alex3714/articles/5248247.html')
运行结果:
C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: http://www.cnblogs.com/alex3714/articles/5248247.html 91829 bytes received from http://www.cnblogs.com/alex3714/articles/5248247.html. Process finished with exit code 0
2.爬多个网页
from urllib import request
import gevent
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #赋给一个实例,请求
data = resp.read() #把结果读出来
print('%d bytes received from %s.' % (len(data), url))
#启动3个协程并且传参数
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
运行结果:
GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 479631 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55394 bytes received from https://github.com/. Process finished with exit code 0
3.测试运行时间:
from urllib import request
import gevent
import time
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #赋给一个实例,请求
data = resp.read() #把结果读出来
print('%d bytes received from %s.' % (len(data), url))
start_time=time.time()
#启动3个协程并且传参数
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
print('cost is %s:'%(time.time()-start_time))
运行结果:通过时间看到也是串行运行的。gevent默认检测不到 urllib 进行的是否是io操作。
C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
488624 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55394 bytes received from https://github.com/.
cost is 4.5304529666900635:
Process finished with exit code 0
4.同步与异步的时间比较:
from urllib import request
import gevent
import time
#from gevent import monkey
#monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #赋给一个实例,请求
data = resp.read() #把结果读出来
print('%d bytes received from %s.' % (len(data), url))
urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
start_time=time.time()
for url in urls:
f(url)
print('同步cost is %s:'%(time.time()-start_time))
async_time_start=time.time() #异步的起始时间
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
print('异步cost is %s:'%(time.time()-async_time_start))
运行时间:几乎差不多,看不出异步的优势。
C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 480499 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55394 bytes received from https://github.com/. 同步cost is 7.112711191177368: GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 485666 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55390 bytes received from https://github.com/. 异步cost is 4.510450839996338: Process finished with exit code 0
5.因为gevent默认检测不到 urllib 进行的是否是io操作。要想让两者关联起来,需要再导入一个新函数(补丁)
from gevent import monkey,
monkey.patch_all()
from urllib import request
import gevent
import time
from gevent import monkey
monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
def f(url):
print('GET: %s' % url)
resp = request.urlopen(url) #赋给一个实例,请求
data = resp.read() #把结果读出来
print('%d bytes received from %s.' % (len(data), url))
urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
start_time=time.time()
for url in urls:
f(url)
print('同步cost is %s:'%(time.time()-start_time))
async_time_start=time.time() #异步的起始时间
gevent.joinall([
gevent.spawn(f, 'https://www.python.org/'),
gevent.spawn(f, 'https://www.yahoo.com/'),
gevent.spawn(f, 'https://github.com/'),
])
print('异步cost is %s:'%(time.time()-async_time_start))
运行结果:
C:\abccdxddd\Oldboy\python-3.5.2-embed-amd64\python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py GET: https://www.python.org/ 48751 bytes received from https://www.python.org/. GET: https://www.yahoo.com/ 487577 bytes received from https://www.yahoo.com/. GET: https://github.com/ 55392 bytes received from https://github.com/. 同步cost is 5.784578323364258: GET: https://www.python.org/ GET: https://www.yahoo.com/ GET: https://github.com/ 480662 bytes received from https://www.yahoo.com/. 48751 bytes received from https://www.python.org/. 55394 bytes received from https://github.com/. 异步cost is 1.8721871376037598: Process finished with exit code 0
浙公网安备 33010602011771号