python 爬虫解析_1_

爬虫总结_1_

爬虫小点:

1.网站背景调查
	robots.txt, Sitemp  文件
		
以检查网站 构建的技术类型一-builtwith 模块
    > import builtwith
    > builtwith . parse ( 'http:/ /example . webscraping . com ''}
    --> 返回网站的所需技术
pip install python-whois
下面是使用该模块对appspot.com 这个域名进行WHOIS 查询时的返回
结果

以使用WHOIS协议查询域名的注册者是谁

Crawling 爬取:

# 下载网页
import urllib2
def download(url):
    return urllib2.urlopen(url).read()

--> 函数将会下载网页并返回其HTML

优化:

import urllinb2

def download(url):
    print('dwnload :' url)
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print(e.reason)
        html = None
    return html

下载错误:


1.重试下载:
	如: 下载时遇到的错误经常是临时性的,服务器 过载时回 503
Ser vice Unavailable错误
	服务器返回的是404 Not Found 这种错误, 则说明该网页目
前并不存在,
	
--》响应码: https: // tools.ietf.org/html/rfc7231#
section- 6
import urllinb2

def download(url,num_retries=2):
    print('dwnload :' url)
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print(e.reason)
        html = None

		if num_retries > 0:
			if hasattr (e,'code') and 500 <= e.code < 600:
				# recursively retry Sxx HTTP errors
			return download ( url,num_retries-1)
	return html 

配置用户代理:

def download(url,user_agent='wsp',num_retries=2):
    print(url)
    headers = {'User-agent':user_agent}
    rrequest = urllib2.Request(url,headers=headers)
    try:
        html = urllib2.urlopen(url).read()
    except urllib2.URLError as e:
        print(e.reason)
        html = None

		if num_retries > 0:
			if hasattr (e,'code') and 500 <= e.code < 600:
				# recursively retry Sxx HTTP errors
			return download ( url,num_retries-1)
	return html 
                        ---> 伪装浏览器

正则匹配:

def crawl_sitemap(url):
    sitemap =  download(url)
    links = re.findall('<loc>(.*?)</loc>',sitemap)
    
    for link in links:
        html = download(link)

页面规则匹配:

import itertools
for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d'% page
    html = download(url)
    if html is None:
        break
    else:
        pass

优化:

max errors = 5
# current number of consecutive download errors
num errors = 0
for page in itertools.count (l) :
	url = 'http://example.webscraping.com/view/-%d'% page
	html = download (url )
	if html is None :
	# rece ived an error trying to down load this webpage
		num errors += 1
		if num errors == max errors :
            break
    else:
		num errors = 0 

延时限制:

class Throttle :

	def __init_(self, dlay) :
		self.delay = delay
		self.domains = {}
	def wait(self, url ) :
		domain = urlparse.ulparse(url).netloc
		last_accessed = self.domains.get(domain )
		if self.delay > 0 and last_accessed is not None :
			sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).se conds
		if sleep_secs > 0:
			time.sleep(s leep secs )
			self.domains [domain] = datetime.datetime.now() 
            
 throttle = Throttle(delay)
 throttle.wait(url )
 result = download(url,headers,proxy=proxy ,num_retries=num_retries ) 

爬虫陷进:

爬虫会跟踪所有之前没有访问过的链接。但是,一些网站会动态生成页面内容这样就会出现无限多的网页

解决思路:
	1.记录到达当前网 页经过了多少个链接也就是深度,当到达最大深度时,爬虫就不再向队列中添加该网页中的链接
    
def link_crawl(...,max_depth=2):
    max_depth = 2
    seen = {}
    depth = seen[url]
    if depth != seen[url]:
        for link in links:
            if link not in seen:
                seen[link] = depth +1
                crawl_queue.append(link)
posted @ 2020-04-26 22:42  black__star  阅读(284)  评论(0编辑  收藏  举报