pyspider学习

1.学习 Python 包并实现基本的爬虫过程
2.了解非结构化数据的存储
3.学习scrapy,搭建工程化爬虫
4.学习数据库知识,应对大规模数据存储与提取
5.掌握各种技巧,应对特殊网站的反爬措施
6.分布式爬虫,实现大规模并发采集,提升效率
- -
学习 Python 包并实现基本的爬虫过程
大部分Python爬虫都是按“发送请求——获得页面——解析页面——抽取并储存内容”这样的流程来进行,这其实也是模拟了我们使用浏览器获取网页信息的过程。
Python爬虫相关的包很多:urllib、requests、bs4、scrapy、pyspider 等,建议从requests+Xpath 开始,requests 负责连接网站,返回网页,Xpath 用于解析网页,便于抽取数据。
如果你用过 BeautifulSoup,会发现 Xpath 要省事不少,一层一层检查元素代码的工作,全都省略了。这样下来基本套路都差不多,一般的静态网站根本不在话下,豆瓣、糗事百科、腾讯新闻等基本上都可以上手了。
当然如果你需要爬取异步加载的网站,可以学习浏览器抓包分析真实请求或者学习Selenium来实现自动化,这样,知乎、时光网、猫途鹰这些动态的网站也可以迎刃而解。

https://www.jianshu.com/p/39c7371dd6c2

https://zhidao.baidu.com/question/717742672064662045.html?fr=iks&word=pyspider%D7%A5%C8%A1%CD%BC%C6%AC%C1%B4%BD%D3&ie=gbk

https://zhidao.baidu.com/question/1696233978334322468.html?fr=iks&word=pyspider%D7%A5%C8%A1%CD%BC%C6%AC%C1%B4%BD%D3&ie=gbk


https://www.cnblogs.com/shiluoliming/p/8379402.html


图片地址抓取
https://zhuanlan.zhihu.com/p/73660017
https://zhuanlan.zhihu.com/p/265326672
https://zhuanlan.zhihu.com/p/73844521

https://github.com/Archiewyq/Python3.6/blob/master/pyspider/mzitu.py
https://cuiqingcai.com/3179.html
https://blog.csdn.net/weixin_43552452/article/details/111987107


https://www.jianshu.com/p/dcfef5bc53ac
https://www.likecs.com/show-204747972.html
https://www.jb51.net/article/220574.htm
https://www.jb51.net/article/224034.htm


正则
https://baijiahao.baidu.com/s?id=1712491497500869183&wfr=spider&for=pc

https://xie.infoq.cn/article/8b572ccd02e5f11728f4b802a


https://www.jb51.net/article/205596.htm

要购买的
https://mbd.baidu.com/newspage/data/landingsuper?context=%7B%22nid%22%3A%22news_9159420957293366938%22%7D&n_type=1&p_from=4


https://baijiahao.baidu.com/s?id=1670801762859023601&wfr=spider&for=pc


这个解释很详细
https://www.cnblogs.com/weimingai/p/15086318.html
https://www.jianshu.com/p/5957ee924196?utm_campaign=maleskine&utm_content=note&utm_medium=writer_share&utm_source=weibo


这个说明的很详细
https://www.cnblogs.com/shiluoliming/p/8379533.html
https://www.cnblogs.com/shiluoliming/p/8379533.html
这个打印出来
http://www.360doc.com/content/20/0608/13/60959293_917178313.shtml

这个项目跟着学
https://blog.csdn.net/CSDN__CPP/article/details/110525035

Pyspider实例之抓取数据并保存到MySQL的数据库
https://www.cnblogs.com/shiluoliming/p/8379579.html


用BeautifulSoup的网站分析,并采集
https://www.jianshu.com/p/807d57a70e78


存储到数据库
http://www.360doc.com/content/20/0313/22/9422167_898992730.shtml
https://blog.csdn.net/ityard/article/details/102763533

示例
我爬取了知乎
https://ityard.blog.csdn.net/article/details/105757618
https://blog.csdn.net/ityard/category_9410986.html


https://www.jianshu.com/p/807d57a70e78

 

元气壁纸
https://blog.csdn.net/weixin_43552452/article/details/111987107
详细例子
https://cuiqingcai.com/3179.html

https://www.jianshu.com/p/8cd25fe07363

https://zhuanlan.zhihu.com/p/73844521

https://zhuanlan.zhihu.com/p/265326672


https://wenku.baidu.com/view/677847cde309581b6bd97f19227916888486b936.html


python 字符串替换_python字符串替换的2种方法
https://www.cnblogs.com/jacob-gn/p/16019591.html

 


pyspider抓取虎嗅网文章数据
https://www.csdn.net/tags/MtTaMg4sNzM4MDM0LWJsb2cO0O0O.html

 

也是说明很详细
https://blog.csdn.net/weixin_43411585/article/details/97812462


架构训练
https://xie.infoq.cn/article/b6d2825a5a674776dd872f015

 

https://www.jb51.net/article/205596.htm

py的爬虫
https://baijiahao.baidu.com/s?id=1712491497500869183&wfr=spider&for=pc

简约py学习

链接
https://blog.csdn.net/qq_42783263/article/details/94196935
https://www.jianshu.com/p/3663623cedd5
https://www.jianshu.com/p/caea3ca21a2a
https://www.jianshu.com/p/bfae053441e6
https://blog.csdn.net/qqlixiao2014/article/details/75612124

这个最详细
https://www.jb51.net/article/204912.htm


手册
https://download.csdn.net/download/u011008678/10731272
手册
https://www.cntofu.com/book/156/totorial/level3.md

 


步骤


过滤价格图片

代理配置
数据库访问配置
头部配置(agent,代理,证书验证,超时时间,重试时间) from import

crawl_config
retry_delay配置抓取失败后重试时间间隔

@every(
@config(priority=8)
@config(age

采集规则

错误解析

结果数据存储表→存表结果解析

 

各部分代码

 

抓取图片链接
下载图片

 

抓取内容
对内容进行编辑,标签过滤


结果数据存储表


2022-4-15
多页连接get
采集图片链接和下载图片

采集内容编辑

 

 

'headers': {
'authority': 'www.amazon.com',
'cache-control': 'max-age=0',
'rtt': '250',
'downlink': '10',
'ect': '4g',
'sec-ch-ua-mobile': '?0',
'upgrade-insecure-requests': '1',
#'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
'user-agent': 'Mozilla/5.0',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'zh-CN,zh;q=0.9'
},
'proxy': random.choice(proxy_ips_list_even[1::2]),
'validate_cert': False,
'connect_timeout': 200,
'timeout': 200,
'retries': 3
}
retry_delay = { # 配置抓取失败后重试时间间隔
0: 10,
1: 1 * 60,
2: 1 * 60,
3: 1 * 60,
'': 1 * 60
}


采集urls代码
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False)

采集自定义页数
@every(minutes=24 * 60)
def on_start(self):
i=2
while i<10:
self.crawl('https://bizhi.ijinshan.com/'+str(i)+'/#', callback=self.index_page,validate_cert=False, fetch_type='js')
i+=1

采集url递增
def __init__(self):
self.base_url = ""
self.page_num = 1
self.total_num = 5
def on_start
while self.page_num <= self.total_num
url = self.base_url + str(self.page_num)
self.crawl(url, callback=self.index_page,validate_cert=False)
self.page_num += 1

筛选过滤urls代码

修改index_page函数为(注意头部导入正则库import re):

def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
if re.match('http://www.mzitu.com/\d.*$', each.attr.href):
self.crawl(each.attr.href, callback=self.detail_page)

另一种
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
if re.match('http://www.yunjinet.com/sell/show.+',each.attr.href):
self.crawl(each.attr.href, callback=self.detail_page)
else:
self.crawl(each.attr.href, callback=self.index_page)


采集image_urls代码
@config(priority=2)
def detail_page(self, response):
# 获取当前页面图片
for each in response.doc('.xzoom-container > img.img-responsive').items():
img_url = each.attr.src
# 获取下一页url
for each in response.doc('div.xzoom-thumbs').items():
self.crawl(each.attr.href, callback=self.detail_page, validate_cert=False)

采集image_urls代码2
def detail_page(self,response):
for each in response.doc('div.fs-c-productMainImage__image > img').items():
img_url = (each.attr.src)#然后再打印出来


采集image_urls代码3
"v_products_image": response.doc('div.fs-c-productMainImage__image > img').attr.src,

采集多图的思路
"v_categories_name_1": response.doc('[itemprop="child"] *').attr.src,
如果是需要替换,把小图换成大图那就再加个替换规则

替换代码(直接替换)
v_products_image_1.replace('-s.jpg' , '-l.jpg')

替换代码(正则替换)
import re

strinfo = re . compile('word')

b = strinfo.sub('python',a)

printf (b)


替换代码(detail_page) 注意后面加入.text(),不然有可能是html的格式而导致格式错误,替换不了

price1 = re.sub(r'\税込|,', '', price)
嵌套price = response.doc('div.p-goods-information__primary > div.p-goods-information__price').text()
price1 = re.sub(r'\税込|,', '', response.doc('div.p-goods-information__primary > div.p-goods-information__price').text())


"v_products_model": re.sub(r'.*?did=|,', '', response.url),

posted @ 2022-04-19 15:54  cooler101  阅读(55)  评论(0)    收藏  举报