scarpy框架使用

爬虫步骤

新建项目(scrapy startproject xxx)：新建一个新的爬虫项目
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
明确目标（编写items.py）：明确你想要抓取的目标
存储内容（pipelines.py）：设计管道存储爬取内容

创建项目

scrapy startproject project_name

创建爬虫模板

cd 项目名

scrapy genspider 爬虫名字爬虫请求时过滤的域名 scrapy genspider name xxx

爬虫项目

class ITEM(scrapy.Item):  
    title = scrapy.Field()  

class VersionSpider(scrapy.Spider):  
    name = "version"      #项目名 
    start_urls = [ href  for page in range(1, 173)]   #请求地址

	#请求为html时
    def parse(self, response):
	    title = response.xpath('').get() #xpath获取text或attr时得到一个对象；使用get()属性获取data属性；即为所需内容
	    title = re.sub(r"\s", "", title) #数据处理

	#请求为json时
    def parse(self, response):
	    data = response.json()    #response带有json()属性直接转化请求为字典
	    title = data["title"]     #对json()进行取数据时无需使用get()
	    
		item = ITEM()
		item["title"] = title
		yield item             #生成器

启动项目

# 项目中创建run.py文件用于运行爬虫项目
from scrapy import cmdline  
cmdline.execute("scrapy crawl spider_name".split())

数据存储

需要在setting文件中启动储存管道，相关代码才能执行

ITEM_PIPELINES = {  
    "bilibili.pipelines.MongoPipeline": 300,  
}

posted @ 2023-06-13 21:35 向众神祈祷阅读(23) 评论(0) 收藏举报

刷新页面返回顶部

scarpy框架使用

爬虫步骤

创建项目

创建爬虫模板

爬虫项目

启动项目

数据存储

公告