作业1:
指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
代码与运行结果:
spider代码:
import scrapy
from urllib.parse import urljoin
from scrapy import Item, Field
class WeatherItem(Item):
image_urls = Field()
class Myspider31Spider(scrapy.Spider):
name = "myspider31"
allowed_domains = ["weather.com.cn"]
start_urls = ["https://weather.com.cn"]
def parse(self, response):
full_image_urls = []
image_urls = response.css('img::attr(src)').getall()
full_image_urls = [urljoin(response.url, img_url) for img_url in image_urls]
item = WeatherItem()
item['image_urls'] = full_image_urls
yield item
setting代码:
ITEM_PIPELINES = {
# "project31.pipelines.Project31Pipeline": 300,
'scrapy.pipelines.images.ImagesPipeline': 300
}
IMAGES_STORE='D:\数据集\数据采集实践3-1'
pipelines代码:
from itemadapter import ItemAdapter
class Project31Pipeline:
def process_item(self, item, spider):
return item
运行结果