数据采集第三次作业

作业1:

指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。

代码与运行结果:

spider代码:

import scrapy
from urllib.parse import urljoin
from scrapy import Item, Field


class WeatherItem(Item):
    image_urls = Field()


class Myspider31Spider(scrapy.Spider):
    name = "myspider31"
    allowed_domains = ["weather.com.cn"]
    start_urls = ["https://weather.com.cn"]

    def parse(self, response):
        full_image_urls = []

        image_urls = response.css('img::attr(src)').getall()
        full_image_urls = [urljoin(response.url, img_url) for img_url in image_urls]

        item = WeatherItem()
        item['image_urls'] = full_image_urls

        yield item

setting代码:

ITEM_PIPELINES = {
#    "project31.pipelines.Project31Pipeline": 300,
'scrapy.pipelines.images.ImagesPipeline': 300
}
IMAGES_STORE='D:\数据集\数据采集实践3-1'

pipelines代码:

from itemadapter import ItemAdapter


class Project31Pipeline:
    def process_item(self, item, spider):
        return item

运行结果

posted @ 2024-11-12 00:07  关忆南北  阅读(7)  评论(0编辑  收藏  举报