数据采集与融合技术_实验三

码云链接:实验三

1.作业①

1.1作业内容

  • 内容:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网(http://www.weather.com.cn)。分别使用单线程和多线程的方式爬取。(限定爬取图片数量为学号后4位)
  • 输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。

1.2解题思路

1.2.1观察并分析主页面,发现图片在img标签下,同时子网站在a标签的href中
image
1.2.2进行网站遍历的同时,编写spider函数从每个网站爬取些许图片

soup = BeautifulSoup(data, "lxml")
images = soup.select("img")
links = soup.select("a")
if len(images) >= 5:
    for i in range(5):
        if images[i]["src"].startswith("http") and images[i]["src"] not in imgs:
            imgs.append(images[i]["src"])
    if len(imgs) >= 121 or deep == 0:
        return
for link in links:
    if link["href"].startswith("http"):
        spider(link["href"],deep-1)

1.2.3 单线程下载图片

def download():
    global imgs
    for i in range(121):
        request.urlretrieve(imgs[i],r"D:\数据采集\demo\实践\3\3.1\images\第"+str(i+1)+"张图.jpg")
        print("downloaded ", imgs[i])

1.2.4 多线程下载图片

def download_threads():
    global imgs
    for i in range(121):
        T = threading.Thread(target=download, args=(imgs[i], i+1))
        T.setDaemon(False)
        T.start()
        threads.append(T)

1.2.5 结果
image

1.3 心得体会

  • 单线程按顺序下载图片,非常简单直观,速度较慢,多线程不一定按顺序执行,但是下载速度快
  • 熟悉了对网站的遍历,巩固了对多线程爬虫的理解

2.作业②

2.1作业内容

求:使用scrapy框架复现作业①。 输出信息: 同作业①

2.2解题思路

2.2.1 在 item 中设置唯一参数src_url储存图片地址,编写mySpider

class mySpider(scrapy.Spider):
    name = "mySpider"
    count = 0
    imgs = []
    def start_requests(self):
        url = "http://www.weather.com.cn/"
        yield scrapy.Request(url=url, callback=self.parse, meta={"deep":3})

    def parse(self, response):
        try:
            deep = response.meta["deep"]
            if deep <= 0 or self.count >= 121:
                return
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            images = selector.xpath("//img/@src").extract()
            n = 0
            for i in images:
                if i.startswith("http") and i not in self.imgs:
                    item = DemoItem()
                    item["src_url"] = i
                    self.imgs.append(i)
                    self.count += 1
                    n += 1
                    if n > 5:
                        break
                    yield item
            links = selector.xpath("//a/@href").extract()
            for link in links:
                if link.startswith("http"):
                    url = response.urljoin(link)
                    yield scrapy.Request(url=url, callback=self.parse, meta={"deep":deep-1})

        except Exception as err:
            print(err)

2.2.2 在pipelines中下载图片

class DemoPipeline:
    count = 1
    def process_item(self, item, spider):
        try:
            if self.count <= 121:
                with open("./images/第" + str(self.count) + "张图.jpg", "wb") as f:
                    print(str(self.count)+":"+item["src_url"])
                    img = requests.get(item["src_url"]).content
                    f.write(img)
                    self.count += 1
            else:
                return
        except Exception as err:
            print(err)
        return item

2.2.3 修改 setting

BOT_NAME = 'demo'

SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'

ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   'demo.pipelines.DemoPipeline': 300,
}

2.2.4 输出结果
image

2.3 心得体会

  • 复习了scrapy爬虫框架,学会了xpath的使用,感觉熟练后的xpath会比CSS更便利
  • 了解了scrapy中parse的传参机制

作业③

3.1作业内容

  • 要求:爬取豆瓣电影数据使用scrapy和xpath,并将内容存储到数据库,同时将图片存储在imgs路径下。
    所有院校信息,并存储在数据库中,同时将浏览器F12调试分析的过程录制Gif加入至博客中。

  • 输出信息:

    序号 电影名称 导演 演员 简介 电影评分 电影封面
    1 肖申克的救赎 弗兰克·德拉邦特 蒂姆·罗宾斯 希望让人自由 9.7 ./imgs/xsk.jpg
    2....

3.2 解题思路

3.2.1 网页内容的解析,寻找所需内容的位置
image

3.2.2 编写movieItem

class MovieItem(scrapy.Item):
    rank = scrapy.Field()
    name = scrapy.Field()
    director = scrapy.Field()
    actor = scrapy.Field()
    score = scrapy.Field()
    img = scrapy.Field()
    desp = scrapy.Field()

3.2.3 编写spider

爬取10页

    def start_requests(self):
        for i in range(10):
            url = "https://movie.douban.com/top250?start=" + str(i * 25)
            yield scrapy.Request(url=url,callback=self.parse)

parse 函数

    def parse(self, response):
        try:
            data = response.body.decode()
            selector = scrapy.Selector(text=data)
            # 获取每一个电影信息
            movies = selector.xpath("//li/div[@class='item']")
            for m in movies:
                rank = m.xpath("./div[@class='pic']/em/text()").extract_first()
                image = m.xpath("./div[@class='pic']/a/img/@src").extract_first()
                name = m.xpath("./div[@class='info']//span[@class='title']/text()").extract_first()
                members = m.xpath("./div[@class='info']//p[@class='']/text()").extract_first()
                desp = m.xpath("./div[@class='info']//p[@class='quote']/span/text()").extract_first()
                score = m.xpath("./div[@class='info']//span[@class='rating_num']/text()").extract_first()
                item = MovieItem()
                item['rank'] = rank
                item['name'] = name
                director = re.search(r'导演:(.*?)\s主', members).group(1)
                actor = re.search(r'主演:(.*)', members)
                item['director'] = director
                #由于有动画片没有演员,actor可能为空
                if actor == None:
                    item['actor'] = "null"
                else:
                    item["actor"] = actor.group(1)
                item['desp'] = desp
                item['score'] = score
                item['img'] = image
                yield item

        except Exception as err:
            print(err)

3.2.4 编写数据库类

class MovieDB:
    def openDB(self):
        self.con = sqlite3.connect("movies.db")
        self.cursor = self.con.cursor()
        try:
            self.cursor.execute("create table movies (Rank int,Name varchar(32),Director varchar(32),"
                                "Actors varchar(64),Description varchar(64),Score varchar(8),ImgPath varchar(64))")
        except:
            self.cursor.execute("delete from movies")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, Rank,Name,Director,Actors,Description,Score,ImgPath):
        try:
            self.cursor.execute("insert into movies (Rank,Name,Director,Actors,Description,Score,ImgPath) "
                                "values (?,?,?,?,?,?,?)", (Rank,Name,Director,Actors,Description,Score,ImgPath))
        except Exception as err:
            print(err)

3.2.5 编写pipelines

class MoviespiderPipeline:
    def open_spider(self, spider):
        self.db = MovieDB()
        self.db.openDB()

    def process_item(self, item, spider):
        path = r"./images/"+item['name']+".jpg"
        url = item['img']
        img = requests.get(url).content
        with open(path,"wb") as f:
            f.write(img)
            print("第"+item['rank']+"张封面下载成功")
        self.db.insert(int(item['rank']),item['name'],item['director'],item['actor'],item['desp'],item["score"],path)
        print("第" + item['rank'] +'部电影数据插入成功')
        return item

    def close_spider(self, spider):
        self.db.closeDB()
        print("结束爬取")

3.2.6 修改setting
3.2.7 输出结果
image

3.3 心得体会

  • 对于scrapy和xpath的使用更加熟练
posted @ 2021-11-09 21:14  oxoxoox  阅读(27)  评论(0编辑  收藏  举报