Chihiro20

如果我真的存在,也是因为你需要我。

数据采集第三次实践

第三次作业

一、作业内容

  • 作业①:

    • 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网(http://www.weather.com.cn)。分别使用单线程多线程的方式爬取。(限定爬取图片数量为学号后3位)

    • 输出信息:

      将下载的Url信息在控制台输出,并将下载的图片存储在result子文件夹中,并给出截图。

      1. 单线程实现过程:

        ① BeautifulSoup解析网页:

        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"}
        req = urllib.request.Request(url, headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        dammit = UnicodeDammit(data, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data, "lxml")
        

        ② 翻页处理的函数:

        由上图可以看出,城市信息可以以列表形式存储在dd a 节点下的

        def getNextPage(data, num):      # 翻页处理
            urls = []       # 保存城市的url
            citys = []      # 保存城市名称
            for i in range(num):
                city = data.select("dd a")[i].text      # 找到城市名称
                url = data.select("dd a")[i]["href"]    # 找到跳转到该城市页面的url
                citys.append(city)
                urls.append(url)
            return citys, urls
        

        ③ 存储图片的函数:

        '''
        函数功能:保存图片到本地
        参数说明:
            url : 网页地址
            path : 存储路径
        '''
        def save(path, url):
            try:
                r = requests.get(url)
                f = open(path, 'wb')
                f.write(r.content)
                f.close()
            except Exception as err:
                print(err)
        

        ④ 主函数:

        Url = "http://www.weather.com.cn"   # 首页
        data = getHTMLText(Url)
        num = 11
        citys, urls = getNextPage(data, num)        # urls存储跳转到各个城市页面的url
        count = 1
        if count <= 425:            # 学号尾数为425,故爬取425张
            for i in range(len(urls)):              # 遍历每个页面
                url = urls[i]
                city = citys[i]
                print("正在爬取" + str(city) + "的天气信息相关图片")
                data = getHTMLText(url)
                images = data.select("img")         # 提取含有图片url的信息,返回的是一个列表
                for j in range(len(images)):        # 对于每一个图片u
                    image_url = images[j]["src"]    # src提取url
                    print(image_url)                # 打印图片的url
                    file = r"E:/PycharmProjects/DataCollecction/test/3/1.1_result/" + str(count) + r".jpg"      # 存储路径
                    save(file, image_url)
                    count += 1          # count加一
        print("总共爬取%d张图片"%count)
        

        ⑤ 结果展示:

      2. 多线程实现过程:

        ① imageSpider函数:

        def imageSpider(start_url):
            global threads
            global count
            try:
                urls = []
                req = urllib.request.Request(start_url, headers=headers)
                data = urllib.request.urlopen(req)
                data = data.read()
                dammit = UnicodeDammit(data, ["utf-8", "gbk"])
                data = dammit.unicode_markup
                soup = BeautifulSoup(data, "lxml")
                images = soup.select("img")             # 提取含有图片url的信息,返回的是一个列表
                for image in images:
                    try:
                        src = image["src"]              # src提取url
                        url = urllib.request.urljoin(start_url, src)
                        if url not in urls:
                            print(url)
                            count = count + 1
                        T = threading.Thread(target=download, args=(url, count))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)
                    except Exception as err:
                        print(err)
            except Exception as err:
                print(err)
        

        ② download函数:

        def download(url,count):
            try:
                if(url[len(url)-4]=="."):
                    ext = url[len(url)-4:]
                else:
                    ext = ""
                req = urllib.request.Request(url, headers=headers)
                data = urllib.request.urlopen(req, timeout=100)
                data = data.read()
                fobj = open("E:/PycharmProjects/DataCollecction/test/3/1.2_result/"+str(count)+ext, "wb")			# 路径及图片名称
                fobj.write(data)
                fobj.close()
                print("downloaded "+str(count)+ext)
            except Exception as err:
                print(err)
        

        ③ 主函数:

        start_url = "http://www.weather.com.cn"   # 首页
        # 设置请求头
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"}
        count=0         # 爬取图片数量
        req = urllib.request.Request(start_url, headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        dammit = UnicodeDammit(data, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data, "lxml")
        num = 11
        citys = soup.select("dd a")
        urls = soup.select("dd a")
        threads=[]
        for i in range(len(urls)):
            if count > 425:
                break
            else:
                url = urls[i]["href"]
                city = citys[i].text
                print("正在爬取" + str(city) + "的天气信息相关图片")
                imageSpider(url)
        for t in threads:
            t.join()
        print("The End")
        

        ④ 结果展示:

        由于是多线程,所以下载图片是没有按照爬取顺序的

  • 作业②

    • 要求:使用scrapy框架复现作业①。

    • 输出信息:

      同作业①

      1. 实现过程:

        ① items.py:

        import scrapy
        class Demo2Item(scrapy.Item):
            img_url = scrapy.Field()        # 用来存储图片url
            pass
        

        ② piplines.py:(与作业①的存储函数起到一样的作用)

        import urllib
        
        class Demo2Pipeline:
            count = 0
            def process_item(self, item, spider):
                try:
                    for url in item["img_url"]:
                        print(url)
                        if (url[len(url) - 4] == "."):
                            ext = url[len(url) - 4:]
                        else:
                            ext = ""
                        req = urllib.request.Request(url)
                        data = urllib.request.urlopen(req, timeout=100)
                        data = data.read()
                        fobj = open("E:/PycharmProjects/DataCollecction/test/3/demo_2/result/" + str(Demo2Pipeline.count) + ext, "wb")
                        fobj.write(data)
                        fobj.close()
                        print("downloaded " + str(Demo2Pipeline.count) + ext)
                        Demo2Pipeline.count += 1
                except Exception as err:
                    print(err)
                return item
        

        ③ setting.py

        DEFAULT_REQUEST_HEADERS = {     # 设置请求头
            'accept': 'image/webp,*/*;q=0.8',
            'accept-language': 'zh-CN,zh;q=0.8',
            'referer': 'http://www.weather.com.cn',
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"
        }
        

        ④ MySpider.py(与作业①主函数起到一样的作用,爬取图片信息)

        ​ 图片信息:

        dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        selector = scrapy.Selector(text=data)
        
        img_url = selector.xpath("//img/@src").extract()
        MySpider.count += len(img_url)
        item = Demo2Item()
        item["img_url"] = img_url
        yield item
        

        ​ 翻页处理:

        if MySpider.count <= 425:
            url = selector.xpath("//dd/a/@href").extract()[MySpider.page]       # 得到下一个页面url
            city = selector.xpath("//dd/a/text()").extract()[MySpider.page]
            print("正在爬取" + str(city) + "的天气信息相关图片")
            yield scrapy.Request(url=url, callback=self.parse)		# 回调函数翻页
            MySpider.page += 1  # 更新页码
        

        ⑤ 结果展示:

      2. 心得体会:

  • 作业③:

    • 要求:爬取豆瓣电影数据使用scrapy和xpath,并将内容存储到数据库,同时将图片存储在imgs路径下。

    • 候选网站: https://movie.douban.com/top250

    • 输出信息:

      序号 电影名称 导演 演员 简介 电影评分 电影封面
      1 肖申克的救赎 弗兰克·德拉邦特 蒂姆·罗宾斯 希望让人自由 9.7 ./imgs/xsk.jpg
      2....
      1. 实现过程:

        ① items.py:

        import scrapy
        
        class Demo3Item(scrapy.Item):
            num = scrapy.Field()            # 序号
            name = scrapy.Field()           # 电影名称
            director = scrapy.Field()       # 导演
            actor = scrapy.Field()          # 演员
            introduction = scrapy.Field()   # 简介
            score = scrapy.Field()          # 电影评分
            cover = scrapy.Field()          # 电影封面url
            pass
        

        ② piplines.py:

        ​ 将信息存储到数据库:

        def open_spider(self, spider):
            print("opened")
            try:
                self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root",
                                           passwd="qwe1346790", db="mydb", charset="utf8")
                self.cursor = self.con.cursor(pymysql.cursors.DictCursor)
                self.cursor.execute("delete from movies")
                self.opened = True
                self.count = 0
            except Exception as err:
                print(err)
                self.opened = False
        
        def close_spider(self, spider):
            if self.opened:
                self.con.commit()
                self.con.close()
                self.opened = False
                print("closed")
                print("总共爬取", self.count, "条电影信息")
        def process_item(self, item, spider):
            try:
                
                        print("序号: ", end="")
                        print(item["num"])              # 打印电影序号
                        print("电影名称: ", end="")
                        print(item["name"])             # 打印电影名称
                        print("导演: ", end="")
                        print(item["director"])         # 打印电影导演
                        print("演员: ", end="")
                        print(item["actor"])            # 打印演员
                        print("简介: ", end="")
                        print(item["introduction"])     # 打印简介
                        print("电影评分: ", end="")
                        print(item["score"])            # 打印电影评分
                        print("电影封面: ", end="")
                        print(item["cover"])            # 打印电影封面
                        print()
                        if self.opened:
                            self.cursor.execute("insert into movies (num, name, director, actor, introduction, score, cover) values( % s, % s, % s, % s, % s, % s, % s)"
                                                , (item["num"], item["name"], item["director"], item["actor"], item["introduction"], item["score"], item["cover"],))
                            self.count += 1
        

        ​ 保存图片到本地:

        def process_item(self, item, spider):
            try:
                url = item["cover"]
                req = urllib.request.Request(url)
                data = urllib.request.urlopen(req, timeout=100)
                data = data.read()
                fobj = open("E:/PycharmProjects/DataCollecction/test/3/demo_3/result/" + str(self.count) + ".jpg", "wb")
                fobj.write(data)
                fobj.close()
                print("downloaded " + str(self.count) + ".jpg")
                self.count += 1
        

        ③ setting.py:

        ITEM_PIPELINES = {
        'demo_3.pipelines.Demo3Pipeline': 300,
        }
        DEFAULT_REQUEST_HEADERS = {     # 设置请求头
            'accept': 'image/webp,*/*;q=0.8',
            'accept-language': 'zh-CN,zh;q=0.8',
            'referer': 'https://movie.douban.com/top250',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
            'Cookie': 'll="108300"; bid=suhTDj9VAic; douban-fav-remind=1; __gads=ID=91b3b86e72bb1bdc-22f833a798cb00ff:T=1631273458:RT=1631273458:S=ALNI_Mb_chXR7p5DDTuCl3S6BX87GbcBTw; viewed="4006425"; gr_user_id=125cd979-ed5d-41e6-9bb0-f4e248d2faab; __utmz=30149280.1631273897.1.1.utmcsr=edu.cnblogs.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmz=223695111.1635338543.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=30149280; __utmc=223695111; _pk_ses.100001.4cf6=*; __utma=30149280.2089450980.1631273897.1635427621.1635435215.5; __utmb=30149280.0.10.1635435215; __utma=223695111.1984107456.1635338543.1635427621.1635435215.4; __utmb=223695111.0.10.1635435215; ap_v=0,6.0; _pk_id.100001.4cf6=251801600b05f8b9.1635338543.4.1635437309.1635429350.',
        }
        

        ④ MySpider.py:

        def start_requests(self):
            url = MySpider.source_url    # 第一页的URL
            yield scrapy.Request(url=url, callback=self.parse)		 # 回调下一个函数
        

        ​ 解析网页:

        dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        selector = scrapy.Selector(text=data)
        

        ​ 找到存储电影的节点:

        lis = selector.xpath("//ol[@class='grid_view']/li")   # 找到储存电影信息的节点
        

        ​ 找到每个节点中我们想要的信息:

                    for li in lis:
                        num = li.xpath("./div[@class='item']/div[@class='pic']/em/text()").extract_first()      # 序号
                        name = li.xpath("./div[@class='item']/div[@class='pic']/a/img/@alt").extract_first()    # 电影名称
                        # 包含主演和导演的信息
                        source = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first()
                        if len(source.split(':')) > 2:      # 以:分隔,当分隔出来的单位大于2时,主演内容不为空
                            director = source.split(':')[1].split('主演')[0].split(' ')[1].strip()    # 提取导演信息
                            actor = source.split(':')[2].split('/')[0].split(' ')[1].strip()         # 提取主演信息
                        else:                               # 以:分隔,当分隔出来的单位小于等于2时,主演内容为空
                            director = source.split(':')[1].split('/')[0].split(' ')[1].strip()
                            actor = ' '
                        # 简介
                        introduction = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span/text()").extract_first()
                        # 评分
                        score = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract_first()
                        # 封面图片的url
                        cover = li.xpath("./div[@class='item']/div[@class='pic']/a/img/@src").extract_first()
                        item = Demo3Item()          # 将数据导入items里面
                        item["num"] = num
                        item["name"] = name
                        item["director"] = director
                        item["actor"] = actor
                        item["introduction"] = introduction
                        item["score"] = score
                        item["cover"] = cover
                        yield item
                        MySpider.count += 1
        

        ​ 翻页处理:


        ​ 由上图看出,下一页的hrefl + 首页url 可以得到完整的url:

        next_link = selector.xpath("//div[@class='article']/div[@class='paginator']/a/@href").extract()     # 
        next_url = MySpider.source_url + next_link[MySpider.page - 1]
        if MySpider.page < 5:
            print("现在正在爬取第" + str(MySpider.page) + "页.......")
            print(next_url)
            yield scrapy.Request(url=next_url, callback=self.parse)		# 回调函数翻页
            MySpider.page += 1  # 更新页码
        

      结果展示:

      1. 心得体会*

        在作业三中,遇到的问题主要是导演与主演信息存在一个text里,需要对字符串进行分割:

        我们首先用xpath方法找到包含此text的字符串:(代码实现)

        source = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first()
        

        再观察,对于某些电影是有主演的,而有些电影是没有主演的


        故我们需要进行细节处理:

        利用source.split(':')对source进行分割,对于有主演的电影,分割后的段数大于2,而对于无主演的电影,分割后的段数小于等于2;

        if len(source.split(':')) > 2:      # 以:分隔,当分隔出来的单位大于2时,主演内容不为空
            director = source.split(':')[1].split('主演')[0].split(' ')[1].strip()    # 提取导演信息
            actor = source.split(':')[2].split('/')[0].split(' ')[1].strip()         # 提取主演信息
        else:                               # 以:分隔,当分隔出来的单位小于等于2时,主演内容为空
            director = source.split(':')[1].split('/')[0].split(' ')[1].strip()
            actor = ' '
        

        注:split(' ')[1]是为了分开英文名与中文名并输出中文名
        附:完整代码链接

posted @ 2021-10-31 20:24  chihiro20  阅读(8)  评论(0编辑  收藏  举报