返回顶部

数据采集与融合第三次个人作业

写在前面

本次作业是关于多线程对于爬虫的运用以及采用 scrapy 框架来编写爬虫,先写个小插曲。之前做软工作业的时候发现 python 的多线程是个鸡肋,因为在加了 GIL 锁的 python,多线程和单线程几乎无差别,甚至多线程可能更慢。但是周三用多线程测试之后颠覆了我的观点,搜完之后就摒弃对 python 多线程的偏见~
先讲一下 IO 密集型任务和计算密集型任务:

  • IO 密集型任务
    • 所谓IO密集型任务,是指磁盘IO、网络IO占主要的任务,计算量很小。比如请求网页、读写文件等。
  • 计算密集型任务
    • 所谓计算密集型任务,是指CPU计算占主要的任务,CPU一直处于满负荷状态。比如在一个很大的列表中查找元素(当然这不合理),复杂的加减乘除等。

先说结论,多线程主要处理 IO 密集型任务,而多进程处理计算密集型任务。原因就是操作系统的知识了,多进程切换开销非常大,占用内存多,虽然不适合同步操作,但是人家算力够猛;而多线程共享进程内数据,同步简单但是算力不行。

接下来就是本实验采取的依旧是 re 和 beautifulsoup 来实现,因为还是前者用的顺手,后者对我的吸引力实在太小了,但实验一和二的Xpath版本已补充~
另外 scrapy 框架我觉得使用并不是很灵活,因为该框架的特点主要是针对大型爬虫项目,并且通过异步处理来加速这个过程。对于小型项目还是使用 beautifulsoup 和 request 的效率更高,经过周五的课之后学会了在 scrapy 中翻页,本实验已在实验三更新 scrapy 框架写法~

实验一

  • 这个实验是书上的实验,只要理解代码逻辑照打一遍就行。关键看一下多线程嵌套的方式,多线程处理的是 download 函数,也就是处理的 IO 密集型任务,天生的优势啊,单线程要1.4秒,而多线程只需要0.6秒,当需要爬取的数据足够多时这个差距会被明显拉大。(周三开着流量跑的,单线程11秒出头,但是使用多线程只需要1秒多,足以看出差距)

  • 结果

  • 单线程代码

from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38"}
url = "http://www.weather.com.cn"
import time

def image_Spider(url):
    try:
        image_urls = []
        req = urllib.request.Request(url,headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        dammit = UnicodeDammit(data,["utf-8","gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data,"lxml")
        images = soup.select("img")
        for image in images:
            try:
                src = image["src"]
                image_url = urllib.request.urljoin(url,src)
                if image_url not in image_urls:
                    image_urls.append(image_url)
                    download(image_url)
            except Exception as e:
                print(e)
    except Exception as ee:
        print(ee)

def download(url):
    global count
    try:
        count += 1
        if (url[len(url)-4] == "."):
            ext = url[len(url)-4:]
        else:
            ext = ""

        req = urllib.request.Request(url,headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        fobj = open("./images/" + str(count) + ext,"wb")
        fobj .write(data)
        fobj.close()
    except Exception as eee:
        print(eee)

count = 0
start = time.time()
image_Spider(url)
end = time.time()
print("consuming_time : " + str(end-start))
  • 多线程代码
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38"}
url = "http://www.weather.com.cn"
import time
import threading
import mmap

def image_Spider(url):
    global threads
    global count
    try:
        image_urls = []
        req = urllib.request.Request(url,headers=headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        dammit = UnicodeDammit(data,["utf-8","gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data,"lxml")
        images = soup.select("img")
        for image in images:
            try:
                src = image["src"]
                image_url = urllib.request.urljoin(url,src)
                if image_url not in image_urls:
                    count += 1
                    T = threading.Thread(target=download,args=(image_url,count))
                    T.setDaemon(False)
                    T.start()
                    threads.append(T)
                    image_urls.append(image_url)
            except Exception as e:
                print(e)
    except Exception as ee:
        print(ee)

def download(url,count):
    try:
        if (url[len(url)-4] == "."):
            ext = url[len(url)-4:]
        else:
            ext = ""

        req = urllib.request.Request(url,headers=headers)
        data = urllib.request.urlopen(req,timeout=100)
        data = data.read()
        fobj = open("./images2/" + str(count) + ext,"wb")
        fobj.write(data)
        fobj.close()
    except Exception as eee:
        print(eee)

count = 0
threads = []
start = time.time()
image_Spider(url)
for t in threads:
    t.join()
end = time.time()
print("consuming_time : " + str(end - start))
  • 收获
    明确了 python 多线程和多进程能够处理的任务类型,并且学会了使用 python 多线程来处理 IO 密集型任务。

实验二

这个实验要通过 scrapy 框架来复现实验一的结果,首先要明白 scrapy 框架是如何工作的,然后怎么运行 scrapy 框架编写的爬虫,代码结果同上,直接上代码以及使用方法。而且实验二爬取项非常少,根本不需要重写 item 和 pipeline 。

  • 单线程代码
import scrapy
from bs4 import UnicodeDammit
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import time

class Scrapy_Weather_forecast(scrapy.Spider):
    name = "xrfSpider"

    def start_requests(self):
        self.url = "http://www.weather.com.cn"
        yield scrapy.Request(url=self.url,callback=self.parse)

    def parse(self, response, **kwargs):

        self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38"}
        start = time.time()
        count = 0
        self.image_Spider(self.url)
        end = time.time()
        print("consuming_time : " + str(end-start))

    def image_Spider(self,url):
        try:
            image_urls = []
            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "lxml")
            images = soup.select("img")
            for image in images:
                try:
                    src = image["src"]
                    image_url = urllib.request.urljoin(url, src)
                    if image_url not in image_urls:
                        image_urls.append(image_url)
                        self.download(image_url)
                except Exception as e:
                    print(e)
        except Exception as ee:
            print(ee)

    def download(self,url):
        try:
            self.count += 1
            if (url[len(url) - 4] == "."):
                ext = url[len(url) - 4:]
            else:
                ext = ""

            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            fobj = open("./images/" + str(self.count) + ext, "wb")
            fobj.write(data)
            fobj.close()
        except Exception as eee:
            print(eee)
  • 多线程代码
import scrapy
from bs4 import UnicodeDammit
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import time
import threading

class Scrapy_Weather_forecast(scrapy.Spider):
    name = "xrfSpider"

    def start_requests(self):
        self.url = "http://www.weather.com.cn"
        yield scrapy.Request(url=self.url,callback=self.parse)

    def parse(self, response, **kwargs):

        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38"}
        start = time.time()
        url = "http://www.weather.com.cn"
        global threads
        threads = []
        count = 0
        self.image_Spider(url,headers,count)
        for t in threads:
            t.join()
        end = time.time()
        print("consuming_time : " + str(end-start))

    def image_Spider(self,url,headers,count):

        try:
            image_urls = []
            req = urllib.request.Request(url, headers=headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "lxml")
            images = soup.select("img")
            for image in images:
                try:
                    src = image["src"]
                    image_url = urllib.request.urljoin(url, src)
                    if image_url not in image_urls:
                        count += 1
                        T = threading.Thread(target=self.download, args=(image_url, count,headers))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)
                        image_urls.append(image_url)
                except Exception as e:
                    print(e)
        except Exception as ee:
            print(ee)

    def download(self,url, count,headers):
        try:
            if (url[len(url) - 4] == "."):
                ext = url[len(url) - 4:]
            else:
                ext = ""

            req = urllib.request.Request(url, headers=headers)
            data = urllib.request.urlopen(req, timeout=100)
            data = data.read()
            fobj = open("./images2/" + str(count) + ext, "wb")
            fobj.write(data)
            fobj.close()
        except Exception as eee:
            print(eee)
  • 实验二补充使用 Xpath 获得图片路径并下载的代码
selector = Selector(text=data)
selector.xpath("//img")
urls = s.xpath("//img/@src").extract()
for u in urls:
    if u not in image_urls:
        image_urls.append(u)
        self.download(u)
  • 运行方法
    现在当前目录文件夹下打开 cmd 用 scrapy startproject [projectname] 创建一个 scrapy 项目,然后直接进到这个项目的spider文件夹,在此处编写代码,之后通过 cmd 使用 scrapy runspider [pythonfilename] 运行。

  • 收获
    学会使用 scrapy 框架来编写爬虫,并且了解如何运行 scrapy 框架。

实验三

实验采用 scrapy 框架复现股票信息爬取,思路和上一次一样,但是这次尝试了一下之前没有试过的 prettytable,发现这东西用起来真香...(上一次实验为了得到规范化的数据我采用了pandas 导出 csv 格式文件来输出结果)

  • 结果

  • items
import scrapy

class StockDemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    idx = scrapy.Field()
    code = scrapy.Field()
    name = scrapy.Field()
    newest_price = scrapy.Field()
    up_down_extent = scrapy.Field()
    up_down_value = scrapy.Field()
    deal_volume = scrapy.Field()
    deal_value = scrapy.Field()
    freq = scrapy.Field()
    highest = scrapy.Field()
    lowest = scrapy.Field()
    opening = scrapy.Field()
    over = scrapy.Field()
    pass
  • pipeline
class StockDemoPipeline:
    pt = PrettyTable()
    pt.field_names = ["序号", "代码", "名称", "最新价", "涨跌幅", "涨跌额", "成交量", "成交额", "涨幅", "最高", "最低", "今开", "昨收"]
    def process_item(self, item, spider):
        self.pt.add_row([item["idx"], item["code"], item["name"], item["newest_price"], item["up_down_extent"],
                    item["up_down_value"], item["deal_volume"], item["deal_value"], item["freq"], item["highest"],
                    item["lowest"], item["opening"], item["over"]])
        if (item["idx"] == 100):
            print(self.pt)
        return item

  • spider
import requests
import re
import scrapy
import sys,os
sys.path.append(os.path.dirname(os.path.dirname(__file__)))
from items import StockDemoItem
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 Edg/85.0.564.63"}


class Scrapy_Stock(scrapy.Spider):
    name = "xrfSpider"
    s = 1
    e = 5
    def start_requests(self):
        url = "http://31.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240033772650735816256_1601427948453&pn={:,d}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:128+t:3,m:128+t:4,m:128+t:1,m:128+t:2&\
            fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f23,f24,f25,f26,f22,f33,f11,f62,f128,f136,f115,f152&_=1601427948454".format(1)

        yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        # print("序号 代码   名称      最新价    涨跌幅    涨跌额    成交量(股)    成交额(港元)    涨幅    最高    最低    今开    昨收")
        try:
            html = response.text
            # print(html)
            res = re.findall(re.compile("\[(.+)\]"),html)[0]
            res = res.split("},") # split by } , but need to fix

            for idx,info in enumerate(res):
                if idx != 19:# reach 19 dont add }
                    info = info + "}" # make a complete dict
                info = eval(info) # construct a dict
                item = StockDemoItem()
                item["idx"] = idx + 1 + 20 * (self.s - 1)
                item["code"] = info["f12"]
                item["name"] = info["f14"]
                item["newest_price"] = info["f2"]
                item["up_down_extent"] = info["f3"]
                item["up_down_value"] = info["f4"]
                item["deal_volume"] = info["f5"]
                item["deal_value"] = info["f6"]
                item["freq"] = info["f7"]
                item["highest"] = info['f15']
                item["lowest"] = info['f16']
                item["opening"] = info['f17']
                item["over"] = info['f18']
                yield item

            if (self.s != self.e):
                self.s += 1
                new_url = "http://31.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240033772650735816256_1601427948453&pn={:,d}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:128+t:3,m:128+t:4,m:128+t:1,m:128+t:2&\
                            fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f23,f24,f25,f26,f22,f33,f11,f62,f128,f136,f115,f152&_=1601427948454".format(
                    self.s)
                yield scrapy.Request(url = new_url,callback=self.parse)


        except Exception as e:
            print(e)

  • 翻页的实现(周五课上完后更新)
if (self.s != self.e):
    self.s += 1
    new_url = "http://31.push2.eastmoney.com/api/qt/clist/get?cb=jQuery11240033772650735816256_1601427948453&pn={:,d}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:128+t:3,m:128+t:4,m:128+t:1,m:128+t:2&\
                fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f23,f24,f25,f26,f22,f33,f11,f62,f128,f136,f115,f152&_=1601427948454".format(
        self.s)
    yield scrapy.Request(url = new_url,callback=self.parse)
  • 收获
    和上一个实验一样的,但是学会了 prettytable 的用法~
posted @ 2020-10-15 21:45  King_James23  阅读(210)  评论(0编辑  收藏  举报