PY简易爬虫

然而,实用性很差,仅仅是能用而已。

已知bug:

由于土啬的问题,经常会炸掉。网络不稳定导致各种Connection Aborted/SSLError: EOF occurred in violation of protocol.

引入新的bug,无法记录错误啊啊啊!

解决方案:

已修复,添加异常处理,一次超时重试三次,超时值设定为1s。三次超时访问下一个页面,同时记录错误信息。

已修复,改了下代码。

程序运行速度已经有了很大的提高[约3pv/(s/thread)]

bug已经修复。

乱写的+现学现卖。

鸣谢:百度爬虫,感谢它的无私奉献(Anti-Anti-Spider Technology)

效果(速度不太稳定,约在1s/pv~10s/pv间波动)

(已经有了较大变化)

多进程生成器:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
import codecs
for i in range(1,7000):
    f = open(str(i*10)+'k'+'.py','w')
    f.write('''# coding:utf-8
import re
import requests
import time
headers = {
'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)',
}
sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(max_retries = 20)

sess.mount('https://', adapter)
for i in range ('''+str(10000*i)+''','''+str((i+1)*10000)+'''):
    print(str(i))
    url = "https://www.pixiv.net/member_illust.php?mode=medium&illust_id="+str(i);
    try:
        r = sess.get(url,headers = headers,timeout = 1)
    except:
        try:
            r = sess.get(url,headers = headers,timeout = 1)
        except:
            try:
                r = sess.get(url,headers = headers,timeout = 1)
            except:
                err = open('err'+str(i)+'.log',"a")
                err.write(str(i)+"\\n")
                err.close
                continue
    if r.status_code != 200:
        continue
    data = r.text
    pattern = u'class="text">初音ミク</a>'
    piclist = re.findall(pattern,data)
    if len(piclist):
        f = open("'''+str(i*10)+'k-'+str((i+1)*10)+'k'+'''.txt","a")
        f.write(str(i)+'\\n')
        f.close()''')
f.close()

生成实例:

# coding:utf-8
import re
import requests
import time
headers = {
'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)',
}
sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(max_retries = 20)

sess.mount('https://', adapter)
for i in range (69990000,70000000):
    print(str(i))
    url = "https://www.pixiv.net/member_illust.php?mode=medium&illust_id="+str(i);
    try:
        r = sess.get(url,headers = headers,timeout = 1)
    except:
        try:
            r = sess.get(url,headers = headers,timeout = 1)
        except:
            try:
                r = sess.get(url,headers = headers,timeout = 1)
            except:
                err = open('err'+str(i)+'.log',"a")
                err.write(str(i)+"\n")
                err.close
                continue
    if r.status_code != 200:
        continue
    data = r.text
    pattern = u'class="text">初音ミク</a>'
    piclist = re.findall(pattern,data)
    if len(piclist):
        f = open("69990k-70000k.txt","a")
        f.write(str(i)+'\n')
        f.close()

 UPDATE: 回家后更新多进程版,速度约25w pv/h.结合bash脚本实现不间断爬虫。然而还是很慢(摊手)。代码大范围重构,补了一点信息(回家后就能跑完了hhh)。稳定性有了提高(1000Wpv无错误)???

系统要求:

Linux主流发行版

内存8G(主进程虚拟内存2G,物理内存2G,worker进程不清楚,反正能跑,总内存消耗大约是4.4G左右)

四核CPU(占用率大约是170%-200%+)

40Mbps网络(外网,速度不是很稳定,约10-30Mbps左右)

------------------------------------------------------

150进程(然而实践证明100进程足矣)???

引入新的bug:单文件7000W数量级,gc会炸掉。orz.

posted @ 2017-05-17 20:48  baka  阅读(486)  评论(0编辑  收藏  举报