爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中
爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中
准备使用的环境和库Python3.6 + requests + bs4 + csv + multiprocessing
库的说明
- requests模拟计算机对服务器发送requests请求
- bs4:页面分析功能,分析页面找到所需要的特定内容
- xlwt:把爬取的内容存入csv文件中
- multiprocessing:开启多进程爬取
1.准备URLs
起点中文网 起点中文网的URL:https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=2 发现通过改变最后以为数字可以变换页数,由主页内容可知一共有61732页。 使用 urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=' + str(k) for k in range(1, 61723)]这个语句可以构造一个所有连接的列表,供后面多进程使用。

2.使用requests库获取页面和使用bs4库来解析页面内容
html = requests.get(url, headers=headers)selector = BeautifulSoup(html.text, 'lxml')names = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > h4 > a')writers = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.name')sign1s = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a:nth-child(4)')sign2s = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.go-sub-type')types = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > span')traductions = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.intro')words = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.update > span > span')
3.把信息存储到xls中
head = ['title', 'author', 'sign1', 'sign2', 'type', 'traduction', 'words']f = open('_06_qidian.csv', 'a+')csv_writer = csv.writer(f)csv_writer.writerow(head)for info in range(len(names)):csv_writer.writerow((names[info].get_text(), writers[info].get_text(), sign1s[info].get_text(), sign2s[info].get_text(), types[info].get_text(), traductions[info].get_text(), words[info].get_text()))
4.最后就可以开足马力使用多进程进行爬取了,这里使用的进程数正好是cpu核心的数量。
pool = Pool(processes=multiprocessing.cpu_count())pool.map(get_info, urls)pool.close()pool.join()
5.完整代码
import requestsfrom bs4 import BeautifulSoupimport xlwtfrom multiprocessing import Poolimport multiprocessingimport csvdef get_info(url):print(url)global ihtml = requests.get(url, headers=headers)selector = BeautifulSoup(html.text, 'lxml')names = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > h4 > a')writers = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.name')sign1s = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a:nth-child(4)')sign2s = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > a.go-sub-type')types = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.author > span')traductions = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.intro')words = selector.select('body > div.wrap > div.all-pro-wrap.box-center.cf > div.main-content-wrap.fl > div.all-book-list > div > ul > li > div.book-mid-info > p.update > span > span')for info in range(len(names)):csv_writer.writerow((names[info].get_text(), writers[info].get_text(), sign1s[info].get_text(), sign2s[info].get_text(), types[info].get_text(), traductions[info].get_text(), words[info].get_text()))if __name__ == '__main__':head = ['title', 'author', 'sign1', 'sign2', 'type', 'traduction', 'words']f = open('_06_qidian.csv', 'a+')csv_writer = csv.writer(f)csv_writer.writerow(head)headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',}urls = ['https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=' + str(k) for k in range(1, 61728)]pool = Pool(processes=multiprocessing.cpu_count())pool.map(get_info, urls)f.close()

浙公网安备 33010602011771号