CSV
爬取 csv 格式数据与 xml 等方法基本一致
使用下列的表格:
| name | sex | addr | |
| Alex | Boy | Los Angeles | alex@hotstone.com |
| Coy | Girl | Los Angeles, | coy@hotstone.com |
| Couch | Boy | California | couch@hotstone.com |
| Tom | Girl | New York | tom@hotstone.com |
创建一个项目:
$ scrapy startproject mycsv |
创建 CSV 模板:
$ cd mycsv$ scrapy genspider -t csvfeed mycsvspider localhost |
编写 items 代码:
import scrapyclass MycsvItem(scrapy.Item): name = scrapy.Field() sex = scrapy.Field() |
编写 spider 文件:
# -*- coding: utf-8 -*-from scrapy.spiders import CSVFeedSpiderfrom mycsv.items import MycsvItemclass MycsvspiderSpider(CSVFeedSpider): name = 'mycsvspoder' allowed_domains = ['localhost'] # headers = ['id', 'name', 'description', 'image_link'] # delimiter = '\t' # 定义 headers headers = ['name', 'sex', 'addr', 'email'] # 定义间隔符 delimiter = ',' # Do any adaptations you need here #def adapt_response(self, response): # return response def parse_row(self, response, row): i = MycsvItem() #i['url'] = row['url'] #i['name'] = row['name'] #i['description'] = row['description'] i['name'] = row['name'].encode() i['sex'] = row['sex'].encode() print(" 名字是: ") print(i['name']) print("性别是: ") print(i['sex']) print("---------------------------") return i |
项目下保存 csv 文件名 feed.csv 内容都是以逗号分隔
使用 Docker 启动本地 HTTP 服务,主要用途是访问 csv 文件:
$ cd mycsv$ docker run -d -w /data -p 80:8080 -v ${PWD}:/data slzcc/java-webserver:jenkins-java-webserver-14 java -jar /usr/src/app/app.jar 8080 |
启动完成后可以检测是否可以访问:
创建 main.py 文件:
from scrapy import cmdlinecmdline.execute("scrapy crawl mycsvspider".split()) |
结果如下:



浙公网安备 33010602011771号