大赵传奇

援引事类,扬搉古今,举要删芜,言辩而理切--QQ276605216

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

说明,

之前使用的pyspider框架的,安装的各种问题都已解决,当初还在云服务器做了镜像,配置的python36+pyspider。因为想加全局代理,还修改了fetcher的代码。对调度的任务先后也在index_page里逐一添加到队列的,但仍对顺序执行掌控不好。

故来scapy看看。

发现还挺灵活的。

1。安装pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple/  就能顺利安装好。cmd中 scrapy 即可以验证是否成功。

2。创建项目只cd 到目标,scarpy startproject tutorial即可以创建 tutorial 项目 ,按这里:https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

3。然后添加在tutorial / spiders 目录下第一个爬虫文件,dmoz_spider.py文件

import scrapy
from pyquery import PyQuery as pq
import random
from tutorial.spiders.ProxyHelperPy import ProxyHelperClass


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "https://www.baidu.com",
    ]

    def start_requests(self):
        urls = [
            "https://www.baidu.com",
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        doc=pq(response.body)
        items=doc('#u1 a').items()
        for i in items:
            print(i.text())
        # print(doc('#u1 a'))
View Code

4。默认的xpath 不看了,还是用的熟悉的pyquery.

from pyquery import PyQuery as pq
然后 parse 里 
doc=pq(response.body) 就可以用了。

5。vscode添加 run.py 用于ctrl + atl +n 运行scrapy 项目,不然还cmd 启动:scrapy crawl dmoz

 6。调试添加launch.json,需要后动:关键是红色的

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Module",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "cwd": "${workspaceRoot}/tutorial",
            "args": [
                "crawl",
                "dmoz"
            ]
        }
    ]
}

7。proxy控制

import scrapy
from pyquery import PyQuery as pq
import random
from tutorial.spiders.ProxyHelperPy import ProxyHelperClass


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    # start_urls = [
    #     "https://www.baidu.com",
    # ]
    proxyObj=None

    def start_requests(self):
        urls = [
            "https://www.baidu.com",
        ]
        if self.proxyObj==None:
            self.proxyObj=ProxyHelperClass()


        for url in urls:
            proxy=self.get_proxy()
            if proxy!='':
                yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})
            else:
                yield scrapy.Request(url=url, callback=self.parse)

    def get_proxy(self):
        proxies=''
        try:
            print(self.proxyObj==None)
            ret=self.proxyObj.get_proxy_ip()
            if ret=='error':
                print('获取代理ip失败')
            ip = '{}:{}'.format(ret[0],ret[1])
            print(ret, ip)
            proxies = 'http://'+ip
            print(proxies)
        except Exception as e:
            print('ip获取失败---------------')
            print(e)
        return proxies

    def parse(self, response):
        doc=pq(response.body)
        items=doc('#u1 a').items()
        for i in items:
            print(i.text())
        # print(doc('#u1 a'))

 8。那些控制台消息,robots.txt配置,

ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
posted on 2022-02-21 15:49  赵长青  阅读(26)  评论(0编辑  收藏  举报