python 爬虫 重复下载 二次请求

在写爬虫的时候,难免会遇到报错,比如 4XX ,5XX,有些可能是网络的原因,或者一些其他的原因,这个时候我们希望程序去做第二次下载,

有一种很low的解决方案,比如是用  try  except

  

try:
    -------
except:
    try:
        --------
    except:
        try:
            ------
        except:
            try:
                ------
            except:
                try:
                    ------
                except:
                    try:
                        ------
                    except:
                        ------

 

有没有看起来更舒服的写法呢?

我们可以用递归实现这个过程

代码如下

 

request_urls = [
"https://www.baidu.com/",
"https://www.baidu.com/",
"https://www.baidu.com/",
"https://www.ba111111idu.com/",
"https://www.baidu.com/",
"https://www.baidu.com/",
]

def down_load(url,request_max=3):
    print "正在请求的URL是:",url
    result_html = ""
    result_status_code = ""
    try:
        result = session.get(url=url)
        result_html = result.content
        result_status_code = result.status_code
        print result_status_code
    except Exception as e:
        print e
        if request_max >0:
            if result_status_code != 200:
                return down_load(url,request_max-1)
    return result_html

for url in request_urls:
    down_load(url=url,request_max=13)

 输出结果:

C:\Python27\python.exe C:/Users/xuchunlin/PycharmProjects/A9_25/auction/test.py
正在请求的URL是: https://www.baidu.com/
200
正在请求的URL是: https://www.baidu.com/
200
正在请求的URL是: https://www.baidu.com/
200
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6208>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6438>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA65F8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6828>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6A90>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA62E8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6D30>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003AA6DD8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B682B0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68080>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B685C0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B687F0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68A20>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.ba111111idu.com/
HTTPSConnectionPool(host='www.ba111111idu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000000003B68C50>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))
正在请求的URL是: https://www.baidu.com/
200
正在请求的URL是: https://www.baidu.com/
200

Process finished with exit code 0

 

posted @ 2018-03-14 10:50  淋哥  阅读(993)  评论(0编辑  收藏  举报