浙江省高等学校教师教育理论培训

微信搜索“教师资格证岗前培训”小程序

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

use proxy in spider

http://love-python.blogspot.com/2008/03/use-proxy-in-your-spider.html

life is short - you need Python!

Using proxy, you can minimize the chance of getting blocked for your crawlers/spiders. Now let me tell you how to use proxy ip address in your spider in python. First load the list from a file:

fileproxylist = open('proxylist.txt', 'r')
proxyList = fileproxylist.readlines()
indexproxy = 0
totalproxy = len(proxyList)

Now for each proxy in list, call the following function:

def get_source_html_proxy(url, pip):

    proxy_handler = urllib2.ProxyHandler({'http': pip})
    opener = urllib2.build_opener(proxy_handler)
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib2.install_opener(opener)
    req=urllib2.Request(url)
    sock=urllib2.urlopen(req)
    data = sock.read()
    return data

Hope your spidering experience will be better with proxies :-)

Let me know about any alternate idea.

 

posted on 2011-11-18 20:21  lexus  阅读(351)  评论(0编辑  收藏  举报