brady-wang - 博客园

2016年11月30日

摘要： # -*- coding: utf-8 -*- from bs4 import BeautifulSoup import re import os import urllib2 import urllib def download_img(urls,k): #urls = "http://tieba.baidu.com/p/4807867791" page = urllib2... 阅读全文

posted @ 2016-11-30 15:01 brady-wang 阅读(442) 评论(0) 推荐(0)

Python获取文件名

摘要：本文实例讲述了python实现从URL地址提取文件名的方法。分享给大家供大家参考。具体分析如下：如：地址为 http://www.jb51.net/images/logo.gif 要想从该地址提取logo.gif，只需要一句代码就可以搞定 import osurl = 'http://www.jb 阅读全文

posted @ 2016-11-30 14:06 brady-wang 阅读(858) 评论(0) 推荐(0)

2016年11月29日

图片爬虫

摘要： # -*- coding: utf-8 -*- import re import urllib import os.path def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html,p): reg = r'<img src="(htt... 阅读全文

posted @ 2016-11-29 23:13 brady-wang 阅读(352) 评论(0) 推荐(0)

爬虫5 html下载器 html_downloader.py

摘要： #coding:utf8 import urllib2 __author__ = 'wang' class HtmlDownloader(object): def download(self, url): if url is None: return None response = urllib2.urlopen(url) ... 阅读全文

posted @ 2016-11-29 22:46 brady-wang 阅读(964) 评论(0) 推荐(0)

爬虫4 html输出器 html_outputer.py

摘要： #coding:utf8 __author__ = 'wang' class HtmlOutputer(object): def __init__(self): self.datas = []; def collect_data(self, data): if data is None: return ... 阅读全文

posted @ 2016-11-29 22:45 brady-wang 阅读(471) 评论(0) 推荐(0)

爬虫3 html解析器 html_parser.py

摘要： #coding:utf8 import urlparse from bs4 import BeautifulSoup import re __author__ = 'wang' class HtmlParser(object): def parse(self, page_url, html_cont): if page_url is None or html_con... 阅读全文

posted @ 2016-11-29 22:44 brady-wang 阅读(695) 评论(0) 推荐(0)

爬虫1 --调度器

摘要： spider_main.py 阅读全文

posted @ 2016-11-29 22:42 brady-wang 阅读(715) 评论(0) 推荐(0)

爬虫2 url管理器 url_manager.py

摘要： #coding:utf8 class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set() def add_new_url(self, url): if url is None: return... 阅读全文

posted @ 2016-11-29 22:42 brady-wang 阅读(904) 评论(0) 推荐(0)

beautifulsoup测试

摘要： import re from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse' 阅读全文

posted @ 2016-11-29 22:20 brady-wang 阅读(430) 评论(0) 推荐(0)

安装beautifulsoup4

摘要： python scripts下 pip install beautifulsoup4 阅读全文

posted @ 2016-11-29 22:00 brady-wang 阅读(216) 评论(0) 推荐(0)

风行天下

天地不仁以万物为刍狗

公告

风行天下

天地不仁 以万物为刍狗

公告

天地不仁以万物为刍狗