随笔分类 -  python爬虫

摘要:1.配置信息 3.spider 4.中间件 5.管道(存储到mongo中) 阅读全文
posted @ 2018-07-30 00:29 Ray_chen 阅读(331) 评论(0) 推荐(0)
摘要:import re from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait fr... 阅读全文
posted @ 2018-07-29 17:10 Ray_chen 阅读(310) 评论(0) 推荐(0)
摘要:1 from pyquery import PyQuery as pq 2 3 # url初始化 4 # html = '' 5 # doc = pq(html) 6 url = 'https://www.baidu.com' 7 doc = pq(url=url) 8 print(doc('hea 阅读全文
posted @ 2018-07-27 10:10 Ray_chen 阅读(170) 评论(0) 推荐(0)
摘要:1 import requests 2 import re 3 import json 4 from requests.exceptions import RequestException 5 from multiprocessing import Pool 6 7 # 获取网页 8 def get_one_page(url): 9 headers = { 10 ... 阅读全文
posted @ 2018-07-27 10:04 Ray_chen 阅读(226) 评论(0) 推荐(0)
摘要:1.re实现 1 import requests 2 from requests.exceptions import RequestException 3 import re,json 4 import xlwt,xlrd 5 6 # 数据 7 DATA = [] 8 KEYWORD = 'pyth 阅读全文
posted @ 2018-07-27 02:24 Ray_chen
摘要:1.re实现 1 import re,os 2 import requests 3 from requests.exceptions import RequestException 4 5 MAX_PAGE = 10 #最大页数 6 KEYWORD = 'python' 7 headers = { 阅读全文
posted @ 2018-07-26 19:12 Ray_chen 阅读(318) 评论(0) 推荐(0)