随笔分类 -  爬虫及Scrapy

摘要:import requests,json from bs4 import BeautifulSoup import pandas aa=['''http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=direct&pcevaname=pc4.1&qt=con&from=webmap&c=131&wd=%E5%8... 阅读全文
posted @ 2017-08-04 12:39 Erick-LONG 阅读(2661) 评论(0) 推荐(0)
摘要:import requests from bs4 import BeautifulSoup import pymongo from multiprocessing.dummy import Pool as ThreadPool headers = {'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) Apple... 阅读全文
posted @ 2017-06-16 16:13 Erick-LONG 阅读(734) 评论(0) 推荐(0)
摘要:http://cuiqingcai.com/4020.html 阅读全文
posted @ 2017-06-12 18:09 Erick-LONG 阅读(167) 评论(0) 推荐(0)
摘要:import requests from bs4 import BeautifulSoup import pandas as pd import gevent from gevent import monkey;monkey.patch_all() import time import re import random UA_list = [ 'Mozilla/5.0 (Windows NT ... 阅读全文
posted @ 2017-06-05 15:01 Erick-LONG 阅读(693) 评论(0) 推荐(0)
摘要:import scrapy import urllib.request from scrapy.http import Request,FormRequest class LoginspdSpider(scrapy.Spider): name = "loginspd" allowed_domains = ["douban.com"] start_urls = ['htt... 阅读全文
posted @ 2017-05-11 16:10 Erick-LONG 阅读(293) 评论(1) 推荐(0)
摘要:item.py pipeline.py spd.py 阅读全文
posted @ 2017-05-11 15:13 Erick-LONG 阅读(231) 评论(0) 推荐(0)
摘要:pipeline item 阅读全文
posted @ 2017-05-10 17:29 Erick-LONG 阅读(1732) 评论(0) 推荐(0)
摘要:rules = [ Rule(SgmlLinkExtractor(allow=('/u012150179/article/details'), restrict_xpaths=('//li[@class="next_article"]')), callback='parse_ite... 阅读全文
posted @ 2017-05-10 16:05 Erick-LONG 阅读(786) 评论(0) 推荐(0)
摘要:UA池 阅读全文
posted @ 2017-05-10 15:05 Erick-LONG 阅读(524) 评论(0) 推荐(0)
摘要:放入项目目录,配置setting.py 阅读全文
posted @ 2017-05-10 14:19 Erick-LONG 阅读(660) 评论(0) 推荐(0)
摘要:class CsvspiderSpider(CSVFeedSpider): name = 'csvspider' allowed_domains = ['iqianyue.com'] start_urls = ['http://iqianyue.com/feed.csv'] headers = ['id', 'name', 'description', 'imag... 阅读全文
posted @ 2017-05-10 13:51 Erick-LONG 阅读(320) 评论(0) 推荐(0)
摘要:from scrapy.spiders import XMLFeedSpider from myxml.items import MyxmlItem class XmlspiderSpider(XMLFeedSpider): name = 'xmlspider' allowed_domains = ['sina.com.cn'] start_urls = ['http:... 阅读全文
posted @ 2017-05-10 13:35 Erick-LONG 阅读(217) 评论(0) 推荐(0)
摘要:import scrapy from Autopjt.items import myItem from scrapy.http import Request class AutospdSpider(scrapy.Spider): name = "fulong_spider" start_urls = 阅读全文
posted @ 2017-05-10 13:15 Erick-LONG 阅读(1692) 评论(0) 推荐(0)
摘要:pipeline部分 item部分 阅读全文
posted @ 2017-05-10 13:01 Erick-LONG 阅读(558) 评论(0) 推荐(0)
摘要:Session操作 阅读全文
posted @ 2017-05-03 17:55 Erick-LONG 阅读(2500) 评论(0) 推荐(0)
摘要:setting.py main.py items.py dbbook.py 阅读全文
posted @ 2017-04-20 17:26 Erick-LONG 阅读(266) 评论(0) 推荐(0)
摘要:1 #!/usr/bin/env python 2 # -*- coding:utf-8 -*- 3 import requests 4 from bs4 import BeautifulSoup 5 import pandas 6 def gethousedetail(url): 7 info ={} 8 res = requests.get(url) 9 ... 阅读全文
posted @ 2017-04-19 21:46 Erick-LONG 阅读(1033) 评论(0) 推荐(0)
摘要:1 import requests 2 import json 3 import time 4 url = 'http://weibo.com/aj/message/add?ajwvr=6' 5 headers = { 6 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, ... 阅读全文
posted @ 2017-04-19 13:52 Erick-LONG 阅读(2447) 评论(0) 推荐(0)
摘要:1 from selenium import webdriver 2 import time 3 4 driver = webdriver.PhantomJS(executable_path="D:/phantomjs/bin/phantomjs.exe") 5 driver.get("http://study.163.com/course/courseMain.htm?course... 阅读全文
posted @ 2017-04-19 13:46 Erick-LONG 阅读(139) 评论(0) 推荐(0)
摘要:1 from email.header import Header 2 from email.mime.text import MIMEText 3 from email.utils import parseaddr,formataddr 4 import smtplib 5 from email.mime.multipart import MIMEMultipart 6 from e... 阅读全文
posted @ 2017-04-19 13:44 Erick-LONG 阅读(233) 评论(0) 推荐(0)