网络爬虫 - 随笔分类 - 胡辣汤王子

爬虫26-部署crawl爬虫

摘要：①进入要创建项目的路径 ②scrapy startproject qsbk(项目名)，创建爬虫scrapy genspider -t crawl wxapp_spider(爬虫名) "http://www.wxapp-union.com/" (域名) ③pycahrm打开项目阅读全文

posted @ 2020-03-16 11:29 胡辣汤王子阅读(183) 评论(0) 推荐(0)

爬虫25-scrapy框架详解

摘要：# 2-快速入门 ## 安装和文档： 1. 安装：通过`pip install scrapy`即可安装。 2. Scrapy官方文档：http://doc.scrapy.org/en/latest 3. Scrapy中文文档：http://scrapy-chs.readthedocs.io/zh_C 阅读全文

posted @ 2020-03-15 23:44 胡辣汤王子阅读(230) 评论(0) 推荐(0)

爬虫24-scrapy框架部署

摘要：1.安装scrapy框架：pip install scrapy 2.使用cmd窗口命令创建项目： ①进入要创建项目的路径 ②scrapy startproject qsbk(项目名)，创建爬虫scrapy genspider qsbk_sqider ③pycharm下打开刚才创建的项目 ④修改set 阅读全文

posted @ 2020-03-15 21:18 胡辣汤王子阅读(163) 评论(0) 推荐(0)

爬虫23-验证码识别

摘要：1.tesseract import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd=r"H:\Python\Tesseract_dev20170510\Tesseract-OCR\tesseract.e 阅读全文

posted @ 2020-03-15 21:09 胡辣汤王子阅读(152) 评论(0) 推荐(0)

爬虫22-使用selenium爬取信息

摘要：1.正常使用cookie爬取拉勾网ajax数据 import requests from lxml import etree import time import re headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) A 阅读全文

posted @ 2020-03-15 21:08 胡辣汤王子阅读(687) 评论(0) 推荐(0)

爬虫21-selenium用法

摘要：1.获取cookie信息 from selenium import webdriver driver=webdriver.Firefox() driver.get("https://www.baidu.com") for cookie in driver.get_cookies(): print(c 阅读全文

posted @ 2020-03-15 21:07 胡辣汤王子阅读(195) 评论(0) 推荐(0)

爬虫20-浏览器自动运行简单方法

摘要：from selenium import webdriver from selenium.webdriver.common.by import By #下载后的驱动放到火狐浏览器的根目录 #设置环境变量后就可以引用 driver=webdriver.Firefox() driver.get("htt 阅读全文

posted @ 2020-03-15 21:05 胡辣汤王子阅读(353) 评论(0) 推荐(0)

爬虫19-线程生产者和消费者以及队列

摘要：import threading import random import time gMoney = 1000 gLock = threading.Lock() gTotalTimes = 10 gTimes = 0 class Producer(threading.Thread): def ru 阅读全文

posted @ 2020-03-15 21:03 胡辣汤王子阅读(150) 评论(0) 推荐(0)

爬虫18-多线程爬虫

摘要：import requests from lxml import etree from urllib import request import os from queue import Queue import threading class Procuder(threading.Thread): 阅读全文

posted @ 2020-03-15 21:02 胡辣汤王子阅读(136) 评论(0) 推荐(0)

爬虫17-json用法

摘要：1.dump import json persons=[ { 'username':"wangchenyang", 'age':14, 'country':"china" }, { 'username':"王晨阳", 'age':14, 'country':"china" } ] # json_st 阅读全文

posted @ 2020-03-14 00:43 胡辣汤王子阅读(179) 评论(0) 推荐(0)

爬虫16-csv用法

摘要：1.读取 import csv def read_csv_demo1(): with open('stock.csv','r') as fp: # reader是一个迭代器 reader=csv.reader(fp) next(reader) for x in reader: name=x[3] v 阅读全文

posted @ 2020-03-14 00:41 胡辣汤王子阅读(245) 评论(0) 推荐(0)

爬虫15-正则表达式爬取中国诗词网

摘要：import requests import re from lxml import etree headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) C 阅读全文

posted @ 2020-03-14 00:40 胡辣汤王子阅读(294) 评论(0) 推荐(0)

爬虫14-find_all中国天气网爬虫

摘要：from bs4 import BeautifulSoup import requests from pyecharts import Bar headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.3 阅读全文

posted @ 2020-03-13 11:28 胡辣汤王子阅读(276) 评论(0) 推荐(0)

爬虫13-select语法

摘要：from bs4 import BeautifulSoup html=""" <html> <head> <title>表格标签学习</title> <meta charset="UTF-8"/> <pre> 表格标签学习: table :声明一个表格 tr:声明一行,设置行高及改行所有单元格的高度阅读全文

posted @ 2020-03-13 11:27 胡辣汤王子阅读(249) 评论(0) 推荐(0)

爬虫12-dind_all语法

摘要：from bs4 import BeautifulSoup html=""" <html> <head> <title>表格标签学习</title> <meta charset="UTF-8"/> <pre> 表格标签学习: table :声明一个表格 tr:声明一行,设置行高及改行所有单元格的高度阅读全文

posted @ 2020-03-13 11:25 胡辣汤王子阅读(168) 评论(0) 推荐(0)

爬虫11-lxml爬取复杂网页，电影天堂

摘要：import requests from lxml import etree url_domain="https://www.dytt8.net" headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537 阅读全文

posted @ 2020-03-12 12:10 胡辣汤王子阅读(226) 评论(0) 推荐(0)

爬虫10-lxml爬取飘花电影网

摘要：import requests from lxml import etree url="https://www.piaohua.com/" headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 阅读全文

posted @ 2020-03-12 00:17 胡辣汤王子阅读(1051) 评论(0) 推荐(0)

爬虫08-xpath语法练习

摘要：from lxml import etree parser=etree.HTMLParser(encoding="utf-8") html=etree.parse("test.html",parser=parser) html2=etree.parse("lagou.html",parser=par 阅读全文

posted @ 2020-03-11 19:42 胡辣汤王子阅读(302) 评论(0) 推荐(0)

爬虫08-lxm读取网页文件方法

摘要：from lxml import etree text=""" <html> <head> <title>表格标签学习</title> <meta charset="UTF-8"/> <pre> 表格标签学习: table :声明一个表格 tr:声明一行,设置行高及改行所有单元格的高度. th:声明阅读全文

posted @ 2020-03-11 19:41 胡辣汤王子阅读(193) 评论(0) 推荐(0)

爬虫07-requests库cookie和session

摘要：import requests#1.获取cookiesresp=requests.get("http://www.baidu.com")print(resp.cookies.get_dict())#2.sessiondapeng_url="http://www.renren.com/88015124 阅读全文

posted @ 2020-03-11 19:39 胡辣汤王子阅读(239) 评论(0) 推荐(0)

胡辣汤王子

随笔分类 - 网络爬虫

公告