python - 随笔分类 - toloy

ubuntu 中python基本操作

摘要：# 查看所有python版本 ls /usr/bin/python* # 查看python版本 python --version # 修改python版本 vi ~/.bashrc # 编辑文件添加内容 alias python='/usr/bin/python3.4' # 添加内容后，保存退出重新阅读全文

posted @ 2018-04-26 12:16 toloy 阅读(886) 评论(0) 推荐(0)

10、python图像识别库tesseract下载及配置

摘要：''' 1、官网下载：https://github.com/tesseract-ocr/tessdata/tree/3.04.00，网上有相应的exe文件下载安装，安装完成后把安装目录加到环境变量中，需要重启电脑。可下载相应的语言训练包放到tessdata目录下chi_sim开头的是识别中文的训阅读全文

posted @ 2018-03-23 16:27 toloy 阅读(768) 评论(0) 推荐(0)

9、使用selenium + phantomjs 模拟浏览器登录网站

摘要：''' Selenium 模拟浏览器爬取网页信息一种是真实的浏览器，也即是在程序调用浏览器时，会打开相应的浏览器来显示，如：chrome,ie,safari,firefox 一种是伪浏览器，没有浏览器界面，只负责处理html,js和cookie的功能。如：htmlunit,phantomjs 阅读全文

posted @ 2018-03-23 14:31 toloy 阅读(305) 评论(0) 推荐(0)

8、简单的多线程爬取网页数据并通过xpath解析存到本地

摘要：# Author:toloy # 导入队列包 import queue # 导入线程包 import threading # 导入json处理包 import json # 导入xpath处理包 from lxml import etree # 导入请求处理包 import requests cla 阅读全文

posted @ 2018-03-22 17:46 toloy 阅读(1040) 评论(0) 推荐(0)

7、通过requests的session请求登录示例，并通过BeautifulSoup解析

摘要：# Author:toloy import requests import json from bs4 import BeautifulSoup # 创建session对象 sess = requests.session() # 登录的url url = "http://www.dfenqi.cn/ 阅读全文

posted @ 2018-03-22 10:27 toloy 阅读(262) 评论(0) 推荐(0)

6、通过xpath获取网页数据

摘要：1、xpath解析网页源文件 from urllib import request from lxml import etree # 请求的url url = "http://www.dfenqi.cn/Product/Index" # 请求的头文件 headers = { "User-Agent" 阅读全文

posted @ 2018-03-21 16:45 toloy 阅读(1861) 评论(0) 推荐(0)

5、第一个爬取网页使用正则解析数据

摘要：from urllib import request import re # 请求的url url = "http://www.dfenqi.cn/Product/Index" # 请求的头文件 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 1 阅读全文

posted @ 2018-03-21 15:25 toloy 阅读(236) 评论(0) 推荐(0)

4、自定义cookieHandler发送请求

摘要：from urllib import request # 导入需要引用cookie的包 from http import cookiejar import urllib.parse # 请求的url url = "http://www.jinri.com" # 请求的头文件 headers = { 阅读全文

posted @ 2018-03-21 15:00 toloy 阅读(196) 评论(0) 推荐(0)

3、自定义ProxyHandler发送请求

摘要：from urllib import request # 请求的url url = "http://www.jinri.com" # 请求的头文件 headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/ 阅读全文

posted @ 2018-03-21 14:05 toloy 阅读(456) 评论(0) 推荐(0)

2、自定义HttpHandler 发送请求

摘要：from urllib import request url = "http://www.jinri.com" # 自定义handler handler = request.HTTPHandler() headers = { "User-Agent": "Mozilla/5.0 (Windows N 阅读全文

posted @ 2018-03-21 13:50 toloy 阅读(151) 评论(0) 推荐(0)

1、python爬虫 request.urlopen请求网页获取源码

摘要：# python3导入request包 from urllib import request import sys import io # 如果需要用print打印时，如果出现异常可以先设置输出环境 sys.stdout = io.TextIOWrapper(sys.stdout.buffer, e 阅读全文

posted @ 2018-03-20 18:28 toloy 阅读(494) 评论(0) 推荐(0)

toloy

随笔分类 - python

公告