2016 年 5月 20 日随笔档案 - rongyux

2016年5月20日

摘要：本文介绍下pdf页面的爬取，需要借助pdfminer模块 demo一般流程： 1）设置url 2)requests模块获取url 3）写入.pdf文件 4)使用pdfminer模块(API可以查看本人的另一篇 http://www.cnblogs.com/rongyux/p/5445723.html 阅读全文

posted @ 2016-05-20 23:59 rongyux 阅读(1470) 评论(0) 推荐(0)

爬虫3：html页面+webdriver模块+demo

摘要：保密性好的网站，不能使用request请求页面信息，这样可以使用webdriver模块先开启一个浏览器，然后爬去信息，甚至还可以click等操作对页面操作，再爬取。 demo 一般流程： 1）包含selenium 模块 2）设置采用火狐浏览器（chrome也可以） 3）get方式打开（为了保密，ur 阅读全文

posted @ 2016-05-20 23:36 rongyux 阅读(1139) 评论(0) 推荐(0)

爬虫2：html页面+beautifulsoap模块+post方式+demo

摘要：爬取html页面，有时需要设置参数post方式请求，生成json，保存文件中。 1）引入模块 2）设置参数 3）post请求 4）设置编码 5）BeautifulSoup解析request请求 6）find_all筛选 7）css选择select beautifulsoap的API请查看 https 阅读全文

posted @ 2016-05-20 23:18 rongyux 阅读(1389) 评论(0) 推荐(0)

rongyux

公告