爬虫 - 随笔分类 - st--st

摘要：https://segmentfault.com/a/1190000015826749 阅读全文

posted @ 2019-03-07 10:34 st--st 阅读(1026) 评论(0) 推荐(0)

摘要：使用模块 fake-useragent https://github.com/hellysmile/fake-useragent 1.安装模块 2.配置阅读全文

posted @ 2019-02-27 16:47 st--st 阅读(1134) 评论(0) 推荐(0)

摘要：如何提高scrapy的爬取效率增加并发：默认scrapy开启的并发线程为32个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。降低日志级别：在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使阅读全文

posted @ 2019-02-24 15:20 st--st 阅读(1017) 评论(0) 推荐(0)

【爬虫】多线程爬取糗事百科写入文件

摘要：''' 爬取糗事百科的段子，将内容和连接爬取下来，写入scv 使用技术：多线程，锁，队列，xpath，csv ''' import requests import csv from queue import Queue from lxml import etree import threading class Creeper(threading.Thread): def __ini... 阅读全文

posted @ 2019-02-21 16:46 st--st 阅读(163) 评论(0) 推荐(0)

【爬虫】多线程爬取表情包

摘要：''' 利用多线程、队列爬取表情包 URL：http://www.bbsnet.com/doutu/page/1 ''' import requests from lxml import etree import os import re from urllib import request from queue import Queue import threading class Pr... 阅读全文

posted @ 2019-02-21 09:53 st--st 阅读(179) 评论(0) 推荐(0)

【爬虫】Condition版的生产者和消费者模式

摘要：Condition版的生产者和消费者模式 threading.Condition 在没有数据的时候处于阻塞状态，有数据可以使用notify的函数通知等等待状态的线程运作 threading.Condition 实际上是继承threading.Lock acquire：上锁。 release：解锁。阅读全文

posted @ 2019-02-20 20:38 st--st 阅读(179) 评论(0) 推荐(0)

【爬虫】Load版的生产者和消费者模式

摘要：''' Lock版的生产者和消费者模式 ''' import threading import random import time gMoney = 1000 # 原始金额 gLoad = threading.Lock() gTime = 0 # 生产次数 class Producer(threading.Thread): def run(self... 阅读全文

posted @ 2019-02-20 20:06 st--st 阅读(126) 评论(0) 推荐(0)

selenium

摘要：一、安装selenium和chromedriver 二、安装PhantomJS 三、介绍 selenium最初是一个测试工具，而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉阅读全文

posted @ 2019-01-24 14:26 st--st 阅读(220) 评论(0) 推荐(0)

BeautifulSoup

摘要：Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。一、安装二、使用三、遍历文档树获取标签的文本 tag对象四、五种过滤器 fi 阅读全文

posted @ 2019-01-24 10:37 st--st 阅读(187) 评论(0) 推荐(0)

使用Xpath

摘要：使用Xpath模块 def get_page(url): import requests headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 阅读全文

posted @ 2019-01-23 11:03 st--st 阅读(445) 评论(0) 推荐(0)

Response响应相关

摘要：response是响应的对象 response.text # 返回的是字节，数据的原内容response.content # 返回的是字符串，默认是utf-8解码重定向：浏览器发送请求，服务器返回重定向的状态码和location，没有响应体。浏览器会自动再发送给location的url一次请求，才阅读全文

posted @ 2019-01-22 20:57 st--st 阅读(654) 评论(0) 推荐(0)

Requests模块

摘要：一、Content-Type Content-Type，内容类型，一般是指网页中存在的Content-Type，用于定义网络文件的类型和网页的编码，决定浏览器将以什么形式、什么编码读取这个文件。Content-Type属性指定请求和响应的HTTP内容类型。如果未指定 ContentType，默认为t 阅读全文

posted @ 2019-01-21 20:39 st--st 阅读(171) 评论(0) 推荐(0)

爬虫入门

摘要：一、爬虫介绍网络爬虫，即Web Spider，是一个很形象的名字。如果把互联网比喻成一个蜘蛛网，那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面（通常是首页）开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这阅读全文

posted @ 2019-01-21 20:12 st--st 阅读(216) 评论(0) 推荐(0)

Python小白白白白白白

随笔分类 - 爬虫

公告