免费IP代理池定时维护,封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池,并制作简易流量爬虫

  前言

  我们之前的爬虫都是模拟成浏览器后直接爬取,并没有动态设置IP代理以及UserAgent标识,这样很容易被服务器封IP,因此需要设置IP代理,但又不想花钱买,网上有免费IP代理,但大多都数都是不可用,而且不稳定,所以需要自行抓取、校验

 

  本文记录免费IP代理池定时维护,封装通用爬虫工具类每次随机更新IP代理池跟UserAgent池,并制作简易流量爬虫验证我们的IP代理池、UserAgent池

   主要用到的知识:爬虫相关、SpringBoot相关,项目整合了多个知识点:

  httpclient+jsoup实现小说线上采集阅读

  htmlUnit加持,网络小蜘蛛的超级进化

  SpringBoot系列——定时器

  SpringBoot系列——@Async优雅的异步调用

  SpringBoot系列——Spring-Data-JPA

  SpringBoot系列——WebSocket

  SpringBoot系列——Thymeleaf模板

  SpringBoot系列——Logback日志,输出到文件以及实时输出到web页面

 

  common-spider

  项目结构

  pom引入父类,同时引入基础爬虫所需的依赖,以及mysql、jpa依赖

        <!-- 小蜘蛛 -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.9</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>net.sf.json-lib</groupId>
            <artifactId>json-lib</artifactId>
            <version>2.4</version>
            <classifier>jdk15</classifier>
        </dependency>
        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.32</version>
        </dependency>

        <!--添加springdata-jpa依赖 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>

        <!--添加MySQL驱动依赖 -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
        </dependency>
View Code

 

  PS:具体的数据库连接配置需要在具体的爬虫项目进行配置

 

  然后就可以作为一个通用功能项目,具体的爬虫项目通过pom引入

 

  统一响应对象

  HttpClient请求的响应对象跟WebClient的不一致,为了更加规范,我们定义统一的响应对象

/**
 * 统一响应对象
 */
@Data
public class ResultVo<E> {

    private ResultVo(Integer statusCode, String statusMessage, E page) {
        this.statusCode = statusCode;
        this.statusMessage = statusMessage;
        this.page = page;
    }

    //响应状态
    private Integer statusCode;

    //响应消息
    private String statusMessage;

    //响应对象
    private E page;

    /**
     * 通过静态方法获取实例
     */
    public static <E> ResultVo<E> of(Integer statusCode,String statusMessage,E page) {
        return new ResultVo<>(statusCode, statusMessage, page);
    }
}
ResultVo.java

 

  IP代理池

  免费的IP代理还是有挺多的,不过大多数都不稳定,需要自己抓取、校验,本文主要抓取的是89ip(http://www.89ip.cn/index_1.html)的免费代理,抓取前十页,150个,校验后大概有50个可用,两个定时异步任务:定时更新IP代理池,目前设置一个小时触发一次、定时检查IP代理池,目前设置半个小时触发一次(西刺的免费IP代理可用的太少了,先注释起来)

  更新下来的IP代理需要存库,IP地址就是主键,所以如果已经存在就会替换掉,不存在则会加入数据库,检查IP代理是否可用是用这个IP代理去访问查询外网地址的网站

http://pv.sohu.com/cityjson

),能请求成功,且返回的外网ip是一样说明代理成功,代理失败的将会从数据库池移除,检查完成后更新IP代理池

 

  IP代理表结构SQL

/*
 Navicat Premium Data Transfer

 Source Server         : localhost
 Source Server Type    : MySQL
 Source Server Version : 50528
 Source Host           : localhost:3306
 Source Schema         : test

 Target Server Type    : MySQL
 Target Server Version : 50528
 File Encoding         : 65001

 Date: 13/08/2019 15:55:59
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for spider_ip_proxy
-- ----------------------------
DROP TABLE IF EXISTS `spider_ip_proxy`;
CREATE TABLE `spider_ip_proxy`  (
  `ip` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'ip地址',
  `port` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '端口',
  `city` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '城市',
  `operator` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '运营商',
  PRIMARY KEY (`ip`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

SET FOREIGN_KEY_CHECKS = 1;
spider_ip_proxy

   jpa实体映射

/**
 * 爬虫IP代理池实体对象
 */
@Data
@Entity(name = "spider_ip_proxy")
public class IpProxy {
    @Id
    //ip地址
    private String ip;
    //端口
    private String port;
    //城市
    private String city;
    //运营商
    private String operator;
}
IpProxy.java

 

  UserAgent池

  我并没有在网上找到提供UserAgent池的网站,所以我收集一堆UserAgent标识并存到数据库中当做UserAgent池,个人感觉那么多应该够用了,所以就没有定时任务去更新

  UserAgent标识表结构、数据SQL

/*
 Navicat Premium Data Transfer

 Source Server         : localhost
 Source Server Type    : MySQL
 Source Server Version : 50528
 Source Host           : localhost:3306
 Source Schema         : test

 Target Server Type    : MySQL
 Target Server Version : 50528
 File Encoding         : 65001

 Date: 13/08/2019 15:58:07
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for spider_user_agent
-- ----------------------------
DROP TABLE IF EXISTS `spider_user_agent`;
CREATE TABLE `spider_user_agent`  (
  `user_agent` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT 'User Agent',
  PRIMARY KEY (`user_agent`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

-- ----------------------------
-- Records of spider_user_agent
-- ----------------------------
INSERT INTO `spider_user_agent` VALUES ('Chrome/10.0.648.133 Safari/534.16');
INSERT INTO `spider_user_agent` VALUES ('Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; 360SE) ');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; ');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Mobile Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Mobile Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Mobile Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; LGMS323 Build/KOT49I.MS32310c) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/67.0.3396.87 Mobile Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.109 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.70 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) ');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10');
INSERT INTO `spider_user_agent` VALUES ('Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6)');
INSERT INTO `spider_user_agent` VALUES ('MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1');
INSERT INTO `spider_user_agent` VALUES ('NOKIA5700/ UCWEB7.0.2.37/28/999');
INSERT INTO `spider_user_agent` VALUES ('Openwave/ UCWEB7.0.2.37/28/999');
INSERT INTO `spider_user_agent` VALUES ('Opera/8.0 (Windows NT 5.1; U; en)');
INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10');
INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52');
INSERT INTO `spider_user_agent` VALUES ('Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11');
INSERT INTO `spider_user_agent` VALUES ('UCWEB7.0.2.37/28/999');

SET FOREIGN_KEY_CHECKS = 1;
spider_user_agent

   jpa实体映射

/**
 * 爬虫User-Agent池实体对象
 */
@Data
@Entity(name = "spider_user_agent")
public class UserAgent {
    @Id
    //User Agent
    private String userAgent;
}
UserAgent.java

 

  HttpClientUtil

  HttpClient是http包下面的东西,可以简单发起请求获取数据,但不会去解析DOM、执行js、css等,因此需要借助Jsoup来解析Html文档,工具类包含了IP代理池、UserAgent池,每次发起请求都会随机从IP代理池获取IP代理、从UserAgent池随机获取UserAgent标识,IP代理池由定时任务去更新

   提供一个静态方法,获取一个HttpClient对象,支持绕过SSL校验

 

  WebClientUtil

  WebClient是htmlunit的东西,可模拟浏览器解析DOM、执行js、css等,可以解析Html文档,例如像jq操作DOM对象一样,工具类包含了IP代理池、UserAgent池,每次发起请求都会随机从IP代理池获取IP代理、从UserAgent池随机获取UserAgent标识,IP代理池由定时任务去更新

   提供一个静态方法获取WebClient对象,开启了部分功能

 

  flow-spider

  流量爬虫目前有以下几个项目:

 

  刷博客园阅读量

  我们引入common-spider,开始编写流量爬虫,主要就是用WebClient去访问博客园的博客,换IP代理、换UserAgent标识,设置执行JS,所有的操作都是随机的、随机代理IP、随机UserAgent标识、随机访问时间、随机访问博客,甚至我们可以设置携带随机cookie(需要进行仔细分析,到底发送了那些cookie,cookie的值有什么规则,建议用火狐浏览器进行分析),从来达到模拟真实用户访问,使博客阅读量增加,俗称刷阅读量

  为了方便观察实时日志,秀出我们之前的骚操作(SpringBoot系列——Logback日志,输出到文件以及实时输出到web页面),开始搭建项目

  项目结构

  在pom文件中引入父类、同时引入common-spider,以及thymeleaf、websocket

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <artifactId>flow-spider</artifactId>
    <version>0.0.1</version>
    <name>flow-spider</name>
    <description>流量爬虫</description>

    <!-- 引入父类 -->
    <parent>
        <groupId>cn.huanzi.qch</groupId>
        <artifactId>parent</artifactId>
        <version>1.0.0</version>
    </parent>

    <dependencies>
        <dependency>
            <groupId>cn.huanzi.qch</groupId>
            <artifactId>common-spider</artifactId>
            <version>0.0.1</version>
        </dependency>

        <!-- springboot websocket -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-websocket</artifactId>
        </dependency>

        <!-- thymeleaf模板 -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-thymeleaf</artifactId>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

</project>
View Code

  配置文件配置数据库相关配置

#数据库相关
spring.datasource.url=jdbc:mysql://localhost:3306/test?serverTimezone=GMT%2B8&characterEncoding=utf-8
spring.datasource.username= root
spring.datasource.password=123456
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver

  以及实时日志需要的一些操作就不再重复了,看之前的博客

 

  博客实体对象

  为了方便,我们爬取博客集合存储到数据库中

  数据库表结构SQL

/*
 Navicat Premium Data Transfer

 Source Server         : localhost
 Source Server Type    : MySQL
 Source Server Version : 50528
 Source Host           : localhost:3306
 Source Schema         : test

 Target Server Type    : MySQL
 Target Server Version : 50528
 File Encoding         : 65001

 Date: 13/08/2019 16:48:14
*/

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for spider_blog
-- ----------------------------
DROP TABLE IF EXISTS `spider_blog`;
CREATE TABLE `spider_blog`  (
  `blog_url` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT '博客链接',
  `blog_name` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客标题',
  PRIMARY KEY (`blog_url`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

SET FOREIGN_KEY_CHECKS = 1;
spider_blog

  jpa映射实体

/**
 * 博客园博客文章实体对象
 */
@Data
@Entity(name = "spider_blog")
public class Blog {
    @Id
    private String blogUrl;
    private String blogName;
}
Blog.java

 

  controller

  为了偷懒,我们连service层都懒得写了,业务逻辑直接写在controller层

 

  启动类

  启动类也需要进行一些注解配置,SpringBoot默认只能扫描到当前包和子包,所有我们需要添加注解指定扫描路径Spring才能识别到注解

@Slf4j//使用lombok的@Slf4j,帮我们创建Logger对象,效果与下方获取日志对象一样
@SpringBootApplication//默认只能扫描到当前包和子包
@EnableJpaRepositories(basePackages = {"cn.huanzi.qch.commonspider.repository","cn.huanzi.qch.flowspider.cnblogs.repository"})//扫描@Repository注解;
@EntityScan(basePackages = {"cn.huanzi.qch.commonspider.pojo","cn.huanzi.qch.flowspider.cnblogs.pojo"})//扫描@Entity注解;
@ComponentScan(basePackages = {"cn.huanzi.qch.commonspider.**","cn.huanzi.qch.flowspider.**"})//扫描 带@Component的注解,如:@Controller、@Service 注解
@EnableScheduling //允许支持定时器了
public class FlowSpiderApplication {

    //省略部分代码...

}

  由于我们使用了注解来指定,SpringBoot的默认扫描路径失效,所以也需要将所有需要扫描的路径补全

 

  项目已经配置得差不多了,为了方便操作,我们在实时日志页面新增几个按钮来手动调用这些功能

 

   运行效果

  页面效果大概就是这样

  那这个流量爬虫具体效果怎么样的?这是我挂机从下午6点多到第二天早上9点多的效果,博客集合就只留一篇,其它的全删掉,这篇博客的访问量从34增加到890

 

 

   成功一千多次才增加八百?而且还失败三千多次??效率是不是太低了一点?

 

  1、免费的IP代理很多,但真正可用的很少,而且还不稳定,说不定前几分钟刚校验成功,当你用的时候又代理失败,想要稳定的IP代理得花钱买比较靠谱

  2、经常出现400 The plain HTTP request was sent to HTTPS port,我目前也不知道怎么解决

  3、小概率同一时间段内多次随机到了同一个IP代理,博客园不做访问统计

  4、未知原因导致阅读量增加...

 

  PS:

  正所谓,程序员何苦为难程序员...,大家随机访问秒数不要太快了,我们只是为了学习,不是为了刷流量,也要考虑博客园运维人员的感受哇!

  (偷偷的说一下,可以写个定时任务去更新博客集合,这样我们的流量机器人就可以做到全自动刷流量,按照目前的情况看,一天可以贡献差不多2000的阅读量,打包部署到云服务器,全自动24小时不停机【隐藏滑稽脸~~】)

  另外,你们检出代码后,不要都用我的的博客来试,我怕被封号...

 

  补充:

  这就尴尬了...

 

 

  刷微信投票

  目前只能刷不需要微信登录授权的投票,比如下面这个投票例子,具体原因在后面再跟大家讨论

 

  我们先简单分析以下这类型的微信投票,做一下前期准备,找个正在进行微信投票的项目的网页链接(http://www.dzmshd.com/Home/index.php?m=Index&a=content&id=42&fid=8130&subscribe=1),右键查看源代码,找到投票发起的请求链接

 

  PS:微信很鸡贼,只能用微信内置浏览器打开...

  

  使用微信电脑端打开,对着网页右键,查看源代码

 

     微信会在这个位置生成一个TXT文件,并帮我们自动打开,然后我们按关键字搜索,

  搜索这个js方法,找到请求链接,拼接上参数后:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8130&id=42&tp=

  链接找到了,我们开始写代码,同样,写在controller里就可以了,简单点

  注意,UserAgent标识得设置微信的,不能用我们前面的UserAgent池了,我在网上找了几个

        //微信UserAgent标识
        String[] webKitUserAgent = {
                "Mozilla/5.0 (Linux; Android 7.1.1; MI 6 Build/NMF26X; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/6.2 TBS/043807 Mobile Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN",
                "Mozilla/5.0 (Linux; Android 7.1.1; OD103 Build/NMF26F; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN",
                "Mozilla/5.0 (Linux; Android 6.0.1; SM919 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN",
                "Mozilla/5.0 (Linux; Android 5.1.1; vivo X6S A Build/LMY47V; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN",
                "Mozilla/5.0 (Linux; Android 5.1; HUAWEI TAG-AL00 Build/HUAWEITAG-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043622 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/4G Language/zh_CN",
                "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 MicroMessenger/6.6.1 NetType/4G Language/zh_CN",
                "Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_2 like Mac OS X) AppleWebKit/604.4.7 (KHTML, like Gecko) Mobile/15C202 MicroMessenger/6.6.1 NetType/4G Language/zh_CN",
                "Mozilla/5.0 (iPhone; CPU iPhone OS 11_1_1 like Mac OS X) AppleWebKit/604.3.5 (KHTML, like Gecko) Mobile/15B150 MicroMessenger/6.6.1 NetType/WIFI Language/zh_CN",
                "Mozilla/5.0 (iphone x Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043632 Safari/537.36 MicroMessenger/6.6.1.1220(0x26060135) NetType/WIFI Language/zh_CN",
        };

  controller

  这个链接还是比较简单,GET请求,我们使用HttpClientUtil就可以了,然后运行起来,访问:http://localhost:10087/weChatVote/start

 

  效果

  我们一样随机秒数去请求,换IP代理,UserAgent要换微信标识的,运行一小段时间后

  日志显示成功13次,检查一下,发现已经从2983变成2997,多了一次估计是刚好有人给它投票...

 

   为了方便验证,我们找一个投票数为零的,试一下,别人都几千票了,它一张都没有也是可怜,帮它刷刷人气(嘿嘿~)

  先找出请求链接:http://www.dzmshd.com/Home/index.php?m=Index&a=vote&vid=8679&id=42&tp=

  项目运行起来,访问:http://localhost:10087/weChatVote/start

 

  效果

  运行一小段时间后,刷了137票,瞬间排到12名(捂脸)

 

  PS:发现有好多次失败都是这个原因,因为我们的代理IP太少了,而且前面已经用了部分IP给第二名投票了,所以投票失败,后面去更新IP代理池,然后检查校验继续刷

 

  

  需要登录授权的比较麻烦,先看一下微信网页授权的大致流程:(微信公众平台:https://mp.weixin.qq.com/wiki?t=resource/res_main&id=mp1421140842

 

  普通浏览器无法调试查看微信的链接,得需要抓包软件进行分析,比如fiddler等

 

  如果参数设置错误,连授权页面都访问不了

 

  强行请求在源码找到的链接进行访问,返回这个报错页面,因为少了参数,连授权页面都无法重定向过去

  

     

 

  后记

  自动任务更新免费IP代理,发起的请求都是随机秒数、随机IP、随机UserAgent,甚至还可以随机cookie,模拟真实用户使用浏览器发起的请求

 

  本文就记录到这里,声明一下,技术仅供学习研究,请大家不要应用在触发法律的地方,欢迎大家一起讨论

 

  升级

  原先两个工具类只支持发起GET请求,现在新增支持发起POST请求,不过有一点要注意,经过我测试,post请求分成两种情况来设置参数,后端才能成功接参

  1、服务端有@RequestBody,请求头需要设置Content-type=application/json; charset=UTF-8,同时请求参数要放在body里

  2、服务端没有@RequestBody,请求头需要设置Content-type=application/x-www-form-urlencoded; charset=UTF-8,同时请求参数要放在URL参数里

  目前是两种都写在里面了,我默认先注释其中一个,大家使用的时候再自行调整、扩展

 

  

  代码开源

  代码已经开源、托管到我的GitHub、码云:

  GitHub:https://github.com/huanzi-qch/spider

  码云:https://gitee.com/huanzi-qch/spider

posted @ 2019-08-13 17:44  huanzi-qch  阅读(2162)  评论(0编辑  收藏  举报