网易云爬虫解析

  因为老板要我爬网易云的数据,要对歌曲的评论进行相似度抽取,形成多个歌曲文案,于是我就做了这个爬虫!在此记录一下!

一、分析网易云 API

    为了缓解服务器的压力,网易云会有反爬虫策略!我打开网易云歌曲页面, F12 发现看不到我要的数据,明白了!他应该是到这个页面在发送请求获取的歌词、评论信息!于是我在网上找了要用的 API。

  分析了 API 请求参数的加密方式。这个写的比较好 (https://www.zhanghuanglong.com/detail/csharp-version-of-netease-cloud-music-api-analysis-(with-source-code)

  贴几个项目中用到的 API:

抓歌曲信息(没有歌词) http://music.163.com/m/song?id=123 GET
抓歌词信息: http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=123 GET
抓评论信息 http://music.163.com/weapi/v1/resource/comments/R_SO_4_123 (123 是歌词) POST

 

二、深度网络爬虫

  因为网易云对数据进行了保护,所以不能像常规的网络爬虫一样,抓页面-->分析有用数据-->保持有用的数据-->提取链接加入任务队列-->继续抓页面。

  我决定采用 id 的方式进行数据的抓取,将比如 100000000~200000000 的 id 加入任务队列中。对于 id = 123,获取歌曲信息、歌词,评论,都是通过 song_id 对应起来的。

  为了抓取的速度,采用 java 线程池做多线程爬虫。

   在这里,只讨论爬虫的具体实现吧!

 

三、自定义任务类

  java 任务类就是继承 Runnable 接口,实现 Runnable 方法。在Runnable 方法中实现数据的抓取、分析、存数据库。

  

1、歌曲信息任务类、

 1 @Override
 2     public void run() {
 3         try {
 4             Response execute;
 5             if (uid % 2 == 0) {
 6                 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid)
 7                         .header("User-Agent",
 8                                 "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36")
 9                         .header("Cache-Control", "no-cache").timeout(2000000000)
11 //                         .proxy(IpProxy.ipEntitys.get(i).getIp(),IpProxy.ipEntitys.get(i).getPort())
12                         .execute();
13             } 
14             else {
15                 execute = Jsoup.connect("http://music.163.com/m/song?id=" + uid)
16                         .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; W…) Gecko/20100101 Firefox/56.0")
17                         .header("Cache-Control", "no-cache")
18                        19                         .timeout(2000000000).execute();
20             }
21             String body = execute.body();
22             if (body.contains("很抱歉,你要查找的网页找不到")) {
23                 System.out.println("歌曲ID:" + uid  + "=============网页找不到");
24                 return;
25             }
26             Document parse = execute.parse();
27             
28             // 解析歌名
29             Elements elementsByClass = parse.getElementsByClass("f-ff2");
30             Element element = elementsByClass.get(0);
31             Node childNode = element.childNode(0);
32             String song_name = childNode.toString();
33 
34             // 获取歌手名
35             Elements elements = parse.getElementsByClass("s-fc7");
36             Element singerElement = elements.get(1);
37             Node singerChildNode = singerElement.childNode(0);
38             String songer_name = singerChildNode.toString();
39             
40             // 获取专辑名称
41             Element albumElement = elements.get(2);
42             Node albumChildNode = albumElement.childNode(0);
43             String album_name = albumChildNode.toString();
44 
45             // 歌曲链接
46             String song_url = "http://music.163.com/m/song?id="+uid;
47             
48             // 获取歌词
49             String lyric = getSongLyricBySongId(uid);
50             
51             //歌曲持久化
52             dbUtils.insert_song(uid, song_name, songer_name, lyric, song_url, album_name);
53                     
54         } catch (Exception e) {
55         }
56     }
57 
58     /*
59      *  根据歌曲 id 获取 歌词
60      */
61     private String getSongLyricBySongId(long id) {
62         try {
63             Response data = Jsoup.connect("http://music.163.com/api/song/lyric?os=pc&lv=-1&kv=-1&tv=-1&id=" + id)
64                             .header("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36")
65                             .header("Cache-Control", "no-cache")//.timeout(20000)
66                             .execute();
67             
68             String body = data.body(); 
69             
70             JsonObject jsonObject = (JsonObject)new Gson().fromJson(body, JsonObject.class);
71             jsonObject = (JsonObject) jsonObject.get("lrc");
72             
73             JsonElement jsonElement = jsonObject.get("lyric");
74             String lyric = jsonElement.getAsString();
75             // 替换掉 [*]
76 //            String regex = "\\[\\d{2}\\:\\d{2}\\.\\d{2}\\]";
77             String regex = "\\[\\d+\\:\\d+\\.\\d+\\]";
78             lyric = lyric.replaceAll(regex, "");
79             String regex2 = "\\[\\d+\\:\\d+\\]";
80             lyric = lyric.replaceAll(regex2, "");
81             lyric = lyric.replaceAll("'", "");
82             lyric = lyric.replaceAll("\"", "");
83             
84             return lyric;
85         } catch (IOException e) {
86             e.printStackTrace();
87         }
88         return "";

 

2、歌曲热评任务类

  一首歌大概 0~20 个热评,都是通过一次 POST 请求就可以获取到的。因此与普通评论分开来处理。因为参数 params、ensecKey 进行了加密,可以看前面的链接。

 1 @Override
 2     public void run() {
 3         try{
 4             String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid;
 5             String data = CenterUrl.getDataByUrl(url, "{\"offset\":0,\"limit\":10};");
 6             System.out.println(data);
 7             JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>();
 8             CommentBean jsonData = commentData.getJsonData(data, CommentBean.class);
 9             List<HotComments> hotComments = jsonData.getHotComments();
10             for (HotComments comment : hotComments) {
11                 // 组装字段
12                 Long comment_id = comment.getCommentId();
13                 String comment_content = comment.getContent();
14                 comment_content = comment_content.replaceAll("'", "").replaceAll("\"", "");
15                 Long liked_count = comment.getLikedCount();
16                 String commenter_name = comment.getUser().getNickname();
17                 int is_hot_comment = 1;
18                 Long create_time = comment.getTime();
19                 // 插入数据库
20                 dbUtils.insert_hot_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time);
21             }
22         } catch (Exception e) {
23             logger.error(e.getMessage());
24         }
25     }

 

3、歌曲普通评论任务类

  因为普通评论要进行翻页操作,所以里边有一个循环,可以设置抓取的每首歌的普通评论数。

 1     @Override
 2     public void run() {
 3         long pageSize = 0;
 4         int dynamicPage = 105;    //  +1050,防止一部分抓取失败
 5         for (long i = 0; i <= pageSize && i < dynamicPage; i++) {    // 1000 条非热评
 6             try {
 7                 String url = "http://music.163.com/weapi/v1/resource/comments/R_SO_4_" + uid;
 8                 String data = CenterUrl.getDataByUrl(url, "{\"offset\":" + i * 10 + ",\"limit\":"+ 10 + "};");
 9                 
10                 if(data.trim().equals("HTTP/1.1 400 Bad Request") || data.contains("用户的数据无效")) {
11                     // 由于网络等原因请求抓取失败
12                     i--;
13                     if(pageSize == 0) {    // 第一次就失败了。。。
14                         pageSize = dynamicPage;
15                     }
16                     System.out.println("~~ song_id = " + uid + ", i(Page)=" + i + ", reason = " + data);
17                     continue;
18                 }
19                 // 这一页发生异常
20                 if(data.contains("网络超时")  || data.equals("")) {
21                     continue;
22                 }
23                 
24                 JsonParseUtil<CommentBean> commentData = new JsonParseUtil<>();
25                 CommentBean jsonData = commentData.getJsonData(data, CommentBean.class);
26                 long total = jsonData.getTotal();
27                 pageSize = total / 10;
28                 List<Comments> comments = jsonData.getComments();
29                 for (Comments comment : comments) {
30                     try {
31                         // 组装字段
32                         Long comment_id = comment.getCommentId();
33                         String comment_content = comment.getContent();
34                         comment_content = comment_content.replaceAll("'", "").replaceAll("\"", "");
35                         Long liked_count = comment.getLikedCount();
36                         String commenter_name = comment.getUser().getNickname();
37                         int is_hot_comment = 0;
38                         Long create_time = comment.getTime();
39                         // 插入数据库
40                         dbUtils.insert_tmp_comments(uid, comment_id, comment_content, liked_count, commenter_name, is_hot_comment, create_time);
41                     } catch (Exception e) {
42                         System.out.println(">>>>>>>>插入失败: " + uid );
43                     }
44                 }
45             } catch (Exception e) {
46                 System.err.println("^^^" + e.getMessage());
47             }
48         }
49     }

 

4、POST 请求

  因为爬取的数据量比较大,当我用本地 IP 时,几分钟后,发现浏览器打开网易云音乐,评论加载不出来了,几分钟后,歌曲也加载不出来了。所以,我觉得网易云会对判定为爬虫的 IP 禁止调用他的相应的接口!于是,我做了一个代理 IP 池 。当发现调用接口返回信息包含 “cheating” 等,就去除这个代理 IP,重新从池子里获取一个!

    public static String getDataByUrl(String url, String encrypt) {
        try{
            
            System.out.println("****************************正在使用的代理IP:"+ip+"*********端口"+port+"**********************");
            String data = "";
            // 参数加密 
            String secKey = new BigInteger(100, new SecureRandom()).toString(32).substring(0, 16);//limit
            String encText = EncryptUtils.aesEncrypt(EncryptUtils.aesEncrypt(encrypt,"0CoJUm6Qyw8W8jud"), secKey);
            String encSecKey = EncryptUtils.rsaEncrypt(secKey);
            // 设置请求头
            Response execute = Jsoup.connect(url+"?csrf_token=6b9af67aaac0a2d1deb5683987d059e1")
                    .header("User-Agent",
                            "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.32 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36")
                    .header("Cache-Control", "max-age=60").header("Accept", "*/*").header("Accept-Encoding", "gzip, deflate, br")
                    .header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8").header("Connection", "keep-alive")
                    .header("Referer", "https://music.163.com/song?id=1324447466")
                    .header("Origin", "https://music.163.com").header("Host", "music.163.com")
                    .header("Content-Type", "application/x-www-form-urlencoded")

                    .data("params",encText)
                    .data("encSecKey",encSecKey)
                    .method(Method.POST).ignoreContentType(true)
                    .timeout(1000000)
                    .proxy(ip, port)
                    .execute();
                    data = execute.body().toString();
            //如果当前的IP被拉黑了就从IP网站上抓取新的IP
            if(data.contains("Cheating")||data.contains("指定 product id") || data.contains("无效用户")){
                // 去除无效 ipEntity
                if(IpProxy.ipEntitys.contains(ipEntity))
                    IpProxy.ipEntitys.remove(ipEntity);
                
                ipEntity = getIpEntityByRandom();
                ip = ipEntity.getIp();
                port = ipEntity.getPort();
                return "用户的数据无效!!!";
            }
            return data;
        } catch (Exception e) {
            // 去除无效 ipEntity
            if(IpProxy.ipEntitys.contains(ipEntity))
                IpProxy.ipEntitys.remove(ipEntity);
            ipEntity = getIpEntityByRandom(); 
            ip = ipEntity.getIp();
            port = ipEntity.getPort();
            System.err.println("网络超时原因: " + e.getMessage());
            if(e.getMessage().contains("Connection refused: connect") || e.getMessage().contains("No route to host: connect")) {
                IpProxy.ipEntitys.clear();
                IpProxy.getZDaYeProxyIp();
            }
            return "网络超时";
        }
    }

    /*
     *  随机从 List 中获取 ipEntity
     */
    private static IpEntity getIpEntityByRandom() {
        try {
            int size = IpProxy.ipEntitys.size();
            if(size == 0) {
                Thread.sleep(20000);
                IpProxy.getZDaYeProxyIp();
            }
            int i = (int)(Math.random()*size);
            if(size > 0 && i < size)
                return IpProxy.ipEntitys.get(i);
        } catch (Exception e) {
            System.err.println("pig!pig!随机获取生成代理 ip 异常:!!!!!!!");
        }
        return null;
    }

 

四、代理 IP 资源池

  免费的代理 IP 比较好用的是 西刺代理,IP很新鲜!缺点是不稳定,他的网站经常崩掉。

  还有这个: https://www.ip-adress.com/proxy-list

  对西刺代理网页进行分析,抓取Dom节点的数据放入 IP 代理池中!

 1 public static List<IpEntity> getProxyIp(String url) throws Exception{
 2         Response execute = Jsoup.connect(url)
 3                 .header("User-Agent",
 4                         "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36")
 5                 .header("Cache-Control", "max-age=60").header("Accept", "*/*")
 6                 .header("Accept-Language", "zh-CN,zh;q=0.8,en;q=0.6").header("Connection", "keep-alive")
 7                 .header("Referer", "http://music.163.com/song?id=186016")
 8                 .header("Origin", "http://music.163.com").header("Host", "music.163.com")
 9                 .header("Content-Type", "application/x-www-form-urlencoded")
10                 .header("Cookie",
11                         "UM_distinctid=15e9863cf14335-0a09f939cd2af9-6d1b137c-100200-15e9863cf157f1; vjuids=414b87eb3.15e9863cfc1.0.ec99d6f660d09; _ntes_nnid=4543481cc76ab2fd3110ecaafd5f1288,1505795231854; _ntes_nuid=4543481cc76ab2fd3110ecaafd5f1288; __s_=1; __gads=ID=6cbc4ab41878c6b9:T=1505795247:S=ALNI_MbCe-bAY4kZyMbVKlS4T2BSuY75kw; usertrack=c+xxC1nMphjBCzKpBPJjAg==; NTES_CMT_USER_INFO=100899097%7Cm187****4250%7C%7Cfalse%7CbTE4NzAzNDE0MjUwQDE2My5jb20%3D; P_INFO=m18703414250@163.com|1507178162|2|mail163|00&99|CA&1506163335&mail163#hun&430800#10#0#0|187250&1|163|18703414250@163.com; vinfo_n_f_l_n3=8ba0369be425c0d2.1.7.1505795231863.1507950353704.1508150387844; vjlast=1505795232.1508150167.11; Province=0450; City=0454; _ga=GA1.2.1044198758.1506584097; _gid=GA1.2.763458995.1508907342; JSESSIONID-WYYY=Zm%2FnBG6%2B1vb%2BfJp%5CJP8nIyBZQfABmnAiIqMM8fgXABoqI0PdVq%2FpCsSPDROY1APPaZnFgh14pR2pV9E0Vdv2DaO%2BKkifMncYvxRVlOKMEGzq9dTcC%2F0PI07KWacWqGpwO88GviAmX%2BVuDkIVNBEquDrJ4QKhTZ2dzyGD%2Bd2T%2BbiztinJ%3A1508946396692; _iuqxldmzr_=32; playerid=20572717; MUSIC_U=39d0b2b5e15675f10fd5d9c05e8a5d593c61fcb81368d4431bab029c28eff977d4a57de2f409f533b482feaf99a1b61e80836282123441c67df96e4bf32a71bc38be3a5b629323e7bf122d59fa1ed6a2; __remember_me=true; __csrf=2032a8f34f1f92412a49ba3d6f68b2db; __utma=94650624.1044198758.1506584097.1508939111.1508942690.40; __utmb=94650624.20.10.1508942690; __utmc=94650624; __utmz=94650624.1508394258.18.4.utmcsr=xujin.org|utmccn=(referral)|utmcmd=referral|utmcct=/")
12                 .method(Method.GET).ignoreContentType(true)
13                 .timeout(2099999999).execute();
14         Document pageJson = execute.parse();
15         Element body = pageJson.body();
16         List<Node> childNodes = body.childNode(11).childNode(3).childNode(5).childNode(1).childNodes();
17 //        ipEntitys.clear();    // 先清空在添加
18         
19         for(int i = 2;i < childNodes.size();i += 2){
20             IpEntity ipEntity = new IpEntity();
21             Node node = childNodes.get(i);
22             List<Node> nodes = node.childNodes();
23             String ip = nodes.get(3).childNode(0).toString();
24             int port = Integer.parseInt(nodes.get(5).childNode(0).toString());
25             ipEntity.setIp(ip);
26             ipEntity.setPort(port);
27             ipEntitys.add(ipEntity);
28         }
29         return ipEntitys;
30     }

 

  但是为了少操心,最终买了“站大爷”的提供的代理 IP 服务,17块钱一天!服务还挺好的,嗯!

 

五、总结

  写这个爬虫持续了挺久时间的!中间碰到了很多问题。比如,调网易云接口一直返回 460 等。最后发现是代理 IP 没有更新的问题!调用站大爷的接口,获取到的全是没用的 IP 原来是没有绑定自己公网的 IP !还有爬虫经常是爬十来分钟就卡住了,一直调定时任务更新线程池,而线程没有动了! 我猜是因为我抓普通评论时的 for 循环使得我的多线程都堵塞了,还有待验证!

  虽然爬虫很简单,但是把他做好也很难!加油!

posted @ 2018-11-14 20:17  skillking2  阅读(2757)  评论(0编辑  收藏  举报