[2]新闻

第二个看这个是因为百度APIStore里面免费+综合排序这个就排第二，也是因为公司之前需要新闻采集，自己做的不好。用一下这个看看能有多少有效新闻：

今天看一共有44个频道，每个频道第一页都可以取20条新闻。很强大。代码很简单：

 1 <html>
 2 <head><meta charset="utf-8"></head>
 3 <body>
 4 <?php
 5     $ch = curl_init();
 6     $url = 'http://apis.baidu.com/showapi_open_bus/channel_news/channel_news';
 7     $header = array(
 8         'apikey:百度API密钥',
 9     );
10     curl_setopt($ch, CURLOPT_HTTPHEADER  , $header);
11     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
12     curl_setopt($ch , CURLOPT_URL , $url);
13     $res = curl_exec($ch);
14     $data=json_decode($res,true);
15       echo '<pre>';
16     $channel=array();//频道
17     if(isset($data['showapi_res_code']) && $data['showapi_res_code']===0 && is_array($data['showapi_res_body']['channelList'])){
18         $channel=$data['showapi_res_body']['channelList'];         
19     }else{
20         echo 'error!';
21         exit;
22     }
23     //print_r($data);
24     foreach($channel as $v){
25         $id=$v['channelId'];
26         $name=$v['name'];
27 
28         $ch = curl_init();
29         $url = "http://apis.baidu.com/showapi_open_bus/channel_news/search_news?channelId={$id}&page=1&needContent=1&needHtml=1";
30         curl_setopt($ch, CURLOPT_HTTPHEADER  , $header);
31         curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
32         curl_setopt($ch , CURLOPT_URL , $url);
33         $res = curl_exec($ch);
34         $data=json_decode($res,true);
35         if(isset($data['showapi_res_body']['pagebean']['contentlist']) && is_array($data['showapi_res_body']['pagebean']['contentlist'])){
36             //$num=count($data['showapi_res_body']['pagebean']['contentlist']);
37             echo '<span style="color:red">'.$name.'</span>:';
38              foreach($data['showapi_res_body']['pagebean']['contentlist'] as $dk=>$dv){
39                  echo "<a href='{$dv['link']}' style='color:#646'>".$dv['title'].'</a>';
40                  foreach($dv['imageurls'] as $iv){
41                      echo '<img src="'.$iv['url'].'" style="width:20px;height:20px">';
42                  }    
43                  echo '&nbsp;&nbsp;&nbsp;&nbsp;';
44                  if($dk>3) break; //一类获取三条退出，只做初步查看用
45                  //print_r($dv);
46              }
47          }
48          echo '<br/>';
49          //break;     
50     }
51     echo '</pre>';
52 ?>
53 </body>
54 </html>

自己之前给公司做的新闻采集就比较弱，上一篇提到过，就是用file_get_contents()获取内容然后preg_match正则去取，正则用的也属于新手阶段。

说一下当时的新闻采集的思路，就是找新闻列表页，比如：人民网-北京-区县，然后分析出来所有链接，去链接里面去采集新闻标题，内容，图片。数据库存储：`tag`表存储标签，字段名称，链接地址，列表开始结束标签，新闻标题开始结束标签，内容开始结束标签。以后采集时候直接选取名称然后去查询。目标网站改版后得更新。

遇到的难点有：1.编码，GB2312转UTF-8，开始用iconv()后来改用mb_detect_encoding().2.就是正则了，获取图片等等。自己做的新闻采集的获取新闻的调试页面

代码特别乱，就不献丑。

posted @ 2016-11-03 16:01 姜小豆阅读(409) 评论(0) 收藏举报

刷新页面返回顶部

重剑无锋

领会和表达复杂世界背后的简单性的能力是一份难得而美妙的礼物 —— Van Jacobson

[2]新闻

公告