老李分享:网页爬虫java实现

老李分享:网页爬虫java实现

 

 poptest是国内唯一一家培养测试开发工程师的培训机构,以学员能胜任自动化测试,性能测试,测试工具开发等工作为目标。如果对课程感兴趣,请大家咨询qq:908821478,咨询电话010-84505200。

一. 设计思路

 

(1)一个收集所需网页全站或者指定子域名的链接队列

(2)一个存放将要访问的URL队列(跟上述有点重复, 用空间换时间, 提升爬取速度)

(3)一个保存已访问过URL的数据结构

 数据结构有了, 接下来就是算法了, 一般推荐采取广度优先的爬取算法, 免得中了反爬虫的某些循环无限深度的陷阱。

 使用了 jsoup (一个解析HTML元素的Lib)和 httpclient (网络请求包)来简化代码实现。

  

二. 代码实现

上述三种数据结构:

// 已爬取URL <URL, isAccess>
final static ConcurrentHashMap<String, Boolean> urlQueue = new ConcurrentHashMap<String, Boolean>();

// 待爬取URL
final static ConcurrentLinkedDeque<String> urlWaitingQueue = new ConcurrentLinkedDeque<String>();

// 待扫描网页URL队列
final static ConcurrentLinkedDeque<String> urlWaitingScanQueue = new ConcurrentLinkedDeque<String>();

入队等待:

/**

     * url store in the waiting queue

     * @param originalUrl

     * @throws Exception

     */

    private static void enterWaitingQueue(final String originalUrl) throws Exception{

 

        String url = urlWaitingScanQueue.poll();

 

        // if accessed, ignore the url

        /*while (urlQueue.containsKey(url)) {

            url = urlWaitingQueue.poll();

        }*/

 

        final String finalUrl = url;

 

        Thread.sleep(600);

 

        new Thread(new Runnable() {

 

            public void run() {

 

                try{

 

                    if (finalUrl != null) {

 

                        Connection conn = Jsoup.connect(finalUrl);

                        Document doc = conn.get();

 

                        //urlQueue.putIfAbsent(finalUrl, Boolean.TRUE); // accessed

 

                        logger.info("扫描网页URL: " + finalUrl);

 

                        Elements links = doc.select("a[href]");

 

                        for (int linkNum = 0; linkNum < links.size(); linkNum++) {

                            Element element = links.get(linkNum);

 

                            String suburl = element.attr("href");

 

                            // 某条件下, 并且原来没访问过

                            if (!urlQueue.containsKey(suburl)) {

 

                                    urlWaitingScanQueue.offer(suburl);

                                    urlWaitingQueue.offer(suburl);

                                    logger.info("URL入队等待" + linkNum + ": " + suburl);

                                }

                            }

 

                        }

 

                    }

 

                } catch (Exception ee) {

                    logger.error("muti thread executing error, url: " + finalUrl, ee);

                }

 

            }

        }).start();

    }

访问页面:

private static void viewPages() throws Exception{

 

        Thread.sleep(500);

 

        new Thread(new Runnable() {

 

            @Override

            public void run() {

                try {

 

                    while(!urlWaitingQueue.isEmpty()) {

 

                        String url = urlWaitingQueue.peek();

 

                        final String finalUrl = url;

 

                        // build a client, like open a browser

                        CloseableHttpClient httpClient = HttpClients.createDefault();

 

                        // create get method, like input url in the browser

                        //HttpGet httpGet = new HttpGet("http://www.dxy.cn");

                        HttpPost httpPost = new HttpPost(finalUrl);

 

                        StringBuffer stringBuffer = new StringBuffer();

                        HttpResponse response;

 

 

                        //List<NameValuePair> keyValue = new ArrayList<NameValuePair>();

 

                        //  Post parameter

                        //            keyValue.add(new BasicNameValuePair("username", "zhu"));

                        //

                        //            httpPost.setEntity(new UrlEncodedFormEntity(keyValue, "UTF-8"));

 

 

                        // access and get response

                        response = httpClient.execute(httpPost);

 

                        // record access URL

                        urlQueue.putIfAbsent(finalUrl, Boolean.TRUE);

 

                        if (response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {

 

                            HttpEntity httpEntity = response.getEntity();

                            if (httpEntity != null) {

                                logger.info("viewPages访问URL:" + finalUrl);

                                BufferedReader reader = new BufferedReader(

                                        new InputStreamReader(httpEntity.getContent(), "UTF-8"));

 

                                String line = null;

                                if (httpEntity.getContentLength() > 0) {

 

                                    stringBuffer = new StringBuffer((int) httpEntity.getContentLength());

 

                                    while ((line = reader.readLine()) != null) {

                                        stringBuffer.append(line);

                                    }

 

                                    System.out.println(finalUrl + "内容: " + stringBuffer);

                                }

                            }

 

                        }

 

                    }

 

 

                } catch (Exception e) {

                    logger.error("view pages error", e);

                }

            }

 

        }).start();

 

 

    }

三. 总结及将来要实现功能

 

以上贴出了简易版Java爬虫的核心实现模块, 基本上拿起来就能测试。

控制爬取速度(调度模块), 使用代理IP访问(收集网络代理模块)的实现在你可以在自己的版本中会慢慢加上...

posted @ 2015-11-24 17:21  北京茑萝信息  阅读(228)  评论(0)    收藏  举报