Hi_Amos
坚持每天都在进步!!

以下内容仅供学习交流使用,请勿做他用,否则后果自负。

 

一.什么是HttpClient?

HTTP 协议可能是现在 Internet 上使用得最多、最重要的协议了,越来越多的 Java 应用程序需要直接通过 HTTP 协议来访问网络资源。虽然在 JDK 的 java net包中已经提供了访问 HTTP 协议的基本功能,但是对于大部分应用程序来说,JDK 库本身提供的功能还不够丰富和灵活。HttpClient 是 Apache Jakarta Common 下的子项目,用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。HttpClient 已经应用在很多的项目中,比如 Apache Jakarta 上很著名的另外两个开源项目 Cactus 和 HTMLUnit 都使用了 HttpClient。现在HttpClient最新版本为 HttpClient 4.3.4(2014-06-22).

-----引自百度百科

简单的说,HttpClient就是一个Apache的一个对于Http封装的一个jar包.

下面将介绍使用GET/POST请求,登录中国联通网站并抓取用户的基本信息和账单数据.

二.新建一个maven项目httpclient

我这里的环境是jdk1.7+Intelij idea 13.0+ubuntu12.04+maven+HttpClient 4.3.4 .下面首先建一个maven项目:

如图所示,选择quickstart

然后next下去即可.

建好项目后,如下图所示:

 

双击pom.xml文件并添加所需要的jar包:

    <dependency>
          <groupId>org.apache.httpcomponents</groupId>
          <artifactId>httpclient</artifactId>
          <version>4.3.4</version>
      </dependency>

maven会自动将需要的其它jar包下载好,实际上所需要的jar包如下图所示:

 

三.登录中国联通并抓取数据

1.使用Get模拟登录,抓取每月账单数据

中国联通有两种登录方式:

 

 

 

上面两图的区别一个是带验证码,一个是不带验证码,下面将先解决不带验证码的登录.

package com.amos;

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;

/**
 * @author amosli
 * 登录并抓取中国联通数据
 */

public class LoginChinaUnicom {
    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {

        String name = "中国联通手机号码";
        String pwd = "手机服务密码";

        String url = "https://uac.10010.com/portal/Service/MallLogin?callback=jQuery17202691898950318097_1403425938090&redirectURL=http%3A%2F%2Fwww.10010.com&userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&redirectType=01&rememberMe=1";

        HttpClient httpClient = new DefaultHttpClient();
        HttpGet httpGet = new HttpGet(url);
        HttpResponse loginResponse = httpClient.execute(httpGet);

        if (loginResponse.getStatusLine().getStatusCode() == 200) {
            for (Header head : loginResponse.getAllHeaders()) {
                System.out.println(head);
            }
            HttpEntity loginEntity = loginResponse.getEntity();
            String loginEntityContent = EntityUtils.toString(loginEntity);
            System.out.println("登录状态:" + loginEntityContent);
            //如果登录成功
            if (loginEntityContent.contains("resultCode:\"0000\"")) {

                //月份
                String months[] = new String[]{"201401", "201402", "201403", "201404", "201405"};

                for (String month : months) {
                    String billurl = "http://iservice.10010.com/ehallService/static/historyBiil/execute/YH102010002/QUERY_YH102010002.processData/QueryYH102010002_Data/" + month + "/undefined";

                    HttpPost httpPost = new HttpPost(billurl);
                    HttpResponse billresponse = httpClient.execute(httpPost);

                    if (billresponse.getStatusLine().getStatusCode() == 200) {
                        saveToLocal(billresponse.getEntity(), "chinaunicom.bill." + month + ".2.html");
                    }
                }
            }
        }

    }

找到要登录的url以及要传的参数,这里手机号码服务密码这里就不提供了.

new一个DefaultHttpClient,然后使用Get方式发出请求,如果登录成功,其返回代码是0000.

再用HttpPost方式将返回值写到本地.

/**
     * 写文件到本地
     *
     * @param httpEntity
     * @param filename
     */
    public static void saveToLocal(HttpEntity httpEntity, String filename) {

        try {

            File dir = new File("/home/amosli/workspace/chinaunicom/");
            if (!dir.isDirectory()) {
                dir.mkdir();
            }

            File file = new File(dir.getAbsolutePath() + "/" + filename);
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            InputStream inputStream = httpEntity.getContent();

            if (!file.exists()) {
                file.createNewFile();
            }
            byte[] bytes = new byte[1024];
            int length = 0;
            while ((length = inputStream.read(bytes)) > 0) {
                fileOutputStream.write(bytes, 0, length);
            }
            inputStream.close();
            fileOutputStream.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

这里如果只是想输出一下可以使用EntityUtils.toString(HttpEntity entity)方法,其源码如下:

 public static String toString(
            final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
        Args.notNull(entity, "Entity");
        final InputStream instream = entity.getContent();
        if (instream == null) {
            return null;
        }
        try {
            Args.check(entity.getContentLength() <= Integer.MAX_VALUE,
                    "HTTP entity too large to be buffered in memory");
            int i = (int)entity.getContentLength();
            if (i < 0) {
                i = 4096;
            }
            Charset charset = null;
            try {
                final ContentType contentType = ContentType.get(entity);
                if (contentType != null) {
                    charset = contentType.getCharset();
                }
            } catch (final UnsupportedCharsetException ex) {
                throw new UnsupportedEncodingException(ex.getMessage());
            }
            if (charset == null) {
                charset = defaultCharset;
            }
            if (charset == null) {
                charset = HTTP.DEF_CONTENT_CHARSET;
            }
            final Reader reader = new InputStreamReader(instream, charset);
            final CharArrayBuffer buffer = new CharArrayBuffer(i);
            final char[] tmp = new char[1024];
            int l;
            while((l = reader.read(tmp)) != -1) {
                buffer.append(tmp, 0, l);
            }
            return buffer.toString();
        } finally {
            instream.close();
        }
    }

这里可以发现其实现方式还是比较容易看懂的,可以指定编码,也可以不指定.

2.带验证码的登录,抓取基本信息

package com.amos;


import org.apache.http.HttpResponse;
import org.apache.http.client.CookieStore;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.*;
import org.apache.http.util.EntityUtils;

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;

/**
 * Created by amosli on 14-6-22.
 */
public class LoginWithCaptcha {

    public static void main(String args[]) throws Exception {

        //生成验证码的链接
        String createCaptchaUrl = "http://uac.10010.com/portal/Service/CreateImage";
        HttpClient httpClient = new DefaultHttpClient();

        String name = "中国联通手机号码";
        String pwd = "手机服务密码";

        //这里可自定义所需要的cookie
        CookieStore cookieStore = new BasicCookieStore();

        CloseableHttpClient httpclient = HttpClients.custom()
                .setDefaultCookieStore(cookieStore)
                .build();

        //get captcha,获取验证码
        HttpGet captchaHttpGet = new HttpGet(createCaptchaUrl);
        HttpResponse capthcaResponse = httpClient.execute(captchaHttpGet);

        if (capthcaResponse.getStatusLine().getStatusCode() == 200) {
            //将验证码写入本地
            LoginChinaUnicom.saveToLocal(capthcaResponse.getEntity(), "chinaunicom.capthca." + System.currentTimeMillis());
        }


        //手工输入验证码并验证
        HttpResponse verifyResponse = null;
        String capthca = null;
        String uvc = null;

        do {
            //输入验证码,读入键盘输入
            //1)
            InputStream inputStream = System.in;
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
            System.out.println("请输入验证码:");
            capthca = bufferedReader.readLine();

            //2)
            //Scanner scanner = new Scanner(System.in);
            //capthca = scanner.next();

            String verifyCaptchaUrl = "http://uac.10010.com/portal/Service/CtaIdyChk?verifyCode=" + capthca + "&verifyType=1";
            HttpGet verifyCapthcaGet = new HttpGet(verifyCaptchaUrl);
            verifyResponse = httpClient.execute(verifyCapthcaGet);
            AbstractHttpClient abstractHttpClient = (AbstractHttpClient) httpClient;
            for (Cookie cookie : abstractHttpClient.getCookieStore().getCookies()) {
                System.out.println(cookie.getName() + ":" + cookie.getValue());
                if (cookie.getName().equals("uacverifykey")) {
                    uvc = cookie.getValue();
                }
            }
        } while (!EntityUtils.toString(verifyResponse.getEntity()).contains("true"));

        //登录
        String loginurl = "https://uac.10010.com/portal/Service/MallLogin?userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&verifyCode=" + capthca + "&redirectType=03&uvc=" + uvc;
        HttpGet loginGet = new HttpGet(loginurl);
        CloseableHttpResponse loginResponse = httpclient.execute(loginGet);
        System.out.print("loginResponse:" + EntityUtils.toString(loginResponse.getEntity()));

        //抓取基本信息数据
        HttpPost basicHttpGet = new HttpPost("http://iservice.10010.com/ehallService/static/acctBalance/execute/YH102010005/QUERY_AcctBalance.processData/Result");
        LoginChinaUnicom.saveToLocal(httpclient.execute(basicHttpGet).getEntity(), "chinaunicom.basic.html");

    }


}

这里有两个难点,一是验证码,二uvc码;

验证码,这里将其写到本地,然后人工输入,这个还比较好解决.

uvc码,很重要,这个是在cookie里的,httpclient操作cookie的方法网上找了很久都没有找到,后来看其源码才看到.

 

3.效果图

账单数据(这里是json格式的数据,可能不太方便查看):

  

 4.本文源码

https://github.com/amosli/crawl/tree/httpclient

 

posted on 2014-06-23 00:43  Hi_Amos  阅读(6845)  评论(6编辑  收藏  举报