HTTPClient请求网页抓取数据:验证是否301跳转

制作一个简单的WEB工具:在线检测永久重定向301是否设置成功,碰到了一些问题,记录如下:

DefaultHttpClient获取StatusCode,代码片段如下:

DefaultHttpClient client = new DefaultHttpClient(); 
//使用Get方式请求
HttpGet httpget = new HttpGet("http://sxrczx.com");
//执行请求
try {
HttpResponse response = client.execute(httpget);            System.out.println("httpclicent"+response.getStatusLine().getStatusCode());
} catch (ClientProtocolException e1) {
    e1.printStackTrace();
} catch (IOException e1) {
    e1.printStackTrace();
}

遇到的问题:

通过response.getStatusLine().getStatusCode()获取到的状态码永远是(在保证资源能被访问到,并且正确配置了301重定向的情况下)200

原因分析:

既然返回HTTP status为200,证明HttpClient主动帮助我们处理了301重定向后续的请求,但是我们的目标是拿到重定向状态码即可,也就是说不再需要它主动帮我们处理301后续的问题。

解决办法:

 

查看DefaultHttpClient源代码,通过继承AbstractHttpClient获得了设置client参数的功能:

public synchronized void setParams(HttpParams params) {
defaultParams = params;
}

这样就可以覆写默认的client的defaultParams,在client所有的参数中和redirect有关的又是哪个值呢?defaultParams是通过HttpParams接口引用的:

//接口引用
private HttpParams defaultParams;

没法再继续了,HttpParams是一个Map<String,Object>值对,如何能知道这个String的名字都有那些呢?

查看一下api,和params有关的总共有2个接口,4个实现类:

AllClientPNames、ClientPNames终于找到了相关的参数名字了:

public static final String CONNECTION_MANAGER_FACTORY_CLASS_NAME = "http.connection-manager.factory-class-name";
    public static final String HANDLE_REDIRECTS = "http.protocol.handle-redirects";
    public static final String REJECT_RELATIVE_REDIRECT = "http.protocol.reject-relative-redirect";
    public static final String MAX_REDIRECTS = "http.protocol.max-redirects";
    public static final String ALLOW_CIRCULAR_REDIRECTS = "http.protocol.allow-circular-redirects";
    public static final String HANDLE_AUTHENTICATION = "http.protocol.handle-authentication";
    public static final String COOKIE_POLICY = "http.protocol.cookie-policy";
    public static final String VIRTUAL_HOST = "http.virtual-host";
    public static final String DEFAULT_HEADERS = "http.default-headers";
    public static final String DEFAULT_HOST = "http.default-host";
    public static final String CONN_MANAGER_TIMEOUT = "http.conn-manager.timeout";

api中对HANDLE_REDIRECTS的描述是:

ClientPNames.HANDLE_REDIRECTS='http.protocol.handle-redirects':  defines whether redirects should be handled automatically. This parameter expects a value of type java.lang.Boolean. If this parameter is not set HttpClient will handle redirects automatically.

接收一个Boolean值,根据最后一句说的,如果不设置的话,默认为自动处理重定向。

这就好办了,修改相关代码为:

DefaultHttpClient client = new DefaultHttpClient();

        //使用Get方式请求
        HttpGet httpget = new HttpGet(httpurl);
        HttpParams params = client.getParams();  
        params.setParameter(AllClientPNames.HANDLE_REDIRECTS, false);

        //执行请求
        try {
            HttpResponse response = client.execute(httpget);
            System.out.println("httpclicent"+response.getStatusLine().getStatusCode());
        } catch (ClientProtocolException e1) {
            e1.printStackTrace();
        } catch (IOException e1) {
            e1.printStackTrace();
        }

打完收工,以上问题由QQ群好友“懿紛孩子氣  553104594”提出。

补充一个java.net.HttpURLConnection设置默认不处理redirect 301的问题的设置办法:

try {
URL url = new URL("http://sxrczx.com");
HttpURLConnection conn = (HttpURLConnection)url.openConnection();
conn.setInstanceFollowRedirects(false);         System.out.println(conn.getResponseCode());

} catch (IOException e) {
    e.printStackTrace();
}

关键在这里:conn.setInstanceFollowRedirects(false);设置followRedirect为false。

源代码参考:

/* do we automatically follow redirects? The default is true. */
    private static boolean followRedirects = true;

    /**
     * If <code>true</code>, the protocol will automatically follow redirects.
     * If <code>false</code>, the protocol will not automatically follow 
     * redirects.
     * <p>
     * This field is set by the <code>setInstanceFollowRedirects</code> 
     * method. Its value is returned by the <code>getInstanceFollowRedirects</code> 
     * method.
     * <p>
     * Its default value is based on the value of the static followRedirects 
     * at HttpURLConnection construction time.
     *
     * @see     java.net.HttpURLConnection#setInstanceFollowRedirects(boolean)
     * @see     java.net.HttpURLConnection#getInstanceFollowRedirects()
     * @see     java.net.HttpURLConnection#setFollowRedirects(boolean)
     */
    protected boolean instanceFollowRedirects = followRedirects;
 /**
     * Sets whether HTTP redirects  (requests with response code 3xx) should 
     * be automatically followed by this class.  True by default.  Applets
     * cannot change this variable.
     * <p>
     * If there is a security manager, this method first calls
     * the security manager's <code>checkSetFactory</code> method 
     * to ensure the operation is allowed. 
     * This could result in a SecurityException.
     * 
     * @param set a <code>boolean</code> indicating whether or not
     * to follow HTTP redirects.
     * @exception  SecurityException  if a security manager exists and its  
     *             <code>checkSetFactory</code> method doesn't 
     *             allow the operation.
     * @see        SecurityManager#checkSetFactory
     * @see #getFollowRedirects()
     */
    public static void setFollowRedirects(boolean set) {
    SecurityManager sec = System.getSecurityManager();
    if (sec != null) {
        // seems to be the best check here...
        sec.checkSetFactory();
    }
    followRedirects = set;
    }
 /**
     * Sets whether HTTP redirects (requests with response code 3xx) should
     * be automatically followed by this <code>HttpURLConnection</code> 
     * instance.
     * <p>
     * The default value comes from followRedirects, which defaults to
     * true.
     *
     * @param followRedirects a <code>boolean</code> indicating 
     * whether or not to follow HTTP redirects.
     *
     * @see    java.net.HttpURLConnection#instanceFollowRedirects
     * @see #getInstanceFollowRedirects
     * @since 1.3
     */
     public void setInstanceFollowRedirects(boolean followRedirects) {
    instanceFollowRedirects = followRedirects;
     }
posted @ 2015-06-07 13:06  cornucopia  阅读(2372)  评论(0编辑  收藏  举报