HTTPClient请求网页抓取数据:验证是否301跳转
制作一个简单的WEB工具:在线检测永久重定向301是否设置成功,碰到了一些问题,记录如下:
DefaultHttpClient获取StatusCode,代码片段如下:
DefaultHttpClient client = new DefaultHttpClient();
//使用Get方式请求
HttpGet httpget = new HttpGet("http://sxrczx.com");
//执行请求
try {
HttpResponse response = client.execute(httpget); System.out.println("httpclicent"+response.getStatusLine().getStatusCode());
} catch (ClientProtocolException e1) {
e1.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
}
遇到的问题:
通过response.getStatusLine().getStatusCode()获取到的状态码永远是(在保证资源能被访问到,并且正确配置了301重定向的情况下)200
原因分析:
既然返回HTTP status为200,证明HttpClient主动帮助我们处理了301重定向后续的请求,但是我们的目标是拿到重定向状态码即可,也就是说不再需要它主动帮我们处理301后续的问题。
解决办法:
查看DefaultHttpClient源代码,通过继承AbstractHttpClient获得了设置client参数的功能:
public synchronized void setParams(HttpParams params) { defaultParams = params; }
这样就可以覆写默认的client的defaultParams,在client所有的参数中和redirect有关的又是哪个值呢?defaultParams是通过HttpParams接口引用的:
//接口引用 private HttpParams defaultParams;
没法再继续了,HttpParams是一个Map<String,Object>值对,如何能知道这个String的名字都有那些呢?
查看一下api,和params有关的总共有2个接口,4个实现类:
AllClientPNames、ClientPNames终于找到了相关的参数名字了:
public static final String CONNECTION_MANAGER_FACTORY_CLASS_NAME = "http.connection-manager.factory-class-name"; public static final String HANDLE_REDIRECTS = "http.protocol.handle-redirects"; public static final String REJECT_RELATIVE_REDIRECT = "http.protocol.reject-relative-redirect"; public static final String MAX_REDIRECTS = "http.protocol.max-redirects"; public static final String ALLOW_CIRCULAR_REDIRECTS = "http.protocol.allow-circular-redirects"; public static final String HANDLE_AUTHENTICATION = "http.protocol.handle-authentication"; public static final String COOKIE_POLICY = "http.protocol.cookie-policy"; public static final String VIRTUAL_HOST = "http.virtual-host"; public static final String DEFAULT_HEADERS = "http.default-headers"; public static final String DEFAULT_HOST = "http.default-host"; public static final String CONN_MANAGER_TIMEOUT = "http.conn-manager.timeout";
api中对HANDLE_REDIRECTS的描述是:
ClientPNames.HANDLE_REDIRECTS='http.protocol.handle-redirects': defines whether redirects should be handled automatically. This parameter expects a value of type java.lang.Boolean. If this parameter is not set HttpClient will handle redirects automatically.
接收一个Boolean值,根据最后一句说的,如果不设置的话,默认为自动处理重定向。
这就好办了,修改相关代码为:
DefaultHttpClient client = new DefaultHttpClient(); //使用Get方式请求 HttpGet httpget = new HttpGet(httpurl); HttpParams params = client.getParams(); params.setParameter(AllClientPNames.HANDLE_REDIRECTS, false); //执行请求 try { HttpResponse response = client.execute(httpget); System.out.println("httpclicent"+response.getStatusLine().getStatusCode()); } catch (ClientProtocolException e1) { e1.printStackTrace(); } catch (IOException e1) { e1.printStackTrace(); }
打完收工,以上问题由QQ群好友“懿紛孩子氣 553104594”提出。
补充一个java.net.HttpURLConnection设置默认不处理redirect 301的问题的设置办法:
try { URL url = new URL("http://sxrczx.com"); HttpURLConnection conn = (HttpURLConnection)url.openConnection(); conn.setInstanceFollowRedirects(false); System.out.println(conn.getResponseCode()); } catch (IOException e) { e.printStackTrace(); }
关键在这里:conn.setInstanceFollowRedirects(false);设置followRedirect为false。
源代码参考:
/* do we automatically follow redirects? The default is true. */
private static boolean followRedirects = true;
/**
* If <code>true</code>, the protocol will automatically follow redirects.
* If <code>false</code>, the protocol will not automatically follow
* redirects.
* <p>
* This field is set by the <code>setInstanceFollowRedirects</code>
* method. Its value is returned by the <code>getInstanceFollowRedirects</code>
* method.
* <p>
* Its default value is based on the value of the static followRedirects
* at HttpURLConnection construction time.
*
* @see java.net.HttpURLConnection#setInstanceFollowRedirects(boolean)
* @see java.net.HttpURLConnection#getInstanceFollowRedirects()
* @see java.net.HttpURLConnection#setFollowRedirects(boolean)
*/
protected boolean instanceFollowRedirects = followRedirects;
/**
* Sets whether HTTP redirects (requests with response code 3xx) should
* be automatically followed by this class. True by default. Applets
* cannot change this variable.
* <p>
* If there is a security manager, this method first calls
* the security manager's <code>checkSetFactory</code> method
* to ensure the operation is allowed.
* This could result in a SecurityException.
*
* @param set a <code>boolean</code> indicating whether or not
* to follow HTTP redirects.
* @exception SecurityException if a security manager exists and its
* <code>checkSetFactory</code> method doesn't
* allow the operation.
* @see SecurityManager#checkSetFactory
* @see #getFollowRedirects()
*/
public static void setFollowRedirects(boolean set) {
SecurityManager sec = System.getSecurityManager();
if (sec != null) {
// seems to be the best check here...
sec.checkSetFactory();
}
followRedirects = set;
}
/**
* Sets whether HTTP redirects (requests with response code 3xx) should
* be automatically followed by this <code>HttpURLConnection</code>
* instance.
* <p>
* The default value comes from followRedirects, which defaults to
* true.
*
* @param followRedirects a <code>boolean</code> indicating
* whether or not to follow HTTP redirects.
*
* @see java.net.HttpURLConnection#instanceFollowRedirects
* @see #getInstanceFollowRedirects
* @since 1.3
*/
public void setInstanceFollowRedirects(boolean followRedirects) {
instanceFollowRedirects = followRedirects;
}