在一个网站内容采集程序中,以前一直都能正常使用,但是这几天采集某一个网站时总是返回403错误。错误信息如下:
“System.Net.WebException: 远程服务器返回错误: (403) 已禁止”
考虑到可能是对方网站做了一些改动,防止他人抓取信息。原来采集是用 WebClient 的,我试着在WebClient中加入一些Headers,但还是无效果。一气之下就使用了HttpWebRequest,结果就能正常抓取信息了。
修改之前的代码:
代码
1 using (WebClient webc = new WebClient())
2 {
3 string host = Regex.Replace(url, "http://", "", RegexOptions.IgnoreCase);
4 host = host.Substring(0, host.IndexOf("/"));
5 webc.Headers.Add(HttpRequestHeader.Referer,"http://"+host);
6 byte[] bstr = webc.DownloadData(new Uri(url));
7 return Enc.GetString(bstr);
8 }
2 {
3 string host = Regex.Replace(url, "http://", "", RegexOptions.IgnoreCase);
4 host = host.Substring(0, host.IndexOf("/"));
5 webc.Headers.Add(HttpRequestHeader.Referer,"http://"+host);
6 byte[] bstr = webc.DownloadData(new Uri(url));
7 return Enc.GetString(bstr);
8 }
使用HttpWebRequest后正确抓取的代码:
代码
HttpWebResponse res = null;
string strResult = "";
try
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
//req.Method = "POST";
req.KeepAlive = true;
req.ContentType = "application/x-www-form-urlencoded";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.2; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8";
res = (HttpWebResponse)req.GetResponse();
StreamReader reader = new StreamReader(res.GetResponseStream(), Enc);
strResult = reader.ReadToEnd();
}
catch (Exception ex)
{
WWSoft.Often.LogClass.Write("PuBll.GetPageHtml(" + url + ")", "抓取页面内容", ex);
}
finally
{
if (res != null)
{
res.Close();
}
}
return strResult;
string strResult = "";
try
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
//req.Method = "POST";
req.KeepAlive = true;
req.ContentType = "application/x-www-form-urlencoded";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.2; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8";
res = (HttpWebResponse)req.GetResponse();
StreamReader reader = new StreamReader(res.GetResponseStream(), Enc);
strResult = reader.ReadToEnd();
}
catch (Exception ex)
{
WWSoft.Often.LogClass.Write("PuBll.GetPageHtml(" + url + ")", "抓取页面内容", ex);
}
finally
{
if (res != null)
{
res.Close();
}
}
return strResult;
其中原因我就不知道为什么会这样了。