我用java在写蜘蛛的时候,做了个很简单的HTMLpraser
主要代码为:
public String parserForLan(String str,String Token1, String Token2)
{
String parseStr="";
int i = 0;
int j = 0;
while(i < str.length() && (i+6) < str.length())
{
if(str.substring(i,i+6).equalsIgnoreCase(Token1))
{
parseStr = parseStr + str.substring(j,i);
i = i+6;
while(i < str.length() && (i+7) < str.length())
{
if(str.substring(i,i+7).equalsIgnoreCase(Token2))
{
i = i+7;
j = i;
break;
}
i++;
}
}
i++;
}
parseStr = parseStr + str.substring(j,i);
return parseStr;
}
public String parser(String str, char chF, char chB)
{
int i = 0;
int j = 0;
String parseStr = "";
while (i < str.length())
{
if(str.charAt(i) == chF)
{
i++;
while( i < str.length())
{
if(str.charAt(i) == chB)
{
i = i + 1;
j = i;
break;
}
i++;
}
}
while(i < str.length())
{
if(str.charAt(i) == chF)
{
parseStr = parseStr + str.substring(j, i) + " ";
break;
}
else i++;
}
}
ForI = i;
ForJ = j;
return parseStr;
}
public String parser1(String str, char chF, char chB)
{
String parseStr = parser(str, '&', ';');
parseStr = parseStr + str.substring(ForJ, ForI);
return parseStr;
}
public String allParser(String str)
{
String BeParseStr = parserForLan(str,"script","/script");
BeParseStr = parser(BeParseStr, '<', '>');
BeParseStr = parser1(BeParseStr, '&', ';');
return BeParseStr;
}
一般网页还好,但是遇到源代码如下的就不行了:
<img src="
http://www.sjzdaily.com.cn/tplimg/xscj0021.gif"">
http://www.sjzdaily.com.cn/tplimg/xscj0021.gif" border="0" onload="if(this.width>screen.width*0.7) {this.resized=true; this.width=screen.width*0.7; this.alt='Click here to open new window\nCTRL+Mouse wheel to zoom in/out';}" onmouseover="if(this.width>screen.width*0.7) {this.resized=true; this.width=screen.width*0.7; this.style.cursor='hand'; this.alt='Click here to open new window\nCTRL+Mouse wheel to zoom in/out';}" onclick="if(!this.resized) {return true;} else {window.open('
http://www.sjzdaily.com.cn/tplimg/xscj0021.gif');}" onmousewheel="return imgzoom(this);">
因为在这中间有许多的">"号,所以简单的基于"<"">"对的删除失效了
不知道你有何见解?QQ185415255
回复 引用