php把url扩展为完整网络url的方法
最近在写一个php采集程序,由于要适应各种来源网站,所以所采集的文章内容里面包含的图片、超链接、附件等的url情况就各不相同,比如有:(1)http://www.xxx.com/123/456.html;(2) 456.html;(3)/123/456.html;(4)../456.html;(5)../../456.html等等。
这么多种情况,要把它们根据实际情况,在插入数据库之前进行处理,要使之统一转换为http://...形式,让我烦恼了好久,最后终于在Snoopy采集类的成员方法里面摘取了一个扩展链接的方法出来,稍微修改使之可用。整理后各代码如下:
1、扩展链接的方法代码:
/**
* 扩展完整的链接
* @param $links 链接(支持单个链接和array数组传值)
* @param $URI 源地址(即采集源网站)
*/
public static function expandLink($links, $URI){
$host = "www.php.net";
preg_match("/^[^\?]+/",$URI,$match);
$match = preg_replace("|/[^\/\.]+\.[^\/\.]+$|","",$match[0]);
$match = preg_replace("|/$|","",$match);
$match_part = parse_url($match);
$match_root = $match_part["scheme"]."://".$match_part["host"];
$search = array( "|^http://".preg_quote($host)."|i",
"|^(\/)|i",
"|^(?!http://)(?!mailto:)|i",
"|/\./|",
"|/[^\/]+/\.\./|"
);
$replace = array( "",
$match_root."/",
$match."/",
"/",
"/"
);
$expandedLinks = preg_replace($search,$replace,$links);
return $expandedLinks;
}
2、测试代码:
public function testUrl(){
$links = array(
'../asdasd/asdasd.html',
'http://www.dianxun.com/asdasd/asdasd.html',
'http://www.dia.com/asdasd/asdasd.html',
'asdasd/asdasd.html',
'/asdasdasd//asdasd/asd////asdasd/asdasd.html',
);
$URI = "http://www.dianxun.com/index.do/asdd/asdd.html";
$exLinks = IndexAction::expandLink($links, $URI);
dump($exLinks);
}
运行结果:
../asdasd/asdasd.html-->http://www.dianxun.com/index.do/asdasd/asdasd.html
http://www.dianxun.com/asdasd/asdasd.html-->http://www.dianxun.com/asdasd/asdasd.html
http://www.dia.com/asdasd/asdasd.html-->http://www.dia.com/asdasd/asdasd.html
asdasd/asdasd.html-->http://www.dianxun.com/index.do/asdd/asdasd/asdasd.html
/asdasdasd//asdasd/asd////asdasd/asdasd.html-->http://www.dianxun.com/asdasdasd//asdasd/asd////asdasd/asdasd.html
浙公网安备 33010602011771号