Simple spider with wget

公告

Simple spider with wget
Posted on February 2, 2012 by wangchen
一些临时且定制程度很高的抓取工作，使用系统化的爬虫往往没有使用Shell 效率高，下面是一些思路。
事先用脚本分析出下载连接，存放在一个文件中，然后用split 切割成若干个文件，文件数取决于希望的并发下载进程数量。
使用Wget 进行下载：
^?View Code BASH
 
nohup cat links.txt.aa |awk '{print "wget \""$0"\" --user-agent=\"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6\" -a wget.log -nv "}'| sh &
 
nohup cat links.txt.ab |awk '{print "wget \""$0"\" --user-agent=\"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6\" -a wget.log -nv "}'| sh &
 
nohup cat links.txt.ac |awk '{print "wget \""$0"\" --user-agent=\"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6\" -a wget.log -nv "}'| sh &
模拟了UA，当然模拟Cookies、reference 都没有问题，稍后补上吧。
根据wget.log 统计下载速度：
日志格式：
^?View Code BASH
 
2012-02-01 17:23:32 URL:http://a.b.com/d.zip [196986/196986] -> "d.zip" [1]
2012-02-01 17:23:45 URL:http://a.b.com/e.zip [49455/49455] -> "e.zip" [1]
One-liner 脚本：
^?View Code BASH
 
cat ~/wget.log |grep "2012-02-02 17:..:.."|awk '{a+= substr($4, 2, index($4, "/")-2)}END{print a}'
- 相关文章：
⋯⋯⋯⋯
Cancel Reply
Name (required)

posted on 2013-01-17 17:32 lexus 阅读(235) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

浙江省高等学校教师教育理论培训

公告

- 相关文章：