数据抓取:wget -d 2 -c  -k -r -np 'http://mobile.tool.la/sheng/'

提取某html中的所有手机号码前缀,并按照省市存储到文本文件中

for file in `find sheng -mindepth 3 -name 'index.html'`
do
province=`echo $file | awk -F '/' '{print $2}'`
city=`echo $file | awk -F '/' '{print $3}'`
#way=sheng/$province/$city/$city.txt
#sed 's/<[^<>]*>/\n/g' $file | grep -w '[1-9][0-9]\{6\}' > $way
for num in `sed 's/<[^<>]*>/\n/g' $file | grep -w '[1-9][0-9]\{6\}'`
do
echo $num $province $city >> phonefrom.txt
done
done

取出号码前三位,并按第2个域从大到小排序:

awk '{print substr($0,1,3)}' phoneform.txt | sort -n | uniq -c | sort -rn -k2

(省名中音对应)按照"/><四种方式分割域,取出5,8域:

cat index.html | grep '<dt>' | awk -F'["/><]' '{print $5,$8}'

(市名中音对应)按照"/><四种方式分割域,取出5,6,9域:

cat index.html | grep '<dd>' | awk -F'["/><]' '{print $5,$6,$9}'

多文件合并ARGIND:

awk 'ARGIND==1{a[$1]=$2}ARGIND==2{b[$1]=$2}ARGIND==3{print $0,a[$2],b[$3]}' province2chinese.txt city2chinese.txt phonefrom.txt