【知识涌升】sed再进阶

转自http://bbs.chinaunix.net/thread-1762006-1-1.html

这人真的很厉害，我还是琢磨了好久才明白的。看到的关于查找表在文本处理中的妙用

一、标签

`b LABEL'
     Unconditionally branch to LABEL. The LABEL may be omitted, in
     which case the next cycle is started.

`t LABEL'
     Branch to LABEL only if there has been a successful `s'ubstitution
     since the last input line was read or conditional branch was taken.
     The LABEL may be omitted, in which case the next cycle is started.
`D'
     If pattern space contains no newline, start a normal new cycle as
     if the `d' command was issued. Otherwise, delete text in the
     pattern space up to the first newline, and restart cycle with the
     resultant pattern space, without reading a new line of input.

例1: 用标签完成是AA就加上YES,不是AA就加NO

 1 $ cat urfile
 2 AA
 3 BC
 4 AA
 5 CB
 6 CC
 7 AA
 8 
 9 $ sed '/^AA/s/$/ YES/;t;s/$/ NO/' urfile
10 AA YES
11 BC NO
12 AA YES
13 CB NO
14 CC NO
15 AA YES
16 
17 $ sed '/^AA/ba;s/$/ NO/;b;:a;s/$/ YES/' urfile
18 AA YES
19 BC NO
20 AA YES
21 CB NO
22 CC NO
23 AA YES

例2: 合并行：

 1 $ cat urfile
 2 114.113.144.2:
 3 19ms
 4 19ms
 5 19ms
 6 36ms
 7 22ms
 8 19ms
 9 18ms
10 218.61.204.73:
11 0ms
12 0ms
13 0ms
14 0ms
15 0ms
16 0ms
17 0ms
18 $ sed ':a;$!N;/ms$/s/\n/ /;ta;P;D' urfile
19 114.113.144.2: 19ms 19ms 19ms 36ms 22ms 19ms 18ms
20 218.61.204.73: 0ms 0ms 0ms 0ms 0ms 0ms 0ms

二、计数

例3：如何替换文中第4次出现的指定字符串

sed ':a ; N ; $!ba ;s/root/mmmm/4'

将文本中第1次出现a的行替换为b

 1 $ cat urfile
 2 a
 3 a
 4 a
 5 a
 6 a
 7 a
 8 
 9 $ sed '0,/a/{s//b/}' urfile
10 b
11 a
12 a
13 a
14 a
15 a

将文本中第2次出现a的行替换为b

1 $ sed '0,/a/b;s/a/b/;ta;b;:a;n;ba' urfile
2 a
3 b
4 a
5 a
6 a
7 a

将文本中第3次出现a的行替换为b

1 $ sed '0,/a/b;/a/ba;b;:a;n;s/a/b/;tb;ba;:b;n;bb' urfile
2 a
3 a
4 b
5 a
6 a
7 a

“打点记数法”
主要的思路是这样的：利用sed的hold space，当遇到匹配行时，向hold space里面“打一个.”，使用 . 的个数来记录匹配的次数。如果 . 的个数达到了要求，则执行相应的操作，我们可以看到，对于次数的增加，我们只需要调整需要匹配的数值即可。

 1 $ sed '/a/{x;s/^/./;/^.\{3\}$/{x;s/a/b/;b};x}' urfile
 2 a
 3 a
 4 b
 5 a
 6 a
 7 a
 8 $ sed '/a/{x;s/^/./;/^.\{4\}$/{x;s/a/b/;b};x}' urfile
 9 a
10 a
11 a
12 b
13 a
14 a
15 $ sed '/a/{x;s/^/./;/^.\{5\}$/{x;s/a/b/;b};x}' urfile
16 a
17 a
18 a
19 a
20 b
21 a

三、lookup table

例4：怎么用DATE取上月的月份

1 $ date +%m
2 08
3 
4 $ date +%m | sed 's/$/b12a01a02a03a04a05a06a07a08a09a10a11a12/;s/^\(..\)b.*\(..\)a\1.*/\2/'
5 07

我们来看看这段代码是如何工作的：
1、构造一个列表，字母a左边的2位数字是右边2位数字的上一个月
2、利用lookup table取出上一个月

pattern space初始为：
08
第一个s命令处理后pattern space变为：
08b12a01a02a03a04a05a06a07a08a09a10a11a12
下面我们重点来看看第二个s命令是怎么工作的：
s/^$..$b.*$..$a\1.*/\2/
将pattern space里面的内容按照上面的正则表达式进行分解
^$..$             08
b.*                b12a01a02a03a04a05a06a
$..$             07
a\1                a08
.*                a09a10a11a12
整个过程就是通过第一个括号里面的内容 08 ，定位到后面的 a08 ，从而取出它前面的2位数字 07 ，也就是第二个括号里的内容 \2
这种方法就称之为 lookup table

例5：文本处理

 1 $ cat urfile
 2 172.27.38.0&1=99&2=100
 3 192.168.9.2&1=100&3=111
 4 202.96.64.68&1=99&2=1&3=111
 5 202.96.69.38&1=99&3=111&4=110
 6 202.77.88.99&1=99&2=111&3=66&4=100&5=44
 7 
 8 $ sed -r 's/&/\n1\n2\n3\n4\n5&/;:a;s/\n(.)(.*)&\1=([^&]+)/\t\3\2/;ta;s/\n./\t0/g' urfile
 9 172.27.38.0     99      100     0       0       0
10 192.168.9.2     100     0       111     0       0
11 202.96.64.68    99      1       111     0       0
12 202.96.69.38    99      0       111     110     0
13 202.77.88.99    99      111     66      100     44

我们以第一行数据为例，看看这段代码是怎么工作的：
pattern space初始为：
172.27.38.0&1=99&2=100
s/&/\n1\n2\n3\n4\n5&/后：
172.27.38.0\n1\n2\n3\n4\n5&1=99&2=100
s/\n(.)(.*)&\1=([^&]+)/\t\3\2/后：
172.27.38.0    99\n2\n3\n4\n5&2=100
s命令执行成功，t命令执行，跳转到标签a处，再次执行s/\n(.)(.*)&\1=([^&]+)/\t\3\2/：
172.27.38.0    99    100\n3\n4\n5
s命令执行成功，t命令执行，跳转到标签a处，再次执行s/\n(.)(.*)&\1=([^&]+)/\t\3\2/，s命令执行失败，无替换
t命令不执行，执行s/\n./\t0/g：
172.27.38.0    99    100    0    0    0

对于以上步骤，第一个s命令和最后一个s命令都不难理解，关键是中间的这句：
s/\n(.)(.*)&\1=([^&]+)/\t\3\2/
那我们以第一次的执行为例，将pattern space里面的内容按照上面的正则表达式进行分解
\n(.)             \n1
(.*)                \n2\n3\n4\n5
&\1=                &1=
([^&]+)             99
利用第一个括号的数字1，定位到后面&1=中的数字1，从而取出=号后面的数字99
172.27.38.0 \n1\n2\n3\n4\n5&1=99 &2=100
172.27.38.0    99\n2\n3\n4\n5 &2=100
此正则表达式在工作的过程中，开头的 172.27.38.0 和结尾的 &2=100 都是没有处理的，处理的只是中间的一部分

例6：实现 wc -c 的功能

 1 $ wc -c urfile
 2 254 urfile
 3 
 4 $ sed -nf test.sed urfile
 5 254
 6 
 7 $ cat test.sed
 8 #! /usr/bin/sed -f
 9 s/./a/g
10 H
11 x
12 s/\n/a/
13 : a;  s/aaaaaaaaaa/b/g; t b; b done
14 : b;  s/bbbbbbbbbb/c/g; t c; b done
15 : c;  s/cccccccccc/d/g; t d; b done
16 : d;  s/dddddddddd/e/g; t e; b done
17 : e;  s/eeeeeeeeee/f/g; t f; b done
18 : f;  s/ffffffffff/g/g; t g; b done
19 : g;  s/gggggggggg/h/g; t h; b done
20 : h;  s/hhhhhhhhhh//g
21 : done
22 $! {
23         h
24         b
25 }
26 : loop
27 /a/! s/[b-h]*/&0/
28 s/aaaaaaaaa/9/
29 s/aaaaaaaa/8/
30 s/aaaaaaa/7/
31 s/aaaaaa/6/
32 s/aaaaa/5/
33 s/aaaa/4/
34 s/aaa/3/
35 s/aa/2/
36 s/a/1/
37 y/bcdefgh/abcdefg/
38 /[a-h]/ b loop
39 p

每读一行数据，将里面所有的字符都替换成字母a，因为sed读数据时会将换行符(\n)去掉
所以我们利用H命令产生的\n将其补充回来，也替换成字母a，统一做字符统计
为了节省内存开销，提高效率，这里做了进位的处理，就是将10个a替换成1个b，10个b替换成1个c 。。。
这样到最后，字母a的个数就代表个位数字，字母b的个数就代表十位数字，字母c的个数代表百位数字。。。
如果最后剩下是这样一串字符：
ccbbbbbaaaa
那么就表示总共的字符数为：254
本段代码的统计是有上限的，如果字符数量超过1亿，将无法得到正确结果
可以通过增加替换的次数来增加统计上限，如 s/hhhhhhhhhh/i/g ， s/iiiiiiiiii/j/g 。。。

例7：实现 awk '{sum+=$1}END{print sum}'

 1 $ seq 100 | awk '{sum+=$1}END{print sum}'
 2 5050
 3 
 4 $ seq 100 | sed -nf test.sed
 5 5050
 6 
 7 $ seq 1000 | awk '{sum+=$1}END{print sum}'
 8 500500
 9 
10 $ seq 1000 | sed -nf test.sed
11 500500
12 
13 $ cat test.sed
14 #! /usr/bin/sed -f
15 
16 # This is an alternative approach to summing numbers,
17 # which works a digit at a time and hence has unlimited
18 # precision.  This time it is done with lookup tables,
19 # and uses only 10 commands.
20 
21 G
22 s/\n/-/
23 s/$/-/
24 s/$/;9aaaaaaaaa98aaaaaaaa87aaaaaaa76aaaaaa65aaaaa54aaaa43aaa32aa21a100/
25 
26 :loop
27 /^--[^a]/!{
28         # Convert next digit from both terms into analog form
29         # and put the two groups next to each other
30         s/^\([0-9a]*\)\([0-9]\)-\([^-]*\)-\(.*;.*\2\(a*\)\2.*\)/\1-\3-\5\4/
31         s/^\([^-]*\)-\([0-9a]*\)\([0-9]\)-\(.*;.*\3\(a*\)\3.*\)/\1-\2-\5\4/
32 
33         # Back to decimal, but keeping the carry in analog form
34         # \2 matches an `a' if there are at least ten a's, else nothing
35         s/-\(aaaaaaaaa\(a\)\)\{0,1\}\(a*\)\([0-9]*;.*\([0-9]\)\3\5\)/-\2\5\4/
36         b loop
37 }
38 s/^--\([^;]*\);.*/\1/
39 h
40 $p

这段代码很巧妙，先是将需要相加的2个数字替换成对应的字母a的个数，然后将字母a合并在一起，
在替换回相加后的结果数字，如果超过10就保留一个a表示进位。
例如：123 + 456
那么就先将3和6替换成aaa和aaaaaa，然后合并aaaaaaaaa
这样在利用lookup table将aaaaaaaaa替换成9，就完成了加法的操作
如果 125 + 456
5和6相加最后就会变为a1，字母a回会在计算2加5的时候一并计算
这样也就实现了进位。

真心感觉醍醐灌顶啊~

posted @ 2012-10-11 20:09 poiu_elab 阅读(1259) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

poiu_elab

【知识涌升】sed再进阶

公告