读书笔记：Sed & Awk

1. Where awk departs from sed is in discarding the line-editor command set, it offers in its place a programming language modeled on the C language.

2. one common option of sed and awk is that "-f" , used as sed -f scriptfile inputfile. The way it works is that firstly copy the input file and run script against it, so no changes are made to the original one.

3. sed operation way: script reads input file line by line, and apply instructions if the line matches the pattern script specifies, then output the line and repeat this operation to the next line.

sed 's/WA/Washington/' inputfile /*print all lines, replace WA with Washington*/

sed 's/ WA/ , Washington/; s/ PA/ ,Pennsylvania/' inputfile /*if there are multiple instructions, separate them by semicolon. */

sed -n -e 's/WA/Washington/p' inputfile /*only output affected lines by print command, -n suppresses auto output*/

4. awk operation way: awk allows you to do more programable things on the matched lines.

awk -F, '/WA/ {print $1; print $2; print$3 }' inputfile /*print the first three fields delimited by , on the lines having world WA */

5. there's a difference in use of * between shell and regular expression. In shell, * means zero or more characters whereas * in regular expression just matches zero or more occurrence of preceding pattern, like ".*" will match the any type of pattern, like what * does in shell.

6. use [] to define a character class. e.g, grep '\.H[1234]' inputfile would match lines having ".H1", ".H2" etc.

7. use [0-1][0-9][-/][0-3][0-9][-/][0-9][0-9] to match string like "MM-DD-YY" or "MM/DD/YY".

8. as part of extended regular expression set (can be used by egrep), there are concepts of POSIX character classes, like [:alnum:] indicating printable characters. additionally, [[.ch.]] would only match the element ch because ch is treated as a unit in this case.

9. re-occurrence: * can matches 0 or more occurrences of preceding regular expression, + matches at least one or more occurrence, ? matches 0 or 1 occurrence. dont' confuse the use of ? in regular expression and in shell. In shell, ? represents a single character, equivalent to . in regular expression.

10. positional meta characters: ^ indicates the beginning of a line whereas $ indicates the end of a line, to mark blank lines, use grep -c '^$' file, to match the entire line, then grep '^.*$' file. ^ and $ are used for line matching, similarly, < and > are used for word matching with beginning and ending. e,g, grep '\<book\>' prints all lines with word "book" in it.

11. to restrict on the number of occurrences of a pattern, use \{n, m\}. like grep 10\{2,4\}1 matches 10001, but not 1001.

12. grouping operations with (). egrep compan(y|ies)? would match "compan", "company", "companies". please note it's NOT available in most versions of sed or grep. only available in egrep and awk.

13. with grep, single quote ' and double quote " both used for searching a group of words together, e.g, grep "hello world" instead of grep hello world. And with shell script, ' doesn't expand variable so grep '\$VAR1' file1 literally search string $VAR1 in the file whereas grep "$VAR1" file1 expands the variable value and search that value in the file.

14. sed internally keeps a pattern space, which stores a copy of current line data. TODO

15. sed's three principles: a. use line as a unit for line processing. b. by default apply to every line (globally). c. not write to the original file.

16. sed command on certain address: d(delete all), 1d(delete line one), /^$/d(deletes empty line, regular expression is enclosed in / /). 50, $d(delets the lines between line50 and up to the end line). /^\.TS/,/^\.TE/d(deletes lines from the first match of .TS to and include the first match of .TE). {} can be used to add nested address/command, e.g, /^\.TS/,/^\.TE/{/^$d s/^\.ps 10/.ps 8/}, the address/command in {} is applied to the address specificed in the outer match.

17. use sed to modify source file: sed 's/^$/.LP/;/^+ */d;s/^ *//;s/ */ /g' hoursefeathers (remove empty lines, remove extra space at beginning of line and between words); use sed to print out expected content: sed -n "/^\.deBL/,/^\.\.$/p" /usr/lib/macors/mmt; (-n indicates printing out the mached line of text).

18. sed commands: comands can be grouped at the same address by {}, they all apply to the same line in order. It's recommened to put each command on a seperate line as sed comand itsefl is alreday very complex to read, so putting them on the same line with ; seperator only makes it worse.

19. use # to comment a script

20. sed substitution command: s/regular expression/replacement/flag. flag can be n(only replace the nth occurence of the pattern), g(replace every occurence of the pattern, p(print out the pattern space), w(write to a file). If we need to replace a pattern with a new line, then "s/pattern/\", \ is used to represent a new line. & is another useful metacharacter, it refers to the matched content. Executing "s/UNIX/\\s-2&\\s0/g" on "on the UNIX Operating System" prints "on the \s-2UNIX\s0 Operating System". () is used to select individual portion of a string, and use \n to refer to the matched portion. execute 's/$.*$:$.*$/\2:\1/' on second:first will print first:second.

21. sed insert/append/change command: syntax is [line-address]a\text(text need to be on a seperate line). The difference between append and insert is append can put the new text after the line address whereas insert put it before the line address. Insert and Append only works on a single line but change command can replace mulitple lines with a single copy of specified text. e,g. /^From /, /^$/c\blabala. But if we put enclose multiple commands with {}, then change command will work on each line.

22. sed transform command: syntax is [line-address]y/abc/xyz/, basically the replacement is made by charater position, so it doesn't match the entire "abc". Instead, every a is replaced with x, b->y and c->z. It's used to say make all lower characters into upper case.

23. sed read and write files: syntax is [line-address]r filename or [line-address]w filename. In reading a file, the content of the file is loaded into the pattern space whereas w will print the pattern space to a file.

24. sed mutli-line command (N, D, P). N is used to read next line into pattern space, seperated by a '\n' from the first line. Notice that once the next line is read into the space, the script won't work from the start for the second line anymore because it's already part of the first line. That's how multi-line pattern space work.

<para>

This is a test paragraph in Interleaf style ASCII.  Another line 
in a paragraph.  Yet another. 

<Figure Begin>

v.1111111111111111111111100000000000000000001111111111111000000
100001000100100010001000001000000000000000000000000000000000000
000000

<Figure End>

<para>

More lines of text to be found after the figure.
These lines should print.

------------------------------
The goal is to replace <para> and subsequent new line with ".LP", and write bit map data to a separate file and replace <Figure> tag with .FG macro
-------------------------------

/<para>/{
    N
    c\ ####change pattern space to .LP
.LP
}
/<Figure Begin>,/<Figure End>/{
    w fig.interleaf ####write to a separate file
    /<Figure End>/i\
.FG\
<Insert figure here>\
.FE
    d ###eventually delete the pattern space content before being printed out
}

The delete command(d) deletes the content of the pattern space and caused new line to be read with editing resuming at the top of the script. However, D command deletes a portion of the pattern space up to the first embedded newline. It doesn't cause a new line to be read, instead it returnes to the top of the script, applying the instructions on the remaining in the pattern space. (A typical example of changing workflow). D N P all together can maintain a loop which looks at mulitple lines. (N read lines, P print lines, D delete the first line and return to the start of the loop).

A single case study:

-----------------------

I want to see @f1(what will happen) if we put the
font change commands @f1(on a set of lines).  If I understand
things (correctly), the @f1(third) line causes problems. (No?).
Is this really the case, or is it (maybe) just something else?

Let's test having two on a line @f1(here) and @f1(there) as
well as one that begins on one line and ends @f1(somewhere 
on another line).  What if @f1(it is here) on the line?
Another @f1(one).

-----------------------

To substitute @f1(...) with \fR...\fB?

A primary solution is: sed -n 's/@f1(\(.*\))/\\fB\1\\fR/gp' sample_file # doesn't handle multi-occurrence of @h1() in one line
So a better one : sed -n 's/@f1(\([^)]*\))/\\fB\1\\fR/gp' sample_file # doesn't handle @f1() broke into multiple lines. 
Eventually we have: 
s/@f1(\([^)]*\))/\\fB\1\\fR/g
/@f1([^)]*/{ 
    N
    s/@f1(\([^)]*\n[^)]*\))/\\fB\1\\fR/g
    P
    D
}

25. sed Hold(h, H), Get(g, G), Exchange(x) command: besides pattern space, sed also keep temperory storage of another space called hold space. h/H will copy/append content in pattern space to hold space and g/G will copy/append content in hold space to pattern space. x will exchange contents of hold space and pattern space.

a skilled yet tedious script to capitalized statement names
/the .* statement/{
    h
    s/.*the \(.*)\) statement.*/\1/
    y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
    G #reload the original line
    s/\(.*\)\n\(.*the \).*\( statement.*\)/\2\1\3/ #load the capitalized version
}

FAQ:

1. how to replace tab in file: tried many options like s/\t/,/g but looks like \t doesn't match tab very well. Eventually I was able to match tab usiong contral + v, and then press tab to generate a tab character in the command and it works. According to wiki, Unix interactive terminals use Control-V to mean "the next character should be treated literally".

posted on 2012-07-28 13:37 梁霄阅读(369) 评论(0) 收藏举报

刷新页面返回顶部

工作的那些事儿，小梁的blog

读书笔记：Sed & Awk

导航

公告