读书笔记:Sed & Awk 之 Awk篇

1. Awk's programming model: The main loop in awk is a routine that reads one line of input from a file and makes it avaiblable for processing. You can use BEGIN/END to execute commands before/after the loop. 

2. Similar to sed, printing matched line content is the default behavior if no other action specicied. 

3. Field separator can be specified in the BEGIN section of a script, like BEGIN { FS = "," } . The tilde ~ operator allows you to test a regular expression against a field. like $5 ~ /MA/ { print $1 ", " $6 } . Field seperator can also be regular expression, the leftmost longest non-null and nonoverlapping substring will be the seperator. 

4. System variable: FS->input field separator; OFS->output field separator (default to whitespace), OFS is generated when comma is used to separate fields in print statement. NF->number of fields for the current input record, print $NF will always print the last field in a record. RS->record separator, default to newline(\n). ORS->is the record output equivalent, default to newline(\n). NR->number of current input records. FILENAME->name of the current input file. CONVFMT->numerica value print format, default to "%.6g". 

case: BEGIN { FS = "\n"; RS = "" } { print $1, $NF }  #redefine newline as field separator and empty string meaning empty line as record seperator. 

5. Operator in awk. ~means match with regular expressoin while !~ means doesn't match. 

6. Formatted printing: Awk borrows printf from C, and you can do printf("%d\t%s\n", $3, $9) or printf("|-10s|\n", "hello") will print a 10 character field (lef adjusted) and space filled if provided variable's length less than 10. 

7. Pass argument to a script: awk -f scriptfile high=100 low=60 datafile, no space is allowed in argument assignment. Environment variables and output of a command can be passed asd well. like awk '{...}' directory=$cwd directory2=`pwd`

8. Commandline arguments are not available in BEGIN procedure, because parameters/file is not evaluated/read until the first line of data file is read. So you have to validate input parameter in the shell wrapper(if any) or use NR ==1 procedure to do the work. Looks like add -v to the commandline can make program aware of command line arg in BEGIN procedure. 

9. Most of conditional controls in Awk are borrowed from C. if else for while do while break continue is exactly the same as C style correspondent. The only difference is that there are two main loop control keywords in awk "exit" and "next". exit statement exits the main loop and passes control to END procedure. next causes the next line of input to be read and resumes execution on the top of the script. Also awk introduces the "in" operator, you can use for (item in items) to iterate through an array, or you can see if item in array to test if array[item] exists or not. 

10. n = split(string, array, separator), n = the size of the converted array which starts from index 1 to index n. separator defaults to FS if not specified. 

11. System arrays, ARGV is the array with number of ARGC variables. To pass shell command line args to awk, use $* at the end of awk command. Also it's a practical use of checking argument using ARGV[i] in BEGIN procedure. 

12. Useful system functions: rand() returns a random number between 0<= () < 1, int() truncates input(not rounding, to use rounding, use printf %.0f format. A list of build-in string functions. convert a number to ASCII char, letter = sprintf("%c", 120); User defined functions can be placed anywhere in the script (but usually at the top) in the format of function name(parameter-list) { statements }. What's weird in awk is that local variables used inside functions are visible globally, be careful with this. If you want to put functions in separate files, then in the command, you have to specificy the library file with -f option as well. 

13. getline function reads a new line of input(can be user input or file stream as well) without changing the control in the script. It returns 1 if it's able to read, 0 if encounter the end of a file, -1 if it encounters an error. The getline function is very useful when you want to process the next line right after your provided pattern is matched. getline < "filename" can get input from a file, using a while loop to load all lines. while ( (getline < "file")  > 0) print .... getline < "-" will prompt for user input.  The input can be assigned to a variable as well, like getline name < "-". "who am I" | getline shows that getline can also read from commandline piped output, so it can be very useful when you want to assign commandline related value to a variable, like "date +'%a., %h %d, %Y'" | getline today  assigns date value to variable today. But we have to explicitely close(cmd) to colse a pipe. 

14. #!/bin/awk -f can be used to invode an awk script from shell directly and commindline args can be transparent to awk script. 

最后放上陈皓老师关于awk的文章,图文并茂。

posted on 2013-01-02 14:32  梁霄  阅读(246)  评论(0)    收藏  举报

导航