awk学习笔记

 1、Awk Command Syntax

Basic Awk Syntax:

awk -F '/pattern/ {action}' input_file

In the above syntax:

  •  -F is the field separator. If you don't specify, it will use an empty space as field delimiter.
  • The /pattern/ and the {action} should be enclosed inside single quotes.
  • /pattern/ is optional. If you don't provide it, awk will process all the records from the input-file. If you specify a pattern, it will process only those records from the input-file that match the given pattern.
  • {action} - These are the awk programming commands, which can be one or multiple awk commands. The whole action block (including all the awk commands together) should be closed between { and }
  • input-file - The input file that needs to be processed. 

举例:

input_file如下:

101,John Doe,CEO
102,Jason Smith,IT Manager
103,Raj Reddy,Sysadmin
104,Anand Ram,Developer
105,Jane Miller,Sales Manager

命令行输入:

awk -F"," '/CEO/ {print $2,$3}' input_file

打印结果:

John Doe CEO

2、Awk Program Structure (BEGIN, body, END block)

awk workflow

 

举例:

将awk命令写入脚本demo:

#!/bin/awk -f
BEGIN{
    print "begin"
    FS=","
}
/CEO/ {print $2,$3}
END{
    print "end"
}

执行命令:

awk -f demo test

打印结果:

begin
John Doe CEO
end

3、Print Command 

 默认情况下,print命令(没有参数)会打印输出整条记录,也可以通过传递特定的域号给print命令以只打印特定的域,添加匹配模式可以选择做特定规则打印

举例:

4、FS - Input Field Separator

awk处理文档时,默认的域分隔符为空格,可以通过-F选项来指定分隔符,如下所示:

awk -F ',' '{print $2, $3}' test

也可以使用awk内置变量FS来设置分隔符,需要在BEGIN块里设置:

awk 'BEGIN {FS=","} {print $2, $3}' test

还可以指定多个域分隔符,例如存在以下记录文件,其中的每条记录包含3个不同的域分隔符:逗号、冒号和百分号:

101,John Doe:CEO%10000
102,Jason Smith:IT Manager%5000
103,Raj Reddy:Sysadmin%4500
104,Anand Ram:Developer%4500
105,Jane Miller:Sales Manager%3000

You can specify MULTIPLE field separators using a regular expression. For example FS = "[,:%]" indicates that the field separator can be , or : or %

脚本demo:

#!/bin/awk -f
BEGIN{
    FS="[,:%]"
}
{print $2,$3}

打印结果:

John Doe CEO
Jason Smith IT Manager
Raj Reddy Sysadmin
Anand Ram Developer
Jane Miller Sales Manager

#简单的分隔符可用FS=“正则表达式”实现,复杂的分隔符不妨用python简单处理成空格

5、OFS - Output Field Separator

OFS表示输出分隔符,用以在输出时作为连续域之间的分隔符。默认的域分隔符为空格。

脚本demo:

#!/bin/awk -f
BEGIN{
    FS="[,:%]"
    OFS="--"
}
{print $2,$3}
kl@ubuntu:~/scripts$ awk -f demo test 
John Doe--CEO
Jason Smith--IT Manager
Raj Reddy--Sysadmin
Anand Ram--Developer
Jane Miller--Sales Manager

如果输出不想有间隔符间隔:

脚本demo:

#!/bin/awk -f
BEGIN{
    FS="[,:%]"
    OFS="--"
}
{print $2$3}#或者{print $2 $3}
kl@ubuntu:~/scripts$ awk -f demo test 
John DoeCEO
Jason SmithIT Manager
Raj ReddySysadmin
Anand RamDeveloper
Jane MillerSales Manager

6、RS - Record Separator

如果有以下文本,冒号代替换行符区分词条,逗号为分隔符

101,John Doe:102,Jason Smith:103,Raj Reddy:104,Anand
Ram:105,Jane Miller

要提取姓名,需要用变量RS(默认为换行符)

kl@ubuntu:~/scripts$ awk -F"," 'BEGIN { RS=":" } {print $2}' test 
John Doe
Jason Smith
Raj Reddy
Anand
Ram
Jane Miller

7、ORS - Output Record Separator

默认情况下,awk在输出记录时使用换行来分隔每条记录,可以通过指定变量ORS来显示的指定输出记录分隔符:

kl@ubuntu:~/scripts$ awk -F"," 'BEGIN {RS=":";ORS="--\n" } {print $2}' test 
John Doe--
Jason Smith--
Raj Reddy--
Anand
Ram--
Jane Miller
--

 

8、NR - Number of Records

NR is very helpful. When used inside the loop, this gives the line number. When used in the END block, this gives the total number of records in the file.

The following example shows how NR works in the body block,and in the END block:

文本test:

101,John Doe,CEO
102,Jason Smith,IT Manager
103,Raj Reddy,Sysadmin
104,Anand Ram,Developer
105,Jane Miller,Sales Manager
kl@ubuntu:~/scripts$ awk 'BEGIN{FS=","} {print "Id of record number",NR,"is",$1} END{print "Total number:",NR}' test 
Id of record number 1 is 101
Id of record number 2 is 102
Id of record number 3 is 103
Id of record number 4 is 104
Id of record number 5 is 105
Total number: 5

 

9、FNR - File "Number of Record"

NR keeps growing between multiple files. When the body block starts processing the 2nd file, NR will not be reset to 1, instead it will continue from the last NR number value of the previous file.

FNR will give you record number within the current file. So, when awk finishes executing the body block for the 1st file and starts the body block the next file, FNR will start from 1 again.

The following example shows both NR and FNR:

kl@ubuntu:~/scripts$ awk -F"," '{printf "%s---FILENAME=%s NR=%s FNR=%s\n",$1,FILENAME,NR,FNR}' test1 test2
this is test1 line1---FILENAME=test1 NR=1 FNR=1
this is test1 line2---FILENAME=test1 NR=2 FNR=2
this is test2 line1---FILENAME=test2 NR=3 FNR=1
this is test2 line2---FILENAME=test2 NR=4 FNR=2

10、FILENAME – Current File Name

FILENAME is helpful when you are specifying multiple input-files to the awk program. This will give you the name of the file Awk is currently processing.

kl@ubuntu:~/scripts$ awk '{print $0,"---",FILENAME}' test1 test2
this is test1 --- test1
this is test2 --- test2

11、ARGC,ARGV -Aarguments

ARGC :是一个整数,代表命令行上除了选项-v, -f 及其对应的参数之外所有参数的个数。 ARGV[ ] 是一个字符串数组,ARGV[0]到ARGV[ARGC-1]分别代表命令行上相对应的参数。

kl@ubuntu:~/scripts$ awk -F"," '{for(i=0;i<ARGC;i++) printf "ARGV[%d]=%s\n",i,ARGV[i]}' test
ARGV[0]=awk
ARGV[1]=test

12、 Awk Variables and Operators

You don't need to declare an variable to use it. If you wish to initialize an awk variable, it is better to do it in the BEGIN section, which will be executed only once.

unary operator: 正(+),负(-),自加(++),自减(--)

arithmetric  operator:加(+),减(-),乘(*),除(/),求余(%)

string operator:连字符(空格)

comparison operators:大于(>),大于等于(>=),小于(<),小于等于(<=),等于(==),不等于(!=),相与(&&),相或(||)

regular expression operators:匹配(~),不匹配(!~)

举例:

文本test为:

101,John Doe,CEO,10000
102,Jason Smith,IT Manager,5000
103,Raj Reddy,Sysadmin,4500
104,Anand Ram,Developer,4500
105,Jane Miller,Sales Manager,3000

字符连接:

kl@ubuntu:~/scripts$ awk -F"," '{print $1 $1 $2}' test
101101John Doe
102102Jason Smith
103103Raj Reddy
104104Anand Ram
105105Jane Miller

匹配与不匹配:

匹配J开头的行:

kl@ubuntu:~/scripts$ awk -F"," '$2~/^J/' test 
101,John Doe,CEO,10000
102,Jason Smith,IT Manager,5000
105,Jane Miller,Sales Manager,3000

匹配不以J开头的行的部分:

kl@ubuntu:~/scripts$ awk -F"," '$2!~/^J/ {print $1,$2}' test 
103 Raj Reddy
104 Anand Ram

匹配全字:

kl@ubuntu:~/scripts$ awk -F"," '$2=="John Doe"' test 
101,John Doe,CEO,10000

加入比较符:

kl@ubuntu:~/scripts$ awk -F"," '$4<4000 || $4>5000' test 
101,John Doe,CEO,10000
105,Jane Miller,Sales Manager,3000

13、Awk Variables andAwk Conditional Statements and Loops

if语句:

脚本demo:

#!/bin/awk
BEGIN{
    FS="[,]"
}
{if($2=="John Doe")
print "Hello CEO"
else if($1==104)
print "Hello Developer"
else
print "Hello"
}
kl@ubuntu:~/scripts$ awk -f demo test 
Hello CEO
Hello
Hello
Hello Developer
Hello

while/do while循环:

#!/bin/awk
BEGIN{
i = 0
while(1){
   print i;
   i++;
   if(i>3) break;
}
}
kl@ubuntu:~/scripts$ awk -f demo  
0
1
2
3

do while执行相同动作:

#!/bin/awk
BEGIN{
i = 0
do{
print i;
i++;
}while(i<4)
}

for循环:

实现上例的动作:

#!/bin/awk
BEGIN{
for(i=0;i<4;i++){
print i;
}
}

break/continue语句:

break语句不再赘述,continue语句:

#!/bin/awk
BEGIN{
for(i=0;i<4;i++){
    if(i==2) continue;
    print i;
}
}
kl@ubuntu:~/scripts$ awk -f demo 
0
1
3

exit语句:

退出且不执行后边的语句

#!/bin/awk
BEGIN{
for(i=0;i<4;i++){
    if(i==2) exit;
    print i;
}
}
kl@ubuntu:~/scripts$ awk -f demo 
0
1

14、Awk Associative Arrays

In Awk, arrays are associative, i.e. an array contains multiple index/value pairs. The index doesn't need to be a continuous set of numbers; in fact it can be a string or a number, and you don't need to specify the size of the array.

Syntax:

arrayname[string]=value
  • arrayname is the name of the array.
  • string is the index of an array.
  • value is any value assigning to the element of the array.

The index of the array is always a string.Even when you pass a number for the index, awk will treat it as string index. Both of the following are the same.

#!/bin/awk -f
BEGIN {
  array[101]=3;
  print array["101"];
}
kl@ubuntu:~/scripts$ awk -f demo 
3

对于联合数组的读取:

{for (item in array)  print array[item]} # 输出的顺序是随机的
{for(i=1;i<=len;i++)  print array[i]} # Len 是数组的长度

多维数组,格式为:array[index1,index2,……]

SUBSEP是数组下标分割符,默认为“\034”。可以直接在SUBSEP的位置输入用的分隔符:

kl@ubuntu:~/scripts$ awk 'BEGIN{SUBSEP=":";array["a","b"]=1;for(i in array) printf "array[%s]=%d\n",i,array[i]}'
array[a:b]=1

删除数组或数组元素,使用delete 函数:

delete array                 #删除整个数组
delete array[item]           #删除某个数组元素

排序函数:

asort:对数组的值进行排序,排序之后数组下标为1到数组的长度,例如:

对给定test中的元素排序:

a
1
0
b
20
8
100
cc

脚本demo:

#!/bin/awk -f
{a[$0]=$0} #建立数组a,下标为$0,赋值也为$0
END{
len=asort(a)#利用asort函数对数组a的值排序,同时获得数组长度len
for(i=1;i<=len;i++) print i "\t"a[i]  #打印
}
kl@ubuntu:~/scripts$ awk -f demo test 
1    0
2    1
3    8
4    20
5    100
6    a
7    b
8    cc

asorti函数:对数组的下标排序,即asorti(array)后,会用1到数组长度作为下标,但是数组值为原数组下标:

文本test:

cd
ab
cd
cad
cd
sun
ab
kl@ubuntu:~/scripts$ awk '{a[$0]}END{l=asorti(a);for(i=1;i<=l;i++)print i,a[i]}' test 
1 ab
2 cad
3 cd
4 sun

asorti函数可加入另一个参数,即asorti(array1,array2),其中array1的值是value,array2的值是string:

kl@ubuntu:~/scripts$ awk '{a[$0]++}END{l=asorti(a,b);for(i=1;i<=l;i++)print i,b[i],a[b[i]]}' test 
1 ab 2
2 cad 1
3 cd 3
4 sun 1

两种数组方法去除test文本中重复行:

awk '!($0 in a){a[$0];print}' test
awk '!a[$0]++' test

 15、Awk string function

gsub(r,s)            在整个$0中用s替代r
gsub(r,s,t)          在整个t中使用s替代r
index(s,t)           返回s中字符串t的第一个位置
length(s)            返回s长度
match(s,r)            s是否包含匹配r的字符串,是则返回第一个位置,否则返回0
split(s,a,fs)              用fs为间隔符将s分成数组a
sprint(fmt,exp)       返回经fmt格式化后的exp
sub(r,s)                    用s代替第一个r
substr(s,p)               返回字符串s中从p开始的部分
substr(s,p,n)            返回字符串s中从p开始长度为n的部分
 
举例:

文本test:

M.Tansley       05/99   48311   Green   8       40      44
J.Lulu          06/99   48317   green   9       24      26
P.Bunny         02/99   48      Yellow  12      35      28
J.Troll         07/99   4842    Brown-3 12      26      26
L.Tansley       05/99   4712    Brown-2 12      30      28

gusb(r,s):

kl@ubuntu:~/scripts$ awk 'gsub(4842,4899) {print $0}' test 
J.Troll         07/99   4899    Brown-3 12      26      26

gusb(r,s,t):

kl@ubuntu:~/scripts$ awk 'gsub(9,6,$2) {print $0}' test 
M.Tansley 05/66 48311 Green 8 40 44
J.Lulu 06/66 48317 green 9 24 26
P.Bunny 02/66 48 Yellow 12 35 28
J.Troll 07/66 4842 Brown-3 12 26 26
L.Tansley 05/66 4712 Brown-2 12 30 28

index(s,t): 

kl@ubuntu:~/scripts$ awk '{print index($0,"n")}' test 
5
37
5
37
5

 length(s):

kl@ubuntu:~/scripts$ awk '{print length($1)}' test 
9
6
7
7
9

 match(s,r):

找不到返回0,找到返模式串r在匹配串s中的位置

kl@ubuntu:~/scripts$ awk '$1=="J.Lulu" {print match($1,/u/)}' test 
4

split(s,a,fs)

kl@ubuntu:~/scripts$ awk 'BEGIN {print split("123#456#789",myarray,/#/);print myarray[1],myarray[2],myarray[3]}'
3
123 456 789

sprint(fmt,exp)

sub(r,s)

如下例子有第三个参数

kl@ubuntu:~/scripts$ awk '$1=="J.Troll" {sub(/26/,29,$0)} {print $0}' test 
M.Tansley       05/99   48311   Green   8       40      44
J.Lulu          06/99   48317   green   9       24      26
P.Bunny         02/99   48      Yellow  12      35      28
J.Troll         07/99   4842    Brown-3 12      29      26
L.Tansley       05/99   4712    Brown-2 12      30      28
kl@ubuntu:~/scripts$ awk '$1=="J.Troll" {sub("26",29,$7)} {print $0}' test 
M.Tansley       05/99   48311   Green   8       40      44
J.Lulu          06/99   48317   green   9       24      26
P.Bunny         02/99   48      Yellow  12      35      28
J.Troll 07/99 4842 Brown-3 12 26 29
L.Tansley       05/99   4712    Brown-2 12      30      28

substr(s,p)  :

kl@ubuntu:~/scripts$ awk '$1=="L.Tansley" {print substr($1,1)}' test 
L.Tansley
kl@ubuntu:~/scripts$ awk '$1=="L.Tansley" {print substr($1,1,5)}' test 
L.Tan

 

posted @ 2013-11-02 09:20  brickisku  阅读(406)  评论(1)    收藏  举报