linux awk匹配两个文件中的三列,并将匹配的行附加到新文件
有一个文件,看起来像这样:
1 tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like 2 tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN 3 tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 4 tig00000005 2685 4511 XP_012144644.1 NW_003797249.1 LOC105662970 PREDICTED: fibrinogen alpha chain-like isoform X2 5 tig00000005 28923 29432 XP_012148395.1 NW_003797444.1 LOC100881617 PREDICTED: eukaryotic translation initiation factor 4 gamma 3-like isoform X12 6 tig00000005 32415 34324 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like
第二个文件看起来像这样:
1 tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2 2 tig00000005 maker gene 16764 17237 . + . ID=snap_masked-tig00000005-processed-gene-0.3;Name=snap_masked-tig00000005-processed-gene-0.3 3 tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4 4 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10 5 tig00000005 maker gene 25472 26900 . + . ID=snap_masked-tig00000005-processed-gene-0.5;Name=snap_masked-tig00000005-processed-gene-0.5
希望将第一个文件中的1、2和3列与第二个文件中的1、4和5列进行匹配,如果它们匹配,则将第二个文件的数据附加到第一个文件中,如下所示:
1 tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2
解决方法:
awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2
1 awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2 2 tig00000005 15310 16162 XP_012153921.1 NW_003797090.1 LOC105664333 PREDICTED: elastin-like tig00000005 maker gene 15310 16162 . + . ID=snap_masked-tig00000005-processed-gene-0.2;Name=snap_masked-tig00000005-processed-gene-0.2 3 tig00000005 23339 23974 XP_012152584.1 NW_003797083.1 LOC100878991 PREDICTED: LOW QUALITY PROTEIN tig00000005 maker gene 23339 23974 . + . ID=snap_masked-tig00000005-processed-gene-0.4;Name=snap_masked-tig00000005-processed-gene-0.4 4 tig00000005 24600 25138 XP_012143166.1 NW_003797196.1 LOC100881279 PREDICTED: ankyrin-2 isoform X2 tig00000005 maker gene 24600 25138 . - . ID=snap_masked-tig00000005-processed-gene-0.10;Name=snap_masked-tig00000005-processed-gene-0.10
要写入新文件,只需执行
awk 'NR==FNR{a[$1" "$2" "$3]=$0; next}; {if($1" "$4" "$5 in a){print a[$1" "$4" "$5],$0}}' file1 file2 > file3

浙公网安备 33010602011771号