Do specific task for some lines, copy all other lines - awk

I have set of specific lines in file where I would like to do some changes, and I want to just coppy all other lines. I imagine code should look something like this
awk -v imin=5 -v imax=10 -v shift=5.54545 '{
(NR==5){ print $1+5,$2; }
(NR==7){ print $1+shift,$2; }
((NR>imin)&&(NR<imax)){ print $1,$2,$3+shift; }
(NR == EVERY_OTHER_LINE){ print $0; }
}' input_data.dat
But I don't know how to do this (NR == EVERY_OTHER_LINE), meaning every line except the ones handled above.
Best what I found is here, but it is not really what I want.
https://unix.stackexchange.com/questions/563455/awk-print-all-remaining-lines

I would follow the following approach:
(NR==5){ print $1+5,$2; next }
(NR==7){ print $1+shift,$2; next }
((NR>imin) && (NR<imax)){ print $1,$2,$3+shift; next}
1;
We introduce the next command to avoid that any special lines have a secondary print statement
This is, however, a bit convoluted, so the following method for this particular case might be better:
{line=$0}
(NR==5) { line=$1+5 OFS $2 }
(NR==7) { line=$1+shift OFS $2 }
((NR>imin)&&(NR<imax)){ line = $1 OFS $2 OFS $3+shift }
{print line}
Ofcourse, if record 5 and 7 only have 2 fields and the records between imin and imax with imin>7 have 3 fields, then it is even easier:
(NR==5){ $1+=5 }
(NR==7){ $1+=shift }
(NR>imin)&&(NR<imax){ $3+=shift }
1

Related

Multiple options of nf for identify duplicate in different positions awk?

I hope you find yourself well, I am writing to know if it is possible to do something like this in awk
I NEED SOMETHING LIKE MANY CASE OF NF...
FOR NF = 7 PK IS $1,$5, BUT FOR NF=8 $1,$6
INPUT
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
DESIRE OUTPUTS
file .PK_OK_1
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
file DUPLICATE_PK_1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
file PK_OK_2
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
file DUPLICATE_PK_2
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
file INVALID_LENGHT
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
MY CODE IS something like this (NOM_ARCH IS A VARIABLE)
BEGIN { FS="|";
OFS="|"
}
NF == 7 {
if (!seen[$1,$5]) {
print > NOM_ARCH".PK_OK_1"; seen[$1,$5]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_1"
}
next
}
NF == 8 {
if (!seen[$1,$6]) {
print > NOM_ARCH".PK_OK_2"; seen[$1,$6]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_2"
}
next
}
{ print > NOM_ARCH".INVALID_LENGHT" }
With your shown samples, please try following awk code.
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
NF==8{
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
OR use following code without ternary operators as per OP's request:
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
if(arr1[key]==1){ outputFile="file.PK_OK_1" }
else { outputFile="file_DUPLICATE_PK_1"}
}
NF==8{
if(arr1[key]==1){ outputFile="file.PK_OK_2" }
else { outputFile="file_DUPLICATE_PK_2"}
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
## Starting awk program from here.
awk '
## Starting BEGIN section of this program from here, setting FS and OFS to | here.
BEGIN{ FS=OFS="|" }
##Starting main program from here.
{
##Checking condition if NF is 7 then set key to $1 FS $5.
if(NF==7){ key=($1 FS $5) }
##Checking condition if NF is 8 then set key to $1 FS $6.
if(NF==8){ key=($1 FS $6) }
}
##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
FNR==NR{
##Creating array arr1 with index of key and keep increasing same key value with 1 here.
arr1[key]++
##next will skip all further statements from here.
next
}
##Checking condition if NF==7 then do following.
NF==7{
##Setting outputFile(where contents will be written to), either file.PK_OK_1 OR file_DUPLICATE_PK_1 depending upon value of arr1.
##Basically it uses ternary operators ? and :
##Statements after ? will executed if condition arr1[key]==1 is TRUE.
##Statements after : will be executed if condition ar1[key]==1 is FALSE.
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
##Checking condition if NF==8 then do following.
NF==8{
##Setting outputFile(where contents will be written to), either file.PK_OK_2 OR file_DUPLICATE_PK_2 depending upon value of arr1.
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
##Checking condition if NF>8 then do following.
NF>8{
##Setting outputFile(where contents will be written to) to file_INVALID_LENGHTH here.
outputFile="file_INVALID_LENGHTH"
}
{
##Printing current line to outputFile(already set its value above)
print > (outputFile)
}
##Mentioning Input_file names here.
' Input_file Input_file
Normally I'd recommend a first pass with sort and uniq -c for efficiency but I started out assuming the wrong requirements and so wrote most of this under that assumption and so I've just tweaked it now for the real requirements and so here's how to do it all in one awk script:
$ cat tst.awk
BEGIN {
FS=OFS="|"
map[7] = 1
map[8] = 2
}
{ key = $1 FS $(NF-2) FS NF }
NR==FNR {
cnt[key]++
next
}
{
if ( NF in map ) {
sfx = ( cnt[key]>1 ? "DUPLICATE_PK" : "PK_OK" ) "_" map[NF]
}
else {
sfx = "INVALID_LENGTH"
}
print > (nom_arch "." sfx)
}
$ awk -v nom_arch='foo' -f tst.awk file file
$ head foo.*
==> foo.DUPLICATE_PK_1 <==
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
==> foo.DUPLICATE_PK_2 <==
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
==> foo.INVALID_LENGTH <==
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
==> foo.PK_OK_1 <==
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
==> foo.PK_OK_2 <==
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
I corrected the spelling of LENGTH above.
Note that NF is included in key = $1 FS $(NF-2) FS NF so we avoid a potential case pointed out by #rowboat where a line with 7 fields has the same $1 and $(NF-2) as a line with 8 fields and so we would otherwise end up counting that twice when it should be 2 separate counts of 1.
We could have used NF-6 instead of map[NF] when setting the sfx but the map[] is useful for identifying valid NF values too and there may be other values of NF in future for which the sfx can't be determined by just subtracting 6.
This uses GNU awk for multidimensional arrays:
# classify.awk
BEGIN {
FS = "|"
ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
print > nom_arch".INVALID_LENGTH"
next
}
{
pk = $1 SUBSEP (NF == 7 ? $5 : $6)
count[NF][pk]++
lines[NF][pk] = lines[NF][pk] $0 ORS
}
END {
for (nf in count)
for (pk in count[nf]) {
outfile = nom_arch (count[nf][pk] == 1 ? ok[nf] : dup[nf])
sub(ORS"$", "", lines[nf][pk])
print lines[nf][pk] > outfile
}
}
Then this will produce the desired output files
gawk -f classify.awk -v nom_arch="foo" file
The awk SUBSEP variable is used in array keys when you do something like
var[x,y] = 10
awk uses the value of SUBSEP to join the values of x and y.
The default SUBSEP value is octal value 034, an ASCII character unlikely to appear in text data.
This version is more portable, does not require GNU awk
BEGIN {
FS = "|"
ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
print > (nom_arch".INVALID_LENGTH")
next
}
{
pk = NF SUBSEP $1 SUBSEP (NF == 7 ? $5 : $6)
count[pk]++
lines[pk] = lines[pk] $0 ORS
}
END {
for (pk in count) {
sub(ORS"$", "", lines[pk])
nf = pk; sub(SUBSEP".*", "", nf)
outfile = nom_arch (count[pk] == 1 ? ok[nf] : dup[nf])
print lines[pk] > outfile
}
}
If it's ok to put the first occurrence of a dup in with the OK's, then one pass is easy.
NOM_ARCH=/tmp/mytest
awk -v nom_arch="$NOM_ARCH" ' BEGIN { FS=OFS="|" }
{ if (NF ~ /^[78]$/) { key=($1 FS NF-2) } else { print > (nom_arch ".INVALID_LENGTH"); next; }
print > ( nom_arch "." ( seen[key]++ ? "DUPLICATE_PK" : "PK_OK" ) "_" NF-6 ) } ' file
c.f. AAA|B10|CCC|DDD|000|20200127|JONH3 and AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY which land in the OK files as the first hit, but subsequent dups get seen and directed elsewhere. Note that it might still be faster to shift those records between smaller files on a second pass after the fact.
Personally, I'd probably just split the records to key-sorted files by NF first. Then the second pass each is easy.
NOM_ARCH=/tmp/mytest
# this pre-sort is likely the slow part, though smaller files and in parallel
awk 'BEGIN { FS=OFS="|" } { k2=NF-2; print | "sort -t\\| -k1,1 -k"k2","k2">NF"NF; }' file
shopt -s extglob; cat NF!([78]) > $NOM_ARCH.INVALID_LENGTH &
​for f in NF[78]; do
awk -v nom_arch="$NOM_ARCH" '
BEGIN { FS=OFS="|"; lastkey=""; lastrec=""; }
END { if(""!=lastrec){print lastrec>f} }
{ key=($1 FS $(NF-2));
if ( key==lastkey ) {
f=(nom_arch".DUPLICATE_PK_"NF-6);
if(""!=lastrec){print lastrec>f}
print $0>f;
lastrec="";
} else {
if(""!=lastrec){print lastrec>f}
f=(nom_arch".PK_OK_"NF-6);
lastkey=($1 FS $(NF-2));
lastrec=$0;
}
}' "$f" &
​done
​wait
Now your data should be sorted to files. This likely reorders the records in those files (see below), so if that matters you should add sorts to those outputs as well.
mytest.PK_OK_1:
​AAA|B10|CCC|DDD|000|20200127|JONH3
​AAA|BBB|CCC|DDD|111|20220129|JONH1
mytest.PK_OK_2:
​AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
​AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
mytest.DUPLICATE_PK_1:
​AAA|BBB|MMM|DDD|444|20200131|JONH4
​AAA|XXX|YYY|DDD|444|20210115|JONH2
mytest.DUPLICATE_PK_2:
​AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
​AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
mytest.INVALID_LENGTH:
​ AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
This uses more disk space but less memory than an internal lookup table, and is likely a lot slower.
YMMV.

In AWK, skip the rest of the current action?

Thanks for looking.
I have an AWK script with something like this;
/^test/{
if ($2 == "2") {
# What goes here?
}
# Do some more stuff with lines that match test, but $2 != "2".
}
NR>1 {
print $0
}
I'd like to skip the rest of the action, but process the rest of the patterns/actions on the same line.
I've tried return but this isn't a function.
I've tried next but that skips the rest of the patterns/actions for the current line.
For now I've wrapped the rest of the ^test action in the if statement's else, but I was wondering if there was a better approach.
Not sure this matters but I am using gawk on OSX, installed via brew (for better compatibility with my target OS).
Update (w/solution):
Edits: Expanded code sample based on #karakfa's answer.
BEGIN{
keepLastLine = 1;
}
/^test/ && !keepLastLine{
printLine = 1;
print $0;
next;
}
/^test/ && keepLastLine{
printLine = 0;
next;
}
/^foo/{
# This is where I have the rest of my logic (approx 100 lines),
# including updates to printLine and keepLastLine
}
NR>1 {
if (printLine) {
print $0
}
}
This will work for me, I even like it better that what I was thinking of.
However I do wonder what if my keepLastLine condition was only accessible in a for loop?
I gather from what #karakfa has said, there isn't a control structure for exiting only an action, and continuing with other patterns, so that would have to be implemented with a flag of some sort (not unlike #RavinderSingh13's answer).
If I got it correct could you please try following. I am creating a variable named flag here which will be chedked if condition inside test block for checking if 2nd field is 2 is TRUE then it will be SET. When it is SET so rest of statements in test BLOCK will NOT be executed. Also resetting flag's value before read starts for a line too.
awk '
{
found=""
}
/^test/{
if ($2 == "2") {
# What goes here?
found=1
}
if(!found){
# Do some more stuff with lines that match test, but $2 != "2".
}
}
NR>1 {
print $0
}' Input_file
Testing of code here:
Let's say following is the Input_file:
cat Input_file
file
test 2 file
test
abcd
After running code following we will get following output, where if any line is having test keyword and NOT having $2==2 then also it will execute statements outside of test condition.
awk '
{
found=""
}
/^test/{
if ($2 == "2") {
print "# What goes here?"
found=1
}
if(!found){
print "Do some more stuff with lines that match test, but $2 != 2"
}
}
NR>1 {
print $0
}' Input_file
# What goes here?
test 2 file
Do some more stuff with lines that match test, but $2 != 2
test
abcd
the magic keyword you're looking for is else
/^test/{ if($2==2) { } # do something
else { } # do something else
}
NR>1 # {print $0} is implied.
for some reason if you don't want to use else just move up condition one up (flatten the hierarchy)
/^test/ && $2==2 { } # do something
/^test/ && $2!=2 { } # do something else
# other action{statement}s

awk: extract data from a column by name rather than position

I have a text file that is comma delimited. The first line is a list of field names, and subsequent lines contain data. I'll get new versions of the file, and I want to extract all the values from a particular column by name rather than by column number. (I.e. the column I want may be in different positions in different versions of the file.)
For example, here are two files:
foo,bar,interesting,junk
1,2,gold,ramjet
2,25,diamonds,superfluous
and
foo,bar,baz,interesting,junk,morejunk
5,3,smurf,platinum,garbage,scrap
6,2.5,mushroom,sodium,liverwurst,eew
I'd like a single script that will go through multiple files, extracting the minerals in the "interesting" column. :-)
What I've got so far is something that works on ONE file, but I know that awk is more elegant than this. How do I clean this up and make it work on multiple files at once?
BEGIN {
FS=",";
}
NR == 1 {
for(i=1; i<=NF; i++) {
if($i=="interesting") {
col=i;
}
}
}
NR > 1 {
print $col;
}
You're pretty darn close already. Just use FNR instead of NR, for "File NR".
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 {
for (col=1;col<=NF;col++)
if ($col=="interesting")
next
}
{ print $col }
Or if you like:
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 { for (col=1;$col!="interesting";col++); next }
{ print $col }
Or if you prefer one-liners:
$ awk -F, -v txt="interesting" 'FNR==1{for(c=1;$c!=txt;c++);next} {print $c}' file1 file2
Of course, be careful that you actually have the specified column, or you may find yourself in an endless loop. You can probably figure out the extra condition that saves you from that risk.
Note that in awk, you only need to terminate commands with semicolons if they are followed by another command. Thus, you would do this:
command1; command2
But you can drop the semicolon if you separate commands with newlines:
command1
command2
Do it this way:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["interesting"]) }
$ awk -f tst.awk file1 file2
gold
diamonds
platinum
sodium
Creating a name->value array is always the best approach when it's applicable. It keeps every part of the code simple and decoupled from the rest of the code, and it sets you up for doing other things like changing the order of the fields when you output the results, e.g.:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["junk"]), $(f["interesting"]), $(f["bar"]) }
$ awk -f tst.awk file1 file2
ramjet,gold,2
superfluous,diamonds,25
garbage,platinum,3
liverwurst,sodium,2.5

Sum up from line "A" to line "B" from a big file using awk

aNumber|bNumber|startDate|timeZone|duration|currencyType|cost|
22677512549|778|2014-07-02 10:16:35.000|NULL|NULL|localCurrency|0.00|
22675557361|76457227|2014-07-02 10:16:38.000|NULL|NULL|localCurrency|10.00|
22677521277|778|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|0.00|
22676099496|77250331|2014-07-02 10:16:42.000|NULL|NULL|localCurrency|1.00|
22667222160|22667262389|2014-07-02 10:16:43.000|NULL|NULL|localCurrency|10.00|
22665799922|70110055|2014-07-02 10:16:45.000|NULL|NULL|localCurrency|20.00|
22676239633|433|2014-07-02 10:16:48.000|NULL|NULL|localCurrency|0.00|
22677277255|76919167|2014-07-02 10:16:51.000|NULL|NULL|localCurrency|1.00|
This is the input (sample of million of line) i have in csv file.
I want to sum up duration based on date.
My concern is i want to sum up first 1000000 lines
the awk program i'm using is:
test.awk
BEGIN { FS = "|" }
NR>1 && NR<=1000000
FNR == 1{ next }
{
sub(/ .*/,"",$3)
key=sprintf("%10s",$3)
duration[key] += $5 } END {
printf "%-10s %16s,"dAccused","Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}}
i run my script as
$awk -f test.awk 'file'
The input i have doesn't condsidered my condition NR>1 && NR<=1000000
ANY SUGGESTION? PLEASE!
You're looking for this:
BEGIN { FS = "|" }
1 < NR && NR <= 1000000 {
sub(/ .*/, "", $3)
key = sprintf("%10s",$3)
duration[key] += $5
}
END {
printf "%-10s %16s\n", "dAccused", "Duration"
for (i in duration) {
printf "%-4s %16.2f i,duration[i]
}
}
A lot of errors become obvious with proper indentation.
The reason you saw 1,000,000 lines was due to this:
NR>1 && NR<=1000000
That is a condition with no action block. The default action is to print the current record if the condition is true. That's why you see a lot of awk one-liners end with the number 1
You didn't post any expected output and your duration field is always NULL so it's still not clear what you really want output, but this is probably the right approach:
$ cat tst.awk
BEGIN { FS = "|" }
NR==1 { for (i=1;i<NF;i++) f[$i] = i; next }
{
sub(/ .*/,"",$(f["startDate"]))
sum[$(f["startDate"])] += $(f["duration"])
}
NR==1000000 { exit }
END { for (date in sum) print date, sum[date] }
$ awk -f tst.awk file
2014-07-02 0
Instead of discarding your header line, it uses it to create an array f[] that maps the field names to their order in each line so instead of having to hard-code that duration is field 4 (or whatever) you just reference it as $(f["duration"]).
Any time your input file has a header line, don't discard it - use it so your script is not coupled to the order of fields in your input file.

Merging rows in a file | Performance Improvement

I have a file in which I have to merge 2 rows on the basis of:
- Common sessionID
- Immediate next matching pattern (GX with QG)
file1:
session=001,field01,name=GX1_TRANSACTION,field03,field04
session=001,field91,name=QG
session=001,field01,name=GX2_TRANSACTION,field03,field04
session=001,field92,name=QG
session=004,field01,name=GX1_TRANSACTION,field03,field04
session=002,field01,name=GX1_TRANSACTION,field03,field04
session=002,field01,name=GX2_TRANSACTION,field03,field04
session=002,field92,name=QG
session=003,field91,name=QG
session=003,field01,name=GX2_TRANSACTION,field03,field04
session=003,field92,name=QG
session=004,field91,name=QG
session=004,field01,name=GX2_TRANSACTION,field03,field04
session=004,field92,name=QG
I have created an awk (I am new and learnt awk only from This portal only) which created my desired output.
Output1
session=001,field01,name=GX1_TRANSACTION,field03,field04,session=001,field91,name=QG
session=001,field01,name=GX2_TRANSACTION,field03,field04,session=001,field92,name=QG
session=002,field01,name=GX1_TRANSACTION,field03,field04,NOMATCH-QG
session=002,field01,name=GX2_TRANSACTION,field03,field04,session=002,field92,name=QG
session=003,field01,name=GX2_TRANSACTION,field03,field04,session=003,field92,name=QG
session=004,field01,name=GX1_TRANSACTION,field03,field04,session=004,field91,name=QG
session=004,field01,name=GX2_TRANSACTION,field03,field04,session=004,field92,name=QG
Output2: Pending
session=003,field91,name=QG
Awk:
{
if($0~/name=GX1_TRANSACTION/ || $0~/GX2_TRANSACTION/) {
if($1 in ccr)
print ccr[$1]",NOMATCH-QG";
ccr[$1]=$0;
}
if($0~/name=QG/) {
if($1 in ccr) {
print ccr[$1]","$0;
delete ccr[$1];
}
else {
print $0",NOUSER" >> Pending
}
}
}
END {
for (i in ccr)
print ccr[i]",NOMATCH-QG"
}
Command:
awk -F"," -v Pending=t -f a.awk file1
But Issue is my "file1" is really big, So I want to improve the performance of this script. Is their any way by which I can improve its performance?
There are a couple of changes that may lead to small improvements in speed, and if not may give you some ideas for future awk scripts.
Don't "manually" test every line if you don't have to - raise the name= tests to the main awk loop. Currently your script checks $0 up to three times per line for a name= match.
Since you're using , as the FS, test the corresponding field ($3) instead of $0. It only saves a few leading chars of pattern matching in your example data.
Here's a refactored a.awk:
$3~/name=GX[12]_TRANSACTION/ {
if($1 in ccr)
print ccr[$1]",NOMATCH-QG";
ccr[$1]=$0;
}
$3~/name=QG/ {
if($1 in ccr) {
print ccr[$1]","$0;
delete ccr[$1];
}
else {
print $0",NOUSER" >> Pending
}
}
END { for (i in ccr) print ccr[i]",NOMATCH-QG" }
I've also condensed the GX pattern match to one regex. I get the same output as your example.
In any program, IO (e.g. print statements) is usually the most real-time intensive operation. In awk there's an operation that's even slower, though, and that's string concatenation. Because awk doesn't require you to pre-allocate memory for strings, the memory gets allocated dynamically so then when you increase the length of a string, it must get dynamically re-allocated. So, you can speed up your program by removing the string concatenations, e.g. for all those hard-coded ","s you're printing instead of just setting/using the OFS.
I haven't really thought about the logic of your overall approach but there's a couple of other tweaks you could try:
BEGIN{ FS=OFS="," }
NF {
if ($3 ~ /name=GX[12]_TRANSACTION/) {
if($1 in ccr) {
print ccr[$1], "NOMATCH-QG"
}
ccr[$1]=$0
}
else {
if($1 in ccr) {
print ccr[$1], $0
delete ccr[$1]
}
else {
print $0, "NOUSER" >> Pending
}
}
}
END {
for (i in ccr)
print ccr[i], "NOMATCH-QG"
}
Note that by setting FS in the script you no longer need to use -F"," on the command line.
Are you sure you want >> instead of > on the print to "Pending"? Those 2 constructs don't mean the same in awk as they do in shell.