How to use multiple passes with gawk? - awk

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?
Here's my .awk file:
pass == 1
{
print "pass1 is", pass;
}
pass == 2
{
if(pass == 2)
print "pass2 is", pass;
}
Here's my output (input file is just "hello):
hello
pass1 is 1
pass1 is 2
hello
pass2 is 2
Here's my command line:
gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt
I'd appreciate any help.

An (g)awk solution might look like this:
awk 'FNR == NR{print "1st pass"; next}
{print "second pass"}' x.txt x.txt
(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):
awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
$1==max' x.txt x.txt
The output for x.txt:
6,5
2,6
5,7
6,9
is
6,5
6,9
How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.

So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.
But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)
awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile
Or, spaced out for easier reading:
BEGIN {
FS=","
}
$1 > max {
delete list # empty the array
n=0 # reset the array counter
max=$1 # set a new max
}
max==$1 {
list[++n]=$0 # record the line in our array
}
END {
for(i=1;i<=n;i++) { # print the array in order of found lines.
print list[i]
}
}
With the same input data that F.Knorr tested with, I get the same results.
The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.
This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.

The issue here is that newlines matter to awk.
# This does what I should have done:
pass==1 {print "pass1 is", pass;}
pass==2 {if (pass==2) print "pass2 is", pass;}
# This is the code in my question:
# When pass == 1, do nothing
pass==1
# On every condition, do this
{print "pass1 is", pass;}
# When pass == 2, do nothing
pass==2
# On every condition, do this
{if (pass==2) print "pass2 is", pass;}
Using pass==1, pass==2 isn't as elegant, but it works.

Related

AWK calculate percentages for each input file and output summary to single file

I have hundreds of csv files with the same format. I want to 'summarise' each file by counting occurrences of the word "Correct" in column 3 and calculating the percentage of "Corrects" per file (i.e. "Correct"s / total number of rows in that file). I am currently doing this for each file with a shell 'for-loop', but this isn't ideal for reasons.
Minimal reproducible example:
cat file1.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Incorrect
4,low,Incorrect
cat file2.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Correct
4,low,Incorrect
Correct answer for each individual file:
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file1.csv
model,total_correct,accuracy
file1.csv,2,0.5
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file2.csv
model,total_correct,accuracy
file2.csv,3,0.75
My desired outcome:
model,total_correct,accuracy
file1.csv,2,0.5
file2.csv,3,0.75
Thanks for any advice.
With GNU awk you can try following code. Written and tested with shown samples. Using ENDFILE here to make life easy. Also added 2 more conditions into the code: 1st: Increasing count for n when there is NO NULL line. 2nd: While getting average to make sure no error comes(in case zero records found) it should print N/A rather than an OOTB generated error. I have also changed from NR>1 to FNR>1 since NR will be a cumulative count and we need FNR which reset the line number from each Input_file's beginning.
awk '
BEGIN{
FS=OFS=","
print "model,total_correct,accuracy"
}
FNR>1{
if(NF) { n++ }
if($3 == "Correct"){ correct++ }
}
ENDFILE{
printf("%s,%d,%.02f\n",FILENAME, correct, (n>0&&n?(correct / n):"N/A"))
correct=n=0
}
' *.csv
With the standard awk, you can increment counts in an array indexed by filename and whether the third column is "Correct", and iterate through the filenames in the end to output the statistics:
awk '
BEGIN{FS=OFS=",";print"model,total_correct,accuracy"}
FNR>1{++r[FILENAME,$3=="Correct"]}
END{
for(i=1;i<ARGC;++i){
f=ARGV[i];
print f,c=r[f,1],c/(r[f,0]+c)
}
}' *.csv

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

Delete every line if occurence found

I have a file with this format content:
1 6 8
1 6 9
1 12 20
1 6
2 8
2 9
2 12
2 20
2 35
I want to delete all the lines if the number (from 2nd or 3rd column but not from 1st) is found in the next lines whether it is in the 2nd or 3rd column inluding the line where the initial number is found.
I should have this as an output:
2 35
I've tried using:
awk '{for(i=2;i<=NF;i++){if($i in a){next};a[$i]}} 1'
but it doesn't seem to work.
What is wrong ?
One-pass awk that hashes all the records to r[NR] and keeps another array a[$i] for the values seen in fields $2,...NF.
awk ' {
for(i=2;i<=NF;i++) # iterate fields starting from the second
if($i in a) { # if field value was seen before
delete r[a[$i]] # delete related record
a[$i]="" # clear a
f=1 # flag up
} else { # if it was not seen before
a[$i]=NR # add record number to a
r[NR]=$0
}
if(f!=1) # if flag was not raised
r[NR]=$0 # store record on record number
else # if it was raised
f="" # flag down
}
END {
for(i=1;i<=NR;++i)
if(i in r)
print r[i] # output remaining
}' file
Output:
2 35
The simplest way is a double-pass algorithm where you read your file twice.
The idea is to store all values in an array a and count how many times they appear. If the value appears 2 or more times, it means you have found more then a single entry and you should not print the line.
awk '(NR==FNR){a[$2]++; if(NF>2) a[$3]++; next}
(NF==2) && (a[$2]==1);
(NF==3) && (a[$2]==1 && a[$3]==1)' <file> <file>
In practice, you should avoid things such as a[var]==1 if you are not sure whether var is in the array as it will create that array element. However, since we never increase it any more, it is fine to proceed.
If you want to achieve the same thing with more then three fields you can do:
awk '(NR==FNR){for(i=2;i<=NF;++i) a[$i]++; next }
{for(i=2;i<=NF;++i) if(a[$i]>1) next }
{print}' <file> <file>
While both these solutions read the file twice, you can also store the full file in memory and read the file only a single time. This, however, is exactly the same algorithm:
awk '{for(i=2;i<=NF;++i) a[$i]++; b[NR]=$0}
END{ for(j=1;j<=NR;++j) {
$0=b[j];
for(i=2;i<=NF;++i) if(a[$i]>1) continue
print $0
}
}' <file>
comment: this single-pass solution is very simple and stores the full file in memory. The solution of James Brown is very clever. It removes stuff from memory when they are not needed anymore. A bit shorter version is:
awk '{ for(i=2;i<=NF;++i) if ($i in a) delete b[a[$i]]; else { a[$i]=NR; b[NR]=$0 }}
END { for(n=1;n<=NR;++n) if(n in b) print b[n] }' <file>
note: you should never thrive for the shortest solution, but the most readable one!
Could you please try following.
awk '
FNR==NR{
for(i=2;i<=NF;i++){
a[$i]++
}
next
}
(NF==2 && a[$2]==1) || (NF==3 && a[$2]==1 && a[$3]==1)
' Input_file Input_file
Output will be as follows.
2 35
$ cat tst.awk
NR==FNR {
cnt[$2]++
cnt[$3]++
next
}
cnt[$2]<2 && cnt[$NF]<2
$ awk -f tst.awk file file
2 35
This might work for you (GNU sed):
sed -r 'H;s/^[0-9]+ +//;G;s/\n(.*\n)/\1/;h;$!d;s/^([^\n]*)\n(.*)/\2\n \1/;:a;/^[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/^[0-9]+ +[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/\n/P;:b;s/^[^\n]*\n//;ta;d' file
This is not a serious solution however it demonstrates what can be achieved using only matching and substitution.
The solution makes a copy of the original file and whilst doing so accumulates all numbers in the second and possible third fields of each record in a separate line which it maintains at the head of the copy.
At the end of the file, the first line of the copy contains all the pertinent keys and if there are duplicate keys then any line in the file that contains such a key is deleted. This is achieved by moving the keys (the first line) to the end of the file and matching the second (and possibly third) fields of each record on those keys.

Gawk: Insert line using lookup data from another file

I'm trying to insert lines into a file, where the data being inserted is based on data stored in another file. I've tried this in both Sed and Awk, but can't figure out in either how to access the second file. This is probably a stretch for Sed, perhaps less so for Awk?
The main file:
# alpha --
some data
some more data
# beta --
some data
some more data
# gamma --
some data
some more data
The lookup file:
alpha This is a description of alpha
gamma This guys description
delta And a third description
The result should look like this:
# alpha --
Description = This is a description of alpha
some data
some more data
# beta --
some data
some more data
# gamma --
Description = This guys description
some data
some more data
Notice that the lookup file may not have a description for the item, and that's ok; the "Description = " line will just be omitted.
I figured this much out in Awk, but don't know how to reference the lookup file:
awk '{
if ($0 ~ /^# [^ ]* --/) {
print $0;
print "Description = "; # How to lookup $2's description??
} else {
print $0;
}
}' <file1.txt
How can I obtain the description from the second file using Awk? Or is there a better tool for this? Thanks!
another similar awk
$ awk 'NR==FNR {k=$1; sub(/^\S+\s+/,"Description = "); dict[k]=$0; next}
1;
/^#/ {if($2 in dict) print dict[$2]}' dict file
You could do something like this, supplying both files on the awk command line in the logical order (descriptions first, so it can read and store them, followed by the data that needs them inserted):
$ awk '(NR == FNR) {
desc[$1]=$2;
for (i=3;i<=NF;i++) {
desc[$1]=desc[$1]" "$i
};
}
(NR > FNR) {
print;
if (/^#/) {
print "Description = "desc[$2];
}
}' desc.txt main.txt
Which produces this output given your sample file contents:
# alpha --
Description = This is a description of alpha
some data
some more data
# beta --
Description =
some data
some more data
# gamma --
Description = This guys description
some data
some more data
Explanation:
The awk variable NR contains the Number of Records seen so far. Normally, a record is a line - although you can change the record separator - so this is effectively the current line number, counted continuously across all the files being processed. In this case its value will run from 1 to 14.
The variable FNR (File Number of Records) works the same way, but resets to 1 at the start of each new file. So in this case its value will run from 1 to 4 and then 1 to 10.
By comparing these two values, the program can determine which file is currently being processed. If NR and FNR are the same, we know we're in the first file, and use the contents of the line to populate the associative array desc. The first field ($1) is the key; we concatenate the rest of the fields together to form the value.
If NR is not equal to FNR (it can only be greater, never less), we know we're in the second file. In that case, we first print the line (which we always do, so we just make it unconditional instead of repeating the statement). Then we check to see if we need to append the description. If we do, look it up in the desc array - using $2 (the second whitespace-separated field on the line, the first being the "#") as the lookup key.

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.