AWK: Compare two columns conditionally in one file - awk

I have a pipe (|) delimited file where $1 has IDs and there are values in $2 and $3. The file has ~5000 rows in it with each ID $1 repeated multiple times. The file looks like this
a|1|2
a|2|0
a|3|3
a|4|0
b|5|3
b|2|4
I am trying to print lines where the $2 on the current line is <= the max $3 so the output will be
a|1|2
a|2|0
a|3|3
b|2|4
Any lead on this would be highly appreciated! Thank you.

It sounds like you just want to, for each $1, print those lines where $2 is less than or equal to the max $3:
$ cat tst.awk
BEGIN { FS="[|]" }
NR==FNR {
max[$1] = ( ($1 in max) && (max[$1] > $3) ? max[$1] : $3 )
next
}
$2 <= max[$1]
$ awk -f tst.awk file file
a|1|2
a|2|0
a|3|3
b|2|4

Related

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

Trying to read data from two files, and subtract values from both files using awk

I have two files
0.975301988947238963 1.75276754663189283 2.00584
0.0457467532388459441 1.21307648993841410 1.21394
-0.664000617674924687 1.57872850852906366 1.71268
-0.812129324498058969 4.86617859243825635 4.93348
and
1.98005959631337536 -3.78935155011290536 4.27549
-1.04468782080821154 4.99192849476267053 5.10007
-1.47203672235857397 -3.15493073343947694 3.48145
2.68001948430755244 -0.0630730371855307004 2.68076
I want to subtract the two values in column 3 of each file.
My first awk statement was
**awk
'BEGIN {print "Test"} FNR>1 && FNR==NR { r[$1]=$3; next} FNR>1 { print $3, r[$1], (r[$1]-$3)}' zzz0.dat zzz1.dat**
Test
5.10007 -5.10007
3.48145 -3.48145
2.68076 -2.68076
This suggests it does not recognize r[$1]=$3
I created an additional column xyz by
**awk 'NR==1{$(NF+1)="xyz"} NR>1{$(NF+1)="xyz"}1' zzz0.dat**
then
awk 'BEGIN {print "Test"} FNR>1 && FNR==NR { xyz[$4]=$3; next} FNR>1 { print $3, xyz[$4], (xyz[$4]-$3)}' zzz00.dat zzz11.dat
Test
5.10007 4.93348 -0.16659
3.48145 4.93348 1.45203
2.68076 4.93348 2.25272
This now shows three columns, but xyz[$4] is printing only the value in the last column, instead of creating a array.
My real files have thousands of lines. How can I resolve this problem ?
You can do it relatively easily using a numeric index for your array. For example:
awk 'NR==FNR {a[++n]=$3; next} o<n{++o; printf "%lf - %lf = %lf\n", a[o], $3, a[o]-$3}' file1 file2
That way you preserve the ordering of the records across files. Without a numeric index, the arrays are associative and there is no specific ordering preserved.
Example Use/Output
With your files in file1 and file2 respectively, you would have:
$ awk 'NR==FNR {a[++n]=$3; next} o<n{++o; printf "%lf - %lf = %lf\n", a[o], $3, a[o]-$3}' file1 file2
2.005840 - 4.275490 = -2.269650
1.213940 - 5.100070 = -3.886130
1.712680 - 3.481450 = -1.768770
4.933480 - 2.680760 = 2.252720
Let me know if that is what you intended or if you have any further questions. If I missed your intent, drop a comment and I will help further.
if the records are aligned in both files, easiest is
$ paste file1 file2 | awk '{print $3,$6,$3-$6}'
2.00584 4.27549 -2.26965
1.21394 5.10007 -3.88613
1.71268 3.48145 -1.76877
4.93348 2.68076 2.25272
if you're only interested in the difference, change to print $3-$6.

awk Compare 2 files, print match and print just 2 columns of the second file

I am novice and I am sure it is a silly question but I searched and I didn't find an answer.
I want to select just 2 columns of my file 2. I know how to select one column =$1 and all columns =$0. But If we want just show 2,3, ... column from file2 in my file3, is it possible?
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
or
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $0; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
I just want $1 and $2 from file2, this code doesn´t work. I obtain one column with data from $1 and $2
awk -v RS='\r\n' 'BEGIN {FS=OFS=";"} FNR==NR {a[$2] = $1$2; next} {gsub(/_/,"-",$2);$2=toupper($2);print a[$2]?a[$2]:"NA",$0,a[$2]?a[$2]:"NA"}' $File2 $File1 > file3
Any solution??
awk -v RS='\r\n' ' # call awk and set row separator
BEGIN {
FS=OFS=";" # set input and output field separator
}
# Here you are reading first argument that is File2
FNR==NR {
# Save column2 and column3 separated by OFS that is ;
# from File2 which is first argument, in array a
# whose index/key being second field/column from File2
a[$2] = $2 OFS $3;
# Stop processing go to next line of File1
next
}
# Here on words you are reading second argument that is File1
{
# Global substitution
# replace _ with hyphen - in field2/column2
gsub(/_/,"-",$2);
# Uppercase field2/column2
$2=toupper($2);
# If field2 of current file (File1) exists in array a
# which is created above using File2 then
# print array value that is your field2 and field3 of File2
# else print "NA", and then output field separator,
# entire line/record of current file
print ($2 in a ? a[$2] : "NA"), $0
}' $File2 $File1 > file3

How to get rows with values more than 2 in at least 2 columns?

I am trying to extract row where value is >=2 in atleast two column. My input file look like this
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
and my awk script look like this
cat input_file | awk 'BEGIN{FS=",";OFS=","};{count>=0;for(i=2; i<4; i++) {if($i!=0) {count++}};if (count>=2){print $0}}'
which doesn't give me the expected output that should be
gain,top1,sos1,pho1
ATC3,6,6,0
What is the problem with this script. Thanks.
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
Or below one, print and go to next line immediately after finding 2 values (Reasonably faster)
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
Explanation
awk -F, ' # call awk and set field separator as comma
FNR>1{ # we wanna skip header to be checked so, if no of records related to current file is greater than 1
f=0; # set variable f = 0
for(i=2; i<=NF; i++) # start looping from second field to no of fields in record/line/row
{
if($i>=2)f++; # if field value is greater than 2 increment variable f
if(f>=2) # if we got 2 values ? then
{
print; # print record/line/row
next # we got enough go to next line
}
}
}FNR==1 # if first record being read then print in fact if FNR==1 we get boolean true, so it does default operation print $0, that is current record/line/row
' file
Input
$ cat file
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
Output-1
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
Output-2 (Reasonably faster)
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
hacky awk, handles the header as well
$ awk -F, '($2>=2) + ($3>=2) + ($4>=2) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
or,
$ awk -F, 'function ge2(x) {return x>=2?1:0}
ge2($2) + ge2($3) + ge2($4) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
#pali: #try:
Hope this should be much faster.
awk '{Q=$0;}(gsub(/,[2-9]/,"",Q)>=2) || FNR==1' Input_file
Here I am putting line's value into a variable named Q then, from Q variable globally substituting all the matches , then digits from 2 to 9 to NULL. Then checking it's count if that is greater or equal than 2, if either it's global substitution's value is greater than 2 or line number is 1 then it should print the current line.

How to select 2 fields if they meet requirements & length in awk or sed?

I want to select 2 fields and out put them to a file:
field$1 I want to select all if it = # symbol (for email)
field$2 I want to select if it = certain character length ie. 40.
only output if both requirements are met, how to do this in awk or sed?
I was using this:
awk -F: '/\#/ {print $1 ":" $2 }' test.txt > file_output.txt
however the # is for both $1 and $2 which is not what i want.
Thanks,
Edit: here is an example (in bold)
email#some.com:123456789123456789123456789:blah:blah:blah
ignore:1234#56789
output needed:
email#some.com:123456789123456789123456789
you can use this;
awk -F: '{if ($1 ~ /\#/ && length($2) == 40) print $1 ":" $2 }' test.txt > file_output.txt
Test;
sample file
$ cat t
user#host1:0123456789012345678901234567890123456789
user#host2:0123456789012345678901234567890123456789
userhost3:0123456789012345678901234567890123456789
user#host4:012345677
awk output;
$ awk -F: '{if ($1 ~ /\#/ && length($2) == 40) print $0 }' t
user#host1:0123456789012345678901234567890123456789
user#host2:0123456789012345678901234567890123456789