How to get rows with values more than 2 in at least 2 columns? - awk

I am trying to extract row where value is >=2 in atleast two column. My input file look like this
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
and my awk script look like this
cat input_file | awk 'BEGIN{FS=",";OFS=","};{count>=0;for(i=2; i<4; i++) {if($i!=0) {count++}};if (count>=2){print $0}}'
which doesn't give me the expected output that should be
gain,top1,sos1,pho1
ATC3,6,6,0
What is the problem with this script. Thanks.

awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
Or below one, print and go to next line immediately after finding 2 values (Reasonably faster)
awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
Explanation
awk -F, ' # call awk and set field separator as comma
FNR>1{ # we wanna skip header to be checked so, if no of records related to current file is greater than 1
f=0; # set variable f = 0
for(i=2; i<=NF; i++) # start looping from second field to no of fields in record/line/row
{
if($i>=2)f++; # if field value is greater than 2 increment variable f
if(f>=2) # if we got 2 values ? then
{
print; # print record/line/row
next # we got enough go to next line
}
}
}FNR==1 # if first record being read then print in fact if FNR==1 we get boolean true, so it does default operation print $0, that is current record/line/row
' file
Input
$ cat file
gain,top1,sos1,pho1
ATC1,0,0,0
ATC2,1,2,1
ATC3,6,6,0
ATC4,1,1,2
Output-1
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++)if($i>=2)f++}f>=2 || FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0
Output-2 (Reasonably faster)
$ awk -F, 'FNR>1{f=0; for(i=2; i<=NF; i++){ if($i>=2)f++; if(f>=2){ print; next} } }FNR==1' file
gain,top1,sos1,pho1
ATC3,6,6,0

hacky awk, handles the header as well
$ awk -F, '($2>=2) + ($3>=2) + ($4>=2) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0
or,
$ awk -F, 'function ge2(x) {return x>=2?1:0}
ge2($2) + ge2($3) + ge2($4) > 1' file
gain,top1,sos1,pho1
ATC3,6,6,0

#pali: #try:
Hope this should be much faster.
awk '{Q=$0;}(gsub(/,[2-9]/,"",Q)>=2) || FNR==1' Input_file
Here I am putting line's value into a variable named Q then, from Q variable globally substituting all the matches , then digits from 2 to 9 to NULL. Then checking it's count if that is greater or equal than 2, if either it's global substitution's value is greater than 2 or line number is 1 then it should print the current line.

Related

assigning a var inside AWK for use outside awk

I am using ksh on AIX.
I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script.
The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values.
sample value of $principal_diagnosis0
R65.20|A41.9|G30.9|F02.80
I've tried:
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}'
but I get this message : awk: Field $i is not correct.
My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9).
echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}'
gets me the output of :
A41.9
G30.9
F02.80
So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk.
Thanks!
To answer your specific question:
$ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80'
$ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}')
$ echo "$foo"
A41.9
The above will work with any awk, you can do it more briefly with GNU awk if you have it:
foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work :
echo "${principal_diagnosis0}" |
mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS=
A41.9
——————————————————————————————————————————
another slightly different variation of the same concept — overwriting fields but leaving OFS as is :
gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF'
A41.9
this works, because when you break it out :
gawk -F'^.*R[^|]+[|]|[|].+$' '
{ print NF
} $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }'
3
1 2 1 A41.9
you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2,
but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it
…… which means you can also make it $0 = $2, as such :
gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF'
A41.9
——————————————————————————————————————————
a 3rd variation, this time using RS instead of FS :
mawk NR==2 RS='^.*R[^|]+[|]|[|].+$'
A41.9
——————————————————————————————————————————
and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead :
nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)'
A41.9

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

Substitute patterns using a correspondence file

I try to change in a file some word by others using sed or awk.
My initial fileA as this format:
>Genus_species|NP_001006347.1|transcript-60_2900.p1:1-843
I have a second fileB with the correspondences like this:
NP_001006347.1 GeneA
XP_003643123.1 GeneB
I am trying to substitute in FileA the name to get this ouput:
>Genus_species|GeneA|transcript-60_2900.p1:1-843
I was thinking to use awk or sed, to do something like
's/$patternA/$patternB/' with a while read l but how to indicate which pattern 1 and 2 are in the fileB? I tried also this but not working.
sed "$(sed 's/^\([^ ]*\) \(.*\)$/s#\1#\2#g/' fileB)" fileA
Awk may be able to do the job more easily?
Thanks
It is easier to this in awk:
awk -v OFS='|' 'NR == FNR {
map[$1] = $2
next
}
{
for (i=1; i<=NF; ++i)
$i in map && $i = map[$i]
} 1' file2 FS='|' file1
>Genus_species|GeneA|transcript-60_2900.p1:1-843
Written and tested with your shown samples, considering that you have only one entry for NP_digits.digits in your Input_fileA then you could try following too.
awk '
FNR==NR{
arr[$1]=$2
next
}
match($0,/NP_[0-9]+\.[0-9]+/) && ((val=substr($0,RSTART,RLENGTH)) in arr){
$0=substr($0,1,RSTART-1) arr[val] substr($0,RSTART+RLENGTH)
}
1
' Input_fileB Input_fileA
Using awk
awk -F [\|" "] 'NR==FNR { arr[$1]=$2;next } NR!=FNR { OFS="|";$2=arr[$2] }1' fileB fileA
Set the field delimiter to space or |. Process fileB first (NR==FNR) Create an array called arr with the first space delimited field as the index and the second the value. Then for the second file (NR != FNR), check for an entry for the second field in the arr array and if there is an entry, change the second field for the value in the array and print the lines with short hand 1
You are looking for the join command which can be used like this:
join -11 -22 -t'|' <(tr ' ' '|' < fileB | sort -t'|' -k1) <(sort -t'|' -k2 fileA)
This performs a join on column 1 of fileB with column 2 of fileA. The tr was used such that fileB also uses | as delimiter because join requires it to be equal on both files.
Note that the output columns are not in the order you specified. You can swap by piping the output into awk.

Awk command to compare specific columns in file1 to file2 and display output

File1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
File2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
Desired output:
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0296,444,0.1
I tried awk NR==FNR command but it’s displaying only the last
matched...
reads every line and check if column 1 and 3 of file1 exists in file2:
Using awk and sort
awk 'BEGIN{
# set input and output field separator
FS=OFS=","
}
# read first file f1
# index key field1 and field3 of file1 (f1)
{
k=$1 FS $3
}
# save 2nd and last field of file1 (f1) in array a, key being k
FNR==NR{
a[k]=(k in a ? a[k] RS:"") $2 OFS $NF;
# stop processing go to next line
next
}
# read 2nd file f2 from here
# 2nd and 3rd field of fiel2 (f2) used as key
{
k=$2 FS $3
}
# if key exists in array a
k in a{
# split array value by RS row separator, and put it in array t
split(a[k],t,RS);
# iterate array t, print and sort
for(i=1; i in t; i++)
print $0,t[i] | "sort -t, -nk5"
}
' f1 f2
Test Results:
$ cat f1
111,222,560,0.7
111,333,560,0.2
111,444,560,0.1
$ cat f2
2017,111,560,0.0537
2018,111,560,0.0296
2019,111,560,0.0624
$ awk 'BEGIN{FS=OFS=","}{k=$1 FS $3}FNR==NR{a[k]=(k in a ? a[k] RS:"") $2 OFS $NF; next}{k=$2 FS $3}k in a{split(a[k],t,RS); for(i=1; i in t; i++)print $0,t[i] | "sort -t, -nk5" }' f1 f2
2017,111,560,0.0537,222,0.7
2018,111,560,0.0296,222,0.7
2019,111,560,0.0624,222,0.7
2017,111,560,0.0537,333,0.2
2018,111,560,0.0296,333,0.2
2019,111,560,0.0624,333,0.2
2017,111,560,0.0537,444,0.1
2018,111,560,0.0296,444,0.1
2019,111,560,0.0624,444,0.1
Following awk may help you in same.
awk -F, '
FNR==NR{
a[FNR]=$0;
next
}
{
for(i=1;i<=length(a);i++){
print a[i] FS $2 FS $NF
}
}' Input_file2 Input_file1
Adding explanation too for code as follows.
awk -F, ' ##Setting field separator as comma here for all the lines.
FNR==NR{ ##Using FNR==NR condition which will be only TRUE then first Input_file named File2 is being read.
##FNR and NR both indicates the number of lines for a Input_file only difference is FNR value will be RESET whenever a new file is being read and NR value will be keep increasing till all Input_files are read.
a[FNR]=$0; ##Creating an array named a whose index is FNR(current line) value and its value is current line value.
next ##Using next statement will sip all further statements now.
}
{
for(i=1;i<=length(a);i++){##Starting a for loop from variable i value from 1 to length of array a value. This will be executed on 2nd Input_file reading.
print a[i] FS $2 FS $NF ##Printing the value of array a whose index is variable i and printing 2nd and last field of current line.
}
}' File2 File1 ##Mentioning the Input_file names here.
another one with join/awk
$ join -t, -j99 file2 file1 |
awk -F, -v OFS=, '$3==$6 && $4==$8 {print $2,$3,$4,$5,$7,$9}'

awk to compare two file by identifier & output in a specific format

I have 2 large files i need to compare all pipe delimited
file 1
a||d||f||a
1||2||3||4
file 2
a||d||f||a
1||1||3||4
1||2||r||f
Now I want to compare the files & print accordingly such as if any update found in file 2 will be printed as updated_value#oldvalue & any new line added to file 2 will also be updated accordingly.
So the desired output is: (only the updated & new data)
1||1#2||3||4
1||2||r||f
what I have tried so far is to get the separated changed values:
awk -F '[||]+' 'NR==FNR{for(i=1;i<=NF;i++)a[NR,i]=$i;next}{for(i=1;i<=NF;i++)if(a[FNR,i]!=$i)print $i"#"a[FNR,i]}' file1 file2 >output
But I want to print the whole line. How can I achieve that??
I would say:
awk 'BEGIN{FS=OFS="|"}
FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next}
{for (i=1; i<=NF; i+=2)
if (a[FNR,i] && a[FNR,i]!=$i)
$i=$i"#"a[FNR,i]
}1' f1 f2
This stores the file1 in a matrix a[line number, column]. Then, it compares its values with its correspondence in file2.
Note I am using the field separator | instead of || and looping in steps of two to use the proper data. This is because I for example did gawk -F'||' '{print NF}' f1 and got just 1, meaning that FS wasn't well understood. Will be grateful if someone points the error here!
Test
$ awk 'BEGIN{FS=OFS="|"} FNR==NR {for (i=1;i<=NF;i+=2) a[FNR,i]=$i; next} {for (i=1; i<=NF; i+=2) if (a[FNR,i] && a[FNR,i]!=$i) $i=$i"#"a[FNR,i]}1' f1 f2
a||d||f||b#a
1||1#2||3||4
1||2||r||f