using awk to match and sum a file of multiple lines

using awk to match and sum a file of multiple lines - awk

I am trying to combine matching lines in file.txt $1 and then display the sum of `$2 for those matches. Thank you :).
File.txt
ENSMUSG00000000001:001
ENSMUSG00000000001:002
ENSMUSG00000000001:003
ENSMUSG00000000002:003
ENSMUSG00000000002:003
ENSMUSG00000000003:002
Desired output
ENSMUSG00000000001 6
ENSMUSG00000000002 6
ENSMUSG00000000003 2
awk -F':' -v OFS='\t' '{x=$1;$1="";a[x]=a[x]$0}END{for(x in a)print x,a[x]}' file > output.txt

$ awk -F':' -v OFS='\t' '{sum[$1]+=$2} END{for (key in sum) print key, sum[key]}' file
ENSMUSG00000000001 6
ENSMUSG00000000002 6
ENSMUSG00000000003 2

{x=$1;a[x]=a[x] + $2} END{for(x in a)print x,a[x]}
Just a typo I guess: instead of adding $0 add $2. That gives me the expected output. And the $1="" is not necessary. To make sure that there isn't anything funny with $2 you may consider 1.0*$2.

Related

awk to format file using a specific order

I am trying to format a tab-delimited file using awk and the command runs but no output results. The output is also tab-delimited. The format of the output is $1 $2 $2 $3 REF=$4;OBS=$5 $6. Maybe the awk is not the best approach as it seems like it should work. Thank you :).
file (~370 lines all in the below format)
chr4 70501545 rs28560191 C A UGT2A1;UGT2A2
desired output
chr4 70501545 70501545 rs28560191 REF=C;OBS=A UGT2A1;UGT2A2
awk
awk -F'\t' -v OFS='\t' '{print $1,$2,$2,$3,"REF="$4";""OBS="$5,$6}' file

You are forgetting the print statement.
awk '{ print $1 "\t" $2 "\t" $2 "\t" $3 "\t" "REF="$4";""OBS="$5 "\t" $6}' file

Print default value if index is not in awk array

$ cat file1 #It contains ID:Name
5:John
4:Michel
$ cat file2 #It contains ID
5
4
3
I want to Replace the IDs in file2 with Names from file1, output required
John
Michel
NO MATCH FOUND
I need to expand the below code to reult NO MATCH FOUND text.
awk -F":" 'NR==FNR {a[$1]=$2;next} {print a[$1]}' file1 file2
My current result:
John
Michel
<< empty line
Thanks,

You can use a ternary operator for this: print ($1 in a)?a[$1]:"NO MATCH FOUND". That is, if $1 is in the array, print it; otherwise, print the text "NO MATCH FOUND".
All together:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} {print ($1 in a)?a[$1]:"NO MATCH FOUND"}' f1 f2
John
Michel
NO MATCH FOUND

You can test whether the index occurs in the array:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} $1 in a {print a[$1]; next} {print "NOT FOUND"}' file1 file2
John
Michel
NOT FOUND

if file2 has only digit (no space at the end)
awk -F ':' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1
if not
awk -F '[:[:blank:]]' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1 file2

group of columns in awk

The following awk statement is working as expected.
awk '{print $1, $2, $3}' test.txt
But how do I say that I need all the columns after the second column?
awk '{print $1, $2, $3 to $NF}' test.txt
I need all columns from third column till end of that line. There can be 2 to 10 columns and all are considered as a part of the last column.

if you just want $3-$NF fields, standard way would be loop (for/while)
but for your requirement, you could:
awk '{$1=$2="";}sub("^ *","")'
for example:
kent$ seq -s' ' 10|awk '{$1=$2="";}sub("^ *","")'
3 4 5 6 7 8 9 10
if you want to "group" 100 fields into 3 groups: 1,2, 3-100:
awk '{x=$0;sub($1FS$2,"",x);gsub(FS,"",x);print $1,$2,x}'
same example:
kent$ seq -s' ' 10|awk '{x=$0;sub($1FS$2,"",x);gsub(FS,"",x);print $1,$2,x}'
1 2 345678910
hope it is what you want.

The intuitive way.
awk 'BEGIN{ORS=""} {for(i=3; i<=NF; i++) if(i != NF){print $i " "} else {print $i "\n"}}' test.txt

Some more:
awk '{$1=$2=x; $0=$0; $1=$1}1' file
awk '{$1=$1; sub($1 FS $2 FS,x)}1' file
To keep spacing in tact:
awk 'sub($1 "[ \t]*" $2 "[ \t]*",x)' file

awk and log2 divisions

I have a tab delimited file that looks something like this:
foo 0 4
boo 3 2
blah 4 0
flah 1 1
I am trying to calculate log2 for between the two columns for each row. my problem is with the division by zero
What I have tried is this:
cat file.txt | awk -v OFS='\t' '{print $1, log($3/$2)log(2)}'
when there is a zero as the denominator, the awk will crash. What I would want to do is some sort of conditional statement that would print an "inf" as the result when the denominator is equal to 0.
I am really not sure how to go about this?
Any help would be appreciated
Thanks

You can implement that as follows (with a few additional tweaks):
awk 'BEGIN{OFS="\t"} {if ($2==0) {print $1, "inf"} else {print $1, log($3/$2)log(2)}} file.txt
Explanation:
if ($2==0) {print $1, "inf"} else {...} - First check to see if the 2nd field ($2) is zero. If so, print $1 and inf and move on to the next line; otherwise proceed as usual.
BEGIN{OFS="\t"} - Set OFS inside the awk script; mostly a preference thing.
... file.txt - awk can read from files when you specify it as an argument; this saves the use of a cat process. (See UUCA)

awk -F'\t' '{print $1,($2 ? log($3/$2)log(2) : "inf")}' file.txt

How to print last two columns using awk

All I want is the last two columns printed.

You can make use of variable NF which is set to the total number of fields in the input record:
awk '{print $(NF-1),"\t",$NF}' file
this assumes that you have at least 2 fields.

awk '{print $NF-1, $NF}' inputfile
Note: this works only if at least two columns exist. On records with one column you will get a spurious "-1 column1"

#jim mcnamara: try using parentheses for around NF, i. e. $(NF-1) and $(NF) instead of $NF-1 and $NF (works on Mac OS X 10.6.8 for FreeBSD awkand gawk).
echo '
1 2
2 3
one
one two three
' | gawk '{if (NF >= 2) print $(NF-1), $(NF);}'
# output:
# 1 2
# 2 3
# two three

using gawk exhibits the problem:
gawk '{ print $NF-1, $NF}' filename
1 2
2 3
-1 one
-1 three
# cat filename
1 2
2 3
one
one two three
I just put gawk on Solaris 10 M4000:
So, gawk is the cuplrit on the $NF-1 vs. $(NF-1) issue. Next question what does POSIX say?
per:
http://www.opengroup.org/onlinepubs/009695399/utilities/awk.html
There is no direction one way or the other. Not good. gawk implies subtraction, other awks imply field number or subtraction. hmm.

Please try this out to take into account all possible scenarios:
awk '{print $(NF-1)"\t"$NF}' file
or
awk 'BEGIN{OFS="\t"}' file
or
awk '{print $(NF-1), $NF} {print $(NF-1), $NF}' file

try with this
$ cat /tmp/topfs.txt
/dev/sda2 xfs 32G 10G 22G 32% /
awk print last column
$ cat /tmp/topfs.txt | awk '{print $NF}'
awk print before last column
$ cat /tmp/topfs.txt | awk '{print $(NF-1)}'
32%
awk - print last two columns
$ cat /tmp/topfs.txt | awk '{print $(NF-1), $NF}'
32% /

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

using awk to match and sum a file of multiple lines - awk

$ awk -F':' -v OFS='\t' '{sum[$1]+=$2} END{for (key in sum) print key, sum[key]}' file ENSMUSG00000000001 6 ENSMUSG00000000002 6 ENSMUSG00000000003 2

{x=$1;a[x]=a[x] + $2} END{for(x in a)print x,a[x]} Just a typo I guess: instead of adding $0 add $2. That gives me the expected output. And the $1="" is not necessary. To make sure that there isn't anything funny with $2 you may consider 1.0*$2.

Related

awk to format file using a specific order

Print default value if index is not in awk array

group of columns in awk

awk and log2 divisions

How to print last two columns using awk

Categories

Resources