awk: calculating sum from values in single field with multiple delimiters - awk

Related to another post I had...
parsing a sql string for integer values with multiple delimiters,
In which I say I can easily accomplish the same with UNIX tools (ahem). I found it a bit more messy than expected. I'm looking for an awk solution. Any suggestions on the following?
Here is my original post, paraphrased:
#
I want to use awk to parse data sourced from a flat file that is pipe delimited. One of the fields is sub-formatted as follows. My end state is to sum the integers within the field, but my question here is to see of ways to use awk to sum the numeric values in the field. The pattern of the sub-formatting will always be where the desired integers will be preceded by a tilde (~) and followed by an asterisk (*), except for the last one in field. The number of sub fields may vary too (my example has 5, but there could more or less). The 4 char TAG name is of no importance.
So here is a sample:
|GADS~55.0*BILK~0.0*BOBB~81.0*HETT~32.0*IGGR~51.0|
From this example, all I would want for processing is the final number of 219. Again, I can work on the sum part as a further step; just interested in getting the numbers.
#
My solution currently entails two awk statements. First using gsub to replace the '~' with a '*' delimiter in my target field, 77:
awk -F'|' 'BEGIN {OFS="|"} { gsub("~", "*", $77) ; print }' file_1 > file_2
My second awk statement is to calculate the numeric sums on the target field, 77, which is the last field, and replace it with the calculated value. It is built on the assumption that there will be no other asterisks (*) anywhere else in the file. I'm okay with that. It is working for most examples, but not others, and my gut tells me this isn't that robust of an answer. Any ideas? The suggestions on my other post for SQL were great, but I couldn't implement them for unrelated silly reasons.
awk -F'*' '{if (NF>=2) {s=0; for (i=1; i<=NF; i++) s=s+$i; print substr($1, 1, length($1)-4) s;} else print}' file_2 > file_3

To get the sum (219) from your example, you can use this:
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;print s}' file
or the following for 219.00 :
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;printf "%.2f\n", s}' file

Related

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

How to add zero decimal points to all elements of a matrix in shell

I have a file containing integer and floating numbers which has 3 columns and many lines. I want to have all the numbers in floating format which 6 significant figures, i.e. the number 2 becomes 2.000000. I already tried this script but it only works for one column. I appreciate if you help me how to due it for the next two columns to or any other one line script which can do the job. Thanks in advance
awk '{printf "%.6f\n",$1}' file
The script you provided works for only one column because you are only printing the first column in the script. If you have only 3 columns and if the number of columns is fixed you just need to add the other two columns to your print statement to make it work.
For example:
awk '{printf "%.6f %.6f %.6f \n",$1,$2,$3}'
If the number of columns is unknown you can use a loop inside awk to print all the columns. NF will give the total number of records. You can iterate through it and print the results.
awk '{ for (i=1; i<= NF; i++) printf "%.6f ",$i; print "" }' input
You also can use this script to print the three columns as required, if your file only contains three columns (NF is 3 in that case).
As pointed out by #kvantour in comments, an elegant way to write the above one liner is:
awk '{for(i=1;i<=NF;++i) printf "%.6f" (i==NF?ORS:OFS), $i}
Essentially, both one-liners do the same thing but the ORS (Output Record Separator) and OFS (Output Field Separator) variables give the flexibility to change the output easily and quickly.
So, (i==NF?ORS:OFS) what it does is, if the field is not the last column then OFS is printed (OFS is a space by default) and if it is the last column then ORS is printed (default value of ORS is newline ). Advantage of using these separator variables are, for example, if you want to have two new lines instead of a single newline between rows in the results it can be easily set using ORS="\n\n"

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

finding differences across a row with awk

I have a table in which most of the values in a given row are the same. What I want to pull out are any rows where at least one of the values is different. I’ve figured out how to do that with something like this
awk -F "\t" '{if (($4!=$5)&&($5!=$6)&&($6!=$7)) print $0;}'
The only problem is there are 40 some odd columns to compare. Is there a more elegant way to compare multiple columns for differences. BTW – these are non numerical values so a fancy math trick wont work.
Thanks All. I'm a newbee so I have to admit that I don't understand all of the commands, etc. but I can look it up from here. Not sure who's suggestion I'll go with but I learn more from concrete examples than I do from textbook explanations so having these different solutions is a big help with my learning curve.
A fancy math trick might not work but how about:
$ cat file
one one one one two
two two two two two
three four four five
$ awk '{f=$0;gsub($1,"")}NF{print f}' file
one one one one two
three four four five
First we store the line in original state f=$0 then we do a global substitution on everything matching the first field, if all fields are the same then nothing will be left therefor NF will be 0 and nothing will be printed else we print the original line.
Your script starts at $4 which suggests you are only interested in changes from this field on in which case:
$ awk '{f=$0;gsub($4,"")}NF>3{print f}' file
If any field differs from some other field, then either it differs from field 1, or field 1 differs from some other field (by definition). So just loop from 2 to NF (number of fields) comparing it against all other fields:
awk -F "\t" '{ for (i = 2; i <= NF ;i++) if ($i != $1) { print; next; }}'
You can tune this to ignore leading fields (e.g., start at 5 and compare against $4) as needed.
You could just use a for loop:
awk -F "\t" '{ for(i=4;i<NF;i++) if ($i != $(i+1)) { print; next } }' file
Adjust accordingly. HTH.

scripting in awk

I have a text file with contents as below:
1,A,100
2,A,200
3,B,150
4,B,100
5,B,250
i need the output as :
A,300
B,500
the logic here is sum of all the 3rd fields whose 2nd field is A and in the same way for B
how could we do it using awk?
You can do it using a hash as:
awk -F"," '{cnt[$2]+=$3}END{for (x in cnt){printf "%s,%d\n",x,cnt[x]}}' file
Well, I'm not up for writing and debugging the code for you. However, the elements you need are:
You can use FS="," to change the field separator to a comma.
The fields you care about are obviously the second ($2) and third ($3) fields.
You can create your own variables to accumulate the values into.
I'd suggest an associative array variable, indexed by field two.
$ awk -F"," '{_[$2]+=$3}END{for(i in _)print i,_[i]}' OFS="," file
A,300
B,500