combine specific output and display in specific header - awk

I am trying to combine all matching text before the left of the | and output that to the column "Gene". The amount of lines in the match are outputted to the "Targets" column, the average of $3 to the "Average Depth" column, along with the average of the #'s to the right of the = to the " Average GC" column. I am having some trouble in doing this and need some expert help. Thank you :).
input
chr10:79793602-79793721 RPS24|gc=59.7 150.3
chr10:79795083-79795202 RPS24|gc=41.2 111.4
chr10:79797665-79797784 RPS24|gc=37 69.8
chr11:119077113-119077232 CBL|gc=67.9 27.3
chr11:119103143-119103420 CBL|gc=41.9 240.3
chr11:119142430-119142606 CBL|gc=42.6 177.1
chr11:119144563-119144749 CBL|gc=46.2 324.4
current output
Gene TargetsAverage DepthAverage GC
gc 803 0.0 0.0
desired output
ID times depth GC
RPS24 3 110.5 46.0
CBL 4 192.3 49.7
awk
awk -F'[ |=]' '
{
id[$2] += $4
value[$2] += $5
occur[$2]++
}
END{
printf "%-8s%8s%8s%8s\n", "Gene", "Targets", "Average Depth", "Average GC"
for (i in id)
printf "%-8s%8d%8.1f%8.1f\n", i, occur[i],value[i]/occur[i],id[i]/occur[i]
}' input

#Chris - Your editing of the question has not been very helpful, but I can confirm that, except for the first printf statement, the program runs as expected, which is in accordance with the "desired output". I have used three different awks; the only difference between the outputs is (as expected) the ordering of the rows. You may have to be more specific about the version of awk you are using.

Solution in TXR:
$ txr table2.txr data
ID times depth GC
RPS24 3 110.5 46.0
CBL 4 192.3 49.7
Code in table2.txr:
#(output)
ID times depth GC
#(end)
#(repeat)
# (all)
#nil:#nil-#nil #id|#nil
# (and)
# (collect :gap 0)
#nil:#nil-#nil #id|gc=#gc #dep
# (set (gc dep) (#(tofloat gc) #(tofloat dep)))
# (end)
# (bind n #(length gc))
# (bind avg-gc #(format nil "~,1f"
(/ [reduce-left + gc] n)))
# (bind avg-dep #(format nil "~,1f"
(/ [reduce-left + dep] n)))
# (output)
#{id 9} #{n 6} #{avg-dep 13} #{avg-gc}
# (end)
# (end)
#(end)
What lumps together the entries with the same ID is the two parallel branches of the all directive. The first branch loosely matches the pattern of a single line, extracting the ID, binding it to the id variable. This variable is visible to the second branch, where its appearance introduces a back-referencing constraint. Here, multiple consecutive (:gap 0) lines are matched at the same position (thus including the one which was matched in the first branch of the all). Only lines with the matching id are processed; the collect ends when a non-matching id is encountered (due to the :gap 0 constraint) or when the input ends.

Related

AWK-Get total count of records for numerical grouped column

I have a variable which splits the results of a column based on a condition (group by in others programming languages).
I'm trying to have a variable that counts the NR of each group. If we sum all the groups we should have the NR of the file.
When I try to use NR in the calculation for example NR[variable that splits], I get a fatal error "you tried to use scalar as matrix.
Any ideas how to use NR as a variable, but not counting all the records, only those from each group?
sex, weight
male,50
female,49
female,48
male,66
male,78
female,98
male,74
male,54
female,65
In this case the NR would be 9 BUT, in reality I want a way to get that NR of male is 5 and 4 for female.
I have the total sum of weigth column but struggle to get the avg:
sex= $(f["sex"])
ccWeight[sex] += $(f["weight"])
avgWeight = ccWeight[sex] / ¿?
Important: I don't need to print the result as of now, just to store this number on a variable.
One awk idea:
awk -F, '
NR>1 { counts[$1]++ # keep count of each distinct sex
counts_total++ # replace dependency on NR
weight[$1]+=$2 # keep sum of weights by sex
}
END { for (i in counts) {
printf "%s: (count) %s of %s (%.2f%)\n",i,counts[i],counts_total,(counts[i]/counts_total*100)
printf "%s: (avg weight) %.2f ( %s / %s )\n",i,(weight[i]/counts[i]),weight[i],counts[i]
}
}
' sample.dat
NOTE:
OP can add additional code to verify total counts and weights are not zero (so as to keep from generating a 'divide by zero' error)
perhaps print a different message if there are no (fe)male records to process?
This generates:
female: (count) 4 of 9 (44.44%)
female: (avg weight) 65.00 ( 260 / 4 )
male: (count) 5 of 9 (55.56%)
male: (avg weight) 64.40 ( 322 / 5 )
GNU datamash might be what you are looking for, e.g.:
<infile datamash -Hst, groupby 1 count 1 sum 2 mean 2 | column -s, -t
Output:
GroupBy(sex) count(sex) sum(weight) mean(weight)
female 4 260 65
male 5 322 64.4

sum 4th field data between the pattern

suppose my data is :
*dnet *1234 1.2
1 port *12 2.3
3 port1 *34 0.2
7 *15 0.1
*dnet *234 0.2
2 *12 0.1
4 *123 *234 1.2
fields are separated by space.
In this I want to get the sum of 4th fields of data present inside each *dnet. Some fields have 4th field data some has not. I want 4th field sum value for each *dnet seperate.
I tried using awk but could not get. It will be thankful if someone helps.
the output for above will look like
*dnet *1234 1.2 2.5
*dnet *234 0.2 1.2
Commented, slightly simplified, version of the comment...
awk '
# look for header line
$1=="*dnet" {
# print any previously calculated sum
if (header) print header, sum
# reset sum for next block of lines
sum = 0
# save new header line
header = $0
# skip remaining actions
next
}
# if we get here, we know this is not a header line
# if there is a 4th field, add it to the sum
$4 {
sum += $4
}
END {
# print the final sum
if (header) print header, sum
}
' datafile

Awk failing extraction

I have a huge file containing the xyz positions of some atoms from different molecules. The whole file contains ~ 10000 configurations. I have created a script that iterates over the total number of configurations and extracts the coordinates associated with a specific atomic species that is systematically repeated at a fixed position, along with each frame associated with each system. My code works perfectly, except in the case in which the atomic position coincides with the last position of the frame I have to process, skipping to grab it and print in the corresponding file.
Each frame contains 384 atoms. In the xyz format, we have to take into account two extra lines at the beginning, where the number of atoms (in this case 384, line #1) and a blank/commented line are (line #2) are located.
The awk file with the list of atoms position lines is of the form:
{n = NR%386}
n == 1 {print "24"; next}
n == 2 ||
n == 91 ||
...
n == 378 ||
n == 380 ||
n == 381 ||
n == 386
where the n=NR%386 is the number of lines that awk has to account at every iteration in order to have the correct number of frames, in
n == 1 {print "24"; next}
the code prints the number of atoms I want to extract for each frame, in this case, 24.
The problem arises with the last value, in the last position of each frame before advancing to the next frame:
n == 386
When using the command
awk -f file.awk filename.xyz >> test.txt
the code will skip reading, extracting, and printing the last coordinate.
The filename.xyz I have to process is something like:
384
i = 3171, time = 3171.000, E = -3298.3005315786
C 6.66359796 19.29831718 16.63773520
C 6.19922671 19.83243350 15.35406226
C 7.73577004 21.24303011 16.94974860
C 7.32315891 21.77975003 15.67093925
N 5.08248005 17.55384984 15.51887635
N 7.75857672 23.00895664 15.43811018
N 8.58649028 22.07495287 17.61330368
N 7.45555304 19.97249138 17.42360101
...
...
...
N 3.62924684 23.22942656 15.38486984
N 4.52670891 22.25077226 17.55981432
N 3.17369677 20.23465407 17.45881199
N 2.28230853 21.30557433 14.86646780
S 1.48394488 18.18032187 17.21253664
S 0.70072709 19.13053602 14.60582837
S 4.67511560 23.53830074 16.57005901
Currently, just trying to extract only position 386
n == 386
produces something like:
1
i = 3171, time = 3171.000, E = -3298.3005315786
1
i = 3172, time = 3172.000, E = -3298.3023115390
1
i = 3173, time = 3173.000, E = -3298.3056102462
1
i = 3174, time = 3174.000, E = -3298.3101590395
that are just the corresponding to the commented lines, apparently skipping or not correctly interpreting which line to grep.
I would like to understand why awk if not able to extract the last line properly and how to solve the problem.
This appears to be a math problem. NR%386 will never be 386 because of the way the modulus operator works (there is no remainder when you divide 386 by 386). So your n==386 will never get executed. Try using (NR-1)%386 instead of NR%386 and shift all your conditionals accordingly:
n == 0 {print "24"; next}
etc. If you need n for calculations, add one to it.

Reading fields in previous lines for moving average

Main Question
What is the correct syntax for recursively calling AWK inside of another AWK program, and then saving the output to a (numeric) variable?
I want to call AWK using 2/3 variables:
N -> Can be read from Bash or from container AWK script.
Linenum -> Read from container AWK program
J -> Field that I would like to read
This is my attempt.
Container AWk program:
BEGIN {}
{
...
# Loop in j
...
k=NR
# Call to other instance of AWK
var=(awk -f -v n="$n_steps" linenum=k input-file 'linenum-n {printf "%5.4E", $j}'
...
}
END{}
Background for more general questions:
I have a file for which I would like to calculate a moving average of n (for example 2280) steps.
Ideally, for the first n rows the average is of the values 1 to k,
where k <= n.
For rows k > n the average would be of the last n values.
I will eventually execute the code in many large files, with several columns, and thousands to millions of rows, so I'm interested in streamlining the code as much as possible.
Code Excerpt and Description
The code I'm trying to develop looks something like this:
NR>1
{
# Loop over fields
for (j in columns)
{
# Rows before full moving average is done
if ( $1 <= n )
{
cumsum[j]=cumsum[j]+$j #Cumulative sum
$j=cumsum[j]/$1 # Average
}
#moving average
if ( $1 > n )
{
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}') # Obtain value that will get ubstracted from moving average
cumsum[j]=cumsum[j]+$j-last[j] # Cumulative sum adds last step and deleted unwanted value
$j=cumsum[j]/n # Moving average
}
}
}
My input file contains several columns. The first column contains the row number, and the other columns contain values.
For the cumulative sum of the moving average: If I am in row k, I want to add it to the cumulative sum, but also start subtracting the first value that I don't need (k-n).
I don't want to have to create an array of cumulative sums for the last steps, because I feel it could impact performance. I prefer to directly select the values that I want to substract.
For that I need to call AWK once again (but on a different line). I attempt to do it in this line:
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}'
I am sure that this code cannot be correct.
Discussion Questions
What is the best way to obtain information about a field in a previous line to the one that AWK is working on? Can it be then saved into a variable?
Is this recursive use of AWK allowed or even recommended?
If not, what could be the most efficient way to update the cumulative sum values so that I get an efficient enough code?
Sample input and Output
Here is a sample of the input (second column) and the desired output (third column). I'm using 3 as the number of averaging steps (n)
N VAL AVG_VAL
1 1 1
2 2 1.5
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
14 15 14
If you want to do a running average of a single column, you can do it this way:
BEGIN{n=2280; c=7}
{ s += $c - a[NR%n]; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
Here we store the last n values in an array a and keep track of the cumulative sum s. Every time we update the sum we correct by first removing the last value from it.
If you want to do this for a couple of columns, you have to be a bit handy with keeping track of your arrays
BEGIN{n=2280; c[0]=7; c[1]=8; c[2]=9}
{ for(i in c) { s[i] += $c[i] - a[n*i + NR%n]; a[n*i + NR%n] = $c[i] } }
{ printf $0
for(i=0;i<length(c);++i) printf OFS (s[i]/(NR < n : NR ? n))
printf ORS
}
However, you mentioned that you have to add millions of entries. That is where it becomes a bit more tricky. Summing a lot of values will introduce numeric errors as you loose precision bit by bit (when you add floats). So in this case, I would suggest implementing the Kahan summation.
For a single column you get:
BEGIN{n=2280; c=7}
{ y = $c - a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
or a bit more expanded as:
BEGIN{n=2280; c=7}
{ y = $c - k; t = s + y; k = (t - s) - y; s = t; }
{ y = -a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; }
{ a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
For a multi-column problem, it is now straightforward to adjust the above script. All you need to know is that y and t are temporary values and k is the compensation term which needs to be stored in memory.

awk to print out lines for cumulative sum

I want to print out lines of a file until the cumulative sum of the third field is greater than 0.99, then print out only the first line for which the cumulative sum is greater than or equal to 0.99. However, if field 2 of the first line for which cumulative sum of field 3 is greater than or equal to 0.99 matches field 2 of the next line, then both lines should be printed.
My file looks like:
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
rs934367 -4.3376 3.06953e-06
Desired output:
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
In the above example, the cumulative sum of field 3 exceeds 0.99 at line 2, but I print line 3 as well since field 2 of lines 2 and 3 are equal. If these fields had not been equal, I would print out lines 1 and 2 only.
I have the following command, which works for the cumulative sum, but not for comparing field 2 between adjacent lines:
awk '{sum+=$3;print $0;if(sum>=0.99)exit}' file
Can someone modify this to incorporate the above requirements?
The following should work according to your specifications:
Given file containing
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
rs934367 -4.3376 3.06953e-06
The following awk-script
awk '{sum+=$3; print $0; if(sum >= 0.99 && prev_row == $2)exit;prev_row=$2}' file
will produce
rs76832595 -4.4524 0.501109
rs74660964 -4.9815 0.49886
rs12992037 -4.9815 9.8159e-06
The change in the script consisted of adding a prev_row=$2 at the end of the statement to keep track of the previous row, and incorporating prev_row into the if-statement.