AWK to print min max values of unique value in column - awk

I am trying to use awk to do the following:
Input file:
6:28866209 NA NA NA 8.51368e-06 Y
6:28856689 1 0.007828 1 1.50247e-06 X
6:28856740 2 0.007828 1 1.50247e-06 Y
6:28856889 3 7.51E-08 3 1.50247e-06 X
I want to:
Get min and max of column 5 for each independent value in column 6
Print the min max in the file for each column 5 at the end of the file
The file can have different N columns, but all have at least columns 1-8, which are the same in each of my files.
Output:
6:28866209 NA NA NA 8.51368e-06 Y 8.51368e-06 1.50247e-06
6:28856689 1 0.007828 1 1.50247e-06 X 1.50247e-061.50247e-06
6:28856740 2 0.007828 1 1.50247e-06 Y 8.51368e-06 1.50247e-06
6:28856889 3 7.51E-08 3 1.50247e-06 X 1.50247e-06 1.50247e-06
I have attempted this using the following awk command, but I am only getting back the first value in column 6...
awk 'BEGIN{OFS="\t";FS="\t"}{if (a[$6] == "") a[$6]=$5; if (a[$6] > $5) {a[$6]=$5}} {if (b[$6] == "") b[$6]=$5; if (b[$6] < $5) {b[$6]=$5}} END {if (i=$6) print $0,i,a[i],b[i]}' FILE

I believe the easiest to do is using a double pass of the file :
awk '(NR==FNR) && !($6 in min) { min[$6] = $5; max[$6] = $5; next }
(NR==FNR) { m=min[$6]; M=max[$6];
min[$6] = $5<m ? $5 : m;
max[$6] = $5>M ? $5 : M;
next; }
{print $0,min[$6],max[$6] }' <file> <file>
Your original code has the following flaw. The END statement is only executed when the end of the file is reached. You attempt to print the full file, but you did not store any lines in the parsing.
A correction to your original idea is :
awk 'BEGIN{OFS="\t";FS="\t"}
{if (a[$6] == "") a[$6]=$5;
if (a[$6] > $5) {a[$6]=$5}
}
{if (b[$6] == "") b[$6]=$5;
if (b[$6] < $5) {b[$6]=$5}
}
{ c[NR]=$0; d[NR]=$6 }
END { for (i=1;i<=NR;i++) print c[i],a[d[i]],b[d[i]] }' FILE
Here, I stored the full FILE in array c which is indexed by the line-number NR. I also store the index $6 in array d. At the end, I loop trough all lines I stored and print what is expected.
The downside of this approach is that you have to store the full file in memory.
The downside of my proposal, is that you have to read the full file twice from disk.

awk 'FNR<NR{$7=m[$6];$8=M[$6];print;next} (!M[$6])||$5>M[$6]{M[$6]=$5}(!m[$6])||$5<m[$6]{m[$6]=$5}' file file
with comment
awk '
# optional format
BEGIN { OFS=FS="\t"}
# for second pass (second file read)
FNR<NR{
# add a column 7 and 8 with value of min and max correponsing to column 5
$7=m[$6];$8=M[$6]
# print it and reda next line (don't go further in script)
print;next}
# this point is only reach by first file read
# if Max is unknow or value 5 bigger than max
(!M[$6])||$5>M[$6]{
# set new max
M[$6]=$5}
# do the same for min
(!m[$6])||$5<m[$6]{m[$6]=$5}
# read 2 times the same file (first to find min/max, second to print it)
' sample.txt sample.txt

Related

Divide largest value by second largest value

I am having a file in the following format. Column one has ~20,000 uniq entry and column 2 has ~120,000 different entry and column 3 has count associated with column 2. For a single entry in column 1 there can be multiple entry in column 2. For each unique entry in column 1, I am trying to get ratio of maximum value to second maximum value of column 3.
F1.txt
S1 S2 C1
A A1 1
A AA 10
A A6 5
A A0 4
B BB 12
B BC 11
B B1 19
B B9 4
Expected Output
S1 S2 C1
B B1 19 1.58333
A AA 10 2
I can do in steps like bellow. But is there a smart way of doing in in one script?
awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r "}' F1.txt | awk '!seen[$1]++' >del1.txt
awk 'FNR==NR{a[$2]=1; next}FNR==1{print $0;}!a[$2]' del1.txt F1.txt | awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r"}' | awk '!seen[$1]++' >del2.txt
awk 'FNR==NR{a[$1]=$3; next}FNR==1{print $0"\t";"RT"}FNR>1 a[$1]{print $0"\t"$3/a[$1]}' del2.txt del1.txt
#!/usr/bin/awk -f
NR == 1 { print $1, $2, $3; next }
{ data[$1][$3] = $2 }
END {
for (key in data) {
asorti(data[key], s, "#ind_num_desc")
print key, data[key][s[1]], s[1], s[1] / s[2]
}
}
This^^^ assumes an arbitrary permutation of the lines (and requires gawk (which is pretty common) or another implementation with native multi-dimensional “arrays”).
If you can make more assumptions about the input — e.g. that it is always grouped by the first column —, then you can make it more memory-efficient and get rid of multi-dimensional arrays (by not delaying the evaluation until END and instead calculating it in a per-line block each time the first column’s value changes (and then one last time in END).)
To get a different handling of equal numeric values (e.g. to report the “subkey” (column 2) of the first (instead of last) encountered occurrence of a value), you could add if (!($3 in data[$1])) ... or the like.
Whenever you find yourself creating a pipeline containing awk, there is a very good chance that what you are trying to do can be done in a single call to awk much more efficiently.
A non-GNU awk approach that presumes all field1 'A' records are together and all 'B' records are together (as you show in your sample data) could be:
awk '
FNR==1 { print; next } ## 1st line, output heading
$1 != n { ## 1st field changed
if (n) { ## if n set, output result of last block
printf "%s\t%s\n", rec, max / nextmax
}
rec = $0 ## initialize vars for next block
n = $1
max = $3
nextmax = 1
next ## skip to next record
}
{
if ($3 > max) { ## check if 3rd field > max
rec = $0 ## save record
nextmax = max ## update nextmax
max = $3 ## update max
}
else if ($3 > nextmax) { ## if 3rd field > nextmax
nextmax = $3 ## update nextmax
}
} ## output final block results
END { printf "%s\t%s\n", rec, max / nextmax }
' file
Example Use/Output
With your data in the file file, you would have:
$ awk '
> FNR==1 { print; next } ## 1st line, output heading
> $1 != n { ## 1st field changed
> if (n) { ## if n set, output result of last block
> printf "%s\t%s\n", rec, max / nextmax
> }
> rec = $0 ## initialize vars for next block
> n = $1
> max = $3
> nextmax = 1
> next ## skip to next record
> }
> {
> if ($3 > max) { ## check if 3rd field > max
> rec = $0 ## save record
> nextmax = max ## update nextmax
> max = $3 ## update max
> }
> else if ($3 > nextmax) { ## if 3rd field > nextmax
> nextmax = $3 ## update nextmax
> }
> } ## output final block results
> END { printf "%s\t%s\n", rec, max / nextmax }
> ' file
S1 S2 C1
A AA 10 2
B B1 19 1.58333
Using any awk in any shell on every Unix box and using almost no memory (important since your input file would be huge given your description of it):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 { print; next }
$1 != prev {
if ( prev != "" ) {
print prev, val, max, (preMax ? max/preMax : 0)
}
prev = $1
max = ""
}
(max == "") || ($3 > max) {
val = $2
preMax = max
max = $3
}
END { print prev, val, max, (preMax ? max/preMax : 0) }
$ awk -f tst.awk F1.txt
S1 S2 C1
A AA 10 10
B B1 19 1.58333

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

Compare values in two rows fo specific column

I would like to print the lines of file based on a condition with respect the previous line. I would like to implement the following condition:
If the key (field 1 and field2) between the current line and the previous line is identical and the difference between field 8 and field 8 of the previous line is bigger than 1, print the current line and append the difference.
Input file:
47329,39785,2,12,10,351912.50,2533105.56,170.93,1
47329,39785,3,6,7,351912.82,2533105.07,170.89,1
47329,39785,2,12,28,351912.53,2533118.81,172.91,1
47329,39785,3,6,20,351913.03,2533117.41,170.93,1
47329,39797,2,12,10,352063.14,2533117.84,170.66,1
47329,39797,3,6,7,352064.11,2533119.32,170.64,1
47329,39797,2,12,28,352062.77,2533104.67,173.63,1
47329,39797,3,6,20,352063.50,2533107.10,170.69,1
Expected output file:
47329,39785,2,12,28,351912.53,2533118.81,172.91,1,1.98
47329,39797,2,12,28,352062.77,2533104.67,173.63,1,2.94
Lines 3 and 4 have an identical key (47329,39785) and the difference of the values in fields 8 is 172.91-170.93=1.98, so we print line 4. An identical reasoning goes for lines 6 and 7
attempt:
awk -F, 'NR%2{ab = $1 FS $2} ab == ob && $8 - O8 > 1; {ob = ab; O8 = $8}'
I've come up with this script, tested on gawk v5.0.0
BEGIN{
FS=","
}
{
if (NR == 1)
{
key1 = $1
key2 = $2
field = $8
# when on first record, there's nothing to compare with
next
}
if ($1 == key1)
{
if ($2 == key2)
{
if ($8 - field > 1)
{
print $0, $8-field
# uncomment following line to print line match number
# print "("NR")",$0, $8-field
}
}
}
# assign for next iteration
key1 = $1
key2 = $2
field = $8
}
tested on your input, found:
$ awk -f script.awk test.txt
47329,39785,2,12,28,351912.53,2533118.81,172.91,1 2.02
47329,39797,2,12,28,352062.77,2533104.67,173.63,1 2.99
Matches line 3 and 7.

How to add numbers from files to computation?

I need to get results of this formula - a column of numbers
{x = ($1-T1)/Fi; print (x-int(x))}
from inputs file1
4 4
8 4
7 78
45 2
file2
0.2
3
2
1
From this files should be 4 outputs.
$1 is the first column from file1, T1 is the first line in first column of the file1 (number 4) - it is alway this number, Fi, where i = 1, 2, 3, 4 are numbers from the second file. So I need a cycle for i from 1 to 4 and compute the term one times with F1=0.2, the second output with F2=3, then third output with F3=2 and the last output will be for F4=1. How to express T1 and Fi in this way and how to do a cycle?
awk 'FNR == NR { F[++n] = $1; next } FNR == 1 { T1 = $1 } { for (i = 1; i <= n; ++i) { x = ($1 - T1)/F[i]; print x - int(x) >"output" FNR} }' file2 file1
This gives more than 4 outputs. What is wrong please?
FNR == 1 { T1 = $1 } is being run twice, when file2 is started being read T1 is set to 0.2,
>"output" FNR is problematic, you should enclose the output name expression in parentheses.
Here's how I'd do it:
awk '
NR==1 {t1=$1}
NR==FNR {f[NR]=$1; next}
{
fn="output"FNR
for(i in f) {
x=(f[i]-t1)/$1
print x-int(x) >fn
}
close(fn)
}
' file1 file2

awk one row substracts the next row if their first two colums are the same

If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!
You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2
awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file