Iterate awk function for every unique field in column - awk

I have write an awk script to analyse my table data - I am calculating p-value and log2 odds ratio.
This is an example of data table I have.
Label Value1 Value2
Label1 9 6
Label1 7 6
Label1 1 6
Label2 5 7
Label2 3 7
Label2 8 7
For every label (Label1/2) I count how many times value1 > value2 and divide this number by total times Label was observed - I am getting p-value.
Additionally to this, I compare their log2 ratio.
This is my awk script.
awk '{a[$1]=$1}; ($2>=$3) {c++}; {sum+=$2} END
{print c/NR,log($3/(sum/NR))/log(2),a[$1]}'
And this is result I get
0.666667 0.0824622 Label1
Column1 is p-value; Column 2 is odds ratio; Column 3 is Label.
Problem is that I don't know how to apply this calculation for both Labels - I am getting result only for the first one.
My question is - how to iterate such awk function for every unique field in column 1 (Label1/2)

I assume two lines before first line of data, so I compare NR with 3. The program saves previous label name ($1) and only when it changes ($1 != label) it does the calculations and print. Other condition (NR >= 3) only saves data while processing same label.
awk '
NR == 3 { label = $1 }
NR >=3 && $1 != label {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
c = l = sum = 0
label = $1
}
NR >= 3 {
if ( $2 >= $3 ) { c++ }
l++
sum += $2
v = $3
}
END {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
}
' infile
It yields:
0.666667 0.082462 Label1
0.333333 0.392317 Label2

Another way with awk (using arrays):
awk '
NR>1 && $2>$3 {
times[$1]++
}
{
total[$1]+=$2;
col3[$1]=$3;
seen[$1]++
}
END {
for(label in times) {
print times[label]/seen[label],log(col3[label]/(total[label]/seen[label]))/log(2),label
}
}' inputFile
Output:
0.666667 0.0824622 Label1
0.333333 0.392317 Label2

Related

Changing a list of values in Awk

I am trying to change values in the following list:
A 0.702
B 0.868
C 3.467
D 2.152
If the second column is less than 0.5 I would like to change to -2, between 0.5-1 to -1, between 1-1.5 to 1 and if > 1.5 then to 2.
When I try the following:
awk '$2<0.9 || $2>2' | awk '{if ($2 < 0.5) print $1,-2;}{if($2>0.5 || $2<1) print $1,-1;}{if($2>1 || $2<1.5) print $1,1;}{if($2>2) print $1,2;}'
I get the following:
A -1
A 1
B -1
B 1
C 1
C 2
D 1
D 2
I know I am missing something but for the life of me I can't figure out what - any help gratefully recieved.
If you have multiple if statements and the current value can match multiple statements, you can print multiple outputs.
If you only want to print the output of the first match, you would have to prevent running the if statements that follow.
You can use a single awk and define non overlapping matches with greater than and && lower than.
Note that using only > and < you will not for example 0.5
awk '{
if($2 < 0.5) print($1, -2)
if($2 > 0.5 && $2<1) print($1,-1)
if($2 > 1 && $2<1.5) print($1, 1)
if($2 > 1.5) print($1 ,2)
}
' file
Output
A -1
B -1
C 2
D 2
With your shown samples only. Adding one more solution with using ternary operators for condition checking(for Fun :) ).
awk '{print (NF?($2>1.5?($1 OFS 2):($2>1?($1 OFS 1):($2>0.5?($1 OFS "-1"):($1 OFS "-2")))):"")}' Input_file
Better readable form of above awk code. Since its a one-liner so breaking it up into multi form for better readability here.
awk '
{
print \
(\
NF\
?\
($2>1.5\
?\
($1 OFS 2)\
:\
($2>1\
?\
($1 OFS 1)\
:\
($2>0.5\
?\
($1 OFS "-1")\
:\
($1 OFS "-2")\
)\
)\
)\
:\
""\
)
}
' Input_file
Explanation: Simple explanation would be using ternary operators to perform conditions and accordingly printing values(since its happening in print function).
Another. Replace <s with <=s where needed:
$ awk '{
if($2<0.5) # from low to higher sets the lower limit
$2=-2
else if($2<1) # so only upper limit needs to be tested
$2=-1
else if($2<1.5)
$2=1
else
$2=2
}1' file
Output:
A -1
B -1
C 2
D 2
Probably overkill for your needs but here's a data-driven approach using GNU awk for arrays of arrays and +/-inf:
$ cat tst.awk
BEGIN {
range["-inf"][0.5] = -2
range[0.5][1] = -1
range[1][1.5] = 1
range[1.5]["+inf"] = 2
}
{
val = ""
for ( beg in range ) {
for ( end in range[beg] ) {
if ( (beg+0 < $2) && ($2 <= end+0) ) {
val = range[beg][end]
}
}
}
print $1, val
}
$ awk -f tst.awk file
A -1
B -1
C 2
D 2
I'm assuming above that "between" excludes the start of the range but includes the end of it. You could make it slightly more efficient with:
for ( beg in range ) {
if ( beg+0 < $2 ) {
for ( end in range[beg] ) {
if ( $2 <= end+0 ) {
val = range[beg][end]
}
}
}
}
but I just like having the range comparison all on 1 line and there's only 1 end for every begin so it doesn't make much difference.
UPDATE 1 : new equation should cover nearly all scenarios :
1st half equation handles the sign +/-
2nd half handles the magnitude of the binning
mawk '$NF = (-++_)^(+(__=$NF)<_) * ++_^(int(__+_--^-_)!=_--)'
X -1.25 -2
X -1.00 -2
X -0.75 -2
X -0.50 -2
X -0.25 -2
X 0.00 -2
X 0.25 -2
X 0.50 -1
X 0.75 -1
X 1.00 1
X 1.25 1
X 1.50 2
X 1.75 2
X 2.00 2
X 2.25 2
X 2.50 2
==============================
this may not cover every possible scenario, but if u want a single liner to cover the samples shown :
mawk '$NF = 4 < (_=int(2*$NF)-2)^2 ? 1+(-3)^(_<-_) :_'
A -1
B -1
C 2
D 2

Divide largest value by second largest value

I am having a file in the following format. Column one has ~20,000 uniq entry and column 2 has ~120,000 different entry and column 3 has count associated with column 2. For a single entry in column 1 there can be multiple entry in column 2. For each unique entry in column 1, I am trying to get ratio of maximum value to second maximum value of column 3.
F1.txt
S1 S2 C1
A A1 1
A AA 10
A A6 5
A A0 4
B BB 12
B BC 11
B B1 19
B B9 4
Expected Output
S1 S2 C1
B B1 19 1.58333
A AA 10 2
I can do in steps like bellow. But is there a smart way of doing in in one script?
awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r "}' F1.txt | awk '!seen[$1]++' >del1.txt
awk 'FNR==NR{a[$2]=1; next}FNR==1{print $0;}!a[$2]' del1.txt F1.txt | awk 'NR==1; NR > 1 {print $0 | "sort -k3 -n -r"}' | awk '!seen[$1]++' >del2.txt
awk 'FNR==NR{a[$1]=$3; next}FNR==1{print $0"\t";"RT"}FNR>1 a[$1]{print $0"\t"$3/a[$1]}' del2.txt del1.txt
#!/usr/bin/awk -f
NR == 1 { print $1, $2, $3; next }
{ data[$1][$3] = $2 }
END {
for (key in data) {
asorti(data[key], s, "#ind_num_desc")
print key, data[key][s[1]], s[1], s[1] / s[2]
}
}
This^^^ assumes an arbitrary permutation of the lines (and requires gawk (which is pretty common) or another implementation with native multi-dimensional “arrays”).
If you can make more assumptions about the input — e.g. that it is always grouped by the first column —, then you can make it more memory-efficient and get rid of multi-dimensional arrays (by not delaying the evaluation until END and instead calculating it in a per-line block each time the first column’s value changes (and then one last time in END).)
To get a different handling of equal numeric values (e.g. to report the “subkey” (column 2) of the first (instead of last) encountered occurrence of a value), you could add if (!($3 in data[$1])) ... or the like.
Whenever you find yourself creating a pipeline containing awk, there is a very good chance that what you are trying to do can be done in a single call to awk much more efficiently.
A non-GNU awk approach that presumes all field1 'A' records are together and all 'B' records are together (as you show in your sample data) could be:
awk '
FNR==1 { print; next } ## 1st line, output heading
$1 != n { ## 1st field changed
if (n) { ## if n set, output result of last block
printf "%s\t%s\n", rec, max / nextmax
}
rec = $0 ## initialize vars for next block
n = $1
max = $3
nextmax = 1
next ## skip to next record
}
{
if ($3 > max) { ## check if 3rd field > max
rec = $0 ## save record
nextmax = max ## update nextmax
max = $3 ## update max
}
else if ($3 > nextmax) { ## if 3rd field > nextmax
nextmax = $3 ## update nextmax
}
} ## output final block results
END { printf "%s\t%s\n", rec, max / nextmax }
' file
Example Use/Output
With your data in the file file, you would have:
$ awk '
> FNR==1 { print; next } ## 1st line, output heading
> $1 != n { ## 1st field changed
> if (n) { ## if n set, output result of last block
> printf "%s\t%s\n", rec, max / nextmax
> }
> rec = $0 ## initialize vars for next block
> n = $1
> max = $3
> nextmax = 1
> next ## skip to next record
> }
> {
> if ($3 > max) { ## check if 3rd field > max
> rec = $0 ## save record
> nextmax = max ## update nextmax
> max = $3 ## update max
> }
> else if ($3 > nextmax) { ## if 3rd field > nextmax
> nextmax = $3 ## update nextmax
> }
> } ## output final block results
> END { printf "%s\t%s\n", rec, max / nextmax }
> ' file
S1 S2 C1
A AA 10 2
B B1 19 1.58333
Using any awk in any shell on every Unix box and using almost no memory (important since your input file would be huge given your description of it):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 { print; next }
$1 != prev {
if ( prev != "" ) {
print prev, val, max, (preMax ? max/preMax : 0)
}
prev = $1
max = ""
}
(max == "") || ($3 > max) {
val = $2
preMax = max
max = $3
}
END { print prev, val, max, (preMax ? max/preMax : 0) }
$ awk -f tst.awk F1.txt
S1 S2 C1
A AA 10 10
B B1 19 1.58333

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

How to add numbers from files to computation?

I need to get results of this formula - a column of numbers
{x = ($1-T1)/Fi; print (x-int(x))}
from inputs file1
4 4
8 4
7 78
45 2
file2
0.2
3
2
1
From this files should be 4 outputs.
$1 is the first column from file1, T1 is the first line in first column of the file1 (number 4) - it is alway this number, Fi, where i = 1, 2, 3, 4 are numbers from the second file. So I need a cycle for i from 1 to 4 and compute the term one times with F1=0.2, the second output with F2=3, then third output with F3=2 and the last output will be for F4=1. How to express T1 and Fi in this way and how to do a cycle?
awk 'FNR == NR { F[++n] = $1; next } FNR == 1 { T1 = $1 } { for (i = 1; i <= n; ++i) { x = ($1 - T1)/F[i]; print x - int(x) >"output" FNR} }' file2 file1
This gives more than 4 outputs. What is wrong please?
FNR == 1 { T1 = $1 } is being run twice, when file2 is started being read T1 is set to 0.2,
>"output" FNR is problematic, you should enclose the output name expression in parentheses.
Here's how I'd do it:
awk '
NR==1 {t1=$1}
NR==FNR {f[NR]=$1; next}
{
fn="output"FNR
for(i in f) {
x=(f[i]-t1)/$1
print x-int(x) >fn
}
close(fn)
}
' file1 file2

find the Max and Min with AWK in specific range

I have file with three columns , I want to get max of $3 and min of $2 but in specific range of $1 with awk:
Col1 Col2 Col3
==============
X 1 2
X 3 4
Y 5 6
Y 7 8
E.g. I want to get the minimum value of Col2 , and the maximum value of Col3 while Col1=X.
I could handle max and min value but I dont find out how to find it in specific range
this is my code :
awk ' min=="" || $2 < min {min=$2; minline=$0} $3 > max {max=$3; maxline=$0};END {print $1,min,max}'
I tried to add {If ($1==X)} but It doesnt work well.
kent$ echo "X 1 2
X 3 4
Y 5 6
Y 7 8
"|awk '$1=="X"{min=$2<min||min==""?$2:min;max=$3>max||max==""?$3:max}END{print min,max}'
1 4
is this what you want?
What about:
awk 'BEGIN { c=1 }
$1 == "X" { if (c==1) { mmin=$2; mmax=$3 ;c++ }
if ($2<mmin) { mmin=$2 }
if ($3>mmax) { mmax=$3 }
}
END { print "X min: " mmin ", max: " mmax }' INPUTFILE
See it in action # Ideone.
If you want to collect all the minima and maxima:
awk '
$2 < min[$1] {min[$1] = $2}
$3 > max[$1] {max[$1] = $3}
{col1[$1] = 1}
END {for (c in col1) {print c, min[c], max[c]}}
' file