awk: divide odd columns by following even column - awk

I want to divide all the odd columns in a file by the next even column, e.g. column1/column2, column3/column4,......, columnN/columnN+1
test1.txt
1 4 1 2 1 3
1 2 4 2 3 9
desired output
0.25 0.5 0.333
0.5 2 0.333
I tried this:
awk 'BEGIN{OFS="\t"} { for (i=2; i<NF+2; i+=2) printf $(i-1)/i OFS; printf "\n"}'
but it doesn't work.
I would like to add that my actual files have a very large and variable (but always even) number of columns and I would like something that would work on all of them.

awk '{for(i=1;i<NF;i+=2)printf "%f%s",$i/$(i+1),OFS;print "";}' input.txt
Output:
0.250000 0.500000 0.333333
0.500000 2.000000 0.333333
You can adjust printing format to your needs see here for more info.

Related

How to calculate anomaly using awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.
Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

Making every other row into a new column

So, I have an output that looks like this:
samples pops condition 1 condition 2 condition 3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
However, for the next analysis I need the input to look like this
samples pops condition 1 condition 1 condition 2 condition 2 condition 3 condition 3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions...
any thoughts? I am confident it can be done with awk, but I just can not figure it out.
3 condition columns
Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:
awk 'NR == 1 { printf "%s %s %s %s %s %s %s %s\n",
$1, $2, $3, $3, $4, $4, $5, $5; next }
NR == 2 { next }
NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
NR % 2 == 0 { printf "%s %d %d %d %d %d %d %d\n",
$1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$#"
Given the input file:
samples pops condition_1 condition_2 condition_3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
the script produces the output:
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next } or use \n\n in place of \n in the first printf function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).
The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.
Many condition columns
When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:
awk 'NR == 1 { printf "%s %s", $1, $2
for (i = 3; i <= NF; i++) printf " %s %s", $i, $i
print ""
next
}
NR == 2 { next }
NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
NR % 2 == 0 { printf "%s %d", $1, $2;
for (i = 3; i <= NF; i++) printf " %d %d", c[i], $i
print ""
}' "$#"
When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.
A simple solution (no output headers) with GNU datamash (which is a nice tool for "command-line statistical operations" on textual files):
$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
First, skip all blank lines with grep, then with datamash group lines according to the first field (-g1), using whitespace(s) as field separators (-W), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr.
For a different number of columns, just adapt the range for collapse operation (e.g. collapse 3-1000). And due to grouping operation, any number of samples per group is already supported.
awk to the rescue!
awk '{k=$1 FS $2}
NR==1 {p0=$0; pk=k}
pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
pk!=k {p0=$0; pk=$1 FS $2}' file
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56

Use awk to sum or average for each unique ID

Can anyone tell me how to use awk in order to calculate the sum of two individuals columns or the average of one column for each unique ID.
Input
chr1 3661532 3661533 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661534 3661535 0.2 5 1 chr1 3661529 3662079 NM_01011874
chr1 3661537 3661538 0.0 5 0 chr1 3661529 3662079 NM_01011874
chr1 3661559 3661560 0.0 6 0 chr1 3661529 3662079 NM_01011874
chr2 4661532 4661533 0.0 8 0 chr1 4661532 4661533 NM_00175642
chr2 6661534 6661535 0.2 5 2 chr1 6661534 6661535 NM_00175642
chr2 2661537 2661538 0.0 5 0 chr1 2661537 2661538 NM_00175642
chr2 9661559 9661560 0.0 7 0 chr1 9661559 9661560 NM_00175642
Output (sum $5 $6) for each unique ID
NM_01011874 21 1
NM_00175642 25 2
or average of $4 for each unique ID
NM_01011874 0.0476
NM_00175642 0.08
Also, if you could breakdown the components of the solution I would be grateful. I'm a molecular biologist with minimal bioinformatics training.
sum of columns 5 and 6 per id:
awk '{sum5[$10] += $5; sum6[$10] += $6}; END{ for (id in sum5) { print id, sum5[id], sum6[id] } }' < /tmp/input
NM_00175642 25 2
NM_01011874 21 1
Explained: $10 is the id field, $5 and $6 are columns 5 and 6. We build 2 arrays for summing columns 5 and 6 (which are indexed by strings, so we can use the id field). Once we've processed all the lines/records, we iterate through the array keys (id strings), and print the value at that array index.
average of column 4 per id:
awk '{sum4[$10] += $4; count4[$10]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < /tmp/input
NM_00175642 0.05
NM_01011874 0.05
Explained: Very similar to the summing example. We keep a sum of column 4 per id, and a count of records seen for each id. At the end, we iterate through the ids and print the sum/count.
I don't do much with awk, I find Perl much better for small scripts. But this looks like a good starting point. There are links to more pages with example scripts.

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00