Making every other row into a new column - awk

So, I have an output that looks like this:
samples pops condition 1 condition 2 condition 3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
However, for the next analysis I need the input to look like this
samples pops condition 1 condition 1 condition 2 condition 2 condition 3 condition 3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions...
any thoughts? I am confident it can be done with awk, but I just can not figure it out.

3 condition columns
Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:
awk 'NR == 1 { printf "%s %s %s %s %s %s %s %s\n",
$1, $2, $3, $3, $4, $4, $5, $5; next }
NR == 2 { next }
NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
NR % 2 == 0 { printf "%s %d %d %d %d %d %d %d\n",
$1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$#"
Given the input file:
samples pops condition_1 condition_2 condition_3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
the script produces the output:
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next } or use \n\n in place of \n in the first printf function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).
The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.
Many condition columns
When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:
awk 'NR == 1 { printf "%s %s", $1, $2
for (i = 3; i <= NF; i++) printf " %s %s", $i, $i
print ""
next
}
NR == 2 { next }
NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
NR % 2 == 0 { printf "%s %d", $1, $2;
for (i = 3; i <= NF; i++) printf " %d %d", c[i], $i
print ""
}' "$#"
When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.

A simple solution (no output headers) with GNU datamash (which is a nice tool for "command-line statistical operations" on textual files):
$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
First, skip all blank lines with grep, then with datamash group lines according to the first field (-g1), using whitespace(s) as field separators (-W), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr.
For a different number of columns, just adapt the range for collapse operation (e.g. collapse 3-1000). And due to grouping operation, any number of samples per group is already supported.

awk to the rescue!
awk '{k=$1 FS $2}
NR==1 {p0=$0; pk=k}
pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
pk!=k {p0=$0; pk=$1 FS $2}' file
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).

Related

How to calculate anomaly using awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.
Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

awk: Print $1 with varying number of additional fields on the same line

I have an input file with ~100 lines and ~100 fields per line. Each field represents either a positive or negative value. I wish to print $1 followed by only the positive or negative fields in each line. The number of positive or negative fields per line is random.
sample input
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
desired output
positive
0 9 8 7
2 7 6
4 4 3 2 1
negative
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
context and attempt
The above outputs print $1, followed by either the positive or the negative values in remaining fields on the same line as $1.
The current code I tried (for positive values, starting on line 6 in my input):
awk 'NR>5{for(i=3; i<=NF; i++) if ( $i > 0 ) print $1, $i}' input > output
This works fine, except that I print an output like:
0 9
0 8
0 7
2 7
2 6
4 4
4 3
4 2
4 1
I have also tried:
awk 'BEGIN {ORS="\t"} NR>5 {print $1} {for(i=3;i<=NF;i++) if ( $i > 0 && i <= NF} {print $i}}' input > output
but then I never move to a new line in the output. If I change ORS back to \n via some 'else if (i = NF) {ORS=...}' condition, then it prints all field outputs for each i on a new line, like the BEGIN statement has no effect.
question
How can I tell awk to print $1, then print all other output from the same input line onto the same output line, then advance 1 new line in the output and repeat the process for the next input line?
Thank you.
response to Tiw's answer
I tried to execute this in a loop for my two files:
for j in 1 2; do
positive=ofile.p0
negative=ofile.m0
awk 'NR>5{
printf $1>"positive";
printf $1>"negative";
for(i=3;i<=NF;i++)
if($i~/[-+]?[0-9]+/)
if ($i>0) printf OFS $i>"positive";
else if($i<0) printf OFS $i>"negative";
print "">"positive";
print "">"negative";
}'ofile.0$j
mv positive $positive$j
mv negative $negative$j
done
but it hangs. Edit: Tiw's answer updated with %s in printf. It works with this change.
Try this:
awk 'NF>5{printf "%s",$1>"positive";printf "%s",$1>"negative"; for(i=2;i<=NF;i++) if($i~/^[-+]?[0-9]+$/) if ($i>0) printf "%s",OFS $i>"positive"; else if($i<0) printf "%s",OFS $i>"negative"; print "">"positive";print "">"negative";}' input
With a file named input:
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
It will create two files,
one positive:
0 9 8 7
2 7 6
4 4 3 2 1
one negative:
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
Put in multiple lines for better readability:
awk 'NF>5{
printf "%s",$1>"positive";
printf "%s",$1>"negative";
for(i=2;i<=NF;i++)
if($i~/^[-+]?[0-9]+$/) ## Another and better way is $i == $i + 0
if ($i>0) printf "%s",OFS $i>"positive";
else if($i<0) printf "%s",OFS $i>"negative";
print "">"positive";
print "">"negative";
}' input
It's quite straightforward so I guess it's easy for you to understand.
Note I didn't use {} to quote the block after the for and ifs, since they each has only one command after, so the quotes can be saved.
print will print a newline character \n at the end, printf won't.
Also NR means Number of Records, i.e. the line number, I changed to NF, which means Number of Fields, I think this is what you wanted.
if($i~/^[-+]?[0-9]+$/) is to test the field is a number.
If the field won't be empty, then $i==$+0 is a better way.
And combine with testing the field is not 0 or empty, use $i && ($i==$i+0).
The first thing you need to do is to check if the field is a number, if this is the case, you can do the check. In awk, you can check if a variable is a number by adding zero to it, and check if it returns the same value.
For positive numbers you do this:
awk '{for(i=1;i<=NF;++i) if ($i+0 == $i && $i >= 0) printf $i OFS; printf ORS}' file
If Perl is an option,
Input:
$ cat blaisem.txt
0 x 9 8 7 -1 -2 -3
2 x 7 6 -2 -3 -4 -5
4 x 4 3 2 1 -6 -7
$
+ve and -ve separate runs
$ perl -ne ' #p=/(\S+)(?<=\d)/g;print "$p[0] "; for(#p[1..$#p]) { print "$_ " if $_ >=0 } print "\n" ' blaisem.txt
0 9 8 7
2 7 6
4 4 3 2 1
$ perl -ne ' #p=/(\S+)(?<=\d)/g;print "$p[0] "; for(#p[1..$#p]) { print "$_ " if $_ < 0 } print "\n" ' blaisem.txt
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
$
+ve and -ve in one script
$ perl -ne ' open(POS,">>pos.txt"); open(NEG,">>neg.txt"); #p=/(\S+)(?<=\d)/g;
print POS "$p[0] "; print NEG "$p[0] ";
for(#p[1..$#p]) { print NEG "$_ " if $_ < 0; print POS "$_ " if $_>=0 }
print POS "\n"; print NEG "\n" ' blaisem.txt
$ cat pos.txt
0 9 8 7
2 7 6
4 4 3 2 1
$ cat neg.txt
0 -1 -2 -3
2 -2 -3 -4 -5
4 -6 -7
$

If two columns from different files equal, replace third column with awk

I am looking for a way to replace a column in a file, if two ID columns match.
I have file A.txt
c a b ID
1 0.01 5 1
2 0.1 6 2
3 2 3
and file B.txt
ID a b
1 10 15
2 20 16
3 30 12
4 40 14
The output im looking for is
file A.txt
ID a b
1 0.01 5
2 0.1 6
3 30 2
I can find with awk which ID columns from both files match
awk 'NR==FNR{a[$1];next}$1 in a' B.txt A.txt
But how to add replacement. Thank you for any suggestions.
awk solution:
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR>1 && $1 in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' B.txt A.txt | column -t
if(NR>1) a[$1]=$2; - capturing column values from file B.txt except the header line (N>1)
FNR>1 && $1 in a && NF<3 - if IDs match and some line from A.txt has less than 3 fields
The output:
ID a b
1 0.01 5
2 0.1 6
3 30 2
Adapted to your new data format
awk '
# Load file b reference
FNR==NR && NR > 1 {ColB[$1]=$2; next}
# treat file A
{
# set missing field if know in file B (and not 1st line)
if ( NF < 4 && ( $NF in ColB) && FNR > 1) $0 = $NF FS ColB[$NF] FS $2
# print result (in any case)
print
}
#order of file is mandatory' B.txt A.txt
Self documented.
Assume this is only the second field that is missing like in your sample

Processing Multiple Files with Awk - Unwanted Lines Are Printing

I'm trying to write a script to process two files. I'm getting stuck on a small detail that I've been unsuccessful in troubleshooting - hoping someone here can help!
I have two text files, the first with a single column and seven rows (all fruits). The second text file has two columns and seventeen rows (first column numbers, second column colors). My script is below - I've eliminated the rest of it, because after some troubleshooting I've found that the problem is here.
This script...:
BEGIN { FS = " " }
NR==FNR
{
print NR "\t" FNR
}
END{}
When invoked with "awk -f script.awk file1.txt file2.txt", produces this output:
apples
1 1
oranges
2 2
pears
3 3
grapes
4 4
mango
5 5
kiwi
6 6
banana
7 7
8 1
9 2
10 3
11 4
(truncated)
I don't understand what's happening here. The fields of file1 (the fruits) are being printed, but the only print statement in this script is printing the values of NR and FNR, which, from what I understand, are always numbers.
When I comment out the NR==FNR statement,
BEGIN { FS = " " }
#NR==FNR
{
print NR "\t" FNR
}
END{}
The output is as expected:
1 1
2 2
2 2
3 3
4 4
5 5
6 6
7 7
8 1
9 2
10 3
11 4
(truncated)
I need to use the NR==FNR statement in order to process multiple files.
Does anyone know what's happening here? Seems like such a basic issue (it's only 3 statements!), but I can't get rid of the damn fruits.
NR==FNR by itself is a pattern without an action. And the default action is to print the line (e.g {print}).
So awk sees NR==FNR as a test for the first file (as you indicated) and when it succeeds it then uses the default action.
So your script is effectively:
BEGIN { FS = " " }
NR==FNR {
print
}
{
print NR "\t" FNR
}
END{}

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00