Processing Multiple Files with Awk - Unwanted Lines Are Printing - awk

I'm trying to write a script to process two files. I'm getting stuck on a small detail that I've been unsuccessful in troubleshooting - hoping someone here can help!
I have two text files, the first with a single column and seven rows (all fruits). The second text file has two columns and seventeen rows (first column numbers, second column colors). My script is below - I've eliminated the rest of it, because after some troubleshooting I've found that the problem is here.
This script...:
BEGIN { FS = " " }
NR==FNR
{
print NR "\t" FNR
}
END{}
When invoked with "awk -f script.awk file1.txt file2.txt", produces this output:
apples
1 1
oranges
2 2
pears
3 3
grapes
4 4
mango
5 5
kiwi
6 6
banana
7 7
8 1
9 2
10 3
11 4
(truncated)
I don't understand what's happening here. The fields of file1 (the fruits) are being printed, but the only print statement in this script is printing the values of NR and FNR, which, from what I understand, are always numbers.
When I comment out the NR==FNR statement,
BEGIN { FS = " " }
#NR==FNR
{
print NR "\t" FNR
}
END{}
The output is as expected:
1 1
2 2
2 2
3 3
4 4
5 5
6 6
7 7
8 1
9 2
10 3
11 4
(truncated)
I need to use the NR==FNR statement in order to process multiple files.
Does anyone know what's happening here? Seems like such a basic issue (it's only 3 statements!), but I can't get rid of the damn fruits.

NR==FNR by itself is a pattern without an action. And the default action is to print the line (e.g {print}).
So awk sees NR==FNR as a test for the first file (as you indicated) and when it succeeds it then uses the default action.
So your script is effectively:
BEGIN { FS = " " }
NR==FNR {
print
}
{
print NR "\t" FNR
}
END{}

Related

Making every other row into a new column

So, I have an output that looks like this:
samples pops condition 1 condition 2 condition 3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
However, for the next analysis I need the input to look like this
samples pops condition 1 condition 1 condition 2 condition 2 condition 3 condition 3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions...
any thoughts? I am confident it can be done with awk, but I just can not figure it out.
3 condition columns
Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:
awk 'NR == 1 { printf "%s %s %s %s %s %s %s %s\n",
$1, $2, $3, $3, $4, $4, $5, $5; next }
NR == 2 { next }
NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
NR % 2 == 0 { printf "%s %d %d %d %d %d %d %d\n",
$1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$#"
Given the input file:
samples pops condition_1 condition_2 condition_3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
the script produces the output:
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next } or use \n\n in place of \n in the first printf function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).
The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.
Many condition columns
When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:
awk 'NR == 1 { printf "%s %s", $1, $2
for (i = 3; i <= NF; i++) printf " %s %s", $i, $i
print ""
next
}
NR == 2 { next }
NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
NR % 2 == 0 { printf "%s %d", $1, $2;
for (i = 3; i <= NF; i++) printf " %d %d", c[i], $i
print ""
}' "$#"
When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.
A simple solution (no output headers) with GNU datamash (which is a nice tool for "command-line statistical operations" on textual files):
$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
First, skip all blank lines with grep, then with datamash group lines according to the first field (-g1), using whitespace(s) as field separators (-W), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr.
For a different number of columns, just adapt the range for collapse operation (e.g. collapse 3-1000). And due to grouping operation, any number of samples per group is already supported.
awk to the rescue!
awk '{k=$1 FS $2}
NR==1 {p0=$0; pk=k}
pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
pk!=k {p0=$0; pk=$1 FS $2}' file
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).

Awk - Conditionally print an element from a certain row, based on the condition of a different element in a different row

Say I have a lot of files with a consistent number of columns and rows, and a sample one looks like this:
1 2 3
4 5 6
7 8 9
I want to print column 3 of row 2, but only if column 3 of row 3 == 4 (in this case it is 9). I'm using this logic is a means to determine if the file is valid for my use-case, and extract the relevant field if it is.
My attempt, based on other answers to people asking how to isolate certain rows was this: awk 'BEGIN{FNR=3} $3=="4"{FNR=2;print $2}'
so you are looking for something like this?
awk 'FNR==2{ x = $3 }FNR==3 && $3=="4"{ print x }' file.txt
cat file.txt
1 2 3
4 5 6
7 8 4
Output:
6
cat file.txt
1 2 3
4 5 6
7 8 9
Output:
Nothing since column 3 of row 3 is 9
awk 'FNR==3 && $3==4{print p} {p=$3}' *
Here's another which doesn't care for the order in which the records appear. In the OP the problem was to print a value (v) from 2nd record based on the tested value (t) on the 3rd record. This solution allows for the test value to appear in an earlier record than the value to be printed:
$ awk '
FNR==2 { # record on which is the value to print
v=$3
f=1 # flag indicating the value v has been read
}
FNR==3 { # record of which is the value to test
t=$3
g=1 # test value read indicator
}
f && g { # once the value and test value are acquired and
if(t==4) # test the test
print v # output
exit # and exit
}' file
6
Record order reversed (FNR values changed in the code):
$ cat file2
1 2 3
7 8 4 # records
4 5 6 # reversed
$ awk 'FNR==3{v=$3;f=1}FNR==2{t=$3;g=1}f&&g{if(t==4)print v;exit}' file2
6
Flags f and g are different from v and t in case either should be empty ("").

Finding NR of row with specific conditions (using next line)

Guys I have a file like this
NR column
1 1
2 1
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 0
11 0
12 0
13 1
14 1
What I need is to find the NR what will tell me where there are 1.
so my ideal output should tell me from NR=1 - 2 (there are 1s, then), NR=6 - 9, NR=13 - 14
or
1
2
6
9
13
14
Since, I think is easier not consider in the output the first row and the last. I expect that the output is
2
6
9
13
I've been trying a way to use getline but unsuccessfully.
I am sure there is an easy way to do this, help?
Thanks
Assuming your output above was incorrect (and it should really be the line number where the 0/1 or 1/0 transition happens - so the lines would be: "1, 3, 6, 10, 13"), then an awk oneliner is:
awk 'prev!=$0{print NR};{prev=$0}' file
which says:
for every line that doesn't match the prev line, print the line number, and
for every line, save the prev line
$ awk 'NR>1 && $0!=prev{print NR} {prev=$0}' file
3
6
10
13
or for your updated requirements:
$ awk '$1!=prev{print NR-prev} {prev=$1} END{if (prev) print NR}' file
1
2
6
9
13
14
awk to the rescue!
$ awk '!p&&$2==1{p=$1}
p&&!$2{print p"-"($1-1);p=0}
END{if(p) print p"-"$1}' file
1-2
6-9
13-14
{
if (NR > 1 && last != $0) {
print NR;
}
last = $0;
}
Another way
awk '$2!=x{x=$2;print NR-!($2)}END{if(x)print NR}' file
1
2
6
9
13
14

Lining up columns using awk

I'm trying to use awk to pull every 9th column out of a dataset with 210 columns. How can I get the columns to line up evenly if the data in each column do not contain the same number of characters?
Use for loop to skip over the column you don't need:
awk '{
for(i=1;i<=8;i++) {
printf "%s%s",$i,FS
}
for(i=10;i<=NF;i++) {
printf "%s%s",$i,(i==NF?RS:FS)
}
}' file
Note: Please set your field separator accordingly. You haven't stated what it is so I am going with default (that is space)
Sample Test: (skipping over 3rd column)
$ cat file
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
$ awk '{
for(i=1;i<=2;i++) {
printf "%s%s",$i,FS
}
for(i=4;i<=NF;i++) {
printf "%s%s",$i,(i==NF?RS:FS)
}
}' file
1 2 4 5 6
1 2 4 5 6
1 2 4 5 6

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00