Collapsing a column value into lines, copying values of a second column - awk

I have a file with two columns (tab-separated):
In the first column I have the number of lines that I want to collapse, and in the second column is the number that I want to be pasted in each row (in a new file), based on the first column values.
File1:
col1 col2
365 1
6 1
142 1
99 0
223 0
11 1
So basically in the new file I want 365 lines with the number 1, followed by 6 lines of 1, 142 lines of 1, 99 lines of 0, 223 lines of 0 and 11 lines of 1...and so forth...
In total the new file should have 846 lines (which is the sum of the first column on the File1.
Ideally an awk command should do the trick I guess. Any inputs on this would be really appreciated...
Thanks

I would use GNU AWK following way. Contrived example to avoid superlong output, let file.txt be
col1 col2
5 1
3 0
5 1
then
awk 'NR>1{for(i=0;i<$1;i+=1)print $2}' file.txt
output
1
1
1
1
1
0
0
0
1
1
1
1
1
Explanation: I used for statement to print content of 2nd column ($2) times specified in 1st column ($1) for every line beyond 1st line (NR>1).
(tested in gawk 4.2.1)

Related

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

Match two files with duplicate ids in awk or sed

I have two files. File 1 has 3000 rows (1500 Ids) and File 2 has 1400 rows (700 Ids). File 1 contains all the ids present in file 2. I have to match the ID column of File1 & File 2 while maintaining the order of the ids. If the id from file 2 is present in file 1 then compare column 2 and print match or mismatch. catch is there are duplicate ids and i need to keep them all. Looking for a awk or sed solution.Thanks!
File1
ID A
1 13
1 14
2 13
2 13
3 13
3 12
4 13
4 14
5 14
5 14
File 2
ID A
2 13
2 13
3 13
3 3
5 14
5 15
Desired output
ID A
2 13 Match
2 13 Match
3 13 Match
3 3 mismatch
5 14 Match
5 15 mismatch
You may use awk to achieve that,
awk '
NR==FNR{ if(a[$1]=="") a[$1]=$2; next}
/[0-9]/{
if(a[$1]==$2){
print $0,"match"
} else {
print $0,"mismatch"
} id=$1
}' File1 File2
Output:
2 13 match
2 13 match
3 13 match
3 3 mismatch
5 14 match
5 15 mismatch
Brief explanation,
NR==FNR{...}: in File1, save id/value to array a if the id has never shown previously
if(a[$1]==$2): if the id and value match in File2, view the record as match, and mismatch otherwise.
The easiest method would be to traverse the rows in File 2 and for each row find the matching ID in file 1. As you do not provide a programming language, here is the solution in pseudocode:
for all rows in file2
for all rows in file1
if current_row_file1.id = current_row_file2.id
then
if current_row_file1.value_column2 = current_row_file2.value_column2
then
print current_row_file2.id + current_row_file2.value_column2 + "Match"
else
print current_row_file2.id + current_row_file2.value_column2 + "Mismatch
The code above takes some time as you loop through all records in file 1 for every row in file 2. If your ID's in file 1 are ordered you can use an algorithm like binary search to speed up the processing. Look here for an explanation https://en.wikipedia.org/wiki/Binary_search_algorithm

Find the ratio among columns

I have some input files of the following format:
File1.txt File2.txt File3.txt
1 2 1 6 1 20
2 3 2 9 2 21
3 7 3 14 3 28
Now I need to output a new single file using AWK with three columns, the first column remains the same, and it is the same among the three files (just an ordinal number).
However for 2nd and the 3rd column of this newly created file, I need to values of the 2nd column of the second file divided by the values of the 2nd column of the 1st file, also the values of the second column of the third file divided by the value of the 2nd column of the first file. In other words, the 2nd columns for the 2nd and 3rd file divided by the 2nd column of the first file.
e.g.:
Result.txt
1 3 10
2 3 7
3 2 4
Use a multidimensional matrix to store the values:
awk 'FNR==NR {a[$1]=$2; next}
{b[$1,ARGIND]=$2/a[$1]}
END {for (i in a)
print i,b[i,2],b[i,3]
}' f1 f2 f3
Test
$ awk 'FNR==NR {a[$1]=$2; next} {b[$1,ARGIND]=$2/a[$1]} END {for (i in a) print i,b[i,2],b[i,3]}' f1 f2 f3
1 3 10
2 3 7
3 2 4

how to add a column with specific string depending on 4th column with awk

I have a file in which the 4th column has numbers.
If 4th column is greater than 2 I want to add 5th column corresponding as gain; otherwise, the 5th column will have the string loss.
Input
1 762097 6706109 6
1 7202143 7792617 3
1 8922949 9815420 1
1 10502346 11074110 3
1 11188922 12267136 1
1 12566829 13910626 3
Desired output:
1 762097 6706109 6 gain
1 7202143 7792617 3 gain
1 8922949 9815420 1 loss
1 10502346 11074110 3 gain
1 11188922 12267136 1 loss
1 12566829 13910626 4 gain
How should I do this with awk?
Use awk like this:
$ awk '{print $0, ($4>2?"gain":"lose")}' file
1 762097 6706109 6 gain
1 7202143 7792617 3 gain
1 8922949 9815420 1 lose
1 10502346 11074110 3 gain
1 11188922 12267136 1 lose
1 12566829 13910626 3 gain
As you see, it is printing the full line ($0) followed by a string. This string is determined by the value of $4 using a ternary operator.

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00