Format a number with thousands separators in shell/awk - awk

awk -F, '
{
printf(" Code %s %s\n",$1,$2)
r[NR] = $1
c[NR] = $3
}
END { for(i = 1; i <= NR; i++) printf(" %s Record Count %s\n", r[i],c[i])
}' totalsum.txt
this is my input file
17,123456,1
16,1234,2
0211,34567,2
21,2345,2
am getting output like below.,
Code 17 123456
Code 16 1234
Code 0211 34567
Code 21 2345
17 Record Count 1
16 Record Count 2
0211 Record Count 2
21 Record Count 2
I need format the output like.,below representing the values with ,
Code 16 1,234
Code 0211 34,567
Code 21 112,345
17 Record Count 1
16 Record Count 2
0211 Record Count 2
21 Record Count 2
could some one please help me.

You need to use %'d instead of %s as the format specifier if you want thousands separators. Since you're passing the awk script on the command line, getting the quotes right can be tricky. With a hat tip to Ed Morton, here's one way to do it:
#!/bin/sh
awk -F, '
{
printf(" Code %s %\047d\n",$1,$2)
r[NR] = $1
c[NR] = $3
}
END { for(i = 1; i <= NR; i++) printf(" %s Record Count %s\n", r[i],c[i])
}' totalsum.txt
Output:
$ ./test.sh
Code 37 123,456
Code 27 1,234
Code 0367 34,567
Code 41 2,345
37 Record Count 1
27 Record Count 2
0367 Record Count 2
41 Record Count 2

Related

Column manipulating using Bash & Awk

Let's assume have an example1.txt file consisting of few rows.
item item item
A B C
100 20 2
100 22 3
100 23 4
101 26 2
102 28 2
103 29 3
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
104 36 3
There are few commands I would like to perform to filter out the txt files and add a few more columns.
At first, I want to apply a condition when item C is equal to 2. Using awk command I can do that in the following way.
Therefore The return text file would be:
awk '$3 == 2 { print $1 "\t" $2 "\t" $3} ' example1.txt > example2.txt
item item item
A B C
100 20 2
101 26 2
102 28 2
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
Now I want to count two things:
I want to count the total unique number in column 1.
For example, in the above case example2.txt, it would be:
(100,101,102,103,104) = 5
And I would like to add the repeating column A number and add that to a new column.
I would like to have like this:
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
~
Above Item D column (4th), 1st row is 1, because it did not have any repetitive. but in 4th row, it's 2 because 103 is repetitive twice. Therefore I have added 2 in the 4th and 5th columns. Similarly, the last three columns in Item 4 is 3, because item A is repetitive three times in these three columns.
You may try this awk:
awk -v OFS='\t' 'NR <= 2 {
print $0, (NR == 1 ? "item" : "D")
}
FNR == NR && $3 == 2 {
++freq[$1]
next
}
$3 == 2 {
print $0, freq[$1]
}' file{,}
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Could you please try following. In case you want to save output into same Input_file then append > temp && mv temp Input_file to following code.
awk '
FNR==NR{
if($3==2){
a[$1,$3]++
}
next
}
FNR==1{
$(NF+1)="item"
print
next
}
FNR==2{
$(NF+1)="D"
print
next
}
$3!=2{
next
}
FNR>2{
$(NF+1)=a[$1,$3]
}
1
' Input_file Input_file | column -t
Output will be as follows.
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st time Input_file is being read.
if($3==2){ ##Checking condition if 3rd field is 2 then do following.
a[$1,$3]++ ##Creating an array a whose index is $1,$3 and keep adding its index with 1 here.
}
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is first line.
$(NF+1)="item" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
FNR==2{ ##Checking condition if this is second line.
$(NF+1)="D" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
$3!=2{ ##Checking condition if 3rd field is NOT equal to 2 then do following.
next ##next will skip further statements from here.
}
FNR>2{ ##Checking condition if line is greater than 2 then do following.
$(NF+1)=a[$1,$3] ##Creating new field with value of array a with index of $1,$3 here.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names 2 times here.
Similar to the others, but using awk with a single-pass and storing the information in arrays regarding the records seen and the count for D with the arrays ord and Dcnt used to map the information for each, e.g.
awk '
FNR == 1 { h1=$0"\titem" } # header 1 with extra "\titem"
FNR == 2 { h2=$0"\tD" } # header 2 with exter "\tD"
FNR > 2 && $3 == 2 { # remaining rows with $3 == 2
D[$1]++ # for D colum times A seen
seen[$1,$2] = $0 # save records seen
ord[++n] = $1 SUBSEP $2 # save order all records appear
Dcnt[n] = $1 # save order mapped to $1 for D
}
END {
printf "%s\n%s\n", h1, h2 # output headers
for (i=1; i<=n; i++) # loop outputing info with D column added
print seen[ord[i]]"\t"D[Dcnt[i]]
}
' example.txt
(note: SUBSEP is a built-in variable that corresponds to the substring separator used when using the comma to concatenate fields for an array index, e.g. seen[$1,$2] to allow comparison outside of an array. It is by default "\034")
Example Output
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Always more than one way to skin-the-cat with awk.
Assuming the file is not a big file;
awk 'NR==FNR && $3 == 2{a[$1]++;next}$3==2{$4=a[$1];print;}' file.txt file.txt
You parse through the file twice. In the first iteration, you calculate the 4th column and have it in an array. In the second parsing, we set the count as 4th column,and get the whole line printed.

Sum column and count lines

I am trying to sum certain numbers in colum 2, it works with my code. But I want to count also how many times the same value in colum 2 is repeated and print in the last column.
file1
36 2605 1 2
36 2605 1 2
36 2603 1 2
36 2605 1 2
36 2605 1 2
36 2605 1 2
36 2606 1 2
Output Desired
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
I tried
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
Thanks in advance
Renamed the vars and added pretty print:
awk '
{
sum1[$2]+=$1
sum3[$2]+=$3
sum4[$2]+=$4
count[$2]++
len2=((l=length($2))>len2?l:len2)
len1=((l=length(sum1[$2]))>len1?l:len1)
len3=((l=length(sum3[$2]))>len3?l:len3)
len4=((l=length(sum4[$2]))>len4?l:len4)
len5=((l=length(sum5[$2]))>len5?l:len5)
}
END {
for(i in count) {
printf "%*d %*d %*d %*d %*d\n",
len2,i,len1,sum1[i],len3,sum3[i],len4,sum4[i],len5,count[i]
}
}' file
Output:
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Space chars are relatively inexpensive these days, you should really consider getting some for your code, especially if you want other people to read it to help you debug it! Here's the code you posted:
awk '{a[$2]+=$1}{b[$2]+=$3}{c[$2]+=$4;count[$2]+=$2}END{for(i in a)print i,a[i],b[i],c[i],count[i]}' file1
and here it is after having been run through a code beautifier (I used gawk -o):
{
a[$2] += $1
}
{
b[$2] += $3
}
{
c[$2] += $4
count[$2] += $2
}
END {
for (i in a) {
print i, a[i], b[i], c[i], count[i]
}
}
See how just by adding some white space it's now vastly easier to understand and so the bug in how count[$2] is being populated is glaringly obvious? Some meaningful variable names are always extremely useful too and I hear alphanumeric chars are on special right now!
FWIW here's how I'd do this:
$ cat tst.awk
BEGIN { keyFldNr = 2 }
{
numOutFlds = 0
for (i=1; i<=NF; i++) {
if (i != keyFldNr) {
sum[$keyFldNr,++numOutFlds] += $i
}
}
cnt[$keyFldNr]++
}
END {
for (key in cnt) {
printf "%s%s", key, OFS
for (i=1; i<=numOutFlds; i++) {
printf "%s%s", sum[key,i], OFS
}
print cnt[key]
}
}
$ awk -f tst.awk file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
$ awk -f tst.awk file | column -t
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Notice that it'll work as-is no matter how many fields you have on each line and if you need to use a different field for the key that you count and sum on then you just change the value of keyFldNr in the BEGIN section from 2 to whatever you want it to be.
A non-awk approach, using the very useful GNU datamash, which is designed for tasks like this one:
$ datamash -Ws groupby 2 sum 1,3,4 count 2 < input.txt
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
Read as: For each group of rows with the same value in column 2, display that value, the sums of columns 1, 3 and 4, and the number of rows in the group.
You've almost nailed it, you're not increasing count[$2] properly.
$ awk '{a[$2]+=$1;b[$2]+=$3;c[$2]+=$4;count[$2]++}
END{for(i in a) print i,a[i],b[i],c[i],count[i]}' file
2603 36 1 2 1
2605 180 5 10 5
2606 36 1 2 1
no need external program, faster ~21ms, tried on pure gnu awk
awk '{if($0~/^[A-Za-z0-9]/)a[NR]=$2" "$1" "$3" "$4}END{asort(a);$0="";for(;i++<NR;){split(a[i],b);if($1==""||b[1]==$1){$2+=b[2];$3+=b[3];$4+=b[4];$5++} else {print;$2=b[2];$3=b[3];$4=b[4];$5=1} $1=b[1]} print}' file1

Making every other row into a new column

So, I have an output that looks like this:
samples pops condition 1 condition 2 condition 3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
However, for the next analysis I need the input to look like this
samples pops condition 1 condition 1 condition 2 condition 2 condition 3 condition 3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
So, it is not just making it so that every other row is a new column, every other row in a given column would be in a new column assigned to that same condition, in a way that each sample has two columns for the same condition and not two rows for the same sample. For the example I put 2 samples and 3 conditions, however IRL I have over 100 samples and over 1000 conditions...
any thoughts? I am confident it can be done with awk, but I just can not figure it out.
3 condition columns
Taking the assertion 'the data is perfect' at face value and disregarding years of experience which indicates that data is seldom if ever perfect, then:
awk 'NR == 1 { printf "%s %s %s %s %s %s %s %s\n",
$1, $2, $3, $3, $4, $4, $5, $5; next }
NR == 2 { next }
NR % 2 == 1 { c[1] = $3; c[2] = $4; c[3] = $5 }
NR % 2 == 0 { printf "%s %d %d %d %d %d %d %d\n",
$1, $2, c[1], $3, c[2], $4, c[3], $5 }' "$#"
Given the input file:
samples pops condition_1 condition_2 condition_3
A10051 15 1 3 4
A10051 15 2 4 4
A10052 15 2 1 4
A10052 15 2 1 4
the script produces the output:
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
This code is more mechanical than interesting. If you have 10 columns in each line, you'd approach it differently. You'd probably use loops to save and print the data. If you want a blank line between the headings and the data, you can easily add one (NR == 2 { print; next } or use \n\n in place of \n in the first printf function). You can arrange for the output fields to be separated by tabs if you wish (they're separated by double spaces in this code).
The code does not depend on tabs separating the data fields; it only depends on there being no white space within a field.
Many condition columns
When there are many condition columns, you need to use arrays and loops to capture and print the data, like this:
awk 'NR == 1 { printf "%s %s", $1, $2
for (i = 3; i <= NF; i++) printf " %s %s", $i, $i
print ""
next
}
NR == 2 { next }
NR % 2 == 1 { for (i = 3; i <= NF; i++) c[i] = $i }
NR % 2 == 0 { printf "%s %d", $1, $2;
for (i = 3; i <= NF; i++) printf " %d %d", c[i], $i
print ""
}' "$#"
When run on the same data as before, it produces the same output as before, but the loops would allow it to read 1000 conditions per input line and generate 2000 conditions per output line. The only possible issue is whether your version of Awk handles such long input lines in the first place. If need be, upgrade to GNU Awk.
A simple solution (no output headers) with GNU datamash (which is a nice tool for "command-line statistical operations" on textual files):
$ grep -v ^$ file | datamash -W -g1 --header-in first 2 collapse 3-5 | tr ',' ' ' | column -t
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
First, skip all blank lines with grep, then with datamash group lines according to the first field (-g1), using whitespace(s) as field separators (-W), collapsing multiple rows in a group for fields 3, 4 and 5. Collapsed values are comma separated, that's why we have to break them with tr.
For a different number of columns, just adapt the range for collapse operation (e.g. collapse 3-1000). And due to grouping operation, any number of samples per group is already supported.
awk to the rescue!
awk '{k=$1 FS $2}
NR==1 {p0=$0; pk=k}
pk==k {split(p0,a); for(i=3;i<=NF;i++) $i=a[i] FS $i; print}
pk!=k {p0=$0; pk=$1 FS $2}' file
samples pops condition_1 condition_1 condition_2 condition_2 condition_3 condition_3
A10051 15 1 2 3 4 4 4
A10052 15 2 2 1 1 4 4
will work for unspecified number of columns and records, as long as they are all well-formed (same number of columns) and grouped (same keys are in sequence).

To sum adjacent lines from the same column in AWK

I have a file:
A 1 20
B 2 21
C 3 22
D 4 23
I have to find the sum of values from 0-3rd line then the sum of line 1 to 3 and finally the sum of line 2 to 3. The last value has to be simply 0. In another words, I want to get an output file with two columns where the values are the sum of adjacent lines something like this:
10 86
9 66
7 45
0 0
The last row has to have two zeros as values. How to do it in AWK?
This might be what you want:
$ tac file | awk 'NR==1{ print 0, 0; a=$2; b=$3; next} { print a+=$2, b+=$3 }' | tac
10 86
9 66
7 45
0 0
Avoid two tacs by accumulating the sums in two arrays:
$ awk '{
for (i = 1; i <= NR; ++i) { sum2[i] += $2; sum3[i] += $3 }
}
END {
sum2[NR] = sum3[NR] = 0
for (i = 1; i <= NR; ++i) print sum2[i], sum3[i]
}' file
10 86
9 66
7 45
0 0
The value of each row is added into all the previous rows. Once all rows have been processed, the last values are zeroed out and everything is printed.

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00