Column manipulating using Bash & Awk - awk

Let's assume have an example1.txt file consisting of few rows.
item item item
A B C
100 20 2
100 22 3
100 23 4
101 26 2
102 28 2
103 29 3
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
104 36 3
There are few commands I would like to perform to filter out the txt files and add a few more columns.
At first, I want to apply a condition when item C is equal to 2. Using awk command I can do that in the following way.
Therefore The return text file would be:
awk '$3 == 2 { print $1 "\t" $2 "\t" $3} ' example1.txt > example2.txt
item item item
A B C
100 20 2
101 26 2
102 28 2
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
Now I want to count two things:
I want to count the total unique number in column 1.
For example, in the above case example2.txt, it would be:
(100,101,102,103,104) = 5
And I would like to add the repeating column A number and add that to a new column.
I would like to have like this:
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
~
Above Item D column (4th), 1st row is 1, because it did not have any repetitive. but in 4th row, it's 2 because 103 is repetitive twice. Therefore I have added 2 in the 4th and 5th columns. Similarly, the last three columns in Item 4 is 3, because item A is repetitive three times in these three columns.

You may try this awk:
awk -v OFS='\t' 'NR <= 2 {
print $0, (NR == 1 ? "item" : "D")
}
FNR == NR && $3 == 2 {
++freq[$1]
next
}
$3 == 2 {
print $0, freq[$1]
}' file{,}
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3

Could you please try following. In case you want to save output into same Input_file then append > temp && mv temp Input_file to following code.
awk '
FNR==NR{
if($3==2){
a[$1,$3]++
}
next
}
FNR==1{
$(NF+1)="item"
print
next
}
FNR==2{
$(NF+1)="D"
print
next
}
$3!=2{
next
}
FNR>2{
$(NF+1)=a[$1,$3]
}
1
' Input_file Input_file | column -t
Output will be as follows.
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st time Input_file is being read.
if($3==2){ ##Checking condition if 3rd field is 2 then do following.
a[$1,$3]++ ##Creating an array a whose index is $1,$3 and keep adding its index with 1 here.
}
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is first line.
$(NF+1)="item" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
FNR==2{ ##Checking condition if this is second line.
$(NF+1)="D" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
$3!=2{ ##Checking condition if 3rd field is NOT equal to 2 then do following.
next ##next will skip further statements from here.
}
FNR>2{ ##Checking condition if line is greater than 2 then do following.
$(NF+1)=a[$1,$3] ##Creating new field with value of array a with index of $1,$3 here.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names 2 times here.

Similar to the others, but using awk with a single-pass and storing the information in arrays regarding the records seen and the count for D with the arrays ord and Dcnt used to map the information for each, e.g.
awk '
FNR == 1 { h1=$0"\titem" } # header 1 with extra "\titem"
FNR == 2 { h2=$0"\tD" } # header 2 with exter "\tD"
FNR > 2 && $3 == 2 { # remaining rows with $3 == 2
D[$1]++ # for D colum times A seen
seen[$1,$2] = $0 # save records seen
ord[++n] = $1 SUBSEP $2 # save order all records appear
Dcnt[n] = $1 # save order mapped to $1 for D
}
END {
printf "%s\n%s\n", h1, h2 # output headers
for (i=1; i<=n; i++) # loop outputing info with D column added
print seen[ord[i]]"\t"D[Dcnt[i]]
}
' example.txt
(note: SUBSEP is a built-in variable that corresponds to the substring separator used when using the comma to concatenate fields for an array index, e.g. seen[$1,$2] to allow comparison outside of an array. It is by default "\034")
Example Output
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Always more than one way to skin-the-cat with awk.

Assuming the file is not a big file;
awk 'NR==FNR && $3 == 2{a[$1]++;next}$3==2{$4=a[$1];print;}' file.txt file.txt
You parse through the file twice. In the first iteration, you calculate the 4th column and have it in an array. In the second parsing, we set the count as 4th column,and get the whole line printed.

Related

AWK command of add column to count of grouped column

I have a data set tab separated like this: (file.txt)
A B
1 111
1 111
1 112
1 113
1 113
1 113
1 113
2 113
2 113
2 113
I want to add a new C column to show count of grouped A and B
Desired output:
A B C
1 111 2
1 111 2
1 112 1
1 113 4
1 113 4
1 113 4
1 113 4
2 113 3
2 113 3
2 113 3
I have tried this:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{
if (FNR>1) a[$2]+=$3
next
}
{ $(NF+1)=(FNR==1 ? "C" : a[$2]) }
1
' file.txt file.txt > file2.txt
Could you please try following, With shown samples.
awk '
FNR==NR{
count[$1,$2]++
next
}
FNR==1{
print $0,"C"
next
}
{
print $0,count[$1,$2]
}
' Input_file Input_file
Add BEGIN{FS=OFS="\t"} in above code in case your data is tab delimited.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first time Input_file being read.
count[$1,$2]++ ##Creating count with index of 1st and 2nd field and increasing its count.
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is 1st line then do following.
print $0,"C" ##Printing current line with C heading here.
next ##next will skip further statements from here.
}
{
print $0,count[$1,$2] ##Printing current line along with count with index of 1st and 2nd field.
}
' Input_file Input_file ##Mentioning Input_file(s) here.
Problem in OP's attempt: OP was adding $3 in values(though logic looked ok) but there is NO 3rd field present in Input_file so that's why it was not working. Also OP was using index as 2nd field but as per OP's comments it should be 1st and 2nd fields.
You might consider using GNU Datamash, e.g.:
datamash -HW groupby 1,2 count 1 < file.txt | column -t
Output:
GroupBy(A) GroupBy(B) count(A)
1 111 2
1 112 1
1 113 4
2 113 3

Calculating cumulative sum and percent of total for columns grouped by row

I have a very large table of values that is formatted like this:
apple 1 1
apple 2 1
apple 3 1
apple 4 1
banana 25 4
banana 35 10
banana 36 10
banana 37 10
Column 1 has many different fruit, with varying numbers of rows for each fruit.
I would like to calculate the cumulative sum of column 3 for each type of fruit in column 1, and the cumulative percentage of the total at each row, and add these as new columns. So the desired output would be this:
apple 1 1 1 25.00
apple 2 1 2 50.00
apple 3 1 3 75.00
apple 4 1 4 100.00
banana 25 4 4 11.76
banana 35 10 14 41.18
banana 36 10 24 70.59
banana 37 10 34 100.00
I can get part way there with awk, but I am struggling with how to get the cumulative sum to reset at each new fruit. Here is my horrendous awk attempt for your viewing pleasure:
#!/bin/bash
awk '{cumsum += $3; $3 = cumsum} 1' fruitfile > cumsum.tmp
total=$(awk '{total=total+$3}END{print total}' fruitfile)
awk -v total=$total '{ printf ("%s\t%s\t%s\t%.5f\n", $1, $2, $3, ($3/total)*100)}' cumsum.tmp > cumsum.txt
rm cumsum.tmp
Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
a[$1]+=$NF
next
}
{
sum[$1]+=($NF/a[$1])*100
print $0,++b[$1],sum[$1]
}
' Input_file Input_file |
column -t
Output for shown samples will be as follows.
apple 1 1 1 25
apple 2 1 2 50
apple 3 1 3 75
apple 4 1 4 100
banana 25 4 1 11.7647
banana 35 10 2 41.1765
banana 36 10 3 70.5882
banana 37 10 4 100
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
a[$1]+=$NF ##Creating array a with index $1 and keep adding its last field value to it.
next ##next will skip all further statements from here.
}
{
sum[$1]+=($NF/a[$1])*100 ##Creating sum with index 1st field and keep adding its value to it, each value will have last field/value of a[$1] and multiplying it with 100.
print $0,++b[$1],sum[$1] ##Printing current line, array b with 1st field with increasing value of 1 and sum with index of 1st field.
}
' Input_file Input_file | ##Mentioning Input_file name here.
column -t ##Sending awk output to column command for better look.

Select current and previous line if certain value is found

To figure out my problem, I subtract column 3 and create a new column 5 with new values, then I print the previous and current line if the value found is equal to 25 in column 5.
Input file
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
I tried
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' file
output
1 1 35 1 35
2 5 50 1 15
2 6 75 1 25
4 7 85 1 10
5 8 100 1 15
6 9 125 1 25
4 1 200 1 75
Desired Output
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Thanks in advance
you're almost there, in addition to previous $3, keep the previous $0 and only print when condition is satisfied.
$ awk '{$5=$3-p3} $5==25{print p0; print} {p0=$0;p3=$3}' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
this can be further golfed to
$ awk '25==($5=$3-p3){print p0; print} {p0=$0;p3=$3}' file
check the newly computed field $5 whether equal to 25. If so print the previous line and current line. Save the previous line and previous $3 for the computations in the next line.
You are close to the answer, just pipe it another awk and print it
awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
with Inputs:
$ cat oxxo.txt
1 1 35 1
2 5 50 1
2 6 75 1
4 7 85 1
5 8 100 1
6 9 125 1
4 1 200 1
$ awk '{$5 = $3 - prev3; prev3 = $3; print $0}' oxxo.txt | awk ' { curr=$0; if($5==25) { print prev;print curr } prev=curr } '
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
$
Could you please try following.
awk '$3-prev==25{print line ORS $0,$3} {$(NF+1)=$3-prev;prev=$3;line=$0}' Input_file | column -t
Here's one:
$ awk '{$5=$3-q;t=p;p=$0;q=$3;$0=t ORS $0}$10==25' file
2 5 50 1 15
2 6 75 1 25
5 8 100 1 15
6 9 125 1 25
Explained:
$ awk '{
$5=$3-q # subtract
t=p # previous to temp
p=$0 # store previous for next round
q=$3 # store subtract value for next round
$0=t ORS $0 # prepare record for output
}
$10==25 # output if equals
' file
No checking for duplicates so you might get same record printed twice. Easiest way to fix is to pipe the output to uniq.

To sum adjacent lines from the same column in AWK

I have a file:
A 1 20
B 2 21
C 3 22
D 4 23
I have to find the sum of values from 0-3rd line then the sum of line 1 to 3 and finally the sum of line 2 to 3. The last value has to be simply 0. In another words, I want to get an output file with two columns where the values are the sum of adjacent lines something like this:
10 86
9 66
7 45
0 0
The last row has to have two zeros as values. How to do it in AWK?
This might be what you want:
$ tac file | awk 'NR==1{ print 0, 0; a=$2; b=$3; next} { print a+=$2, b+=$3 }' | tac
10 86
9 66
7 45
0 0
Avoid two tacs by accumulating the sums in two arrays:
$ awk '{
for (i = 1; i <= NR; ++i) { sum2[i] += $2; sum3[i] += $3 }
}
END {
sum2[NR] = sum3[NR] = 0
for (i = 1; i <= NR; ++i) print sum2[i], sum3[i]
}' file
10 86
9 66
7 45
0 0
The value of each row is added into all the previous rows. Once all rows have been processed, the last values are zeroed out and everything is printed.

replace a block of lines in file1 with a block of lines in file2

file1:
a xyz 1 2 4
a xyz 1 2 3
a abc 3 9 7
a abc 3 9 2
a klm 9 3 1
a klm 9 8 3
a tlc 3 9 3
file2:
a xyz 9 2 9
a xyz 8 9 2
a abc 3 8 9
a abc 6 2 7
a tlk 7 8 9
I want to replace the lines that have 'abc' in file1 with the lines that have 'abc' in file2. I'm new to sed, awk, etc. Any help is appreciated.
I tried cat file1 <(sed '/$r = abc;/d' file2) > newfile among others but this one simply copies file1 to newfile. I also don't want to generate a new file but only edit file1.
desired output:
(processed) file1:
a xyz 1 2 4
a xyz 1 2 3
a abc 3 8 9
a abc 6 2 7
a klm 9 3 1
a klm 9 8 3
a tlc 3 9 3
With GNU awk, you can use this trick:
gawk -v RS='([^\n]* abc [^\n]*\n)+' 'NR == FNR { save = RT; nextfile } FNR == 1 { printf "%s", $0 save; next } { printf "%s", $0 RT }' file2 file1
With the record separator ([^\n]* abc [^\n]*\n)+, this splits the input files into records delimited by blocks of lines with " abc " in them. Then,
NR == FNR { # while processing the first given file (file2)
save = RT # remember the first record terminator -- the
# first block of lines with abc in them
nextfile # and go to the next file.
}
FNR == 1 { # for the first record in file1
printf "%s", $0 save # print it with the saved record terminator
next # from file2, and get the next record
}
{ # from then on, just echo.
printf "%s", $0 RT
}
Note that this uses several GNU extensions, so it will not work with mawk.