using awk remove filtered group - awk

i have an input
1 a 0,9
1 b 0,8
1 c 0,1
2 d 0,5
3 e 0,1
3 f 0,7
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
using awk i want to remove filtered group
if third column is greater than 0.6 i want to remove other rows which first columns equal.
Desired Output:
2 d 0,5
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
I have used this, but this dont delete other rows.
awk '($3 < 0.6)' file

With your shown samples, could you please try following.
awk '
FNR==NR{
temp=$3
sub(/,/,".",temp)
if(temp>0.6){
noCount[$1]
}
next
}
!($1 in noCount)
' Input_file Input_file
Output will be as follows.
2 d 0,5
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##This condition will be TRUE when first time is being read.
temp=$3 ##Creating temp with 3rd field value here.
sub(/,/,".",temp) ##Substituting comma with dot in temp here.
if(temp>0.6){ ##Checking condition if temp is greater than 0.6 then do following.
noCount[$1] ##Creating noCount with index of 1st field.
}
next ##next will skip all further statements from here.
}
!($1 in noCount) ##If 1st field is NOT present in noCount then print line.
' Input_file Input_file ##Mentioning Input_file names here.

Related

AWK command of add column to count of grouped column

I have a data set tab separated like this: (file.txt)
A B
1 111
1 111
1 112
1 113
1 113
1 113
1 113
2 113
2 113
2 113
I want to add a new C column to show count of grouped A and B
Desired output:
A B C
1 111 2
1 111 2
1 112 1
1 113 4
1 113 4
1 113 4
1 113 4
2 113 3
2 113 3
2 113 3
I have tried this:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{
if (FNR>1) a[$2]+=$3
next
}
{ $(NF+1)=(FNR==1 ? "C" : a[$2]) }
1
' file.txt file.txt > file2.txt
Could you please try following, With shown samples.
awk '
FNR==NR{
count[$1,$2]++
next
}
FNR==1{
print $0,"C"
next
}
{
print $0,count[$1,$2]
}
' Input_file Input_file
Add BEGIN{FS=OFS="\t"} in above code in case your data is tab delimited.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first time Input_file being read.
count[$1,$2]++ ##Creating count with index of 1st and 2nd field and increasing its count.
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is 1st line then do following.
print $0,"C" ##Printing current line with C heading here.
next ##next will skip further statements from here.
}
{
print $0,count[$1,$2] ##Printing current line along with count with index of 1st and 2nd field.
}
' Input_file Input_file ##Mentioning Input_file(s) here.
Problem in OP's attempt: OP was adding $3 in values(though logic looked ok) but there is NO 3rd field present in Input_file so that's why it was not working. Also OP was using index as 2nd field but as per OP's comments it should be 1st and 2nd fields.
You might consider using GNU Datamash, e.g.:
datamash -HW groupby 1,2 count 1 < file.txt | column -t
Output:
GroupBy(A) GroupBy(B) count(A)
1 111 2
1 112 1
1 113 4
2 113 3

Replacing with condition on two files awk

Using those example:
File1:
rs12124819 1 0.020242 776546 A G
rs28765502 1 0.022137 832918 T C
rs7419119 1 0.022518 842013 T G
rs950122 1 0.022720 846864 G C
File2:
1_752566 1 0 752566 G A
1_776546 1 0 776546 A G
1_832918 1 0 832918 T C
1_842013 1 0 842013 T G
I am trying to change the 1st column of file2 with the corresponding 1st column of file1 if their 4th column are equal.
Expected output:
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G
I tried to create 2 array but couldn't find the correct way to use it:
awk 'FNR==NR{a[$4],b[$1];next} ($4) in a{$1=b[FNR]}1' file1 file2 > out.txt
Thanks a lot!
With your shown samples, could you please try following. Written and tested in GNU awk.
awk 'FNR==NR{a[$4]=$1;next} ($4 in a){$1=a[$4]} 1' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
a[$4]=$1 ##Creating array a whose index is $4 and value is $1.
next ##next will skip all further statements from here.
}
($4 in a){ ##Checking condition if 4th field is present in a then do following.
$1=a[$4] ##Setting value of 1st field of file2 as array a value with index of 4th column
}
1 ##1 will print edited/non-edited line.
' file1 file2 ##mentioning Input_file names here.
You may try this awk:
awk 'FNR==NR {map[FNR] = $1; next} {$1 = map[FNR]} 1' file1 file2 | column -t
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G
another alternative (if the files are sorted in the join key as in the sample data)
$ join -j4 -o1.1,2.2,2.3,2.4,2.5,2.6 file1 file2 | column -t
rs12124819 1 0 776546 A G
rs28765502 1 0 832918 T C
rs7419119 1 0 842013 T G
Note that your input files have only 3 matching records.

How can I use awk to calculate sum and replace column in file

I'm new to the site and to the programming world and I hope you have time to help me.
My problem is as follows: I have a file with several columns. In the 2nd column there are values. I'm tring to calculate the sum of each values to a given number and to replace the second column with a new column containing the results of the sum.
Here an example of my input:
A B C
x 1 t
y 2 u
z 3 v
I want to sum values in B column to 5 and obtain an output like the one below:
A B C
x 6 t
y 7 u
z 8 v
The code I tried unsucesfully is
zcat my_file.vcf.gz| tail -n +49 | awk 'BEGIN{FS=OFS="\t"} {print $0, $2+5}'>my.output.vcf
Thanks in advance
We could avoid using tail since printing of lines from 49th line could be handled within awk itself. Also you need to add value in 2nd field and then you could print the whole line itself by print command.
Important point, as per OP's sample if 2nd field is having alphabets then need NOT to add 5 in it, so taken care of that condition too here.
zcat my_file.vcf.gz |
awk '
BEGIN{ FS=OFS="\t" }
FNR>=49{
$2=($2~/[a-zA-Z]/?$2:$2+5)
print
}
' > my.output.vcf
You can use
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1'
Here, $2+=5 will add 5 to Filed 2 value, and 1 will trigger the display of the record (row, line, same as print $0).
See an online awk demo:
#!/bin/bash
s='A B C
x 1 t
y 2 u
z 3 v'
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1' <<< "$s"
Output:
A 5 C
x 6 t
y 7 u
z 8 v
Another form for clarity:
awk 'BEGIN{FS=OFS="\t"} {print $1, $2+5, $3}'
you can use:
awk 'BEGIN {FS=OFS="\t"} NR == 1 {print $0} NR > 1 {print $1,($2+5),$3;}'
output:
A B C
x 6 t
y 7 u
z 8 v
Maube this can help you:
cat file | awk '{if (NR > 1 && $2 = ($2+5))
print $0;
else print $0;}'
Answer apply to your code:
zcat my_file.vcf.gz| tail -n +49 | awk '{if (NR > 1 && $2 = ($2+5)) print $0; else print $0;}' > my.output.vcf
cat boo
A B C
x 1 t
y 2 u
z 3 v
cat boo | awk 'BEGIN{FS=OFS="\t"} $2 ~ /^[0-9]+$/ {print $1, $2+5, $3} $2 !~ /^[0-9]+$/ {print} '
A B C
x 6 t
y 7 u
z 8 v

Calculating cumulative sum and percent of total for columns grouped by row

I have a very large table of values that is formatted like this:
apple 1 1
apple 2 1
apple 3 1
apple 4 1
banana 25 4
banana 35 10
banana 36 10
banana 37 10
Column 1 has many different fruit, with varying numbers of rows for each fruit.
I would like to calculate the cumulative sum of column 3 for each type of fruit in column 1, and the cumulative percentage of the total at each row, and add these as new columns. So the desired output would be this:
apple 1 1 1 25.00
apple 2 1 2 50.00
apple 3 1 3 75.00
apple 4 1 4 100.00
banana 25 4 4 11.76
banana 35 10 14 41.18
banana 36 10 24 70.59
banana 37 10 34 100.00
I can get part way there with awk, but I am struggling with how to get the cumulative sum to reset at each new fruit. Here is my horrendous awk attempt for your viewing pleasure:
#!/bin/bash
awk '{cumsum += $3; $3 = cumsum} 1' fruitfile > cumsum.tmp
total=$(awk '{total=total+$3}END{print total}' fruitfile)
awk -v total=$total '{ printf ("%s\t%s\t%s\t%.5f\n", $1, $2, $3, ($3/total)*100)}' cumsum.tmp > cumsum.txt
rm cumsum.tmp
Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
a[$1]+=$NF
next
}
{
sum[$1]+=($NF/a[$1])*100
print $0,++b[$1],sum[$1]
}
' Input_file Input_file |
column -t
Output for shown samples will be as follows.
apple 1 1 1 25
apple 2 1 2 50
apple 3 1 3 75
apple 4 1 4 100
banana 25 4 1 11.7647
banana 35 10 2 41.1765
banana 36 10 3 70.5882
banana 37 10 4 100
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
a[$1]+=$NF ##Creating array a with index $1 and keep adding its last field value to it.
next ##next will skip all further statements from here.
}
{
sum[$1]+=($NF/a[$1])*100 ##Creating sum with index 1st field and keep adding its value to it, each value will have last field/value of a[$1] and multiplying it with 100.
print $0,++b[$1],sum[$1] ##Printing current line, array b with 1st field with increasing value of 1 and sum with index of 1st field.
}
' Input_file Input_file | ##Mentioning Input_file name here.
column -t ##Sending awk output to column command for better look.

sum value of a 3rd row and divide rows accordingly

I have a file as below with n number of rows, I want to total it's sum(based on 3rd column) and distribute rows accordingly in 3 different files(based on sum of each)
For example- if we sum all the 3rd column values it's total is coming as 516 and if we divide it by 3 it is 172.
So i want to add a rows to a file so it doesn't exceed 172 mark, same with the 2nd file and rest all rows should move to the third file.
Input file
a aa 10
b ab 15
c ac 17
a dy 30
y ae 12
a dl 34
a fk 45
l ah 56
o aj 76
l ai 12
q al 09
d pl 34
e ik 30
f ll 10
g dl 15
h fr 17
i dd 23
j we 27
k rt 12
l yt 13
m tt 19
expected output
file1(total -163)
a aa 10
b ab 15
c ac 17
a dy 30
y ae 12
a dl 34
a fk 45
file2 (total-153)
l ah 56
o aj 76
l ai 12
q al 9
file3 (total - 200)
d pl 34
e ik 30
f ll 10
g dl 15
h fr 17
i dd 23
j we 27
k rt 12
l yt 13
m tt 19
Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
sum+=$NF
next
}
FNR==1{
count=sum/3
}
{
curr_sum+=$NF
}
(curr_sum>=count || FNR==1) && fileCnt<=2{
close(out_file)
out_file="file" ++fileCnt
curr_sum=$NF
}
{
print > (out_file)
}' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
sum+=$NF ##Taking sum of last field of all lines here and keep adding them to get cumulative sum of whole Input_file.
next ##next will skip all further statements from here.
}
FNR==1{ ##Checking condition if its first line for 2nd time reading of Input_file.
count=sum/3 ##Creating count with value of sum/3 here.
}
{
curr_sum+=$NF ##Keep adding lst field sum in curr_sum here.
}
(curr_sum>=count || FNR==1) && fileCnt<=2{ ##Checking if current sum is <= count OR its first line(in 2nd time reading) AND output file count is <=2 here.
close(out_file) ##Closing output file here, may NOT be needed here since we are having only 3 files here in output.
out_file="file" ++fileCnt ##Creating output file name here.
curr_sum=$NF ##Keep adding lst field sum in curr_sum here.
}
{
print > (out_file) ##Printing current line into output file here.
}' Input_file Input_file ##Mentioning Input_file names here.
awk '{ L[nr++]=$0; sum+=$3 }
END{ sumpf=sum/3; sum=0; file=1;
for(i in L) { split(L[i],a);
if ((sum+a[3])>sumpf && file<3) { file+=1; sum=0; };
print i, L[i] > "file" file;
sum+=a[3];
}
}' input
This script will read all input into array L, and calculate sum
Int the END block the sumPerFile is calculated sumpf, and the output is done.
In contrast to the other solution, this only needs one inputfile.