Create histogram like bins for a range including negative numbers - awk

I have numbers in a range from -4 to 4, including 0, as in
-0.526350041828112
-0.125648350883331
0.991377353361933
1.079241128983
1.06322905224238
1.17477528478982
-0.0651086035371559
0.818471811380787
0.0355593553368815
I need to create histogram like buckets, and have being trying to use this
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
from Create bins with awk histogram-like
The output would look like
-2.4 -2.1 8
-2.1 -1.8 25
-1.8 -1.5 108
-1.5 -1.2 298
-1.2 -0.9 773
-0.9 -0.6 1067
-0.6 -0.3 1914
-0.3 0.0 4174
0.0 0.3 3969
0.3 0.6 2826
0.6 0.9 1460
0.9 1.2 752
1.2 1.5 396
1.5 1.8 121
1.8 2.1 48
2.1 2.4 13
2.4 2.7 1
2.7 3.0 1
I'm thinking I would have to run this 2x, one with delta let's say 0.3 and another with delta -0.3, and cat the two together.
But I'm not sure this intuition is correct.

This might work for you:
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0<0?$0-delta:$0)/delta)
cnt[bucketNr]++
maxBucket = (maxBucket > bucketNr ? maxBucket : bucketNr)
minBucket = (minBucket < bucketNr ? minBucket : bucketNr)
}
END {
beg = minBucket*delta
for (bucketNr=minBucket; bucketNr<=maxBucket; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
It's basically the code you posted + handling negative numbers.

Related

awk script to sum numbers in a column over a loop not working for some iterations in the loop

Sample input
12.0000 0.6000000 0.05
13.0000 1.6000000 0.05
14.0000 2.6000000 0.05
15.0000 3.0000000 0.05
15.0000 3.2000000 0.05
15.0000 3.4000000 0.05
15.0000 3.6000000 0.10
15.0000 3.8000000 0.10
15.0000 4.0000000 0.10
15.0000 4.2000000 0.11
15.0000 4.4000000 0.12
15.0000 4.6000000 0.13
15.0000 4.8000000 0.14
15.0000 5.0000000 0.15
15.0000 5.2000000 0.14
15.0000 5.4000000 0.13
15.0000 5.6000000 0.12
15.0000 5.8000000 0.11
15.0000 6.0000000 0.10
15.0000 6.2000000 0.10
15.0000 6.4000000 0.10
15.0000 6.6000000 0.05
15.0000 6.8000000 0.05
15.0000 7.0000000 0.05
Goal
Print line 1 in output as 0 0
For $2 = 5.000000, $3 = 0.15.
Print line 2 in output as 1 0.15
For $2 = 4.800000 through $2 = 5.200000, sum+=$3 for each line (i.e. 0.14 + 0.15 + 0.14 = 0.43).
Print line 3 in output as 2 0.43.
For $2 = 4.600000 through $2 = 5.400000, sum+=$3 for each line (i.e. 0.13 + 0.14 + 0.15 + 0.14 + 0.13 = 0.69).
Print line 4 in output as 3 0.69
Continue this pattern until $2 = 5.000000 +- 1.6 (9 lines total, plus line 1 as 0 0 = 10 total lines in output)
Desired Output
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
Attempt
Script 1
#!/bin/bash
for (( i=0; i<=8; i++ )); do
awk '$2 >= 5.0000000-'$i'*0.2 {sum+=$3}
$2 == 5.0000000+'$i'*0.2 {print '$i', sum; exit
}' test.dat
done > test.out
produces
0 0.15
1 0.43
2 0.69
3 0.93
4 1.15
5 1.35
6 1.55
7 1.75
8 1.85
This is very close. However, the output is missing 0 0 for line 1, and because of this, lines 2 through 10 have $1 and $2 mismatched by 1 line.
Script 2
#!/bin/bash
for (( i=0; i<=8; i++ )); do
awk ''$i'==0 {sum=0}
'$i'>0 && $2 > 5.0000000-'$i'*0.2 {sum+=$3}
$2 == 5.0000000+'$i'*0.2 - ('$i' ? 0.2 : 0) {print '$i', sum; exit
}' test.dat
done > test.out
which produces
0 0
1 0.15
2 0.43
4 0.93
5 1.15
6 1.35
7 1.55
$1 and $2 are now correctly matched. However, I am missing the lines with $1=3, $1=8, and $1=9 completely. Adding the ternary operator causes my code to skip these iterations in the loop somehow.
Question
Can anyone explain what's wrong with script 2, or how to achieve the desired output in one line of code? Thank you.
Solution
I used Ed Morton's solution to solve this. Both of them work for different goals. Instead of using the modulus to save array space, I constrained the array to $1 = 15.0000. I did this instead of the modulus in order to include two other "key" variables that I had wanted to also sum over at different parts of the input, into separate output files.
Furthermore, as far as I understood it, the script summed only for lines with $2 >= 5.0000000, and then multiplied the summation by 2, in order to include the lines with $2 <= 5.0000000. This works for the sample input here because I made $3 symmetric around 0.15. I modified it to sum them separately, though.
awk 'BEGIN { key=5; range=9}
$1 == 15.0000 {
a[NR] = $3
}
$2 == key { keyIdx = NR}
END {
print (0, 0) > "test.out"
sum = a[keyIdx]
for (delta=1; delta<=range; delta++) {
print (delta, sum) > "test.out"
plusIdx = (keyIdx + delta)
minusIdx = (keyIdx - delta)
sum += a[plusIdx] + a[minusIdx]
}
exit
}' test.dat
Is this what you're trying to do?
$ cat tst.awk
$2 == 5 { keyNr = NR }
{ nr2val[NR] = $3 }
END {
print 0, 0
sum = nr2val[keyNr]
for (delta=1; delta<=9; delta++) {
print delta, sum
sum += nr2val[keyNr+delta] + nr2val[keyNr-delta]
}
}
$ awk -f tst.awk file
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
We could optimize it to only store 2*(range=9) values in vals[] (using a modulus operator NR%(2*range) for the index) and do the calculation when we hit an NR that's range lines past the line where $2 == key rather than doing it after we've read the whole of the input if it's either too slow or your input file is too big to store all in memory, e.g.:
$ cat tst.awk
BEGIN { key=5; range=9 }
{
idx = NR % (2*range)
nr2val[idx] = $3
}
$2 == key { keyIdx = idx; endNr = NR+range }
NR == endNr { exit }
END {
print 0, 0
sum = nr2val[keyIdx]
for (delta=1; delta<=range; delta++) {
print delta, sum
idx = (keyIdx + delta) % (2*range)
sum += nr2val[idx] + nr2val[idx]
}
exit
}
$ awk -f tst.awk file
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
I like your problem. It is an adequate challenge.
My approach is to put all possible into the awk script. And scan the input file only once. Because I/O manipulation is slower than computation (these days).
Do as many computation (actually 9) on the relevant input line.
The required inputs are variable F1 and text file input.txt
The execution command is:
awk -v F1=95 -f script.awk input.txt
So the logic is:
1. Initialize: Compute the 9 range markers and store their values in an array.
2. Store the 3rd input value in an order array `field3`. We use this array to compute the sum.
3. On each line that has 1st field equals 15.0000.
3.1 If found begin marker then mark it.
3.2 If found end marker then compute the sum, and mark it.
4. Finalize: Output all the computed results
script.awk including few debug printout to assist in debugging
BEGIN {
itrtns = 8; # iterations count consistent all over the program.
for (i = 0; i <= itrtns; i++) { # compute range markers per iteration
F1start[i] = (F1 - 2 - i)/5 - 14; # print "F1start["i"]="F1start[i];
F1stop[i] = (F1 - 2 + i)/5 - 14; # print "F1stop["i"]="F1stop[i];
b[i] = F1start[i] + (i ? 0.2 : 0); # print "b["i"]="b[i];
}
}
{ field3[NR] = $3;} # store 3rd input field in ordered array.
$1==15.0000 { # for each input line that has 1st input field 15.0000
currVal = $2 + 0; # convert 2nd input field to numeric value
for (i = 0; i <= itrtns; i++) { # on each line scan for range markers
# print "i="i, "currVal="currVal, "b["i"]="b[i], "F1stop["i"]="F1stop[i], isZero(currVal-b[i]), isZero(currVal-F1stop[i]);
if (isZero(currVal - b[i])) { # if there is a begin marker
F1idx[i] = NR; # store the marker index postion
# print "F1idx["i"] =", F1idx[i];
}
if (isZero(currVal - F1stop[i])) { # if there is an end marker
for (s = F1idx[i]; s <= NR; s++) {sum[i] += field3[s];} # calculate its sum
F2idx[i] = NR; # store its end marker postion (for debug report)
# print "field3["NR"]=", field3[NR];
}
}
}
END { # output the computed results
for (i = 0; i <= itrtns; i++) {print i, sum[i], "rows("F1idx[i]"-"F2idx[i]")"}
}
function isZero(floatArg) { # floating point number pecision comparison
tolerance = 0.00000000001;
if (floatArg < tolerance && floatArg > -1 * tolerance )
return 1;
return 0;
}
Provided input.txt from the question.
12.0000 0.6000000 0.05
13.0000 1.6000000 0.05
14.0000 2.6000000 0.05
15.0000 3.0000000 0.05
15.0000 3.2000000 0.05
15.0000 3.4000000 0.05
15.0000 3.6000000 0.10
15.0000 3.8000000 0.10
15.0000 4.0000000 0.10
15.0000 4.2000000 0.11
15.0000 4.4000000 0.12
15.0000 4.6000000 0.13
15.0000 4.8000000 0.14
15.0000 5.0000000 0.15
15.0000 5.2000000 0.14
15.0000 5.4000000 0.13
15.0000 5.6000000 0.12
15.0000 5.8000000 0.11
15.0000 6.0000000 0.10
15.0000 6.2000000 0.10
15.0000 6.4000000 0.10
15.0000 6.6000000 0.05
15.0000 6.8000000 0.05
15.0000 7.0000000 0.05
The output for: awk -v F1=95 -f script.awk input.txt
0 0.13 rows(12-12)
1 0.27 rows(12-13)
2 0.54 rows(11-14)
3 0.79 rows(10-15)
4 1.02 rows(9-16)
5 1.24 rows(8-17)
6 1.45 rows(7-18)
7 1.6 rows(6-19)
8 1.75 rows(5-20)
The output for: awk -v F1=97 -f script.awk input.txt
0 0.15 rows(14-14)
1 0.29 rows(14-15)
2 0.56 rows(13-16)
3 0.81 rows(12-17)
4 1.04 rows(11-18)
5 1.25 rows(10-19)
6 1.45 rows(9-20)
7 1.65 rows(8-21)
8 1.8 rows(7-22)

Mixture prior not working in JAGS, only when likelihood term included

The code at the bottom will replicate the problem, just copy and paste it into R.
What I want is for the mean and precision to be (-100, 100) 30% of the time, and (200, 1000) for 70% of the time. Think of it as lined up in a, b, and p.
So 'pick' should be 1 30% of the time, and 2 70% of the time.
What actually happens is that on every iteration, pick is 2 (or 1 if the first element of p is the larger one). You can see this in the summary, where the quantiles for 'pick', 'testa', and 'testb' remain unchanged throughout. The strangest thing is that if you remove the likelihood loop, pick then works exactly as intended.
I hope this explains the problem, if not let me know. It's my first time posting so I'm bound to have messed things up.
library(rjags)
n = 10
y <- rnorm(n, 5, 10)
a = c(-100, 200)
b = c(100, 1000)
p = c(0.3, 0.7)
## Model
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 10)
}
# ISSUE HERE: MIXTURE PRIOR
mu ~ dnorm(a[pick], b[pick])
pick ~ dcat(p[1:2])
testa = a[pick]
testb = b[pick]
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'testa', 'testb', 'mu'), n.iter = 10000)
summary(res)
I think you are having problems for a couple of reasons. First, the data that you have supplied to the model (i.e., y) is not a mixture of normal distributions. As a result, the model itself has no need to mix. I would instead generate data something like this:
set.seed(320)
# number of samples
n <- 10
# Because it is a mixture of 2 we can just use an indicator variable.
# here, pick (in the long run), would be '1' 30% of the time.
pick <- rbinom(n, 1, p[1])
# generate the data. b is in terms of precision so we are converting this
# to standard deviations (which is what R wants).
y_det <- pick * rnorm(n, a[1], sqrt(1/b[1])) + (1 - pick) * rnorm(n, a[2], sqrt(1/b[2]))
# add a small amount of noise, can change to be more as necessary.
y <- rnorm(n, y_det, 1)
These data look more like what you would want to supply to a mixture model.
Following this, I would code the model up in a similar way as I did the data generation process. I want some indicator variable to jump between the two normal distributions. Thus, mu may change for each scalar in y.
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], 10)
mu[i] <- mu_ind[i] * a_mu + (1 - mu_ind[i]) * b_mu
mu_ind[i] ~ dbern(p[1])
}
a_mu ~ dnorm(a[1], b[1])
b_mu ~ dnorm(a[2], b[2])
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('mu_ind', 'a_mu', 'b_mu'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
a_mu -100.4 -100.3 -100.2 -100.1 -100
b_mu 199.9 200.0 200.0 200.0 200
mu_ind[1] 0.0 0.0 0.0 0.0 0
mu_ind[2] 1.0 1.0 1.0 1.0 1
mu_ind[3] 0.0 0.0 0.0 0.0 0
mu_ind[4] 1.0 1.0 1.0 1.0 1
mu_ind[5] 0.0 0.0 0.0 0.0 0
mu_ind[6] 0.0 0.0 0.0 0.0 0
mu_ind[7] 1.0 1.0 1.0 1.0 1
mu_ind[8] 0.0 0.0 0.0 0.0 0
mu_ind[9] 0.0 0.0 0.0 0.0 0
mu_ind[10] 1.0 1.0 1.0 1.0 1
If you supplied more data, you would (in the long run) have the indicator variable mu_ind take the value of 1 30% of the time. If you had more than 2 distributions you could instead use dcat. Thus, an alternative and more generalized way of doing this would be (and I am borrowing heavily from this post by John Kruschke):
mod_str = "model {
# Likelihood:
for( i in 1 : n ) {
y[i] ~ dnorm( mu[i] , 10 )
mu[i] <- muOfpick[ pick[i] ]
pick[i] ~ dcat( p[1:2] )
}
# Prior:
for ( i in 1:2 ) {
muOfpick[i] ~ dnorm( a[i] , b[i] )
}
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'muOfpick'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
muOfpick[1] -100.4 -100.3 -100.2 -100.1 -100
muOfpick[2] 199.9 200.0 200.0 200.0 200
pick[1] 2.0 2.0 2.0 2.0 2
pick[2] 1.0 1.0 1.0 1.0 1
pick[3] 2.0 2.0 2.0 2.0 2
pick[4] 1.0 1.0 1.0 1.0 1
pick[5] 2.0 2.0 2.0 2.0 2
pick[6] 2.0 2.0 2.0 2.0 2
pick[7] 1.0 1.0 1.0 1.0 1
pick[8] 2.0 2.0 2.0 2.0 2
pick[9] 2.0 2.0 2.0 2.0 2
pick[10] 1.0 1.0 1.0 1.0 1
The link above includes even more priors (e.g., a Dirichlet prior on the probabilities incorporated into the Categorical distribution).

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64

How to use abs function in a objective function of JuMP + Julia

I would like to solve a simple linear optimization problem with JuMP and Julia.
This is my code:
using JuMP
using Mosek
model = Model(solver=MosekSolver())
#variable(model, 2.5 <= z1 <= 5.0)
#variable(model, -1.0 <= z2 <= 1.0)
#objective(model, Min, abs(z1+5.0) + abs(z2-3.0))
status = solve(model)
println("Objective value: ", getobjectivevalue(model))
println("z1:",getvalue(z1))
println("z2:",getvalue(z2))
However, I got this error message.
> ERROR: LoadError: MethodError: no method matching
> abs(::JuMP.GenericAffExpr{Float64,JuMP.Variable}) Closest candidates
> are: abs(!Matched::Bool) at bool.jl:77 abs(!Matched::Float16) at
> float.jl:512 abs(!Matched::Float32) at float.jl:513
How can I use abs function in the JuMP code?
My problem is solved by #rickhg12hs's commnet.
If I use #NLobjective instead of #objective, It works.
This is the final code.
using JuMP
using Mosek
model = Model(solver=MosekSolver())
#variable(model, 2.5 <= z1 <= 5.0)
#variable(model, -1.0 <= z2 <= 1.0)
#NLobjective(model, Min, abs(z1+5.0) + abs(z2-3.0))
status = solve(model)
println("Objective value: ", getobjectivevalue(model))
println("z1:",getvalue(z1))
println("z2:",getvalue(z2))
I did it on a diffrent way
AvgOperationtime = [1 2]#[2.0 2.0 2.0 3.3333333333333335 2.5 2.0 2.0 2.5 2.5 2.0 2.0]
Operationsnumberremovecounter = [1 0;1 1]#[1.0 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 -0.0 1.0 1.0; 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0; 1.0 1.0 1.0 -0.0 1.0 1.0 1.0 -0.0 1.0 1.0 1.0]
Modelnumber = 2
Operationsnumber = 2
Basecaseworkload = 2
y = 0.1
Highestnumber = 999
Solver = GLPK.Optimizer
#Operationtime[1,1 X;0,9 2]
m = Model(with_optimizer(Solver));
#variable(m, Operationtime[1:Modelnumber,1:Operationsnumber]>=0);
#variable(m, Absoluttime[1:Modelnumber,1:Operationsnumber]>=0);
#variable(m, Absolutchoice[1:Modelnumber,1:Operationsnumber,1:2], Bin);
#objective(m, Max, sum(Absoluttime[M,O]*Operationsnumberremovecounter[M,O] for M=1:Modelnumber,O=1:Operationsnumber))
#How much Time can differ
#constraint(m, BorderOperationtime1[M=1:Modelnumber,O=1:Operationsnumber], AvgOperationtime[O]*(1-y) <= Operationtime[M,O]);
#constraint(m, BorderOperationtime2[M=1:Modelnumber,O=1:Operationsnumber], AvgOperationtime[O]*(1+y) >= Operationtime[M,O]);
#Workload
#constraint(m, Worklimit[O=1:Operationsnumber], sum(Operationtime[M,O]*Operationsnumberremovecounter[M,O] for M=1:Modelnumber) == Basecaseworkload);
#Absolut
#constraint(m, Absolutchoice1[M=1:Modelnumber,O=1:Operationsnumber], sum(Absolutchoice[M,O,X] for X=1:2) == 1);
#constraint(m, Absoluttime1[M=1:Modelnumber,O=1:Operationsnumber], Absoluttime[M,O] <= Operationtime[M,O]-AvgOperationtime[O]+Absolutchoice[M,O,1]*Highestnumber);
#constraint(m, Absoluttime2[M=1:Modelnumber,O=1:Operationsnumber], Absoluttime[M,O] <= AvgOperationtime[O]-Operationtime[M,O]+Absolutchoice[M,O,2]*Highestnumber);
optimize!(m);
println("Termination status: ", JuMP.termination_status(m));
println("Primal status: ", JuMP.primal_status(m));

Faster way to split a string and count characters using R?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")