Why is MIP best bound infinite for this problem? - optimization

I have the following MIP problem. Upper bound for pre_6_0 should not be infinite because it is calculated from inp1, inp2, inp3, and inp4, all of which are bounded on both sides.
Maximize
obj: pre_6_0
Subject To
c1: inp0 >= -84
c2: inp0 <= 174
c3: inp1 >= -128
c4: inp1 <= 128
c5: inp2 >= -128
c6: inp2 <= 128
c7: inp3 >= -128
c8: inp3 <= 128
c9: inp4 >= -128
c10: inp4 <= 128
c11: pre_6_0 + 0.03125 inp1 - 0.0078125 inp2 - 0.00390625 inp3
+ 0.00390625 inp4 = -2.5
c12: - 0.0078125 inp0 + pre_6_1 = -2.5
c13: - 0.00390625 inp0 - 0.01171875 inp3 + pre_6_2 = 6.5
c14: - 0.0078125 inp0 + pre_6_3 = -1.5
c15: - 0.00390625 inp0 - 0.0078125 inp3 + pre_6_4 = 6.5
Bounds
pre_6_0 Free
inp0 Free
inp1 Free
inp2 Free
inp3 Free
inp4 Free
pre_6_1 Free
pre_6_2 Free
pre_6_3 Free
pre_6_4 Free
Generals
pre_6_0 inp0 inp1 inp2 inp3 inp4 pre_6_1 pre_6_2 pre_6_3 pre_6_4

The MIP best bound is infinite because no feasible integer solution exists.
Indeed, all the variables in your ILP have been restricted to general integer values (Generals section).
Here an example by using GLPK to solve the ILP.
15 rows, 10 columns, 25 non-zeros
10 integer variables, none of which are binary
...
Solving LP relaxation...
GLPK Simplex Optimizer, v4.65
5 rows, 10 columns, 15 non-zeros
0: obj = -8.000000000e+00 inf = 1.631e+01 (5)
5: obj = -3.750000000e-01 inf = 0.000e+00 (0)
* 8: obj = 3.000000000e+00 inf = 0.000e+00 (0)
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
Long-step dual simplex will be used
+ 8: mip = not found yet <= +inf (1; 0)
+ 8: mip = not found yet <= tree is empty (0; 3)
PROBLEM HAS NO INTEGER FEASIBLE SOLUTION
Time used: 0.0 secs
Memory used: 0.1 Mb (63069 bytes)

Related

Two Loop Network on Gams

Regarding the Two Loop Network formulation on GAMS, I'm struggling with one of the equations.
I can solve the problem without the energy conservation loop constraint but, once i add it, the problem becomes infeasible. Also, I'm not sure if the two loops are well defined
I would apprecite if someone spots my error.
Thank you.
Set
n 'nodes' / 1, 2, 3, 4, 5, 6, 7 /
a(n,n) 'arcs/pipes arbitrarly directed'
/1.2, 4.(2,5,6), 3.(2,5), 7.(5,6)/
rn(n) 'reservoir' / 1 /
dn(n) 'demand nodes' /2, 3, 4, 5, 6, 7/
m 'number of loops' /c1, c2/
c1(n,n) 'loop 1'
/2.(3,4), 5.(3,4)/
c2(n,n) 'loop 2'
/4.(5,6), 7.(5,6)/
k 'Options available for the diameters'
/ k1, k2, k3, k4, k5, k6, k7, k8, k9, k10, k11, k12, k13, k14 /;
dn(n) = yes;
dn(rn) = no;
display a;
display dn;
display rn;
display m;
display c1;
display c2;
Alias(n,np);
Table node(n,*) 'node data'
demand elevation minhead
* m^3/sec m m
1 210 30
2 0.0278 150 30
3 0.0278 160 30
4 0.0333 155 30
5 0.0750 150 30
6 0.0917 165 30
7 0.0444 200 30 ;
display node;
Table Diam(k,*) 'Diameter and cost information'
Diameter Cost
* m $/m
k1 0.0254 2
k2 0.0508 5
k3 0.0762 8
k4 0.1016 11
k5 0.1524 16
k6 0.2032 23
k7 0.2540 32
k8 0.3048 50
k9 0.3556 60
k10 0.4064 90
k11 0.4572 130
k12 0.5080 170
k13 0.5588 300
k14 0.6096 550;
Scalar
length 'pipes diameter' /1000/
roughcoef 'roughness coefficient for every pipe' /130/
Vmin 'Minimum velocity (m/s)' /0.3/
Vmax 'Maximum velocity (m/s)' /3.0/
dmin 'minimum diameter of pipe' /0.0254/
dmax 'maximum diameter of pipe' /0.6096/
davg 'Diamter Average for starting point';
davg = sqrt(dmin*dmax);
Variable
x(n,n) 'absolute flow through each arc'
y(n,n,k) 'takes value 1 when a pipe in arc(n,n) has diameter e(k) and 0 otherwise'
t(n,n) 'auxiliary variable for modeling the flow going in the forward direction'
r(n,n) 'auxiliary variable for modeling the flow going in the reverse direction'
u(n) 'variable representing the head of node n'
d(n,n) 'representing the diameter of pipe in link (n,n), takes the same value as some e(n)'
v(n,n) 'Water velocity'
q(n,n)'real variable representing the flow direction by being the flow sign, being 1 if flow goes forward or −1 if in reverse direction for a link (n,n)'
Custo 'total cost';
Binary Variable y, t, r;
NonNegative Variable x, d;
Equation
UniPipeDiam(n,n) 'Unique Pipe Diameter'
PipeDiam(n,n) 'Pipe Diameter Real Value'
FlowDirection(n,n) 'Flow Direction'
FlowSign(n,n) 'Flow Sign'
FlowConservation(n) 'Flow Conservation at Each Node'
HeadLoss(n,n) 'Head Loss'
EnerConserLoop1(n,np) 'Energy Conservation in Loop 1'
EnerConserLoop2(n,np) 'Energy Conservation in Loop 2'
Objective 'Objective Function: Capital Cost'
Velocity(n,n) 'Velocity calculation'
VelocUp(n,np) 'Upper bound velocity'
VelocDown(n,np) 'Lower bound velocity';
UniPipeDiam(a).. sum(k, y(a,k)) =e= 1;
PipeDiam(a(n,np)).. d(n,np) =e= sum(k, Diam(k,'Diameter')*y(n,np,k));
FlowDirection(a(n,np)).. t(a) + r(a) =e= 1;
FlowSign(a(n,np)).. q(a) =e= t(a) - r(a);
FlowConservation(dn(n)).. sum(a(np,n), x(a)*q(a)) - sum(a(n,np), x(a)*q(a)) =e= node(n,"demand");
HeadLoss(a(n,np)).. u(n) - u(np) =e= [10.667]*[roughcoef**(-1.852)]*length*[d(a)**(-4.8704)]*[x(a)**(2)]*q(a);
Velocity(a(n,np)).. v(a) =e= (4.0*x(a))/(Pi*d(a)**2.0);
VelocUp(a).. v(a) =l= Vmax;
VelocDown(a).. v(a) =g= Vmin;
EnerConserLoop1(n,np).. sum(a(n,np)$c1(n,np), q(a) * (u(n) - u(np))) =e= 0;
EnerConserLoop2(n,np).. sum(a(n,np)$c2(n,np),q(a) * (u(n) - u(np))) =e= 0;
Objective.. Custo =e= sum(a(n,np), sum(k, length*Diam(k,'Cost')*y(n,np,k)));
*bounds
d.lo(n,np)$a(n,np) = dmin;
d.up(n,np)$a(n,np) = dmax;
u.lo(rn) = node(rn,"elevation");
u.lo(dn) = node(dn,"elevation") + 5.0 + 5.0*node(dn,"demand");
u.up(dn) = 300.0;
* initial values
d.l(n,np)$a(n,np) = davg;
u.l(n) = u.lo(n) + 5.0;
Model network / all /;
network.domLim = 1000;
Option Iterlim = 50000;
option MINLP = baron;
solve network using minlp minimizing Custo;

RuntimeWarning: invalid value encountered

I'm trying to make my Philips hue lights change colors based on the Hz of a played song. But i faced a RuntimeWarning and can't figure out whats going on. I'd highly appreciate it if anyone could help me out here :)
wf = wave.open('visualize.wav', 'rb')
swidth = wf.getsampwidth()
RATE = wf.getframerate()
window = np.blackman(chunk)
p = pyaudio.PyAudio()
channels = wf.getnchannels()
stream = p.open(format =
p.get_format_from_width(wf.getsampwidth()),
channels = channels,
rate = RATE,
output = True)
data = wf.readframes(chunk)
print('switdth {} chunk {} data {} ch {}'.format(swidth,chunk,len(data), channels))
while len(data) == chunk*swidth*channels:
stream.write(data)
indata = np.fromstring(data, dtype='int16')
channel0 = indata[0::channels]
fftData=abs(np.fft.rfft(indata))**2
which = fftData[1:].argmax() + 1
if which != len(fftData)-1:
y0,y1,y2 = np.log(fftData[which-1:which+2:])
x1 = (y2 - y0) * .5 / (2 * y1 - y2 - y0)
thefreq = (which+x1)*RATE/chunk
print ("The freq is %f Hz." % (thefreq))
elif thefreq > 4000:
for i in cycle(color_list):
change_light_color(room, *color_list[i])
time.sleep(0.5)
else:
if thefreq < 4000:
for i in cycle(color_list_2):
change_light_color(room, *color_list_2[i])
time.sleep(0.5)
if data:
stream.write(data)
stream.close()
p.terminate()
This is what i end up with:
/usr/local/bin/python3 /Users/Sem/Desktop/hue_visualizer/visualize.py
Sem#Sems-MacBook-Pro hue_visualizer % /usr/local/bin/python3 /Users/Sem/Desktop/hue_visualizer/visualize.py
switdth 2 chunk 1024 data 4096 ch 2
/Users/Sem/Desktop/hue_visualizer/visualize.py:69: DeprecationWarning: The binary mode of fromstring is deprecated, as it behaves surprisingly on unicode inputs. Use frombuffer instead
indata = np.fromstring(data, dtype='int16')
/Users/Sem/Desktop/hue_visualizer/visualize.py:74: RuntimeWarning: divide by zero encountered in log
y0,y1,y2 = np.log(fftData[which-1:which+2:])
/Users/Sem/Desktop/hue_visualizer/visualize.py:75: RuntimeWarning: invalid value encountered in double_scalars
x1 = (y2 - y0) * .5 / (2 * y1 - y2 - y0)
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.
The freq is nan Hz.

go benchmark and gc: B/op alloc/op

benchmark code:
func BenchmarkSth(b *testing.B) {
var x []int
b.ResetTimer()
for i := 0; i < b.N; i++ {
x = append(x, i)
}
}
result:
BenchmarkSth-4 50000000 20.7 ns/op 40 B/op 0 allocs/op
question/s:
where did 40 B/op come from? (any way of tracing + instructions is greatly appreciated)
how is it possible to have 40 B/op while having 0 allocs?
which one affects GC and how? (B/op or allocs/op)
is it really possible to have 0 B/op using append?
The Go Programming Language Specification
Appending to and copying slices
The variadic function append appends zero or more values x to s of
type S, which must be a slice type, and returns the resulting slice,
also of type S.
append(s S, x ...T) S // T is the element type of S
If the capacity of s is not large enough to fit the additional values,
append allocates a new, sufficiently large underlying array that fits
both the existing slice elements and the additional values. Otherwise,
append re-uses the underlying array.
For your example, on average, [40, 41) bytes per operation are allocated to increase the capacity of the slice when necessary. The capacity is increased using an amortized constant time algorithm: up to len 1024 increase to 2 times cap then increase to 1.25 times cap. On average, there are [0, 1) allocations per operation.
For example,
func BenchmarkMem(b *testing.B) {
b.ReportAllocs()
var x []int64
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkMem-4 50000000 26.6 ns/op 40 B/op 0 allocs/op
--- BENCH: BenchmarkMem-4
bench_test.go:32: op 1 B 8 alloc 1 lx 1 cx 1
bench_test.go:32: op 100 B 2040 alloc 8 lx 100 cx 128
bench_test.go:32: op 10000 B 386296 alloc 20 lx 10000 cx 12288
bench_test.go:32: op 1000000 B 45188344 alloc 40 lx 1000000 cx 1136640
bench_test.go:32: op 50000000 B 2021098744 alloc 57 lx 50000000 cx 50539520
For op = 50000000,
B/op = floor(2021098744 / 50000000) = floor(40.421974888) = 40
allocs/op = floor(57 / 50000000) = floor(0.00000114) = 0
Read:
Go Slices: usage and internals
Arrays, slices (and strings): The mechanics of 'append'
'append' complexity
To have zero B/op (and zero allocs/op) for append, allocate a slice with sufficient capacity before appending.
For example, with var x = make([]int64, 0, b.N),
func BenchmarkZero(b *testing.B) {
b.ReportAllocs()
var x = make([]int64, 0, b.N)
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkZero-4 100000000 11.7 ns/op 0 B/op 0 allocs/op
--- BENCH: BenchmarkZero-4
bench_test.go:51: op 1 B 0 alloc 0 lx 1 cx 1
bench_test.go:51: op 100 B 0 alloc 0 lx 100 cx 100
bench_test.go:51: op 10000 B 0 alloc 0 lx 10000 cx 10000
bench_test.go:51: op 1000000 B 0 alloc 0 lx 1000000 cx 1000000
bench_test.go:51: op 100000000 B 0 alloc 0 lx 100000000 cx 100000000
Note the reduction in benchmark CPU time from around 26.6 ns/op to around 11.7 ns/op.

Fastest way to create a sparse matrix of the form A.T * diag(b) * A + C?

I'm trying to optimize a piece of code that solves a large sparse nonlinear system using an interior point method. During the update step, this involves computing the Hessian matrix H, the gradient g, then solving for d in H * d = -g to get the new search direction.
The Hessian matrix has a symmetric tridiagonal structure of the form:
A.T * diag(b) * A + C
I've run line_profiler on the particular function in question:
Line # Hits Time Per Hit % Time Line Contents
==================================================
386 def _direction(n, res, M, Hsig, scale_var, grad_lnprior, z, fac):
387
388 # gradient
389 44 1241715 28220.8 3.7 g = 2 * scale_var * res - grad_lnprior + z * np.dot(M.T, 1. / n)
390
391 # hessian
392 44 3103117 70525.4 9.3 N = sparse.diags(1. / n ** 2, 0, format=FMT, dtype=DTYPE)
393 44 18814307 427597.9 56.2 H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
394
395 # update direction
396 44 10329556 234762.6 30.8 d, fac = my_solver(H, -g, fac)
397
398 44 111 2.5 0.0 return d, fac
Looking at the output it's clear that constructing H is by far the most costly step - it takes considerably longer than actually solving for the new direction.
Hsig and M are both CSC sparse matrices, n is a dense vector and z is a scalar. The solver I'm using requires H to be either a CSC or CSR sparse matrix.
Here's a function that produces some toy data with the same formats, dimensions and sparseness as my real matrices:
import numpy as np
from scipy import sparse
def make_toy_data(nt=200000, nc=10):
d0 = np.random.randn(nc * (nt - 1))
d1 = np.random.randn(nc * (nt - 1))
M = sparse.diags((d0, d1), (0, nc), shape=(nc * (nt - 1), nc * nt),
format='csc', dtype=np.float64)
d0 = np.random.randn(nc * nt)
Hsig = sparse.diags(d0, 0, shape=(nc * nt, nc * nt), format='csc',
dtype=np.float64)
n = np.random.randn(nc * (nt - 1))
z = np.random.randn()
return Hsig, M, n, z
And here's my original approach for constructing H:
def original(Hsig, M, n, z):
N = sparse.diags(1. / n ** 2, 0, format='csc')
H = - Hsig - z * np.dot(M.T, np.dot(N, M)) # slow!
return H
Timing:
%timeit original(Hsig, M, n, z)
# 1 loops, best of 3: 483 ms per loop
Is there a faster way to construct this matrix?
I get close to a 4x speed-up in computing the product M.T * D * M out of the three diagonal arrays. If d0 and d1 are the main and upper diagonal of M, and d is the main diagonal of D, then the following code creates M.T * D * M directly:
def make_tridi_bis(d0, d1, d, nc=10):
d00 = d0*d0*d
d11 = d1*d1*d
d01 = d0*d1*d
len_ = d0.size
data = np.empty((3*len_ + nc,))
indices = np.empty((3*len_ + nc,), dtype=np.int)
# Fill main diagonal
data[:2*nc:2] = d00[:nc]
indices[:2*nc:2] = np.arange(nc)
data[2*nc+1:-2*nc:3] = d00[nc:] + d11[:-nc]
indices[2*nc+1:-2*nc:3] = np.arange(nc, len_)
data[-2*nc+1::2] = d11[-nc:]
indices[-2*nc+1::2] = np.arange(len_, len_ + nc)
# Fill top diagonal
data[1:2*nc:2] = d01[:nc]
indices[1:2*nc:2] = np.arange(nc, 2*nc)
data[2*nc+2:-2*nc:3] = d01[nc:]
indices[2*nc+2:-2*nc:3] = np.arange(2*nc, len_+nc)
# Fill bottom diagonal
data[2*nc:-2*nc:3] = d01[:-nc]
indices[2*nc:-2*nc:3] = np.arange(len_ - nc)
data[-2*nc::2] = d01[-nc:]
indices[-2*nc::2] = np.arange(len_ - nc ,len_)
indptr = np.empty((len_ + nc + 1,), dtype=np.int)
indptr[0] = 0
indptr[1:nc+1] = 2
indptr[nc+1:len_+1] = 3
indptr[-nc:] = 2
np.cumsum(indptr, out=indptr)
return sparse.csr_matrix((data, indices, indptr), shape=(len_+nc, len_+nc))
If your matrix M were in CSR format, you can extract d0 and d1 as d0 = M.data[::2] and d1 = M.data[1::2], I modified you toy data making routine to return those arrays as well, and here's what I get:
In [90]: np.allclose((M.T * sparse.diags(d, 0) * M).A, make_tridi_bis(d0, d1, d).A)
Out[90]: True
In [92]: %timeit make_tridi_bis(d0, d1, d)
10 loops, best of 3: 124 ms per loop
In [93]: %timeit M.T * sparse.diags(d, 0) * M
1 loops, best of 3: 501 ms per loop
The whole purpose of the above code is to take advantage of the structure of the non-zero entries. If you draw a diagram of the matrices you are multiplying together, it is relatively easy to convince yourself that the main (d_0) and top and bottom (d_1) diagonals of the resulting tridiagonal matrix are simply:
d_0 = np.zeros((len_ + nc,))
d_0[:len_] = d00
d_0[-len_:] += d11
d_1 = d01
The rest of the code in that function is simply building the tridiagonal matrix directly, as calling sparse.diags with the above data is several times slower.
I tried running your test case and had problems with the np.dot(N, M). I didn't dig into it, but I think my numpy/sparse combo (both pretty new) had problems using np.dot on sparse arrays.
But H = -Hsig - z*M.T.dot(N.dot(M)) runs just fine. This uses the sparse dot.
I haven't run a profile, but here are Ipython timings for several parts. It takes longer to generate the data than to do that double dot.
In [37]: timeit Hsig,M,n,z=make_toy_data()
1 loops, best of 3: 2 s per loop
In [38]: timeit N = sparse.diags(1. / n ** 2, 0, format='csc')
1 loops, best of 3: 377 ms per loop
In [39]: timeit H = -Hsig - z*M.T.dot(N.dot(M))
1 loops, best of 3: 1.55 s per loop
H is a
<2000000x2000000 sparse matrix of type '<type 'numpy.float64'>'
with 5999980 stored elements in Compressed Sparse Column format>

Faster way to split a string and count characters using R?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")