I created a typed memoryview in cython and would like to multiply it by a scalar:
import numpy as np
import math
cimport numpy as np
def foo():
N = 10
cdef np.double_t [:, :] A = np.ones(shape=(N,N),dtype=np.double_)
cdef int i,j
cdef double pi = math.pi
for i in range(N):
for j in range(N):
A[i,j] *= pi
return A
def bar():
N = 10
cdef np.double_t [:, :] A = np.ones(shape=(N,N),dtype=np.double_)
cdef double pi = math.pi
A *= pi
return A
Function foo() does this task but is not very convenient/readable.
The line A *= pi in function bar() does however not compile: Invalid operand types for '*' (double_t[:, :]; double).
Is there a way to perform such a broadcasting operation on a cython memoryview?
No, memoryviews don't do this. A memoryview is literally just a way to access individual elements of an array quickly. It has no really concept of the mathematical operations that can be performed on the array.
In the case of your bar function, any attempt to type it is probably actually going to make it worse (i.e. it'll spend extra time checking the type, but ultimately the work is done in ordinary calls to Numpy function).
There's a number of (not 100% satisfactory) ways of getting a Numpy array from a memoryview:
np.asarray(memview) - this should be done without copying (provided you aren't using the esoteric indirect memory layout). It might be worth adding an assertion to check that no copy was made though.
memview.base - be slightly careful with this. If the memoryview is a result of slicing then .base will be the original unsliced object.
Keep a parallel numpy array and memoryview variable:
Anp = np.array(...)
cdef double[:] Amview = Anp
because the memoryview is a view of some memory, modifications to the array will be reflected in the memoryview and vice-versa. (Reassigning the array variable, e.g. Anp = something_else, won't be reflected though).
In summary, memoryviews are designed for one main job: being able to access individual elements quickly. If that's not what you're doing then you probably don't want to use a memoryview.
Related
I am currently solving an integrated system of 559 non linear differential equations.I have to fit the solutions obtained to some experimental data by varying the constants c1,c2 b and g.
I am using scipy.odeint and I would like to know if there is a way to make my program faster as it takes ages to run.
The code is this:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
import random as rd
from numba import jit
L=np.loadtxt('C:/Users/Pablo/Desktop/TFG/Probas/matriz_L_Pablo.txt')
I=np.loadtxt('C:/Users/Pablo/Desktop/TFG/Probas/vector_I_Pablo.txt')
k=np.diag(L)
n=len(k) #Contamos o numero de nodos
u=np.zeros(n)
for i in range (n):
u[i]=rd.random()
M=np.zeros((n,n))
derivs=np.zeros(n)
c1=100 ; c2=10000 ; b=0.01 ; g=1
#jit
def f(y,t,params):
suma=0
c1,c2,b,g=params
for i in range(n):
for j in range(n):
if i==j:
M[i,i]=(1-y[i]/b)+g*(1-y[i])+c2*I[i]*(1/n-1)
if i!=j:
M[i,j]=(1/n)*(c1*L[i,j]+c2*I[i])
out=(M[i,j]*y[j])
suma=suma+out
derivs[i]=suma
suma=0
return derivs
#Condicions iniciais
y0=u
#lista cos parametros
params=[c1,c2,b,g]
#tempos de int
tf=1
deltat=0.001
t=np.arange(0,tf,deltat)
#solucion
sol= odeint(f, y0,t, args=(params,))
(Sorry if it is not very clear it's my first time here)
You can try vectorizing your code. The function f does 2 things - first it creates the matrix M, and then does the multiplication $$My$$. The multiplication $$My$$ is easy to vectorize because all we have to do is use numpy's matmul function.
def f(y,t,params):
suma=0
c1,c2,b,g=params
for i in range(n):
for j in range(n):
if i==j:
M[i,i]=(1-y[i]/b)+g*(1-y[i])+c2*I[i]*(1/n-1)
if i!=j:
M[i,j]=(1/n)*(c1*L[i,j]+c2*I[i])
return np.matmul(M,y)
That should help with runtime a bit. But the most time consuming part is the fact that the entire matrix M is formed every time f is called, and that it is formed one element at a time.
The only parts of M that need to be modified when calling f, are the parts which depend on y. So all of the off diagonal entries in M can be filled before the ode solver is called. So if M is (569x569), instead of having to calculate all ~250000+ elements of M every time f is called, you would only have to calculate the 569 elements of M on the diagonal. The remaining 250000 entries of M don't depend on y, and are specified before calling the ode solver. Making this modification should result in a huge speedup as this seems to be the main bottleneck in your code.
Lastly, you could also vectorize how the diagonal of M is filled by using something like numpy.diag_indices.
After having read a lot of documentation on numpy / cython I am still unable to create a numpy array from a pointer in cython. The situation is as follows. I have a cython (*.pyx) file containing a callback function:
cimport numpy
cdef void func_eval(double* values,
int values_len,
void* func_data):
func = (<object> func_data)
# values.: contiguous array of length=values_len
array = ???
# array should be a (modifiable) numpy array containing the
# values as its data. No copying, no freeing the data by numpy.
func.eval(array)
Most tutorials and guides consider the problem of turning an array to a pointer, but I am interested in the opposite.
I have seen one solution here based on pure python using the ctypes library (not what I am interested in). Cython itself talks about typed memoryviews a great deal. This is also not what I am looking for exactly, since I want all the numpy goodness to work on the array.
Edit: A (slightly) modified standalone MWE (save as test.cyx, compile via cython test.cyx):
cimport numpy
cdef extern from *:
"""
/* This is C code which will be put
* in the .c file output by Cython */
typedef void (*callback)(double* values, int values_length);
void execute(callback cb)
{
double values[] = {0., 1.};
cb(values, 2);
}
"""
ctypedef void (*callback)(double* values, int values_length);
void execute(callback cb);
def my_python_callback(array):
print(array.shape)
print(array)
cdef void my_callback(double* values, int values_length):
# turn values / values_length into numpy array
# and call my_pytohn_callback
pass
cpdef my_execute():
execute(my_callback)
2nd Edit:
Regarding the possible duplicate: While the questions are related, the first answer given is, as was pointed out rather fragile, relying on memory data flags, which are arguably an implementation detail. What is more, the question and answers are rather outdated and the cython API has been expanded since 2014. Fortunately however, I was able to solve the problem myself.
Firstly, you can cast a raw pointer to a typed MemoryView operating on the same underlying memory without taking ownership of it via
cdef double[:] values_view = <double[:values_length]> values
This is not quire enough however, as I stated I want a numpy array. But it is possible to convert a MemoryView to a numpy array provided that it has a numpy data type. Thus, the goal can be achieved in one line via
array = np.asarray(<np.float64_t[:values_length]> values)
It can be easily checked that the array operates on the correct memory segment without owning it.
I have a list of varying size that contains numpy arrays with the same data type and shape. I would like to process this data using a function written in Cython without copying the data. Both memoryviews and the Python buffer protocol seem to support this kind of data using indirect for the first dimension. So I was hoping that something like this could work:
%%cython
from cython.view cimport indirect
def test(list a):
cdef double[::indirect, :] x
x = a
x[0, 0] = 42
Unfortunately it doesn't.
Is there a way to convert this list of numpy arrays into such a memoryview?
I have a vector and wish to make another vector of the same length whose k-th component is
The question is: how can we vectorize this for speed? NumPy vectorize() is actually a for loop, so it doesn't count.
Veedrac pointed out that "There is no way to apply a pure Python function to every element of a NumPy array without calling it that many times". Since I'm using NumPy functions rather than "pure Python" ones, I suppose it's possible to vectorize, but I don't know how.
import numpy as np
from scipy.integrate import quad
ws = 2 * np.random.random(10) - 1
n = len(ws)
integrals = np.empty(n)
def f(x, w):
if w < 0: return np.abs(x * w)
else: return np.exp(x) * w
def temp(x): return np.array([f(x, w) for w in ws]).sum()
def integrand(x, w): return f(x, w) * np.log(temp(x))
## Python for loop
for k in range(n):
integrals[k] = quad(integrand, -1, 1, args = ws[k])[0]
## NumPy vectorize
integrals = np.vectorize(quad)(integrand, -1, 1, args = ws)[0]
On a side note, is a Cython for loop always faster than NumPy vectorization?
The function quad executes an adaptive algorithm, which means the computations it performs depend on the specific thing being integrated. This cannot be vectorized in principle.
In your case, a for loop of length 10 is a non-issue. If the program takes long, it's because integration takes long, not because you have a for loop.
When you absolutely need to vectorize integration (not in the example above), use a non-adaptive method, with the understanding that precision may suffer. These can be directly applied to a 2D NumPy array obtained by evaluating all of your functions on some regularly spaced 1D array (a linspace). You'll have to choose the linspace yourself since the methods aren't adaptive.
numpy.trapz is the simplest and least precise
scipy.integrate.simps is equally easy to use and more precise (Simpson's rule requires an odd number of samples, but the method works around having an even number, too).
scipy.integrate.romb is in principle of higher accuracy than Simpson (for smooth data) but it requires the number of samples to be 2**n+1 for some integer n.
#zaq's answer focusing on quad is spot on. So I'll look at some other aspects of the problem.
In recent https://stackoverflow.com/a/41205930/901925 I argue that vectorize is of most value when you need to apply the full broadcasting mechanism to a function that only takes scalar values. Your quad qualifies as taking scalar inputs. But you are only iterating on one array, ws. The x that is passed on to your functions is generated by quad itself. quad and integrand are still Python functions, even if they use numpy operations.
cython improves low level iteration, stuff that it can convert to C code. Your primary iteration is at a high level, calling an imported function, quad. Cython can't touch or rewrite that.
You might be able to speed up integrand (and on down) with cython, but first focus on getting the most speed from that with regular numpy code.
def f(x, w):
if w < 0: return np.abs(x * w)
else: return np.exp(x) * w
With if w<0 w must be scalar. Can it be written so it works with an array w? If so, then
np.array([f(x, w) for w in ws]).sum()
could be rewritten as
fn(x, ws).sum()
Alternatively, since both x and w are scalar, you might get a bit of speed improvement by using math.exp etc instead of np.exp. Same for log and abs.
I'd try to write f(x,w) so it takes arrays for both x and w, returning a 2d result. If so, then temp and integrand would also work with arrays. Since quad feeds a scalar x, that may not help here, but with other integrators it could make a big difference.
If f(x,w) can be evaluated on a regular nx10 grid of x=np.linspace(-1,1,n) and ws, then an integral (of sorts) just requires a couple of summations over that space.
You can use quadpy for fully vectorized computation. You'll have to adapt your function to allow for vector inputs first, but that is done rather easily:
import numpy as np
import quadpy
np.random.seed(0)
ws = 2 * np.random.random(10) - 1
def f(x):
out = np.empty((len(ws), *x.shape))
out0 = np.abs(np.multiply.outer(ws, x))
out1 = np.multiply.outer(ws, np.exp(x))
out[ws < 0] = out0[ws < 0]
out[ws >= 0] = out1[ws >= 0]
return out
def integrand(x):
return f(x) * np.log(np.sum(f(x), axis=0))
val, err = quadpy.quad(integrand, -1, +1, epsabs=1.0e-10)
print(val)
[0.3266534 1.44001826 0.68767868 0.30035222 0.18011948 0.97630376
0.14724906 2.62169217 3.10276876 0.27499376]
I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py to .pyx (and cimport cython, numpy etc) and importing into ipython3 with pyximport.
Many tutorials start in this approach with the next step being to add cdef declarations for every data type, which I can do for the iterators in my for loops etc.
But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).
So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double), with columns that are int and rows that are int?
How to define the type of an embedded list? i.e [[int,int],[int]]
Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:
cimport cython
import numpy as np
cimport numpy as np
import pandas as pd
offcat = [
"breakingPeace",
"damage",
"deception",
"kill",
"miscellaneous",
"royalOffences",
"sexual",
"theft",
"violentTheft"
]
def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
"""EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
"""partOf/block is a list of ints"""
"""ll, AIC, is series/frame of floats"""
##Cython cdefs
cdef int DFlen
cdef int puns
cdef int DeathPun
cdef int k
cdef int pId
cdef int punish
DFlen = EmpFrame.shape[1]
puns = 2
DeathPun = 0
PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)
for partOf in part:
Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)
for punish in range(0,puns):
PunishGroup = [x*puns+punish for x in partOf]
punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
PartitionModel.iloc[:,PunishGroup] = punishPunishment
PartitionModel = PartitionModel*OffenceEstimateFrame
if ReturnDeathEstimate:
DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
for pId,block in enumerate(part):
DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in x ]
for idx,x in enumerate(row['Partition'])],
key=lambda x: x[0], reverse=True),axis=1)
ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
k = (len(part))*(puns-1)
AIC = 2*k-2*ll
if ReturnDeathEstimate:
return AIC, DeathProbFrame
else:
return AIC
My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:
Profile your code. (See this SO answer, or use %prun in ipython).
The output of prun should drive what bit to improve next.
pandas (make your code more pandorable, this can help a lot).
numpy (not creating intermediary Series/DataFrames, being careful about dtypes)
cython (the last resort).
Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).
Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.
Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.
*Full-disclose I wrote it that section of the docs! :)