Cythonising Pandas: ctypes for content, index and columns - pandas

I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py to .pyx (and cimport cython, numpy etc) and importing into ipython3 with pyximport.
Many tutorials start in this approach with the next step being to add cdef declarations for every data type, which I can do for the iterators in my for loops etc.
But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).
So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double), with columns that are int and rows that are int?
How to define the type of an embedded list? i.e [[int,int],[int]]
Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:
cimport cython
import numpy as np
cimport numpy as np
import pandas as pd
offcat = [
"breakingPeace",
"damage",
"deception",
"kill",
"miscellaneous",
"royalOffences",
"sexual",
"theft",
"violentTheft"
]
def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
"""EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
"""partOf/block is a list of ints"""
"""ll, AIC, is series/frame of floats"""
##Cython cdefs
cdef int DFlen
cdef int puns
cdef int DeathPun
cdef int k
cdef int pId
cdef int punish
DFlen = EmpFrame.shape[1]
puns = 2
DeathPun = 0
PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)
for partOf in part:
Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)
for punish in range(0,puns):
PunishGroup = [x*puns+punish for x in partOf]
punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
PartitionModel.iloc[:,PunishGroup] = punishPunishment
PartitionModel = PartitionModel*OffenceEstimateFrame
if ReturnDeathEstimate:
DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
for pId,block in enumerate(part):
DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in x ]
for idx,x in enumerate(row['Partition'])],
key=lambda x: x[0], reverse=True),axis=1)
ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
k = (len(part))*(puns-1)
AIC = 2*k-2*ll
if ReturnDeathEstimate:
return AIC, DeathProbFrame
else:
return AIC

My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:
Profile your code. (See this SO answer, or use %prun in ipython).
The output of prun should drive what bit to improve next.
pandas (make your code more pandorable, this can help a lot).
numpy (not creating intermediary Series/DataFrames, being careful about dtypes)
cython (the last resort).
Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).
Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.
Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.
*Full-disclose I wrote it that section of the docs! :)

Related

A more efficient way of creating an NxM array in Python

In Python, I need to create an NxM matrix in which the ij entry has value of i^2 + j^2.
I'm currently constructing it using two for loops, but the array is quite big and the computation time is long and I need to perform it several times. Is there a more efficient way of constructing such matrix using maybe Numpy ?
You can use broadcasting in numpy. You may refer to the official documentation. For example,
import numpy as np
N = 3; M = 4 #whatever values you'd like
a = (np.arange(N)**2).reshape((-1,1)) #make it to column vector
b = np.arange(M)**2
print(a+b) #broadcasting applied
Instead of using np.arange(), you can use np.array([...some array...]) for customizing it.

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Array-Broadcasting in Cython Memoryview

I created a typed memoryview in cython and would like to multiply it by a scalar:
import numpy as np
import math
cimport numpy as np
def foo():
N = 10
cdef np.double_t [:, :] A = np.ones(shape=(N,N),dtype=np.double_)
cdef int i,j
cdef double pi = math.pi
for i in range(N):
for j in range(N):
A[i,j] *= pi
return A
def bar():
N = 10
cdef np.double_t [:, :] A = np.ones(shape=(N,N),dtype=np.double_)
cdef double pi = math.pi
A *= pi
return A
Function foo() does this task but is not very convenient/readable.
The line A *= pi in function bar() does however not compile: Invalid operand types for '*' (double_t[:, :]; double).
Is there a way to perform such a broadcasting operation on a cython memoryview?
No, memoryviews don't do this. A memoryview is literally just a way to access individual elements of an array quickly. It has no really concept of the mathematical operations that can be performed on the array.
In the case of your bar function, any attempt to type it is probably actually going to make it worse (i.e. it'll spend extra time checking the type, but ultimately the work is done in ordinary calls to Numpy function).
There's a number of (not 100% satisfactory) ways of getting a Numpy array from a memoryview:
np.asarray(memview) - this should be done without copying (provided you aren't using the esoteric indirect memory layout). It might be worth adding an assertion to check that no copy was made though.
memview.base - be slightly careful with this. If the memoryview is a result of slicing then .base will be the original unsliced object.
Keep a parallel numpy array and memoryview variable:
Anp = np.array(...)
cdef double[:] Amview = Anp
because the memoryview is a view of some memory, modifications to the array will be reflected in the memoryview and vice-versa. (Reassigning the array variable, e.g. Anp = something_else, won't be reflected though).
In summary, memoryviews are designed for one main job: being able to access individual elements quickly. If that's not what you're doing then you probably don't want to use a memoryview.

Numpy iterating over rows

I kind of have the misconception that for loops should be avoided in Numpy for speed reasons, for example
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.array([[1,1,1,1,1,1,1]])
output = numpy.zeros((2,1))
for i in range(2):
output[i] = numpy.mean(targets[a[i]])
Is this a good way to get the mean on selected positions of each row? Feels like there might be ways to slice the array first then apply mean directly.
I think you are looking for this:
targets[a].mean(1)
Note that in your example, targets need to be 1-D and not 2-D. Otherwise, your loop throws out of bound index as it interprets the index for row index and not the column index.
numpy actually interprets this for you: targets[a] works "row-wise" and subsequently using np.mean(targets[a], axis=1) as suggested by #hpaulj in the comments does exactly what you want:
import numpy
a = numpy.array([[2,0,1,3],[0,2,3,1]])
targets = numpy.arange(1,6) # To make the results differ
output = numpy.mean(targets[a], axis=1) # the i-th row of targets[a] is targets[a[i]]

How can I combine multiple numpy arrays into a single memoryview for cython?

I have a list of varying size that contains numpy arrays with the same data type and shape. I would like to process this data using a function written in Cython without copying the data. Both memoryviews and the Python buffer protocol seem to support this kind of data using indirect for the first dimension. So I was hoping that something like this could work:
%%cython
from cython.view cimport indirect
def test(list a):
cdef double[::indirect, :] x
x = a
x[0, 0] = 42
Unfortunately it doesn't.
Is there a way to convert this list of numpy arrays into such a memoryview?