Cython references to slots in a numpy array - numpy

I have an object with a numpy array instance variable.
Within a function, I want to declare local references to slots within that numpy array.
E.g.,
cdef double& x1 = self.array[0]
Reason being, I don't want to spend time instantiating new variables and copying values.
Obviously the above code doesn't work. Something about c++ style references not supported. How do I do what I want to do?

C++ references aren't supported as local variables (even in Cython's C++ mode) because they need to be initialized upon creation and Cython prefers to generate code like:
# start of function
double& x_ref
# ...
x_ref = something # assign
# ...
This ensures that variable scope behaves in a "Python way" rather than a "C++ way". It does mean everything needs to be default constructable though.
However, C++ references are usually implemented in terms of pointers, so the solution is just to use pointers yourself:
cdef double* x1 = &self.array[1]
x1[0] = 2 # use [0] to dereference pointer
Obviously C++ references make the syntax nicer (you don't have to worry about dereferences and getting addresses) but performance-wise it should be the same.

Related

Why does Cython keep making python objects instead of c? [duplicate]

This question already has an answer here:
What parts of a Numpy-heavy function can I accelerate with Cython
(1 answer)
Closed last year.
I am trying to learn cython, where I compile with annotate=True.
Says in The basic manual:
If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line
Then I wrote this code following (as much as I understood) numpy in cython basic manual instructions:
+14: cdef entropy(counts):
15: '''
16: INPUT: pandas table with counts as obsN
17: OUTPUT: general entropy
18: '''
+19: cdef int l = counts.shape[0]
+20: cdef np.ndarray probs = np.zeros(l, dtype=np.float)
+21: cdef int totals = np.sum(counts)
+22: probs = counts/totals
+23: cdef np.ndarray plogp = np.zeros(l, dtype=np.float)
+24: plogp = ( probs.T * (np.log(probs)) ).T
+25: cdef float d = np.exp(-1 * np.sum(plogp))
+26: cdef float relative_d = d / probs.shape[0]
27:
+28: return {'d':d,
+29: 'relative_d':relative_d
30: }
Where all the "+" at the beginning of the line are yellow in the cython.debug.output.html file.
What am I doing very wrong? How can I make at least part of this function run at c speed?
The function returns a python dictionary, hence I think that I can't returned any c data type. I might be wrong here to.
Thank you for the help!
First of all, Cython does not rewrite Numpy functions, it just call them like CPython does. This is the case for np.zeros, np.sum or np.log for example. Such calls will not be faster with Cython. If you want a faster code you can use plain loops to reimplement them in you code. However, this may not be faster: on one hand Numpy calls introduce an overhead (due to type checking AFAIK still enabled with Cython, internal function calls, wrappers, etc) certainly significant if you use small arrays and each function generate huge temporary arrays that are often slow to read/write; on the other hand, some Numpy functions makes use of highly-optimized code (like BLAS or low-level SIMD intrinsics). Moreover, the division in Python does not behave the same way than C. This is why Cython provides the flag cython.cdivision which can be set to True (it is False by default). If the Python division is used, Cython generate a slower wrapping code. Finally, np.ndarray is a CPython type and behave as such, you can use memoryviews so not to deal with Numpy objects.
If you want to get a fast code, you certainly need to use memoryviews, loops and and avoid creating temporary arrays as well as using multiple threads. Additionally, you can use np.empty instead of np.zeros in your case. Besides this, the Numpy transposition is not very efficient and Numpy does not solves this problem. You can implement a tiled-transposition to speed it up but this is not trivial to implement it efficiently. Here is a Numba implementation that can certainly be easily transformed to a Cython code. Putting some cdef on a Python Numpy code generally does not make it faster.

How do you use/view memoryview objects in Cython?

I've got a project where a handful of nested for-loops are slowing down the runtime of the code so I've started implementing some Cython typing and it sped up the runtime of the loops significantly but I've run into a new problem, the typing I'm using doesn't allow for any computations to be done one them. Here's a mock sketch of my code:
cdef double[:,:] my_matrix = np.zeros([width, height])
for i in range(0,width):
for j in range(0,height):
a = v1[i] - v2[j]
my_matrix[i,j] = np.sqrt(a**2)
After that I want to compute the product of my_matrix using
A complex number
Two constants
The exponential function
The matrix itself, like so:
product = constant1 * np.exp(-1j * constant2 * my_matrix) / my_matrix
By attempting this I get the error:
TypeError: unsupported operand type(s) for *: 'complex' and 'my_cython_function_cy._memoryviewslice'
I understand the implication of this error but I dont get how to use the contents of the memoryview-object as an array, I tried doing this;
new_matrix = my_matrix
but that won't compile. I'm new to both C and Cython and the documentation isn't very helpful for these rookie-questions so I would be very grateful for any help here.
The best thing to do is:
new_matrix = np.as_array(my_matrix)
That lets you access the full set of Numpy operations on the array. It should be a pretty lightweight transformation (they'll share the same underlying data).
You could also get the wrapped object with my_matrix.base (this would probably be the original Numpy array that you initialized it with). However, depending on what you've done with slicing this might not be quite the same as the memoryview, so be a bit wary of this approach.

Integrating tfe.EagerVariableStore with tfe.Checkpoint?

tfe.Checkpoint seems to require things to be checkpointed to implement CheckpointableBase which EagerVariableStore doesn't.
What is the right way then to use EagerVariableStore to "eagerify" the functional parts of Tensorflow with ability to checkpoint?
Providing some working code would be appreciated.
For eagerifying functional code, I'd suggest tf.make_template rather than EagerVariableStore directly. When executing eagerly, this will create a variable store automatically (allowing variable reuse with tf.get_variable), and the object tf.make_template returns is checkpointable.
import tensorflow as tf
tf.enable_eager_execution()
def uses_functional_layers(x):
return tf.layers.dense(inputs=x, units=1)
save_template = tf.make_template("save_template", uses_functional_layers)
save_checkpoint = tf.train.Checkpoint(model=save_template)
save_template(tf.ones([1, 1]))
save_template.variables[0].assign([42.])
save_output = save_template(tf.ones([1, 1]))
save_path = save_checkpoint.save('/tmp/tf_template_ckpt')
So we make a function which wraps our functional layers / tf.get_variable usage, then make a template object out of that with tf.make_template, and finally can checkpoint that template object after it has been called once to create its variables.
An advantage of doing it this way is that we get restore-on-create for variables in the template, meaning the template is evaluated with the restored values the first time it is called:
import numpy
# Create a second template to load the checkpoint into
restore_template = tf.make_template("save_template", uses_functional_layers)
tf.train.Checkpoint(model=restore_template).restore(save_path)
numpy.testing.assert_allclose(
save_output,
restore_template(tf.ones([1, 1]))) # Variables are restored on creation
numpy.testing.assert_equal([42.], restore_template.variables[0].numpy())
Nested templates work too. Note that the template object strips its own variable_scope from variables created within it, but otherwise uses the full variable names (which may be more fragile than usual object-based checkpointing):
Looking up variables repeatedly with tf.get_variable (done each time the template is evaluated) is also quite slow, which is one reason TensorFlow is moving toward object-oriented Keras-style layers instead of functional layers.
I have found a "hackish" way, but works!
The main problem is:
tfe.EagerVariableStore doesn't inherit CheckpointableBase, hence it can't be saved with tfe.Checkpoint
The big idea is:
We are going to create a CheckpointableBase object that "points" to every variable stored in the tfe.EagerVariableStore
How to know what are stored in EagerVariableStore?
Reference: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/ops/variable_scope.py
It says that EagerVariableStore uses _VariableStore to store all the variables, via _store.
Now, the _VariableStore stores the variables in self._vars as a dictionary.
If we have a container = tfe.EagerVariableStore(), we can get all the variables via container._store._vars as a dictionary.
How to create a CheckpointableBase that points to every variable then?
We will use tfe.Checkpointable since it has __setattr__.
checkpointable = tfe.Checkpointable()
for k, v in container._store._vars.items():
setattr(checkpointable, k, v)
How to combine the two?
As we have a tfe.Checkpoint for saving, all we need to do is this:
saver = tfe.Checkpoint(checkpointable=checkpointable)
saver.save(...)
And saver.restore(...) to restore.
Your tfe.EagerVariableStore need not to be changed, the checkpointable after restored via tfe.Checkpoint will "replace" the values in tfe.EagerVariableStore automigically!

Find all variables that a tensorflow op depends upon

Is there a way to find all variables that a given operation (usually a loss) depends upon?
I would like to use this to then pass this collection into optimizer.minimize() or tf.gradients() using various set().intersection() combinations.
So far I have found op.op.inputs and tried a simple BFS on that, but I never chance upon Variable objects as returned by tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES) or slim.get_variables()
There does seem to be a correspondence between corresponding 'Tensor.op._idandVariables.op._id` fields, but I'm not sure that's a something I should rely upon.
Or maybe I should't want to do this in the first place?
I could of course construct my disjoint sets of variables meticulously while building my graph, but then it would be easy to miss something if I change the model.
The documentation for tf.Variable.op is not particularly clear, but it does refer to the crucial tf.Operation used in the implementation of a tf.Variable: any op that depends on a tf.Variable will be on a path from that operation. Since the tf.Operation object is hashable, you can use it as the key of a dict that maps tf.Operation objects to the corresponding tf.Variable object, and then perform the BFS as before:
op_to_var = {var.op: var for var in tf.trainable_variables()}
starting_op = ...
dependent_vars = []
queue = collections.deque()
queue.append(starting_op)
visited = set([starting_op])
while queue:
op = queue.popleft()
try:
dependent_vars.append(op_to_var[op])
except KeyError:
# `op` is not a variable, so search its inputs (if any).
for op_input in op.inputs:
if op_input.op not in visited:
queue.append(op_input.op)
visited.add(op_input.op)

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.