This question already has an answer here:
What parts of a Numpy-heavy function can I accelerate with Cython
(1 answer)
Closed last year.
I am trying to learn cython, where I compile with annotate=True.
Says in The basic manual:
If a line is white, it means that the code generated doesn’t interact with Python, so will run as fast as normal C code. The darker the yellow, the more Python interaction there is in that line
Then I wrote this code following (as much as I understood) numpy in cython basic manual instructions:
+14: cdef entropy(counts):
15: '''
16: INPUT: pandas table with counts as obsN
17: OUTPUT: general entropy
18: '''
+19: cdef int l = counts.shape[0]
+20: cdef np.ndarray probs = np.zeros(l, dtype=np.float)
+21: cdef int totals = np.sum(counts)
+22: probs = counts/totals
+23: cdef np.ndarray plogp = np.zeros(l, dtype=np.float)
+24: plogp = ( probs.T * (np.log(probs)) ).T
+25: cdef float d = np.exp(-1 * np.sum(plogp))
+26: cdef float relative_d = d / probs.shape[0]
27:
+28: return {'d':d,
+29: 'relative_d':relative_d
30: }
Where all the "+" at the beginning of the line are yellow in the cython.debug.output.html file.
What am I doing very wrong? How can I make at least part of this function run at c speed?
The function returns a python dictionary, hence I think that I can't returned any c data type. I might be wrong here to.
Thank you for the help!
First of all, Cython does not rewrite Numpy functions, it just call them like CPython does. This is the case for np.zeros, np.sum or np.log for example. Such calls will not be faster with Cython. If you want a faster code you can use plain loops to reimplement them in you code. However, this may not be faster: on one hand Numpy calls introduce an overhead (due to type checking AFAIK still enabled with Cython, internal function calls, wrappers, etc) certainly significant if you use small arrays and each function generate huge temporary arrays that are often slow to read/write; on the other hand, some Numpy functions makes use of highly-optimized code (like BLAS or low-level SIMD intrinsics). Moreover, the division in Python does not behave the same way than C. This is why Cython provides the flag cython.cdivision which can be set to True (it is False by default). If the Python division is used, Cython generate a slower wrapping code. Finally, np.ndarray is a CPython type and behave as such, you can use memoryviews so not to deal with Numpy objects.
If you want to get a fast code, you certainly need to use memoryviews, loops and and avoid creating temporary arrays as well as using multiple threads. Additionally, you can use np.empty instead of np.zeros in your case. Besides this, the Numpy transposition is not very efficient and Numpy does not solves this problem. You can implement a tiled-transposition to speed it up but this is not trivial to implement it efficiently. Here is a Numba implementation that can certainly be easily transformed to a Cython code. Putting some cdef on a Python Numpy code generally does not make it faster.
I have an object with a numpy array instance variable.
Within a function, I want to declare local references to slots within that numpy array.
E.g.,
cdef double& x1 = self.array[0]
Reason being, I don't want to spend time instantiating new variables and copying values.
Obviously the above code doesn't work. Something about c++ style references not supported. How do I do what I want to do?
C++ references aren't supported as local variables (even in Cython's C++ mode) because they need to be initialized upon creation and Cython prefers to generate code like:
# start of function
double& x_ref
# ...
x_ref = something # assign
# ...
This ensures that variable scope behaves in a "Python way" rather than a "C++ way". It does mean everything needs to be default constructable though.
However, C++ references are usually implemented in terms of pointers, so the solution is just to use pointers yourself:
cdef double* x1 = &self.array[1]
x1[0] = 2 # use [0] to dereference pointer
Obviously C++ references make the syntax nicer (you don't have to worry about dereferences and getting addresses) but performance-wise it should be the same.
I needed to draw 3-d graphs using C code. For this purpose i have to include the matplotlib of Python. Anyone help to do this?? I have to plot the graph on the values currently placed in an array of C.
Although not exactly the same question you might want to take a look into this.
That being said some of the solutions proposed are:
A) That you include Python on you C program (by #Raj):
#include "Python.h"
int main()
{
Py_Initialize();
PyRun_SimpleString("import pylab");
PyRun_SimpleString("pylab.plot(range(5))");
PyRun_SimpleString("pylab.show()");
Py_Exit(0);
return 0;
}
B) That you use libraries that mimic matplotlib (by #kazemakase):
matplotlib-cpp
As for the array issue, depending on the solution that you chose, it might be worth your while to look into this question. In here #en_Knight provide a few recipes for transforming data (C to Python and vice-versa). Example:
int* my_data_to_modify;
if (PyArg_ParseTuple(args, "O", &numpy_tmp_array)){
/* Point our data to the data in the numpy pixel array */
my_data_to_modify = (int*) numpy_tmp_array->data;
}
TLDR: How do I call standard floating point code in a way that compiles both 32 and 64 bit CGFloats without warnings?
CGFloat is defined as either double or float, depending on the compiler settings and platform. I'm trying to write code that works well in both situations, without generating a lot of warnings.
When I use functions like floor, abs, ceil, and other simple floating point operations, I get warnings about values being truncated. For example:
warning: implicit conversion shortens 64-bit value into a 32-bit value
I'm not concerned about correctness or loss of precision in of calculations, as I realize that I could just use the double precision versions of all functions all of the time (floor instead of floorf, etc); however, I would have to tolerate these errors.
Is there a way to write code cleanly that supports both 32 bit and 64 bit floats without having to either use a lot of #ifdef __ LP64 __ 's, or write wrapper functions for all of the standard floating point functions?
You may use those functions from tgmath.h.
#include <tgmath.h>
...
double d = 1.5;
double e = floor(d); // will choose the 64-bit version of 'floor'
float f = 1.5f;
float g = floor(f); // will choose the 32-bit version of 'floorf'.
If you only need a few functions you can use this instead:
#if CGFLOAT_IS_DOUBLE
#define roundCGFloat(x) round(x)
#define floorCGFloat(x) floor(x)
#define ceilCGFloat(x) ceil(x)
#else
#define roundCGFloat(x) roundf(x)
#define floorCGFloat(x) floorf(x)
#define ceilCGFloat(x) ceilf(x)
#endif
Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.