Say I have a memory buffer with a vector of type std::decimal::decimal128 (IEEE754R) elements, can I wrap and expose that as a NumPy array, and do fast operations on those decimal vectors, like for example compute variance or auto-correlation over the vector? How would I do that best?
Numpy does not support such a data type yet (at least on mainstream architectures). Only float16, float32, float64 and the non standard native extended double (generally with 80 bits) are supported. Put it shortly, only floating-point types natively supported by the target architecture. If the target machine support 128 bit double-precision numbers, then you could try the numpy.longdouble type but I do not expect this to be the case. In practice, x86 processors does not support that yet as well as ARM. IBM processors like POWER9 supports that natively but I am not sure they (fully) support the IEEE-754R standard. For more information please read this. Note that you could theoretically wrap binary data in Numpy types but you will not be able to do anything (really) useful with it. The Numpy code can theoretically be extended with new types but please note that Numpy is written in C and not C++ so adding the std::decimal::decimal128 in the source code will not be easy.
Note that if you really want to wrap such a type in Numpy array without having to change/rebuild the Numpy code, could wrap your type in a pure-Python class. However, be aware that the performance will be very bad since using pure-Python object prevent all the optimization done in Numpy (eg. SIMD vectorization, use of fast native code, specific algorithm optimized for a given type, etc.).
Related
Good afternoon!
Currently, I'm diggin out the reason why numpy is fast.
More specific, I'm wondering why np.sum() is that fast.
My one suggestion is np.sum() uses some kind of SIMD optimization, but I'm not sure whether it is.
Is there any way that I can check which numpy's method uses SIMD operations?
Thx in advance
Numpy does not currently use SIMD instructions for trivial np.sum calls yet. However, I made this PR which should be merged soon and fix this issue with integers (it will use the 256-bit AVX2 instruction set if available and the 128-bit SSE/Neon instruction set otherwise). Using SIMD instructions for np.sum with floating-point numbers is a bit harder due to the current algorithm used (pair-wise summation) and because one should care about the precision.
Is there any way that I can check which numpy's method uses SIMD operations?
Low-level profilers and hardware-counter-based tools (eg. Linux perf, Intel VTune) can do that but they are not very user-friendly (ie. you need to have some notions in assembly, know roughly how processors work and read some documentation about hardware counters). Another solution is to look the disassembled code of Numpy using tools like objdump (require a pretty good knowledge in assembly and the name of the C function called) or simply look at the Numpy C code (note compilers can autovectorize loops so this solution is not so simple).
Update: If you are using np.sum on contiguous double-precision Numpy arrays, then the benefit of using SIMD instructions is not so big. Indeed, for large contiguous double-precision arrays not fitting in the cache, a scalar implementation should be able to saturate the memory bandwidth on most PCs (but certainly not the Apple M1 nor computing servers), especially on high-frequency processors. On small arrays (eg. <4000), Numpy overheads dominate the execution time of such a function. For contiguous medium-sized arrays (eg. >10K and <1M items), using SIMD instructions should result in a significant speed up, especially for simple-precision arrays (eg. 3-4 times faster on DP and 6-8 times faster on SP on mainstream machines).
Are the feature vectors generated by featuretools/DFS dense or sparse or does it depend on something?
The sparseness of feature vectors generated by Featuretools will in general be dependent on
the EntitySet in question and
the primitives chosen.
Primitives are meant to give back dense information. While it's possible (but not helpful) to construct example EntitySets that will make the output of primitive sparse, it's more common for the primitive to give back no information than sparse information.
However, certain primitives and workflows are more likely to give back sparse than others. A big one to worry about is feature encoding, which uses one-hot. Because that's generating a vector with 1s only when a certain value occurs, an infrequently occurring categorical value immediately would be converted into a sparse vector. Using Where aggregation primitives can sometimes have similar results.
UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.
Is there a way to allocate the data section (i.e. the data) of a numpy array on a page boundary?
For why I care, if I were using PyOpenCL on an Intel device, and I wanted to create a buffer using CL_MEM_USE_HOST_PTR, they recommend that the data is 1) page aligned and 2) size a multiple of a cache line.
There are various ways in C of allocating page aligned memory, see for example: aligned malloc() in GCC?
I'm not aware that Numpy has any explicit calls to align memory at this time. The only way I can think of doing this, short of Cython as suggested by #Saulio Castro, would be through judicious allocation of memory, with "padding", using the numpy allocation or PyOpenCL APIs.
You would need to create a buffer "padded" to align on multiples of 64K bytes. You would also need to "pad" the individual data structure elements you were allocating in the array so they too, in turn, were aligned to 4k byte boundaries. This would of course depend on what your elements look like, whether they were built in numpy data types, or structures created using the numpy dtype. The API for dtype has an "align" keyword but I would be wary of that, based on the discussion at this link.
An old school trick to align structure is to start with the largest elements, work your way down, then "pad" with enough uint8's so one or N structs fill out the alignment boundary.
Hope that's not too vague...
Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.