Return numpy matrix from SWIG method using numpy.i - numpy

I've found the following SWIG rule in numpy.i documentation.
( DATA_TYPE ARGOUT_ARRAY2[ANY][ANY] )
Frankly, this snippet gives no idea how to wrap a C++ function returning a 3*3 matrix. What does "ANY" mean? Suppose, I've got a function with a signature void make_matrix(double** matrix). It allocates 3*3*sizeof(double) bytes with malloc() and writes them to out parameter. Now to apply this rule? How to bind 3 to "ANY"? How to bind matrix to "ARGOUT_ARRAY2"? How to bind DATA_TYPE to double? Is there any kind of documentation covering these details?

Based on the docs you cite (in the Input Arrays section):
The first signature listed, ( DATA_TYPE IN_ARRAY[ANY] ) is for one-dimensional arrays with hard-coded dimensions. Likewise, ( DATA_TYPE IN_ARRAY2[ANY][ANY] ) is for two-dimensional arrays with hard-coded dimensions, and similarly for three-dimensional.
The all caps portions are place holders for the values you are actually using. Proper use is dependent on a thorough understanding of SWIG syntax.

If you happen to be using the venerable Eigen library to represent your matrices in C++, then there are some very useful SWIG wrappers available within the Biomechanical Toolkit which allow Eigen matrices to be transparently converted into numpy arrays.
Your SWIG interface definition should contain something along the following lines:
%module(docstring="Simple SWIG+Eigen demo") myeigenswig
%include eigen.i
%include numpy.i
%{
#define SWIG_FILE_WITH_INIT
#include <Eigen/Core>
#include "mylibrary.hpp"
%}
%init %{
import_array();
%}
%eigen_typemaps(Eigen::VectorXd)
%eigen_typemaps(Eigen::MatrixXd)
Eigen::MatrixXd myFunction();
Another option, if you're not using Eigen, is to convert your 3x3 C++ array into a 9-element std::vector<double>, which SWIG will easily convert into a Python list, and which can then be turned into a numpy array by invoking numpy.asarray(vectorFromCxx).reshape((3,3)). This won't be especially efficient, involving quite a bit of copying between different data structures, but may be sufficient for your application unless you're passing many matrices between C++ & Python.

Related

Stan variable declarations: difference in use between var_type var_name[length] and vector[length] var_name

I am new at Stan and I'm struggling to understand the difference in how different variable declaration styles are used. In particular, I am confused about when I should put square brackets after the variable type and when I should put them after the variable name. For example, given int<lower = 0> L; // length of my data, let's consider:
real N[L]; // my variable
versus
vector[L] N; // my variable
From what I understand, both declare a variable N as a vector of length L.
Is the only difference between the two that the first way specifies the variable type? Can they be used interchangeably? Should they belong do different parts of the Stan code (e.g., data vs parameters or model)?
Thanks for explaining!
real name[size] and vector[size] name can be used pretty interchangeably. They are stored differently internally, so you can get better performance with one or the other. Some operations might also be restricted to one and the other (e.g. vector multiplication) and the optimal order to loop over them changes. E.g. with a matrix vs. a 2-D array, it is more efficient to loop over rows first vs. columns first, but those will come up if you have a more specific example. The way to read this is:
real name[size];
means name is an array of type real, so a bunch of reals that are stored together.
vector[size] name;
means that name is a vector of size size, which is also a bunch of reals stored together. But the vector data type in STAN is based on the eigen c++ library (c++) and that allows for other operations.
You can also create arrays of vectors like this:
vector[N] name[K];
which is going to produce an array of K vectors of size N.
Bottom line: You can get any model running with using vector or real, but they're not necessarily equivalent in the computational efficiency.

Is it possible to get the native CPU size of an integer in Rust?

For fun, I'm writing a bignum library in Rust. My goal (as with most bignum libraries) is to make it as efficient as I can. I'd like it to be efficient even on unusual architectures.
It seems intuitive to me that a CPU will perform arithmetic faster on integers with the native number of bits for the architecture (i.e., u64 for 64-bit machines, u16 for 16-bit machines, etc.) As such, since I want to create a library that is efficient on all architectures, I need to take the target architecture's native integer size into account. The obvious way to do this would be to use the cfg attribute target_pointer_width. For instance, to define the smallest type which will always be able to hold more than the maximum native int size:
#[cfg(target_pointer_width = "16")]
type LargeInt = u32;
#[cfg(target_pointer_width = "32")]
type LargeInt = u64;
#[cfg(target_pointer_width = "64")]
type LargeInt = u128;
However, while looking into this, I came across this comment. It gives an example of an architecture where the native int size is different from the pointer width. Thus, my solution will not work for all architectures. Another potential solution would be to write a build script which codegens a small module which defines LargeInt based on the size of a usize (which we can acquire like so: std::mem::size_of::<usize>().) However, this has the same problem as above, since usize is based on the pointer width as well. A final obvious solution is to simply keep a map of native int sizes for each architecture. However, this solution is inelegant and doesn't scale well, so I'd like to avoid it.
So, my questions: is there a way to find the target's native int size, preferably before compilation, in order to reduce runtime overhead? Is this effort even worth it? That is, is there likely to be a significant difference between using the native int size as opposed to the pointer width?
It's generally hard (or impossible) to get compilers to emit optimal code for BigNum stuff, that's why https://gmplib.org/ has its low level primitive functions (mpn_... docs) hand-written in assembly for various target architectures with tuning for different micro-architecture, e.g. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/core2/mul_basecase.asm for the general case of multi-limb * multi-limb numbers. And https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/aors_n.asm for mpn_add_n and mpn_sub_n (Add OR Sub = aors), tuned for SandyBridge-family which doesn't have partial-flag stalls so it can loop with dec/jnz.
Understanding what kind of asm is optimal may be helpful when writing code in a higher level language. Although in practice you can't even get close to that so it sometimes makes sense to use a different technique, like only using values up to 2^30 in 32-bit integers (like CPython does internally, getting the carry-out via a right shift, see the section about Python in this). In Rust you do have access to add_overflow to get the carry-out, but using it is still hard.
For practical use, writing Rust bindings for GMP is probably your best bet, unless that already exists.
Using the largest chunks possible is very good; on all current CPUs, add reg64, reg64 has the same throughput and latency as add reg32, reg32 or reg8. So you get twice as much work done per unit. And carry propagation through 64 bits of result in 1 cycle of latency.
(There are alternate ways to store BigInteger data that can make SIMD useful; #Mysticial explains in Can long integer routines benefit from SSE?. e.g. 30 value bits per 32-bit int, allowing you to defer normalization until after a few addition steps. But every use of such numbers has to be aware of these issues so it's not an easy drop-in replacement.)
In Rust, you probably want to just use u64 regardless of the target, unless you really care about small-number (single-limb) performance on 32-bit targets. Let the compiler build u64 operations for you out of add / adc (add with carry).
The only thing that might need to be ISA-specific is if u128 is not available on some targets. You want to use 64 * 64 => 128-bit full multiply as your building block for multiplication; if the compiler can do that for you with u128 then that's great, especially if it inlines efficiently.
See also discussion in comments under the question.
One stumbling block for getting compilers to emit efficient BigInt addition loops (even inside the body of one unrolled loop) is writing an add that takes a carry input and produces a carry output. Note that x += 0xff..ff + carry=1 needs to produce a carry out even though 0xff..ff + 1 wraps to zero. So in C or Rust, x += y + carry has to check for carry out in both the y+carry and the x+= parts.
It's really hard (probably impossible) to convince compiler back-ends like LLVM to emit a chain of adc instructions. An add/adc is doable when you don't need the carry-out from adc. Or probably if the compiler is doing it for you for u128.overflowing_add
Often compilers will turn the carry flag into a 0 / 1 in a register instead of using adc. You can hopefully avoid that for at least pairs of u64 in addition by combining the input u64 values to u128 for u128.overflowing_add. That will hopefully not cost any asm instructions because a u128 already has to be stored across two separate 64-bit registers, just like two separate u64 values.
So combining up to u128 could just be a local optimization for a function that adds arrays of u64 elements, to get the compiler to suck less.
In my library ibig what I do is:
Select architecture-specific size based on target_arch.
If I don't have a value for an architecture, select 16, 32 or 64 based on target_pointer_width.
If target_pointer_width is not one of these values, use 64.

How to serialize slice without length using bincode?

I'm using the bincode crate to write a structure into a file. The structure contains a slice with a fixed size. How can I force bincode to write only the slice's content without the slice's length?
#![allow(unstable)]
#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]
extern crate serde;
extern crate bincode;
use std::fs::File;
use bincode::serde::serialize_into;
use bincode::SizeLimit;
#[derive(Serialize)]
struct Foo([u8; 16]);
fn main() {
let data = Foo([0; 16]);
let mut writer = File::create("test.bin").unwrap();
serialize_into::<File, Foo>(&mut writer, &data, SizeLimit::Infinite).unwrap();
}
File 'test.bin' has 24 bytes size instead of 16.
I saw related remark in documentation of bincode, but I did not understand how to use it.
a slice with a fixed size
[u8; 16] is not a slice. It is an array which may be coerced to a slice.
Anyway... I do not believe that you can. The important function appears to be Serializer::serialize_fixed_size_array which is not implemented by the current serializer. That means it defaults to behaving the same as a slice.
Since slices do not have a length known at compile time, they must have their size written when serialized.
If no one else can provide a better suggestion, it's possible that the maintainer could find a way to make this happen. You may want to politely ask the maintainer for this feature or offer to help with the work.
Beyond that, it sounds like you are trying to make the bincode output fit a pre-existing format. That doesn't really make sense; bincode is its own format and had already made various choices and tradeoffs.
If you need to, you could implement your own encoder / decoder (either using serde or not). If you are concerned about file size, you can combine bincode with a compression step as well.

How do I limit the scope of the preprocessing definitions used?

How do I preprocess a code base using the clang (or gcc) preprocessor while limiting its text processing to use only #define entries from a single header file?
This is useful generally: imagine you want to preview the immediate result of some macros that you are currently working on… without having all the clutter that results from the mountain of includes inherent to C.
Imagine a case, where there are macros that yield a backward compatible call or an up-to-date one based on feature availability.
#if __has_feature(XYZ)
# define JX_FOO(_o) new_foo(_o)
# define JX_BAR(_o) // nop
...
#else
# define JX_FOO(_o) old_foo(_o)
# define JX_BAR(_o) old_bar(_o)
...
#endif
A concrete example is a collection of Objective-C code that was ported to be ARC-compatible (Automatic Reference Counting) from manual memory management (non-ARC) using a collection of macros (https://github.com/JanX2/google-diff-match-patch-Objective-C/blob/master/JXArcCompatibilityMacros.h) so that it compiles both ways afterwards.
At some point, you want to drop non-ARC support to improve readability and maintainability.
Edit: The basis for getting the preprocessor output is described here: C, Objective-C preprocessor output
Edit 2: If someone has details of how the source-to-source transformation options in Xcode are implemented (Edit > Refactor > Convert To…), that might help.
If you are writing the file from scratch or all the includes are in one place, why not wrap them inside of:
#ifndef MACRO_DEBUG
#include "someLib.h"
/* ... */
#endif
But as I mentioned, this only works when the includes are in consecutive lines and in the best case, you are starting to write the file yourself from scratch so you don't have to go and look for the includes.
This is a perfect case for sed/awk. However there exists an even better tool available for the exact use-case that you mention. Checkout coan.
To pre-process a source file as if the symbol <SYMBOL>is defined,
$ coan source -D<SYMBOL> sourcefile.c
Similarly to pre-process a source file as if the symbol <SYMBOL>is NOT defined,
$ coan source -U<SYMBOL> source.c
This is a bit of a stupid solution, but it works: apparently you can use AppCode’s refactoring to delete uses of a macro.
This limits the solution to OS X, though. It also is slightly tedious, because you have to do this manually for every JX_FOO() and JX_BAR().

How do I access an integer array within a struct/class from in-line assembly (blackfin dialect) using gcc?

Not very familiar with in-line assembly to begin with, and much less with that of the blackfin processor. I am in the process of migrating a legacy C application over to C++, and ran into a problem this morning regarding the following routine:
//
void clear_buffer ( short * buffer, int len ) {
__asm__ (
"/* clear_buffer */\n\t"
"LSETUP (1f, 1f) LC0=%1;\n"
"1:\n\t"
"W [%0++] = %2;"
:: "a" ( buffer ), "a" ( len ), "d" ( 0 )
: "memory", "LC0", "LT0", "LB0"
);
}
I have a class that contains an array of shorts that is used for audio processing:
class AudProc
{
enum { buffer_size = 512 };
short M_samples[ buffer_size * 2 ];
// remaining part of class omitted for brevity
};
Within the AudProc class I have a method that calls clear_buffer, passing it the samples array:
clear_buffer ( M_samples, sizeof ( M_samples ) / 2 );
This generates a "Bus Error" and aborts the application.
I have tried making the array public, and that produces the same result. I have also tried making it static; that allows the call to go through without error, but no longer allows for multiple instances of my class as each needs its own buffer to work with. Now, my first thought is, it has something to do with where the buffer is in memory, or from where it is being accessed. Does something need to be changed in the in-line assembly to make this work, or in the way it is being called?
Thought that this was similar to what I was trying to accomplish, but it is using a different dialect of asm, and I can't figure out if it is the same problem I am experiencing or not:
GCC extended asm, struct element offset encoding
Anyone know why this is occurring and how to correct it?
Does anyone know where there is helpful documentation regarding the blackfin asm instruction set? I've tried looking on the ADSP site, but to no avail.
I would suspect that you could define your clear_buffer as
inline void clear_buffer (short * buffer, int len) {
memset (buffer, 0, sizeof(short)*len);
}
and probably GCC is able to optimize (when invoked with -O2 or -O3) that cleverly (because GCC knows about memset).
To understand assembly code, I suggest running gcc -S -O -fverbose-asm on some small C file, then to look inside the produced .s file.
I would have take a guess, because I don't know Blackfin assembler:
That LC0 sounds like "loop counter", LSETUP looks like a macro/insn, which, well, setups a loop between two labels and with a certain loop counter.
The "%0" operands is apparently the address to write to and we can safely guess it's incremented in the loop, in other words it's both an input and output operand and should be described as such.
Thus, I suggest describing it as in input-output operand, using "+" constraint modifier, as follows:
void clear_buffer ( short * buffer, int len ) {
__asm__ (
"/* clear_buffer */\n\t"
"LSETUP (1f, 1f) LC0=%1;\n"
"1:\n\t"
"W [%0++] = %2;"
: "+a" ( buffer )
: "a" ( len ), "d" ( 0 )
: "memory", "LC0", "LT0", "LB0"
);
}
This is, of course, just a hypothesis, but you could disassemble the code and check if by any chance GCC allocated the same register for "%0" and "%2".
PS. Actually, only "+a" should be enough, early-clobber is irrelevant.
For anyone else who runs into a similar circumstance, the problem here was not with the in-line assembly, nor with the way it was being called: it was with the classes / structs in the program. The class that I believed to be the offender was not the problem - there was another class that held an instance of it, and due to other members of that outer class, the inner one was not aligned on a word boundary. This was causing the "Bus Error" that I was experiencing. I had not come across this before because the classes were not declared with __attribute__((packed)) in other code, but they are in my implementation.
Giving Type Attributes - Using the GNU Compiler Collection (GCC) a read was what actually sparked the answer for me. Two particular attributes that affect memory alignment (and, thus, in-line assembly such as I am using) are packed and aligned.
As taken from the aforementioned link:
aligned (alignment)
This attribute specifies a minimum alignment (in bytes) for variables of the specified type. For example, the declarations:
struct S { short f[3]; } __attribute__ ((aligned (8)));
typedef int more_aligned_int __attribute__ ((aligned (8)));
force the compiler to ensure (as far as it can) that each variable whose type is struct S or more_aligned_int is allocated and aligned at least on a 8-byte boundary. On a SPARC, having all variables of type struct S aligned to 8-byte boundaries allows the compiler to use the ldd and std (doubleword load and store) instructions when copying one variable of type struct S to another, thus improving run-time efficiency.
Note that the alignment of any given struct or union type is required by the ISO C standard to be at least a perfect multiple of the lowest common multiple of the alignments of all of the members of the struct or union in question. This means that you can effectively adjust the alignment of a struct or union type by attaching an aligned attribute to any one of the members of such a type, but the notation illustrated in the example above is a more obvious, intuitive, and readable way to request the compiler to adjust the alignment of an entire struct or union type.
As in the preceding example, you can explicitly specify the alignment (in bytes) that you wish the compiler to use for a given struct or union type. Alternatively, you can leave out the alignment factor and just ask the compiler to align a type to the maximum useful alignment for the target machine you are compiling for. For example, you could write:
struct S { short f[3]; } __attribute__ ((aligned));
Whenever you leave out the alignment factor in an aligned attribute specification, the compiler automatically sets the alignment for the type to the largest alignment that is ever used for any data type on the target machine you are compiling for. Doing this can often make copy operations more efficient, because the compiler can use whatever instructions copy the biggest chunks of memory when performing copies to or from the variables that have types that you have aligned this way.
In the example above, if the size of each short is 2 bytes, then the size of the entire struct S type is 6 bytes. The smallest power of two that is greater than or equal to that is 8, so the compiler sets the alignment for the entire struct S type to 8 bytes.
Note that although you can ask the compiler to select a time-efficient alignment for a given type and then declare only individual stand-alone objects of that type, the compiler's ability to select a time-efficient alignment is primarily useful only when you plan to create arrays of variables having the relevant (efficiently aligned) type. If you declare or use arrays of variables of an efficiently-aligned type, then it is likely that your program also does pointer arithmetic (or subscripting, which amounts to the same thing) on pointers to the relevant type, and the code that the compiler generates for these pointer arithmetic operations is often more efficient for efficiently-aligned types than for other types.
The aligned attribute can only increase the alignment; but you can decrease it by specifying packed as well. See below.
Note that the effectiveness of aligned attributes may be limited by inherent limitations in your linker. On many systems, the linker is only able to arrange for variables to be aligned up to a certain maximum alignment. (For some linkers, the maximum supported alignment may be very very small.) If your linker is only able to align variables up to a maximum of 8-byte alignment, then specifying aligned(16) in an __attribute__ still only provides you with 8-byte alignment. See your linker documentation for further information.
.
packed
This attribute, attached to struct or union type definition, specifies that each member (other than zero-width bit-fields) of the structure or union is placed to minimize the memory required. When attached to an enum definition, it indicates that the smallest integral type should be used.
Specifying this attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members. Specifying the -fshort-enums flag on the line is equivalent to specifying the packed attribute on all enum definitions.
In the following example struct my_packed_struct's members are packed closely together, but the internal layout of its s member is not packed—to do that, struct my_unpacked_struct needs to be packed too.
struct my_unpacked_struct
{
char c;
int i;
};
struct __attribute__ ((__packed__)) my_packed_struct
{
char c;
int i;
struct my_unpacked_struct s;
};
You may only specify this attribute on the definition of an enum, struct or union, not on a typedef that does not also define the enumerated type, structure or union.
The problem which I was experiencing was specifically due to the use of packed. I attempted to simply add the aligned attribute to the structs and classes, but the error persisted. Only removing the packed attribute resolved the problem. For now, I am leaving the aligned attribute on them and testing to see if I find any improvements in the efficiency of the code as mentioned above, simply due to their being aligned on word boundaries. The application makes use of arrays of these structures, so perhaps there will be better performance, but only profiling the code will say for certain.