Numpy vs Eigen vs Xtensor Linear Algebra Benchmark Oddity - numpy

I recently was trying to compare different python and C++ matrix libraries against each other for their linear algebra performance in order to see which one(s) to use in an upcoming project. While there are multiple types of linear algebra operations, I have chosen to focus mainly on matrix inversion, as it seems to be the one giving strange results. I have written the following code below for the comparison, but am thinking I must be doing something wrong.
C++ Code
#include <iostream>
#include "eigen/Eigen/Dense"
#include <xtensor/xarray.hpp>
#include <xtensor/xio.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xrandom.hpp>
#include <xtensor-blas/xlinalg.hpp> //-lblas -llapack for cblas, -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread for openblas
//including accurate timer
#include <chrono>
//including vector array
#include <vector>
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats = 1000);
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats = 1000);
int main()
{
std::vector<int> sizings{1, 10, 100, 1000, 10000, 100000};
basicMatrixComparisonEigen(sizings, 2);
basicMatrixComparisonXtensor(sizings,2);
return 0;
}
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Eigen: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = Eigen::MatrixXd::Random(dim, dim);
//Eigen Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
Eigen::MatrixXd pinv = l.completeOrthogonalDecomposition().pseudoInverse();
//note this does not come out to be identity. The inverse is wrong.
//std::cout<<l*pinv<<std::endl;
}
t2 = time::now();
std::cout << "Eigen Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Xtensor: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = xt::random::randn<double>({dim, dim});
//Xtensor Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
auto inverse = xt::linalg::pinv(l);
//something is wrong here. The inverse is not actually the inverse when you multiply it out.
//std::cout << xt::linalg::dot(inverse,l) << std::endl;
}
t2 = time::now();
std::cout << "Xtensor Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
This is compiled with:
g++ cpp_library.cpp -O2 -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread -march=native -o benchmark.exe
for OpenBLAS, and
g++ cpp_library.cpp -O2 -lblas -llapack -march=native -o benchmark.exe
for cBLAS.
g++ version 9.3.0.
And for Python 3:
import numpy as np
from datetime import datetime as dt
#import timeit
start=dt.now()
l=np.random.rand(1000,1000)
for i in range(2):
result=np.linalg.inv(l)
end=dt.now()
print("Completed in: "+str((end-start)/2))
#print(np.matmul(l,result))
#print(np.dot(l,result))
#Timeit also gives similar results
I will focus on the largest decade that runs in a reasonable amount of time on my computer: 1000x1000. I know that only 2 runs introduces a bit of variance, but I've run it with more and the results are roughly the same as below:
Eigen 3.3.9: 196.804 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 378.156 milliseconds
Numpy 1.17.4: 172.582 milliseconds
Is this a reasonable result to expect? Why are the C++ libraries slower than Numpy? All 3 packages are using some sort of Lapack/BLAS backend, yet there is a significant difference between the 3. Particularly, Xtensor will pin my CPU to 100% usage with OpenBlas' threads, yet still manage to have worse performance.
I'm wondering if the C++ libraries are actually performing the inverse/pseudoinverse of the matrix, and if this is what is causing these results. In the commented sections of the C++ test code, I have noted that when I sanity-checked the results from both Eigen and Xtensor, the resulting matrix product between the matrix and its inverse was not even close to the identity matrix. I tried with smaller matrices (10x10) thinking it might be a precision error, but the problem remained. In another test, I test for rank, and these matrices are full rank. To be sure I wasn't going crazy, I tried with inv() instead of pinv() in both cases, and the results are the same. Am I using the wrong functions for this linear algebra benchmark, or is this Numpy twisting the knife on 2 disfunctional low level libraries?
EDIT:
Thank you everyone for your interest in this problem. I think I have figured out the issue. I suspect Eigen and Xtensor have lazy evaluation and this actually is causing errors downstream, and outputting random matrices instead of the inversed matrices. I was able to correct the strange numerical inversion failure with the following replacements in the code:
auto temp = Eigen::MatrixXd::Random(dim, dim);
Eigen::MatrixXd l(dim,dim);
l=temp;
and
auto temp = xt::random::randn<double>({dim, dim});
xt::xarray<double> l =temp;
However, the timings didn't change much:
Eigen 3.3.9: 201.386 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 337.299 milliseconds.
Numpy 1.17.4: (from before) 172.582 milliseconds
Actually, a little strangely, adding -O3 and -ffast-math actually slowed down the code a little. -march=native had the biggest performance increase for me when I tried it. Also, OpenBLAS is 2-3X faster than CBLAS for these problems.

Firstly, you are not computing same things.
To compute inverse of l matrix, use l.inverse() for Eigen and xt::linalg::inv() for xtensor
When you link Blas to Eigen or xtensor, these operations are automatically dispatched to the your choosen Blas.
I tried replacing the inverse functions, replaced auto with MatrixXd and xt::xtensor to avoid lazy evaluation, linked openblas to Eigen, xtensor and numpy and compiled with only -O3 flag, the following are the results on my Macbook pro M1:
Eigen-3.3.9 (with openblas) - ~ 38 ms
Eigen-3.3.9 (without openblas) - ~ 85 ms
xtensor-master (with openblas) - ~41 ms
Numpy- 1.21.2 (with openblas) - ~35 ms.

Related

OpenMP with Segmentation fault (core dumped)

I encountered a problem when using OpenMP to parallelize my code. I have attached the simplest code that can reproduce my problem.
#include <iostream>
#include <vector>
using namespace std;
int main()
{
int n = 10;
int size = 1;
vector<double> vec(1, double(1.0));
double sum = 0.0;
#pragma omp parallel for private(vec) reduction(+: sum)
for (int i = 0; i != n; ++i)
{
/* in real case, complex operations applied on vec here */
sum += vec[0];
}
cout << "sum: " << sum << endl;
return 0;
}
I compile with g++ with flag -fopenmp, and the error message from g++ prompts "Segmentation fault (core dumped)". I am wondering what's wrong with the code.
Note that vec should be set to private since in the real code a complex operation is applied on vec in the for-loop.
The problem indeed comes from the private(vec) clause. There are two issues with this code.
First, from a semantics perspective, the private(vec) should be shared(vec), as the intent seems to be to work on the same std::vector instance in parallel. So, the code should look like this:
#pragma omp parallel for shared(vec), reduction(+: sum)
for (int i = 0; i != n; ++i)
{
sum += vec[0];
}
In the previous code, the private(vec) made a private instance of std::vector for each thread and was supposed to initialize these instances by calling the default constructor of std::vector.
Second, the segmentation fault then arises from the fact that there's no vec[0] element in any of the private instances. This can be confirmed by calling vec.size() fro the threads.
PS: shared(vec) would be been the default sharing for vec as per the OpenMP specification anyways.

Eigen::placeholders::all for Eigen array/matrix with pybind11

I am building a c++ program linked to python code, using pybind11.
I use Eigen for the matrix operations.
I am having issues with the slicing of an Eigen array.
According to Eigen documentation, it is possible to slice an array using Eigen::placeholders::all -
std::vector<int> ind{4,2,5,5,3};
MatrixXi A = MatrixXi::Random(4,6);
cout << "Initial matrix A:\n" << A << "\n\n";
cout << "A(all,ind):\n" << A(Eigen::placeholders::all,ind) << "\n\n";
However, when I try to use this syntax in my code, I get the following error:
error: ‘Eigen::indexing’ has not been declared
I found an explanation for this - The Eigen header I used is of pybind11, not the original Eigen header.
This explains the issue, but not helping with a solution.
I tried including the original Eigen headers, but it won't include the indexing or placeholders namespaces.
Thank for your assistance!
edit:
Here is the code I tried to compile:
#include <pybind11/pybind11.h>
#include <pybind11/eigen.h>
#include <pybind11/stl.h>
#include <pybind11/numpy.h>
#include <pybind11/iostream.h>
#include <iostream>
#include <valarray>
#include <Eigen/Core>
namespace py = pybind11;
void example()
{
std::vector<int> ind{4,2,5,5,3};
Eigen::MatrixXi A = Eigen::MatrixXi::Random(4,6);
std::cout << "Initial matrix A:\n" << A << "\n\n";
std::cout << "A(all,ind):\n" << A(Eigen::placeholders::all,ind) << "\n\n";
}
For which I got the following error:
error: ‘Eigen::placeholders’ has not been declared
Eventually I was able to solve the issue - it was an issue with the cmake files, and specifically, the FindEigen3.cmake was missing under the cmake folder.
Somehow, (probably because of pybind11/Eigen header) the program was able to compile, but could not find all the relevant headers.
After adding the FindEigen3.cmake under the cmake folder, all included folders were correct, and I could use Eigen::placeholders::all.
Thanks #Homer512 for the assistance!

How should GMP/MPFR limbs be interpreted?

The arbitrary precision libraries GMP and MPFR use heap-allocated arrays of machine word-sized integers to store the limbs that make up the high precision number/mantissa.
How should this array of limbs be interpreted to recover the arbitrary precision integer number? In other words: for N limbs holding B bits each, how should I interpret them to recover the N*B bit number?
Does the limb size really affect the in-memory representation (see below)? If so, what is the rationale behind this?
Background:
I wrote a small program to look inside the representation, but I was confused by what I saw. The limbs seem to be ordered in most significant digit order, whereas the limbs themselves are in native least significant digit format. When representing the 64-bit word 0xAAAABBBBCCCCDDDD using 32-bit words and precision fixed to 128 bits, I see:
% c++ limbs.cpp -lgmp -lmpfr -o limbs && ./limbs
ccccdddd|aaaabbbb|00000000|00000000
00000000|00000000|ccccdddd|aaaabbbb
This seems to imply that the in-memory representation can not be read back as a string of bits to recover the arbitrary precision number (e.g., if loaded this into a register on a machine that supported N*B sized words). Furthermore, this also seems to suggest that the limb size changes the representation, so that I would not be able to deserialize a number without knowing which limb size was used to serialize it.
Here's my test program (uses 32-bit limbs with the __GMP_SHORT_LIMB macro):
#define __GMP_SHORT_LIMB
#include <gmp.h>
#include <mpfr.h>
#include <iomanip>
#include <iostream>
constexpr int PRECISION = 128;
void PrintLimbs(mp_limb_t const *const limbs) {
std::cout << std::hex;
constexpr int NUM_LIMBS = PRECISION / (8 * sizeof(mp_limb_t));
for (int i = 0; i < NUM_LIMBS; ++i) {
std::cout << std::setfill('0') << std::setw(2 * sizeof(mp_limb_t)) << limbs[i];
if (i < NUM_LIMBS - 1) {
std::cout << "|";
}
}
std::cout << "\n";
}
int main() {
{ // GMP
mpz_t num;
mpz_init2(num, PRECISION);
mpz_set_ui(num, 0xAAAABBBBCCCCDDDD);
PrintLimbs(num->_mp_d);
mpz_clear(num);
}
{ // MPFR
mpfr_t num;
mpfr_init2(num, PRECISION);
mpfr_set_ui(num, 0xAAAABBBBCCCCDDDD, MPFR_RNDN);
PrintLimbs(num->_mpfr_d);
mpfr_clear(num);
}
return 0;
}
3 things that matter for the byte representation:
The limb size depends on your machine and the chosen ABI. The real size is also affected by the optional presence of nails (an experimental feature, thus it is unlikely that limbs have nails). MPFR does not support the presence of nails.
The limb representation in memory follows the endianness of the machine.
Limbs are stored least significant limb first (a.k.a. little endian).
Note that from the last two points, on a same big-endian machine, the byte representation of the array will depend on the limb size.
Concerning the size of the array of limbs, it depends on the type. For instance, with the mpn layer of GMP, it is entirely handled by the user.
For MPFR, the size is deduced from the precision of the mpfr_t object; and if the precision is not a multiple of the limb bitsize, the trailing bits are always set to 0. Note also that more memory may be allocated than the one actually used, and it must not be confused with the size of the array; you can ignore this fact, as the unused data are always after the actual array of limbs.
EDIT concerning the rationale: Manipulating limbs instead of bytes is for speed reasons. Then I suppose that little endian has been chosen to represent the array of limbs for two reasons. First, it makes the basic operations (addition, subtraction, multiplication) easier to implement and potentially faster. Second, this is much better to implement arithmetic modulo 2^K, in particular when K may change.
It finally clicked for me. The limb size does not affect the in-memory representation.
The data in GMP/MPFR is stored consistently in little-endian format, even when interpreted as a string of bytes across limbs. But registers on x86 are big-endian.
The inconsistent outcome when printing the limbs comes from how words are interpreted when read back from memory. When loaded into a register, memory is reinterpreted from little-endian (how it is stored in memory) to big-endian (how it is stored in registers).
I've modified the example below to show how it is in fact the word size with which we reinterpret memory that affects how the content is printed, as the output will be the same no matter if 32-bit or 64-bit limbs are used:
#define __GMP_SHORT_LIMB
#include <gmp.h>
#include <mpfr.h>
#include <iomanip>
#include <iostream>
#include <cstdint>
constexpr int PRECISION = 128;
template <typename InterpretAs>
void PrintLimbs(mp_limb_t const *const limbs) {
constexpr int LIMB_BITS = 8 * sizeof(InterpretAs);
constexpr int NUM_LIMBS = PRECISION / LIMB_BITS;
std::cout << LIMB_BITS << "-bit: ";
for (int i = 0; i < NUM_LIMBS; ++i) {
const auto limb = reinterpret_cast<InterpretAs const *>(limbs)[i];
for (int b = 0; b < LIMB_BITS; ++b) {
if (b > 0 && b % 16 == 0) {
std::cout << " ";
}
uint64_t bit = (limb >> (LIMB_BITS - 1 - b)) & 0x1;
std::cout << bit;
}
if (i < NUM_LIMBS - 1) {
std::cout << "|";
}
}
std::cout << "\n";
}
int main() {
uint64_t literal = 0b1111000000000000000000000000000000000000000000000000000000001001;
{ // GMP
mpz_t num;
mpz_init2(num, PRECISION);
mpz_set_ui(num, literal);
std::cout << "GMP where limbs are interpreted as:\n";
PrintLimbs<uint64_t>(num->_mp_d);
PrintLimbs<uint32_t>(num->_mp_d);
PrintLimbs<uint16_t>(num->_mp_d);
mpz_clear(num);
}
{ // MPFR
mpfr_t num;
mpfr_init2(num, PRECISION);
mpfr_set_ui(num, literal, MPFR_RNDN);
std::cout << "MPFR where limbs are interpreted as:\n";
PrintLimbs<uint64_t>(num->_mpfr_d);
PrintLimbs<uint32_t>(num->_mpfr_d);
PrintLimbs<uint16_t>(num->_mpfr_d);
mpfr_clear(num);
}
return 0;
}
This prints (regardless of limb size):
GMP where limbs are interpreted as:
64-bit: 1111000000000000 0000000000000000 0000000000000000 0000000000001001|0000000000000000 0000000000000000 0000000000000000 0000000000000000
32-bit: 0000000000000000 0000000000001001|1111000000000000 0000000000000000|0000000000000000 0000000000000000|0000000000000000 0000000000000000
16-bit: 0000000000001001|0000000000000000|0000000000000000|1111000000000000|0000000000000000|0000000000000000|0000000000000000|0000000000000000
MPFR where limbs are interpreted as:
64-bit: 0000000000000000 0000000000000000 0000000000000000 0000000000000000|1111000000000000 0000000000000000 0000000000000000 0000000000001001
32-bit: 0000000000000000 0000000000000000|0000000000000000 0000000000000000|0000000000000000 0000000000001001|1111000000000000 0000000000000000
16-bit: 0000000000000000|0000000000000000|0000000000000000|0000000000000000|0000000000001001|0000000000000000|0000000000000000|1111000000000000

QtCreator with MinGW : How to make compiler optimization

I am using QtCreator on windows, and I would like to know how to make my compiler optimize the output.
My understanding is that MinGW is a port of GCC. So, I should be able to use arguments such as -O2. However, in the "Projects" bar, the only things I see are :
Build step for qmake (probably not here, qmake is about the .pro files / MOC / Qt stuff ...)
Build step for mingw32-make (probably)
Clean steps (probably not)
So, I tried to add -O2 in the "Make arguments" box, but unfortunately, I get an error "invalid option --O"
For anyone interested, I was trying to make an implementation of the Ackermann function because I read that :
The Ackermann function, due to its definition in terms of extremely
deep recursion, can be used as a benchmark of a compiler's ability to
optimize recursion
The code (which doesn't really use Qt) :
#include <QtCore/QCoreApplication>
#include <QDebug>
#include <ctime>
using namespace std;
int nbRecursion;
int nbRecursions9;
int Ackermann(int m, int n){
nbRecursion++;
if(nbRecursion % 1000000 == 0){
qDebug() << nbRecursions9 << nbRecursion;
}
if(nbRecursion == 1000000000){
nbRecursion = 0;
nbRecursions9++;
}
if(m==0){
return n+1;
}
if(m>0 && n>0){
return Ackermann(m-1,Ackermann(m, n-1));
}
if(m>0 && n==0){
return Ackermann(m-1,1);
}
qDebug() << "Bug at " << m << ", " << n;
}
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
nbRecursion = 0;
nbRecursions9 = 0;
int m = 3;
int n = 13;
clock_t begin = clock();
Ackermann(m,n);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
qDebug() << "There are " << CLOCKS_PER_SEC << " CLOCKS_PER_SEC";
qDebug() << "There were " << nbRecursions9 << nbRecursion << " recursions in " << elapsed_secs << " seconds";
double timeX = 1000000000.0*((elapsed_secs)/(double)nbRecursion);
if(nbRecursions9>0){
timeX += elapsed_secs/(double)nbRecursions9;
}
qDebug() << "Time for a function call : " << timeX << " nanoseconds";
return a.exec();
}
-O2 is used by default when you do a release build. Only debug builds don't use optimization. Regardless, if you want to use specific compiler options, you do so in the project file (*.pro) itself, by appending your options to the QMAKE_CFLAGS_RELEASE (for C files) and QMAKE_CXXFLAGS_RELEASE (for C++ files). For example:
QMAKE_CFLAGS_RELEASE += -O3 -march=i686 -mtune=generic -fomit-frame-pointer
QMAKE_CXXFLAGS_RELEASE += -O3 -march=i686 -mtune=generic -fomit-frame-pointer
If you really want to use some specific options always, regardless of whether it's a debug or release build, then append to QMAKE_CFLAGS and QMAKE_CXXFLAGS instead. But usually, you'll only want optimization options in your release builds, not the debug ones.

AbsoluteToNanoseconds vs AbsoluteToDuration

Apple has extremely comprehensive documentation, but I can't find any documentation for the function AbsoluteToNanoseconds? I was to find the difference between AbsoluteToNanoseconds and AbsoluteToDuration.
Note
I am beginning to think that the Apple Docs only cover Objective-C functions? Is this the case?
I found the following by using Apple-double-click:
Duration 32-bit millisecond timer for drivers
AbsoluteTime 64-bit clock
I'm not sure why it isn't documented anywhere, but here is an example of how it is used, if that helps:
static float HowLong(
AbsoluteTime endTime,
AbsoluteTime bgnTime
)
{
AbsoluteTime absTime;
Nanoseconds nanosec;
absTime = SubAbsoluteFromAbsolute(endTime, bgnTime);
nanosec = AbsoluteToNanoseconds(absTime);
return (float) UnsignedWideToUInt64( nanosec ) / 1000.0;
}
UPDATE:
"The main reason I am interested in the docs is to find out how it differs from AbsoluteToDuration"
That's easier. AbsoluteToNanoseconds() returns a value of type Nanoseconds, which is really an UnsignedWide struct.
struct UnsignedWide {
UInt32 hi;
UInt32 lo;
};
In contrast, AbsoluteToDuration() returns a value of type Duration, which is actually an SInt32 or signed long:
typedef SInt32 Duration;
Durations use a smaller, signed type because they are intended to hold relative times. Nanoseconds, on the other hand, only make sense as positive values, and they can be very large, since computers can stay running for years at a time.
According to https://developer.apple.com/library/prerelease/mac/releasenotes/General/APIDiffsMacOSX10_9/Kernel.html,
SubAbsoluteFromAbsolute(), along with apparently all the other *Absolute* functions, have been removed from Mavericks. I have confirmed this.
These functions are no longer necessary since at least in Mavericks and Mountain Lion (the two I tested), mach_absolute_time() already returns time in nanoseconds, and not in absolute form (which used to be the number of bus cycles), making a conversion no longer necessary. Thus, the conversion shown in clock_gettime alternative in Mac OS X and similar code presented in several places on the web, is no longer necessary. This can be confirmed on your system by checking that both the numerator and denominator returned by mach_timebase_info() are 1.
Here is my test code with lots of output to check if you need to do the conversion on your system (I have to perform a check since my code might run on older Macs, although I do the check at program initiation and set a function pointer to call a different routine):
#include <CoreServices/CoreServices.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
#include <time.h>
#include <iostream>
using namespace std;
int main()
{
uint64_t now, then;
uint64_t abs, nano;
mach_timebase_info_data_t timebase_info = {0,0};
then = mach_absolute_time();
sleep(1);
now = mach_absolute_time();
abs = now - then;
mach_timebase_info(&timebase_info);
cout << "numerator " << timebase_info.numer << " denominator "
<< timebase_info.denom << endl;
if ((timebase_info.numer != 1) || (timebase_info.denom != 1))
{
nano = (abs * timebase_info.numer) / timebase_info.denom;
cout << "Have a real conversion value" << endl;
}
else
{
nano = abs;
cout << "Both numerator and denominator are 1" << endl;
}
cout << "milliseconds = " << nano/1000000LL << endl;
}