How should GMP/MPFR limbs be interpreted? - gmp

The arbitrary precision libraries GMP and MPFR use heap-allocated arrays of machine word-sized integers to store the limbs that make up the high precision number/mantissa.
How should this array of limbs be interpreted to recover the arbitrary precision integer number? In other words: for N limbs holding B bits each, how should I interpret them to recover the N*B bit number?
Does the limb size really affect the in-memory representation (see below)? If so, what is the rationale behind this?
Background:
I wrote a small program to look inside the representation, but I was confused by what I saw. The limbs seem to be ordered in most significant digit order, whereas the limbs themselves are in native least significant digit format. When representing the 64-bit word 0xAAAABBBBCCCCDDDD using 32-bit words and precision fixed to 128 bits, I see:
% c++ limbs.cpp -lgmp -lmpfr -o limbs && ./limbs
ccccdddd|aaaabbbb|00000000|00000000
00000000|00000000|ccccdddd|aaaabbbb
This seems to imply that the in-memory representation can not be read back as a string of bits to recover the arbitrary precision number (e.g., if loaded this into a register on a machine that supported N*B sized words). Furthermore, this also seems to suggest that the limb size changes the representation, so that I would not be able to deserialize a number without knowing which limb size was used to serialize it.
Here's my test program (uses 32-bit limbs with the __GMP_SHORT_LIMB macro):
#define __GMP_SHORT_LIMB
#include <gmp.h>
#include <mpfr.h>
#include <iomanip>
#include <iostream>
constexpr int PRECISION = 128;
void PrintLimbs(mp_limb_t const *const limbs) {
std::cout << std::hex;
constexpr int NUM_LIMBS = PRECISION / (8 * sizeof(mp_limb_t));
for (int i = 0; i < NUM_LIMBS; ++i) {
std::cout << std::setfill('0') << std::setw(2 * sizeof(mp_limb_t)) << limbs[i];
if (i < NUM_LIMBS - 1) {
std::cout << "|";
}
}
std::cout << "\n";
}
int main() {
{ // GMP
mpz_t num;
mpz_init2(num, PRECISION);
mpz_set_ui(num, 0xAAAABBBBCCCCDDDD);
PrintLimbs(num->_mp_d);
mpz_clear(num);
}
{ // MPFR
mpfr_t num;
mpfr_init2(num, PRECISION);
mpfr_set_ui(num, 0xAAAABBBBCCCCDDDD, MPFR_RNDN);
PrintLimbs(num->_mpfr_d);
mpfr_clear(num);
}
return 0;
}

3 things that matter for the byte representation:
The limb size depends on your machine and the chosen ABI. The real size is also affected by the optional presence of nails (an experimental feature, thus it is unlikely that limbs have nails). MPFR does not support the presence of nails.
The limb representation in memory follows the endianness of the machine.
Limbs are stored least significant limb first (a.k.a. little endian).
Note that from the last two points, on a same big-endian machine, the byte representation of the array will depend on the limb size.
Concerning the size of the array of limbs, it depends on the type. For instance, with the mpn layer of GMP, it is entirely handled by the user.
For MPFR, the size is deduced from the precision of the mpfr_t object; and if the precision is not a multiple of the limb bitsize, the trailing bits are always set to 0. Note also that more memory may be allocated than the one actually used, and it must not be confused with the size of the array; you can ignore this fact, as the unused data are always after the actual array of limbs.
EDIT concerning the rationale: Manipulating limbs instead of bytes is for speed reasons. Then I suppose that little endian has been chosen to represent the array of limbs for two reasons. First, it makes the basic operations (addition, subtraction, multiplication) easier to implement and potentially faster. Second, this is much better to implement arithmetic modulo 2^K, in particular when K may change.

It finally clicked for me. The limb size does not affect the in-memory representation.
The data in GMP/MPFR is stored consistently in little-endian format, even when interpreted as a string of bytes across limbs. But registers on x86 are big-endian.
The inconsistent outcome when printing the limbs comes from how words are interpreted when read back from memory. When loaded into a register, memory is reinterpreted from little-endian (how it is stored in memory) to big-endian (how it is stored in registers).
I've modified the example below to show how it is in fact the word size with which we reinterpret memory that affects how the content is printed, as the output will be the same no matter if 32-bit or 64-bit limbs are used:
#define __GMP_SHORT_LIMB
#include <gmp.h>
#include <mpfr.h>
#include <iomanip>
#include <iostream>
#include <cstdint>
constexpr int PRECISION = 128;
template <typename InterpretAs>
void PrintLimbs(mp_limb_t const *const limbs) {
constexpr int LIMB_BITS = 8 * sizeof(InterpretAs);
constexpr int NUM_LIMBS = PRECISION / LIMB_BITS;
std::cout << LIMB_BITS << "-bit: ";
for (int i = 0; i < NUM_LIMBS; ++i) {
const auto limb = reinterpret_cast<InterpretAs const *>(limbs)[i];
for (int b = 0; b < LIMB_BITS; ++b) {
if (b > 0 && b % 16 == 0) {
std::cout << " ";
}
uint64_t bit = (limb >> (LIMB_BITS - 1 - b)) & 0x1;
std::cout << bit;
}
if (i < NUM_LIMBS - 1) {
std::cout << "|";
}
}
std::cout << "\n";
}
int main() {
uint64_t literal = 0b1111000000000000000000000000000000000000000000000000000000001001;
{ // GMP
mpz_t num;
mpz_init2(num, PRECISION);
mpz_set_ui(num, literal);
std::cout << "GMP where limbs are interpreted as:\n";
PrintLimbs<uint64_t>(num->_mp_d);
PrintLimbs<uint32_t>(num->_mp_d);
PrintLimbs<uint16_t>(num->_mp_d);
mpz_clear(num);
}
{ // MPFR
mpfr_t num;
mpfr_init2(num, PRECISION);
mpfr_set_ui(num, literal, MPFR_RNDN);
std::cout << "MPFR where limbs are interpreted as:\n";
PrintLimbs<uint64_t>(num->_mpfr_d);
PrintLimbs<uint32_t>(num->_mpfr_d);
PrintLimbs<uint16_t>(num->_mpfr_d);
mpfr_clear(num);
}
return 0;
}
This prints (regardless of limb size):
GMP where limbs are interpreted as:
64-bit: 1111000000000000 0000000000000000 0000000000000000 0000000000001001|0000000000000000 0000000000000000 0000000000000000 0000000000000000
32-bit: 0000000000000000 0000000000001001|1111000000000000 0000000000000000|0000000000000000 0000000000000000|0000000000000000 0000000000000000
16-bit: 0000000000001001|0000000000000000|0000000000000000|1111000000000000|0000000000000000|0000000000000000|0000000000000000|0000000000000000
MPFR where limbs are interpreted as:
64-bit: 0000000000000000 0000000000000000 0000000000000000 0000000000000000|1111000000000000 0000000000000000 0000000000000000 0000000000001001
32-bit: 0000000000000000 0000000000000000|0000000000000000 0000000000000000|0000000000000000 0000000000001001|1111000000000000 0000000000000000
16-bit: 0000000000000000|0000000000000000|0000000000000000|0000000000000000|0000000000001001|0000000000000000|0000000000000000|1111000000000000

Related

Why the bitfield's least significant bit is promoted to MSb during typecasting in the below program?

Why do we get this value as output:- ffffffff
struct bitfield {
signed char bitflag:1;
};
int main()
{
unsigned char i = 1;
struct bitfield *var = (struct bitfield*)&i;
printf("\n %x \n", var->bitflag);
return 0;
}
I know that in a memory block of size equal to the data-type, the first bit is used to represent if it is positive(0) or negative(1); when interpreted as a signed data-type. But, still can't figure out why -1 (ffffffff) is printed. When the struct with only one bit set, I was expecting that when it gets promoted to a 1 byte char. Because, my machine is a little-endian and I was expecting that one bit in the field to be interpreted as the LSb in my 1 byte character.
Can somehow please explain. I'm really confused.

Numpy vs Eigen vs Xtensor Linear Algebra Benchmark Oddity

I recently was trying to compare different python and C++ matrix libraries against each other for their linear algebra performance in order to see which one(s) to use in an upcoming project. While there are multiple types of linear algebra operations, I have chosen to focus mainly on matrix inversion, as it seems to be the one giving strange results. I have written the following code below for the comparison, but am thinking I must be doing something wrong.
C++ Code
#include <iostream>
#include "eigen/Eigen/Dense"
#include <xtensor/xarray.hpp>
#include <xtensor/xio.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xrandom.hpp>
#include <xtensor-blas/xlinalg.hpp> //-lblas -llapack for cblas, -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread for openblas
//including accurate timer
#include <chrono>
//including vector array
#include <vector>
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats = 1000);
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats = 1000);
int main()
{
std::vector<int> sizings{1, 10, 100, 1000, 10000, 100000};
basicMatrixComparisonEigen(sizings, 2);
basicMatrixComparisonXtensor(sizings,2);
return 0;
}
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Eigen: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = Eigen::MatrixXd::Random(dim, dim);
//Eigen Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
Eigen::MatrixXd pinv = l.completeOrthogonalDecomposition().pseudoInverse();
//note this does not come out to be identity. The inverse is wrong.
//std::cout<<l*pinv<<std::endl;
}
t2 = time::now();
std::cout << "Eigen Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Xtensor: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = xt::random::randn<double>({dim, dim});
//Xtensor Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
auto inverse = xt::linalg::pinv(l);
//something is wrong here. The inverse is not actually the inverse when you multiply it out.
//std::cout << xt::linalg::dot(inverse,l) << std::endl;
}
t2 = time::now();
std::cout << "Xtensor Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
This is compiled with:
g++ cpp_library.cpp -O2 -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread -march=native -o benchmark.exe
for OpenBLAS, and
g++ cpp_library.cpp -O2 -lblas -llapack -march=native -o benchmark.exe
for cBLAS.
g++ version 9.3.0.
And for Python 3:
import numpy as np
from datetime import datetime as dt
#import timeit
start=dt.now()
l=np.random.rand(1000,1000)
for i in range(2):
result=np.linalg.inv(l)
end=dt.now()
print("Completed in: "+str((end-start)/2))
#print(np.matmul(l,result))
#print(np.dot(l,result))
#Timeit also gives similar results
I will focus on the largest decade that runs in a reasonable amount of time on my computer: 1000x1000. I know that only 2 runs introduces a bit of variance, but I've run it with more and the results are roughly the same as below:
Eigen 3.3.9: 196.804 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 378.156 milliseconds
Numpy 1.17.4: 172.582 milliseconds
Is this a reasonable result to expect? Why are the C++ libraries slower than Numpy? All 3 packages are using some sort of Lapack/BLAS backend, yet there is a significant difference between the 3. Particularly, Xtensor will pin my CPU to 100% usage with OpenBlas' threads, yet still manage to have worse performance.
I'm wondering if the C++ libraries are actually performing the inverse/pseudoinverse of the matrix, and if this is what is causing these results. In the commented sections of the C++ test code, I have noted that when I sanity-checked the results from both Eigen and Xtensor, the resulting matrix product between the matrix and its inverse was not even close to the identity matrix. I tried with smaller matrices (10x10) thinking it might be a precision error, but the problem remained. In another test, I test for rank, and these matrices are full rank. To be sure I wasn't going crazy, I tried with inv() instead of pinv() in both cases, and the results are the same. Am I using the wrong functions for this linear algebra benchmark, or is this Numpy twisting the knife on 2 disfunctional low level libraries?
EDIT:
Thank you everyone for your interest in this problem. I think I have figured out the issue. I suspect Eigen and Xtensor have lazy evaluation and this actually is causing errors downstream, and outputting random matrices instead of the inversed matrices. I was able to correct the strange numerical inversion failure with the following replacements in the code:
auto temp = Eigen::MatrixXd::Random(dim, dim);
Eigen::MatrixXd l(dim,dim);
l=temp;
and
auto temp = xt::random::randn<double>({dim, dim});
xt::xarray<double> l =temp;
However, the timings didn't change much:
Eigen 3.3.9: 201.386 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 337.299 milliseconds.
Numpy 1.17.4: (from before) 172.582 milliseconds
Actually, a little strangely, adding -O3 and -ffast-math actually slowed down the code a little. -march=native had the biggest performance increase for me when I tried it. Also, OpenBLAS is 2-3X faster than CBLAS for these problems.
Firstly, you are not computing same things.
To compute inverse of l matrix, use l.inverse() for Eigen and xt::linalg::inv() for xtensor
When you link Blas to Eigen or xtensor, these operations are automatically dispatched to the your choosen Blas.
I tried replacing the inverse functions, replaced auto with MatrixXd and xt::xtensor to avoid lazy evaluation, linked openblas to Eigen, xtensor and numpy and compiled with only -O3 flag, the following are the results on my Macbook pro M1:
Eigen-3.3.9 (with openblas) - ~ 38 ms
Eigen-3.3.9 (without openblas) - ~ 85 ms
xtensor-master (with openblas) - ~41 ms
Numpy- 1.21.2 (with openblas) - ~35 ms.

sem_open - valgrind complains about uninitialised bytes

I have a trivial program:
int main(void)
{
const char sname[]="xxx";
sem_t *pSemaphor;
if ((pSemaphor = sem_open(sname, O_CREAT, 0644, 0)) == SEM_FAILED) {
perror("semaphore initilization");
exit(1);
}
sem_unlink(sname);
sem_close(pSemaphor);
}
When I run it under valgrind, I get the following error:
==12702== Syscall param write(buf) points to uninitialised byte(s)
==12702== at 0x4E457A0: __write_nocancel (syscall-template.S:81)
==12702== by 0x4E446FC: sem_open (sem_open.c:245)
==12702== by 0x4007D0: main (test.cpp:15)
==12702== Address 0xfff00023c is on thread 1's stack
==12702== in frame #1, created by sem_open (sem_open.c:139)
The code was extracted from a bigger project where it ran successfully for years, but now it is causing segmentation fault.
The valgrind error from my example is the same as seen in the bigger project, but there it causes a crash, which my small example doesn't.
I see this with glibc 2.27-5 on Debian. In my case I only open the semaphores right at the start of a long-running program and it seems harmless so far - just annoying.
Looking at the code for sem_open.c which is available at:
https://code.woboq.org/userspace/glibc/nptl/sem_open.c.html
It seems that valgrind is complaining about the line (270 as I look now):
if (TEMP_FAILURE_RETRY (__libc_write (fd, &sem.initsem, sizeof (sem_t)))
== sizeof (sem_t)
However sem.initsem is properly initialised earlier in a fairly baroque manner, firstly by explicitly setting fields in the sem.newsem (part of the union), and then once that is done by a call to memset (L226-228):
/* Initialize the remaining bytes as well. */
memset ((char *) &sem.initsem + sizeof (struct new_sem), '\0',
sizeof (sem_t) - sizeof (struct new_sem));
I think that this particular shenanigans is all quite optimal, but we need to make sure that all of the fields of new_sem have actually been initialised... we find the definition in https://code.woboq.org/userspace/glibc/sysdeps/nptl/internaltypes.h.html and it is this wonderful creation:
struct new_sem
{
#if __HAVE_64B_ATOMICS
/* The data field holds both value (in the least-significant 32 bytes) and
nwaiters. */
# if __BYTE_ORDER == __LITTLE_ENDIAN
# define SEM_VALUE_OFFSET 0
# elif __BYTE_ORDER == __BIG_ENDIAN
# define SEM_VALUE_OFFSET 1
# else
# error Unsupported byte order.
# endif
# define SEM_NWAITERS_SHIFT 32
# define SEM_VALUE_MASK (~(unsigned int)0)
uint64_t data;
int private;
int pad;
#else
# define SEM_VALUE_SHIFT 1
# define SEM_NWAITERS_MASK ((unsigned int)1)
unsigned int value;
int private;
int pad;
unsigned int nwaiters;
#endif
};
So if we __HAVE_64B_ATOMICS then the structure has a data field which contains both the value and the nwaiters, otherwise these are separate fields.
In the initialisation of sem.newsem we can see that these are initialised correctly, as follows:
#if __HAVE_64B_ATOMICS
sem.newsem.data = value;
#else
sem.newsem.value = value << SEM_VALUE_SHIFT;
sem.newsem.nwaiters = 0;
#endif
/* pad is used as a mutex on pre-v9 sparc and ignored otherwise. */
sem.newsem.pad = 0;
/* This always is a shared semaphore. */
sem.newsem.private = FUTEX_SHARED;
I'm doing all of this on a 64-bit system, so I think that valgrind is complaining about the initialisation of the 64-bit sem.newsem.data with a 32-bit value since from:
value = va_arg (ap, unsigned int);
We can see that value is defined simply as an unsigned int which will usually still be 32 bits even on a 64-bit system (see What should be the sizeof(int) on a 64-bit machine?), but that should just be an implicit cast to 64-bits when it is assigned.
So I think this is not a bug - just valgrind getting confused.

NSSwapInt from byte array

I'm trying to implement a function that will read from a byte array (which is a char* in my code) a 32bit int stored with different endianness. I was suggested to use NSSwapInt, but I'm clueless on how to go about it. Could anyone show me a snippet?
Thanks in advance!
Heres a short example:
unsigned char bytes[] = { 0x00, 0x00, 0x01, 0x02 };
int intData = *((int *)bytes);
int reverseData = NSSwapInt(intData);
NSLog(#"integer:%d", intData);
NSLog(#"bytes:%08x", intData);
NSLog(#"reverse integer: %d", reverseData);
NSLog(#"reverse bytes: %08x", reverseData);
The output will be:
integer:33619968
bytes:02010000
reverse integer: 258
reverse bytes: 00000102
As mentioned in the docs,
Swaps the bytes of iv and returns the resulting value. Bytes are swapped from each low-order position to the corresponding high-order position and vice versa. For example, if the bytes of inv are numbered from 1 to 4, this function swaps bytes 1 and 4, and bytes 2 and 3.
There's also a NSSwapShort and NSSwapLongLong.
There is a potential of a data misalignment exception if you solve this problem by using integer pointers - e.g. some architectures require 32-bit values to be at addresses which are multiples of 2 or 4 bytes. The ARM architecture used by the iPhone et al. may throw an exception in this case, but I've no iOS device handy to test whether it does.
A safe way to do this which will never throw any misalignment exceptions is to assemble the integer directly:
int32_t bytes2int(unsigned char *b)
{
int32_t i;
i = b[0] | b[1] << 8 | b[2] << 16 | b[3] << 24; // little-endian, or
i = b[3] | b[2] << 8 | b[1] << 16 | b[0] << 24; // big-endian (pick one)
return i;
}
You can pass this any byte pointer and it will assemble 4 bytes to make a 32-bit int. You can extend the idea to 64-bit integers if required.

AbsoluteToNanoseconds vs AbsoluteToDuration

Apple has extremely comprehensive documentation, but I can't find any documentation for the function AbsoluteToNanoseconds? I was to find the difference between AbsoluteToNanoseconds and AbsoluteToDuration.
Note
I am beginning to think that the Apple Docs only cover Objective-C functions? Is this the case?
I found the following by using Apple-double-click:
Duration 32-bit millisecond timer for drivers
AbsoluteTime 64-bit clock
I'm not sure why it isn't documented anywhere, but here is an example of how it is used, if that helps:
static float HowLong(
AbsoluteTime endTime,
AbsoluteTime bgnTime
)
{
AbsoluteTime absTime;
Nanoseconds nanosec;
absTime = SubAbsoluteFromAbsolute(endTime, bgnTime);
nanosec = AbsoluteToNanoseconds(absTime);
return (float) UnsignedWideToUInt64( nanosec ) / 1000.0;
}
UPDATE:
"The main reason I am interested in the docs is to find out how it differs from AbsoluteToDuration"
That's easier. AbsoluteToNanoseconds() returns a value of type Nanoseconds, which is really an UnsignedWide struct.
struct UnsignedWide {
UInt32 hi;
UInt32 lo;
};
In contrast, AbsoluteToDuration() returns a value of type Duration, which is actually an SInt32 or signed long:
typedef SInt32 Duration;
Durations use a smaller, signed type because they are intended to hold relative times. Nanoseconds, on the other hand, only make sense as positive values, and they can be very large, since computers can stay running for years at a time.
According to https://developer.apple.com/library/prerelease/mac/releasenotes/General/APIDiffsMacOSX10_9/Kernel.html,
SubAbsoluteFromAbsolute(), along with apparently all the other *Absolute* functions, have been removed from Mavericks. I have confirmed this.
These functions are no longer necessary since at least in Mavericks and Mountain Lion (the two I tested), mach_absolute_time() already returns time in nanoseconds, and not in absolute form (which used to be the number of bus cycles), making a conversion no longer necessary. Thus, the conversion shown in clock_gettime alternative in Mac OS X and similar code presented in several places on the web, is no longer necessary. This can be confirmed on your system by checking that both the numerator and denominator returned by mach_timebase_info() are 1.
Here is my test code with lots of output to check if you need to do the conversion on your system (I have to perform a check since my code might run on older Macs, although I do the check at program initiation and set a function pointer to call a different routine):
#include <CoreServices/CoreServices.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
#include <time.h>
#include <iostream>
using namespace std;
int main()
{
uint64_t now, then;
uint64_t abs, nano;
mach_timebase_info_data_t timebase_info = {0,0};
then = mach_absolute_time();
sleep(1);
now = mach_absolute_time();
abs = now - then;
mach_timebase_info(&timebase_info);
cout << "numerator " << timebase_info.numer << " denominator "
<< timebase_info.denom << endl;
if ((timebase_info.numer != 1) || (timebase_info.denom != 1))
{
nano = (abs * timebase_info.numer) / timebase_info.denom;
cout << "Have a real conversion value" << endl;
}
else
{
nano = abs;
cout << "Both numerator and denominator are 1" << endl;
}
cout << "milliseconds = " << nano/1000000LL << endl;
}