An optimized implementation of the Heaviside function

An optimized implementation of the Heaviside function - optimization

I'm would like to (super)optimize an implementation of the Heaviside function.
I'm working on a numerical algorithm (in Fortran) where speed is particularly important. This employs the Heaviside function many times, currently implemented by the signum intrinsic function as follows:
heaviside = 0.5*sign(1,x)+1
I'm mainly interested in the case where x is a double precision real number on intel processors.
Is it possible to develop a more efficient implementation of the Heaviside function?
Perhaps using assembly language, a superoptimizing code or call to an existing external library?

Did you intend heaviside = 0.5*(sign(1,x)+1)? In any case testing with gcc 4.8.1 fortran shows High Performance Mark's idea should be beneficial. Here are 3 possibilities:
heaviside1 - original
heaviside2 - High Performance Mark's idea
heaviside3 - another variation
function heaviside1 (x)
double precision heaviside1, x
heaviside1 = 0.5 * (sign(1d0,x) + 1)
end
function heaviside2 (x)
double precision heaviside2, x
heaviside2 = sign(0.5d0,x) + 0.5
end
function heaviside3 (x)
double precision heaviside3, x
heaviside3 = 0
if (x .ge. 0) heaviside3 = 1
end
program demo
double precision heaviside1, heaviside2, heaviside3, x, a, b, c
do
x = 0.5 - RAND(0)
a = heaviside1(x)
b = heaviside2(x)
c = heaviside3(x)
print *, "x=", x, "heaviside(x)=", a, b, c
enddo
end
When compiled, gcc generates these 3 stand-alone functions:
<heaviside1_>:
vmovsd xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm0,XMMWORD PTR [rip+0x2d824]
vorpd xmm0,xmm0,XMMWORD PTR [rip+0x2d80c]
vaddsd xmm0,xmm0,QWORD PTR [rip+0x2d7f4]
vmulsd xmm0,xmm0,QWORD PTR [rip+0x2d81c]
ret
<heaviside2_>:
vmovsd xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm0,XMMWORD PTR [rip+0x2d844]
vorpd xmm0,xmm0,XMMWORD PTR [rip+0x2d85c]
vaddsd xmm0,xmm0,QWORD PTR [rip+0x2d844]
ret
<heaviside3_>:
vxorpd xmm0,xmm0,xmm0
vmovsd xmm1,QWORD PTR [rip+0x2d844]
vcmplesd xmm0,xmm0,QWORD PTR [rcx]
vandpd xmm0,xmm1,xmm0
ret
When compiled with gcc, heaviside1 generates a multiply that might slow execution.
heaviside2 eliminates the multiply.
heaviside3 has the same number of instructions as heaviside2, but uses 2 fewer memory accesses.
For the stand-alone functions:
instruction memory reference
count count
heaviside1 6 5
heaviside2 5 4
heaviside3 5 2
The inline code for these functions avoids the need for the return instruction and ideally passes the arguments in registers and preloads other registers with needed constants. The exact result depends on the compiler used and the calling code. An estimate for inlined code:
instruction memory reference
count count
heaviside1 4 0
heaviside2 3 0
heaviside3 2 0
It looks like the function could be handled by as few as two compiler generated instructions: vcmplesd+vandpd. The first instruction creates a mask of all zeros if the argument is negative, or a mask of all ones otherwise. The second instruction applies the mask to a register constant value of one in order to produce the result value of zero or one.
Though I have not benchmarked these functions, it looks like the heaviside function should not take much execution time.
---09/23/2013: adding x86_64 assembly language versions and C language benchmark---
file functions.s
//----------------------------------------------------------------------------
.intel_syntax noprefix
.text
//-----------------------------------------------------------------------------
// this heaviside function generates its own register constants
// double heaviside_a1 (double arg);
.globl heaviside_a1
heaviside_a1:
mov rax,0x3ff0000000000000
xorpd xmm1,xmm1 # xmm1: constant 0.0
cmplesd xmm1,xmm0 # xmm1: mask (all Fs or all 0s)
movq xmm0,rax # xmm0: constant 1.0
andpd xmm0,xmm1
retq
//-----------------------------------------------------------------------------
// this heaviside function uses register constants passed from caller
// double heaviside_a2 (double arg, double const0, double const1);
.globl heaviside_a2
heaviside_a2:
cmplesd xmm1,xmm0 # xmm1: mask (all Fs or all 0s)
movsd xmm0,xmm2 # xmm0: constant 1.0
andpd xmm0,xmm1
retq
//-----------------------------------------------------------------------------
file ctest.c
#define __USE_MINGW_ANSI_STDIO 1
#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// functions.s
double heaviside_a1 (double x);
double heaviside_a2 (double arg, double const0, double const1);
//-----------------------------------------------------------------------------
static double heaviside_c1 (double x)
{
double result = 0;
if (x >= 0) result = 1;
return result;
}
//-----------------------------------------------------------------------------
//
// queryPerformanceCounter - similar to QueryPerformanceCounter, but returns
// count directly.
uint64_t queryPerformanceCounter (void)
{
LARGE_INTEGER int64;
QueryPerformanceCounter (&int64);
return int64.QuadPart;
}
//-----------------------------------------------------------------------------
//
// queryPerformanceFrequency - same as QueryPerformanceFrequency, but returns count direcly.
uint64_t queryPerformanceFrequency (void)
{
LARGE_INTEGER int64;
QueryPerformanceFrequency (&int64);
return int64.QuadPart;
}
//----------------------------------------------------------------------------
//
// lfsr64gpr - left shift galois type lfsr for 64-bit data, general purpose register implementation
//
static uint64_t lfsr64gpr (uint64_t data, uint64_t mask)
{
uint64_t carryOut = data >> 63;
uint64_t maskOrZ = -carryOut;
return (data << 1) ^ (maskOrZ & mask);
}
//---------------------------------------------------------------------------
int runtests (uint64_t pattern, uint64_t mask)
{
uint64_t startCount, elapsed, index, loops = 800000000;
double ns;
double total = 0;
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_c1 (x);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_c1: %7.2f ns\n", ns);
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_a1 (x);
//printf ("heaviside_a1 (%lf): %lf\n", x, result);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_a1: %7.2f ns\n", ns);
startCount = queryPerformanceCounter ();
for (index = 0; index < loops; index++)
{
double x, result;
const double const0 = 0.0;
const double const1 = 1.0;
pattern = lfsr64gpr (pattern, mask);
x = (double) (int64_t) pattern;
result = heaviside_a2 (x, const0, const1);
//printf ("heaviside_a2 (%lf): %lf\n", x, result);
total += result;
}
elapsed = queryPerformanceCounter () - startCount;
ns = (double) elapsed / queryPerformanceFrequency () * 1000000000 / loops;
printf ("heaviside_a2: %7.2f ns\n", ns);
return total;
}
//---------------------------------------------------------------------------
int main (void)
{
uint64_t mask;
mask = 0xBEFFFFFFFFFFFFFF;
// raise our priority to increase measurement accuracy
SetPriorityClass (GetCurrentProcess (), REALTIME_PRIORITY_CLASS);
printf ("using pseudo-random data\n");
runtests (1, mask);
return 0;
}
//---------------------------------------------------------------------------
mingw64 build command: gcc -Wall -Wextra -O3 -octest.exe ctest.c functions.s
Program output from Intel Core i7-2600K at 4.0 GHz:
using pseudo-random data
heaviside_c1: 2.24 ns
heaviside_a1: 2.00 ns
heaviside_a2: 2.00 ns
These timing results include execution of pseudo-random argument generation and result totalization code needed to keep the optimizer from eliminating the otherwise unused heaviside_c1 local function.
heaviside_c1 is from the original fortran suggestion, ported to C.
heaviside_a1 is an assembly language implementation.
heaviside_a2 is a modification of the assembly language version that uses register constants passed by the caller to avoid the overhead of generating them. For my processor, benchmarking shows no advantage to passing constants.
The assembly language functions assume xmm0 returns the result and xmm1 and xmm2 are available as scratch registers. This is valid for the x64 calling convention used by Windows. This assumption should be confirmed for other calling conventions.
In order to avoid memory accesses, the assembly language version expects the argument to be passed by value in a register (XMM0). Because this is not the fortran default, a special declaration is required. This one seems to work properly for gfortran 64-bit:
interface
real(c_double) function heaviside_a1(x)
use iso_c_binding, only: c_double
real(c_double), VALUE :: x
end function heaviside_a1
end interface

Related

Computation of 64 bit CRC polynomial performance

I found the following page in the web:
https://users.ece.cmu.edu/~koopman/crc/crc64.html
It lists the performance of a handful of 64 bit CRC polynomials. The optimal payload for a hamming distance of 3 is listed as 18446744073709551551 bit. A polynomial providing that HD 3 payload is 0xd6c9e91aca649ad4 (Koopman notation).
On the same website there is also some basic "HDLen" C code that can compute the performance of any polynomial (https://users.ece.cmu.edu/~koopman/crc/hdlen.html). I checked that code and the HD 3 optimized loop is very simple, similar to this:
Poly_t accum = cPoly;
Length_t len = 0;
while(accum != cTopBitSet)
{
accum = (accum & 1) ? (accum >> 1) ^ cPoly) : (accum >> 1);
len++;
}
18446744073709551551 is a huge number. It is almost the full range of a 64 bit integral. Even that simple loop would run centuries on the most powerful CPU core available.
It also appears to me that this loop can not be parallelized since each iteration depends from the previous iteration.
It is claimed that payload is optimal amongst all possible 64 bit polynomials which means that all possible 64 bit polynomials would have been checked for their individual HD 3 performance. This task can be parallelized, still the huge number of candidate polynomials seems to be undoable.
I can't see a way to even compute a single (good) polynomial's (HD 3) performance. Not to mention all possible 64 bit wide polynomials.
So I wonder: How has the number been found? What kind of code or method (in contrast to the simple HDLen software) was used to find the mentioned optimal HD 3 payload?

It is a primitive polynomial, where it can be shown that the HD=3 length of any primitive polynomial over GF(2) is 2n-(n+1), where n is the degree of the polynomial.
It can be shown pretty quickly whether a polynomial over a finite field is primitive or not.
Also, it is possible to compute the CRC of a very sparse codeword of n bits in O(log n) time instead of O(n) time. Here is an example in C, demonstrating the case mentioned for the provided CRC:
#include <stdio.h>
#include <stdint.h>
// Jones' 64-bit primitive polynomial (the constant excludes the x^64 term):
// 1 + x^3 + x^5 + x^7 + x^8 + x^10 + x^12 + x^13 + x^16 + x^19 + x^22 + x^23 +
// x^26 + x^28 + x^31 + x^32 + x^34 + x^36 + x^37 + x^41 + x^44 + x^46 + x^47 +
// x^48 + x^49 + x^52 + x^55 + x^56 + x^58 + x^59 + x^61 + x^63 + x^64
#define POLY 0xad93d23594c935a9
#define HIGH 0x8000000000000000 // high bit set
// Return polynomial a times polynomial b modulo p (POLY). a must be non-zero.
static uint64_t multmodp(uint64_t a, uint64_t b) {
uint64_t prod = 0;
for (;;) {
if (a & 1) {
prod ^= b;
if (a == 1)
break;
}
a >>= 1;
b = b & HIGH ? (b << 1) ^ POLY : b << 1;
}
return prod;
}
// x2n_table[n] is x^2^n mod p.
static uint64_t x2n_table[64];
// Initialize x2n_table[].
static void x2n_table_init(void) {
uint64_t p = 2; // first entry is x^2^0 == x^1
x2n_table[0] = p;
for (size_t n = 1; n < 64; n++)
x2n_table[n] = p = multmodp(p, p);
}
// Compute x^n modulo p. This takes O(log n) time.
static uint64_t xtonmodp(uintmax_t n) {
uint64_t x = 1;
int k = 0;
for (;;) {
if (n & 1)
x = multmodp(x2n_table[k], x);
n >>= 1;
if (n == 0)
break;
k++;
}
return x;
}
// Feed n zero bits into the CRC, taking O(log n) time.
static uint64_t crc64zeros(uint64_t crc, uint64_t n) {
return multmodp(xtonmodp(n), crc);
}
// Feed one one bit into the CRC.
static uint64_t crc64one(uint64_t crc) {
return crc & HIGH ? crc << 1 : (crc << 1) ^ POLY;
}
// Return the CRC-64 of one one bit, followed by n zero bits, followed by one
// more one bit.
static uint64_t crc64_one_zeros_one(uint64_t n) {
return crc64one(crc64zeros(crc64one(0), n));
}
int main(void) {
x2n_table_init();
uint64_t n = -2; // code word with 2^64 bits: a 1, 2^64-2 0's, and a 1
printf("%llx\n", crc64_one_zeros_one(n)); // prints 0
return 0;
}
That calculation completes in about 7.4 µs on my machine. As opposed to the bit-at-a-time calculation, which would take about 560 years on my machine.

Basic group arithmetic in libsodium

I am trying to implement a simple cryptographic primitive.
Under the following code: given sa, sk, hn, I want to compute sb: such that sg*G = (sb + sk . hn)*G.
However, after finding sb, the following equality does not hold: sb*G + (sk.hn)G = saG.
My understand stand is that in the exponent is arithmetic modulo the order of group instead of L.
However, I have a few questions relating to their implementation:
why the scalar has to be chosen from [0,L] where L is the order of the subgroup?
is there a "helper" function that multiplies two large scalar without performing modulo L?
int main(void)
{
if (sodium_init() < 0) {
/* panic! the library couldn't be initialized, it is not safe to use */
return -1;
}
uint8_t sb[crypto_core_ed25519_SCALARBYTES];
uint8_t sa[crypto_core_ed25519_SCALARBYTES];
uint8_t hn[crypto_core_ed25519_SCALARBYTES];
uint8_t sk[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_random(sa); // s_a <- [0,l]
crypto_core_ed25519_scalar_random(sk); // sk <- [0,l]
crypto_core_ed25519_scalar_random(hn); // hn <- [0,l]
uint8_t product[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_mul(product, sk,hn); // sk*hn
crypto_core_ed25519_scalar_sub(sb, sa, product); // sb = sa-hn*sk
uint8_t point1[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(point1, sa);
uint8_t point2[crypto_core_ed25519_BYTES];
uint8_t sum[crypto_core_ed25519_BYTES];
// equal
// crypto_core_ed25519_scalar_add(sum, sb, product);
// crypto_scalarmult_ed25519_base(point2, sum);
// is not equal
uint8_t temp1[crypto_core_ed25519_BYTES];
uint8_t temp2[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(temp1, sb); // sb*G
crypto_scalarmult_ed25519_base(temp2, product); //
crypto_core_ed25519_add(point2, temp1, temp2);
if(memcmp(point1, point2, 32) != 0)
{
printf("[-] Not equal ");
return -1;
}
printf("[+] equal");
return 0;
}

I got the answer from jedisct1 , the author of libsodium and I will post it here:
crypto_scalarmult_ed25519_base() clamps the scalar (clears the 3 lower bits, set the high bit) before performing the multiplication.
Use crypto_scalarmult_ed25519_base_noclamp() to prevent this.
Or, even better, use the Ristretto group instead.

are 2^n exponent calculations really less efficient than bit-shifts?

if I do:
int x = 4;
pow(2, x);
Is that really that much less efficient than just doing:
1 << 4
?

Yes. An easy way to show this is to compile the following two functions that do the same thing and then look at the disassembly.
#include <stdint.h>
#include <math.h>
uint32_t foo1(uint32_t shftAmt) {
return pow(2, shftAmt);
}
uint32_t foo2(uint32_t shftAmt) {
return (1 << shftAmt);
}
cc -arch armv7 -O3 -S -o - shift.c (I happen to find ARM asm easier to read but if you want x86 just remove the arch flag)
_foo1:
# BB#0:
push {r7, lr}
vmov s0, r0
mov r7, sp
vcvt.f64.u32 d16, s0
vmov r0, r1, d16
blx _exp2
vmov d16, r0, r1
vcvt.u32.f64 s0, d16
vmov r0, s0
pop {r7, pc}
_foo2:
# BB#0:
movs r1, #1
lsl.w r0, r1, r0
bx lr
You can see foo2 only takes 2 instructions vs foo1 which takes several instructions. It has to move the data to the FP HW registers (vmov), convert the integer to a float (vcvt.f64.u32) call the exp function and then convert the answer back to an uint (vcvt.u32.f64) and move it from the FP HW back to the GP registers.

Yes. Though by how much I can't say. The easiest way to determine that is to benchmark it.
The pow function uses doubles... At least, if it conforms to the C standard. Even if that function used bitshift when it sees a base of 2, there would still be testing and branching to reach that conclusion, by which time your simple bitshift would be completed. And we haven't even considered the overhead of a function call yet.
For equivalency, I assume you meant to use 1 << x instead of 1 << 4.
Perhaps a compiler could optimize both of these, but it's far less likely to optimize a call to pow. If you need the fastest way to compute a power of 2, do it with shifting.
Update... Since I mentioned it's easy to benchmark, I decided to do just that. I happen to have Windows and Visual C++ handy so I used that. Results will vary. My program:
#include <Windows.h>
#include <cstdio>
#include <cmath>
#include <ctime>
LARGE_INTEGER liFreq, liStart, liStop;
inline void StartTimer()
{
QueryPerformanceCounter(&liStart);
}
inline double ReportTimer()
{
QueryPerformanceCounter(&liStop);
double milli = 1000.0 * double(liStop.QuadPart - liStart.QuadPart) / double(liFreq.QuadPart);
printf( "%.3f ms\n", milli );
return milli;
}
int main()
{
QueryPerformanceFrequency(&liFreq);
const size_t nTests = 10000000;
int x = 4;
int sumPow = 0;
int sumShift = 0;
double powTime, shiftTime;
// Make an array of random exponents to use in tests.
const size_t nExp = 10000;
int e[nExp];
srand( (unsigned int)time(NULL) );
for( int i = 0; i < nExp; i++ ) e[i] = rand() % 31;
// Test power.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = (int)pow(2, (double)e[i%nExp]);
sumPow += y;
}
powTime = ReportTimer();
// Test shifting.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = 1 << e[i%nExp];
sumShift += y;
}
shiftTime = ReportTimer();
// The compiler shouldn't optimize out our loops if we need to display a result.
printf( "Sum power: %d\n", sumPow );
printf( "Sum shift: %d\n", sumShift );
printf( "Time ratio of pow versus shift: %.2f\n", powTime / shiftTime );
system("pause");
return 0;
}
My output:
379.466 ms
15.862 ms
Sum power: 157650768
Sum shift: 157650768
Time ratio of pow versus shift: 23.92

That depends on the compiler, but in general (when the compiler is not totally braindead) yes, the shift is one CPU instruction, the other is a function call, that involves saving the current state an setting up a stack frame, that requires many instructions.

Generally yes, as bit shift is very basic operation for the processor.
On the other hand many compilers optimise code so that raising to power is in fact just a bit shifting.

Is there any GMP logarithm function?

Is there any logarithm function implemented in the GMP library?

I know you didn't ask how to implement it, but...
You can implement a rough one using the properties of logarithms: http://gnumbers.blogspot.com.au/2011/10/logarithm-of-large-number-it-is-not.html
And the internals of the GMP library: https://gmplib.org/manual/Integer-Internals.html
(Edit: Basically you just use the most significant "digit" of the GMP representation since the base of the representation is huge B^N is much larger than B^{N-1})
Here is my implementation for Rationals.
double LogE(mpq_t m_op)
{
// log(a/b) = log(a) - log(b)
// And if a is represented in base B as:
// a = a_N B^N + a_{N-1} B^{N-1} + ... + a_0
// => log(a) \approx log(a_N B^N)
// = log(a_N) + N log(B)
// where B is the base; ie: ULONG_MAX
static double logB = log(ULONG_MAX);
// Undefined logs (should probably return NAN in second case?)
if (mpz_get_ui(mpq_numref(m_op)) == 0 || mpz_sgn(mpq_numref(m_op)) < 0)
return -INFINITY;
// Log of numerator
double lognum = log(mpq_numref(m_op)->_mp_d[abs(mpq_numref(m_op)->_mp_size) - 1]);
lognum += (abs(mpq_numref(m_op)->_mp_size)-1) * logB;
// Subtract log of denominator, if it exists
if (abs(mpq_denref(m_op)->_mp_size) > 0)
{
lognum -= log(mpq_denref(m_op)->_mp_d[abs(mpq_denref(m_op)->_mp_size)-1]);
lognum -= (abs(mpq_denref(m_op)->_mp_size)-1) * logB;
}
return lognum;
}
(Much later edit)
Coming back to this 5 years later, I just think it's cool that the core concept of log(a) = N log(B) + log(a_N) shows up even in native floating point implementations, here is the glibc one for ia64
And I used it again after encountering this question

No there is no such function in GMP.
Only in MPFR.

The method below makes use of mpz_get_d_2exp and was obtained from the gmp R package. It can be found under the function biginteger_log in the file bigintegerR.cc (You first have to download the source (i.e. the tar file)). You can also see it here: biginteger_log.
// Adapted for general use from the original biginteger_log
// xi = di * 2 ^ ex ==> log(xi) = log(di) + ex * log(2)
double biginteger_log_modified(mpz_t x) {
signed long int ex;
const double di = mpz_get_d_2exp(&ex, x);
return log(di) + log(2) * (double) ex;
}
Of course, the above method could be modified to return the log with any base using the properties of logarithm (e.g. the change of base formula).

Here it is:
https://github.com/linas/anant
Provides gnu mp real and complex logarithm, exp, sine, cosine, gamma, arctan, sqrt, polylogarithm Riemann and Hurwitz zeta, confluent hypergeometric, topologists sine, and more.

As other answers said, there is no logarithmic function in GMP. Part of answers provided implementations of logarithmic function, but with double precision only, not infinite precision.
I implemented full (arbitrary) precision logarithmic function below, even up to thousands bits of precision if you wish. Using mpf, generic floating point type of GMP.
My code uses Taylor serie for ln(1 + x) plus mpf_sqrt() (for boosting computation).
Code is in C++, and is quite large due to two facts. First is that it does precise time measurements to figure out best combinations of internal computational parameters for your machine. Second is that it uses extra speed improvements like extra usage of mpf_sqrt() for preparing initial value.
Algorithm of my code is following:
Factor out exponent of 2 from input x, i.e. rewrite x = d * 2^exp, with usage of mpf_get_d_2exp().
Make d (from step above) such that 2/3 <= d <= 4/3, this is achieved by possibly multiplying d by 2 and doing --exp. This ensures that d always differs from 1 by at most 1/3, in other words d extends from 1 in both directions (negative and positive) in equal distance.
Divide x by 2^exp, with usage of mpf_div_2exp() and mpf_mul_2exp().
Take square root of x several times (num_sqrt times) so that x becomes closer to 1. This ensures that Taylor Serie converges more rapidly. Because computation of square root several times is faster than contributing much more time in extra iterations of Taylor Serie.
Compute Taylor Serie for ln(1 + x) up to desired precision (even thousands of bit of precision if needed).
Because in Step 4. we took square root several times, now we need to multiply y (result of Taylor Serie) by 2^num_sqrt.
Finally because in Step 1. we factored out 2^exp, now we need to add ln(2) * exp to y. Here ln(2) is computed by just one recursive call to same function that implements whole algorithm.
Steps above come from sequence of formulas ln(x) = ln(d * 2^exp) = ln(d) + exp * ln(2) = ln(sqrt(...sqrt(d))) * num_sqrt + exp * ln(2).
My implementation automatically does timings (just once per program run) to figure out how many square roots is needed to balance out Taylor Serie computation. If you need to avoid timings then pass 3rd parameter sqrt_range to mpf_ln() equal to 0.001 instead of zero.
main() function contains examples of usage, testing of correctness (by comparing to lower precision std::log()), timings and output of different verbose information. Function is tested on first 1024 bits of Pi number.
Before call to my function mpf_ln() don't forget to setup needed precision of computation by calling mpf_set_default_prec(bits) with desired precision in bits.
Computational time of my mpf_ln() is about 40-90 micro-seconds for 1024 bit precision. Bigger precision will take more time, that is approximately linearly proportional to the amount of precision bits.
Very first run of a function takes considerably longer time becuse it does pre-computation of timings table and value of ln(2). So it is suggested to do first single computation at program start to avoid longer computation inside time critical region later in code.
To compile for example on Linux, you have to install GMP library and issue command:
clang++-14 -std=c++20 -O3 -lgmp -lgmpxx -o main main.cpp && ./main
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <chrono>
#include <mutex>
#include <vector>
#include <unordered_map>
#include <gmpxx.h>
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
mpf_class mpf_ln(mpf_class x, bool verbose = false, double sqrt_range = 0) {
auto total_time = verbose ? Time() : 0.0;
int const prec = mpf_get_prec(x.get_mpf_t());
if (sqrt_range == 0) {
static std::mutex mux;
std::lock_guard<std::mutex> lock(mux);
static std::vector<std::pair<size_t, double>> ranges;
if (ranges.empty())
mpf_ln(3.14, false, 0.01);
while (ranges.empty() || ranges.back().first < prec) {
size_t const bits = ranges.empty() ? 64 : ranges.back().first * 3 / 2;
mpf_class x = 3.14;
mpf_set_prec(x.get_mpf_t(), bits);
double sr = 0.35, sr_best = 1, time_best = 1000;
size_t constexpr ntests = 5;
while (true) {
auto tim = Time();
for (size_t i = 0; i < ntests; ++i)
mpf_ln(x, false, sr);
tim = (Time() - tim) / ntests;
bool updated = false;
if (tim < time_best) {
sr_best = sr;
time_best = tim;
updated = true;
}
sr /= 1.5;
if (sr <= 1e-8) {
ranges.push_back(std::make_pair(bits, sr_best));
break;
}
}
}
sqrt_range = std::lower_bound(ranges.begin(), ranges.end(), size_t(prec),
[](auto const & a, auto const & b){
return a.first < b;
})->second;
}
signed long int exp = 0;
// https://gmplib.org/manual/Converting-Floats
double d = mpf_get_d_2exp(&exp, x.get_mpf_t());
if (d < 2.0 / 3) {
d *= 2;
--exp;
}
mpf_class t;
// https://gmplib.org/manual/Float-Arithmetic
if (exp >= 0)
mpf_div_2exp(x.get_mpf_t(), x.get_mpf_t(), exp);
else
mpf_mul_2exp(x.get_mpf_t(), x.get_mpf_t(), -exp);
auto sqrt_time = verbose ? Time() : 0.0;
// Multiple Sqrt of x
int num_sqrt = 0;
if (x >= 1)
while (x >= 1.0 + sqrt_range) {
// https://gmplib.org/manual/Float-Arithmetic
mpf_sqrt(x.get_mpf_t(), x.get_mpf_t());
++num_sqrt;
}
else
while (x <= 1.0 - sqrt_range) {
mpf_sqrt(x.get_mpf_t(), x.get_mpf_t());
++num_sqrt;
}
if (verbose)
sqrt_time = Time() - sqrt_time;
static mpf_class const eps = [&]{
mpf_class eps = 1;
mpf_div_2exp(eps.get_mpf_t(), eps.get_mpf_t(), prec + 8);
return eps;
}(), meps = -eps;
// Taylor Serie for ln(1 + x)
// https://math.stackexchange.com/a/878376/826258
x -= 1;
mpf_class k = x, y = x, mx = -x;
size_t num_iters = 0;
for (int32_t i = 2;; ++i) {
k *= mx;
y += k / i;
// Check if error is small enough
if (meps <= k && k <= eps) {
num_iters = i;
break;
}
}
auto VerboseInfo = [&]{
if (!verbose)
return;
total_time = Time() - total_time;
std::cout << std::fixed << "Sqrt range " << sqrt_range << ", num sqrts "
<< num_sqrt << ", sqrt time " << sqrt_time << " sec" << std::endl;
std::cout << "Ln number of iterations " << num_iters << ", ln time "
<< total_time << " sec" << std::endl;
};
// Correction due to multiple sqrt of x
y *= 1 << num_sqrt;
if (exp == 0) {
VerboseInfo();
return y;
}
mpf_class ln2;
{
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
static std::unordered_map<size_t, mpf_class> ln2s;
auto it = ln2s.find(size_t(prec));
if (it == ln2s.end()) {
mpf_class sqrt_sqrt_2 = 2;
mpf_sqrt(sqrt_sqrt_2.get_mpf_t(), sqrt_sqrt_2.get_mpf_t());
mpf_sqrt(sqrt_sqrt_2.get_mpf_t(), sqrt_sqrt_2.get_mpf_t());
it = ln2s.insert(std::make_pair(size_t(prec), mpf_class(mpf_ln(sqrt_sqrt_2, false, sqrt_range) * 4))).first;
}
ln2 = it->second;
}
y += ln2 * exp;
VerboseInfo();
return y;
}
std::string mpf_str(mpf_class const & x) {
mp_exp_t exp;
auto s = x.get_str(exp);
return s.substr(0, exp) + "." + s.substr(exp);
}
int main() {
// https://gmplib.org/manual/Initializing-Floats
mpf_set_default_prec(1024); // bit-precision
// http://www.math.com/tables/constants/pi.htm
mpf_class x(
"3."
"1415926535 8979323846 2643383279 5028841971 6939937510 "
"5820974944 5923078164 0628620899 8628034825 3421170679 "
"8214808651 3282306647 0938446095 5058223172 5359408128 "
"4811174502 8410270193 8521105559 6446229489 5493038196 "
"4428810975 6659334461 2847564823 3786783165 2712019091 "
"4564856692 3460348610 4543266482 1339360726 0249141273 "
"7245870066 0631558817 4881520920 9628292540 9171536436 "
);
std::cout << std::boolalpha << std::fixed << std::setprecision(14);
std::cout << "x:" << std::endl << mpf_str(x) << std::endl;
auto cmath_val = std::log(mpf_get_d(x.get_mpf_t()));
std::cout << "cmath ln(x): " << std::endl << cmath_val << std::endl;
auto volatile tmp = mpf_ln(x); // Pre-Compute to heat-up timings table.
auto time_start = Time();
size_t constexpr ntests = 20;
for (size_t i = 0; i < ntests; ++i) {
auto volatile tmp = mpf_ln(x);
}
std::cout << "mpf ln(x) time " << (Time() - time_start) / ntests << " sec" << std::endl;
auto mpf_val = mpf_ln(x, true);
std::cout << "mpf ln(x):" << std::endl << mpf_str(mpf_val) << std::endl;
std::cout << "equal to cmath: " << (std::abs(mpf_get_d(mpf_val.get_mpf_t()) - cmath_val) <= 1e-14) << std::endl;
return 0;
}
Output:
x:
3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724587007
cmath ln(x):
1.14472988584940
mpf ln(x) time 0.00004426845000 sec
Sqrt range 0.00000004747981, num sqrts 23, sqrt time 0.00001440000000 sec
Ln number of iterations 42, ln time 0.00003873100000 sec
mpf ln(x):
1.144729885849400174143427351353058711647294812915311571513623071472137769884826079783623270275489707702009812228697989159048205527923456587279081078810286825276393914266345902902484773358869937789203119630824756794011916028217227379888126563178049823697313310695003600064405487263880223270096433504959511813198
equal to cmath: true

Create a Fraction array

I have to Create a dynamic array capable of holding 2*n Fractions.
If the dynamic array cannot be allocated, prints a message and calls exit(1).
It next fills the array with reduced random Fractions whose numerator
is between 1 and 20, inclusive; and whose initial denominator
is between 2 and 20, inclusive.
I ready did the function that is going to create the fraction and reduced it. this is what I got. When I compiled and run this program it crashes I cant find out why. If I put 1 instead of 10 in the test.c It doesn't crash but it gives me a crazy fraction. If I put 7,8,or 11 in the test.c it will crash. I would appreciate if someone can help me.
FractionSumTester.c
Fraction randomFraction(int minNum, int minDenom, int max)
{
Fraction l;
Fraction m;
Fraction f;
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
while (m.denominator <= 1)
{
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
}
return m;
}
Fraction *createFractionArray(int n)
{
Fraction *p;
int i;
p = malloc(n * sizeof(Fraction));
if (p == NULL)
{
printf("error");
exit(1);
}
for(i=0; i < 2*n ; i++)
{
p[i] = randomFraction(1,2,20);
printf("%d/%d\n", p[i].numerator, p[i].denominator);
}
return p;
}
this is the what I am using to test this two functions.
test.c
#include "Fraction.h"
#include "FractionSumTester.h"
#include <stdio.h>
int main()
{
createFractionArray(10);
return 0;
}

In your createFractionArray() function, you malloc() space for n items. Then, in the for loop, you write 2*n items into that space... which overruns your buffer and causes the crash.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

An optimized implementation of the Heaviside function - optimization

Related

Computation of 64 bit CRC polynomial performance

Basic group arithmetic in libsodium

are 2^n exponent calculations really less efficient than bit-shifts?

Is there any GMP logarithm function?

Create a Fraction array

Categories

Resources