Is there any GMP logarithm function? - gmp

Is there any logarithm function implemented in the GMP library?

I know you didn't ask how to implement it, but...
You can implement a rough one using the properties of logarithms: http://gnumbers.blogspot.com.au/2011/10/logarithm-of-large-number-it-is-not.html
And the internals of the GMP library: https://gmplib.org/manual/Integer-Internals.html
(Edit: Basically you just use the most significant "digit" of the GMP representation since the base of the representation is huge B^N is much larger than B^{N-1})
Here is my implementation for Rationals.
double LogE(mpq_t m_op)
{
// log(a/b) = log(a) - log(b)
// And if a is represented in base B as:
// a = a_N B^N + a_{N-1} B^{N-1} + ... + a_0
// => log(a) \approx log(a_N B^N)
// = log(a_N) + N log(B)
// where B is the base; ie: ULONG_MAX
static double logB = log(ULONG_MAX);
// Undefined logs (should probably return NAN in second case?)
if (mpz_get_ui(mpq_numref(m_op)) == 0 || mpz_sgn(mpq_numref(m_op)) < 0)
return -INFINITY;
// Log of numerator
double lognum = log(mpq_numref(m_op)->_mp_d[abs(mpq_numref(m_op)->_mp_size) - 1]);
lognum += (abs(mpq_numref(m_op)->_mp_size)-1) * logB;
// Subtract log of denominator, if it exists
if (abs(mpq_denref(m_op)->_mp_size) > 0)
{
lognum -= log(mpq_denref(m_op)->_mp_d[abs(mpq_denref(m_op)->_mp_size)-1]);
lognum -= (abs(mpq_denref(m_op)->_mp_size)-1) * logB;
}
return lognum;
}
(Much later edit)
Coming back to this 5 years later, I just think it's cool that the core concept of log(a) = N log(B) + log(a_N) shows up even in native floating point implementations, here is the glibc one for ia64
And I used it again after encountering this question

No there is no such function in GMP.
Only in MPFR.

The method below makes use of mpz_get_d_2exp and was obtained from the gmp R package. It can be found under the function biginteger_log in the file bigintegerR.cc (You first have to download the source (i.e. the tar file)). You can also see it here: biginteger_log.
// Adapted for general use from the original biginteger_log
// xi = di * 2 ^ ex ==> log(xi) = log(di) + ex * log(2)
double biginteger_log_modified(mpz_t x) {
signed long int ex;
const double di = mpz_get_d_2exp(&ex, x);
return log(di) + log(2) * (double) ex;
}
Of course, the above method could be modified to return the log with any base using the properties of logarithm (e.g. the change of base formula).

Here it is:
https://github.com/linas/anant
Provides gnu mp real and complex logarithm, exp, sine, cosine, gamma, arctan, sqrt, polylogarithm Riemann and Hurwitz zeta, confluent hypergeometric, topologists sine, and more.

As other answers said, there is no logarithmic function in GMP. Part of answers provided implementations of logarithmic function, but with double precision only, not infinite precision.
I implemented full (arbitrary) precision logarithmic function below, even up to thousands bits of precision if you wish. Using mpf, generic floating point type of GMP.
My code uses Taylor serie for ln(1 + x) plus mpf_sqrt() (for boosting computation).
Code is in C++, and is quite large due to two facts. First is that it does precise time measurements to figure out best combinations of internal computational parameters for your machine. Second is that it uses extra speed improvements like extra usage of mpf_sqrt() for preparing initial value.
Algorithm of my code is following:
Factor out exponent of 2 from input x, i.e. rewrite x = d * 2^exp, with usage of mpf_get_d_2exp().
Make d (from step above) such that 2/3 <= d <= 4/3, this is achieved by possibly multiplying d by 2 and doing --exp. This ensures that d always differs from 1 by at most 1/3, in other words d extends from 1 in both directions (negative and positive) in equal distance.
Divide x by 2^exp, with usage of mpf_div_2exp() and mpf_mul_2exp().
Take square root of x several times (num_sqrt times) so that x becomes closer to 1. This ensures that Taylor Serie converges more rapidly. Because computation of square root several times is faster than contributing much more time in extra iterations of Taylor Serie.
Compute Taylor Serie for ln(1 + x) up to desired precision (even thousands of bit of precision if needed).
Because in Step 4. we took square root several times, now we need to multiply y (result of Taylor Serie) by 2^num_sqrt.
Finally because in Step 1. we factored out 2^exp, now we need to add ln(2) * exp to y. Here ln(2) is computed by just one recursive call to same function that implements whole algorithm.
Steps above come from sequence of formulas ln(x) = ln(d * 2^exp) = ln(d) + exp * ln(2) = ln(sqrt(...sqrt(d))) * num_sqrt + exp * ln(2).
My implementation automatically does timings (just once per program run) to figure out how many square roots is needed to balance out Taylor Serie computation. If you need to avoid timings then pass 3rd parameter sqrt_range to mpf_ln() equal to 0.001 instead of zero.
main() function contains examples of usage, testing of correctness (by comparing to lower precision std::log()), timings and output of different verbose information. Function is tested on first 1024 bits of Pi number.
Before call to my function mpf_ln() don't forget to setup needed precision of computation by calling mpf_set_default_prec(bits) with desired precision in bits.
Computational time of my mpf_ln() is about 40-90 micro-seconds for 1024 bit precision. Bigger precision will take more time, that is approximately linearly proportional to the amount of precision bits.
Very first run of a function takes considerably longer time becuse it does pre-computation of timings table and value of ln(2). So it is suggested to do first single computation at program start to avoid longer computation inside time critical region later in code.
To compile for example on Linux, you have to install GMP library and issue command:
clang++-14 -std=c++20 -O3 -lgmp -lgmpxx -o main main.cpp && ./main
Try it online!
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <cmath>
#include <chrono>
#include <mutex>
#include <vector>
#include <unordered_map>
#include <gmpxx.h>
double Time() {
static auto const gtb = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
mpf_class mpf_ln(mpf_class x, bool verbose = false, double sqrt_range = 0) {
auto total_time = verbose ? Time() : 0.0;
int const prec = mpf_get_prec(x.get_mpf_t());
if (sqrt_range == 0) {
static std::mutex mux;
std::lock_guard<std::mutex> lock(mux);
static std::vector<std::pair<size_t, double>> ranges;
if (ranges.empty())
mpf_ln(3.14, false, 0.01);
while (ranges.empty() || ranges.back().first < prec) {
size_t const bits = ranges.empty() ? 64 : ranges.back().first * 3 / 2;
mpf_class x = 3.14;
mpf_set_prec(x.get_mpf_t(), bits);
double sr = 0.35, sr_best = 1, time_best = 1000;
size_t constexpr ntests = 5;
while (true) {
auto tim = Time();
for (size_t i = 0; i < ntests; ++i)
mpf_ln(x, false, sr);
tim = (Time() - tim) / ntests;
bool updated = false;
if (tim < time_best) {
sr_best = sr;
time_best = tim;
updated = true;
}
sr /= 1.5;
if (sr <= 1e-8) {
ranges.push_back(std::make_pair(bits, sr_best));
break;
}
}
}
sqrt_range = std::lower_bound(ranges.begin(), ranges.end(), size_t(prec),
[](auto const & a, auto const & b){
return a.first < b;
})->second;
}
signed long int exp = 0;
// https://gmplib.org/manual/Converting-Floats
double d = mpf_get_d_2exp(&exp, x.get_mpf_t());
if (d < 2.0 / 3) {
d *= 2;
--exp;
}
mpf_class t;
// https://gmplib.org/manual/Float-Arithmetic
if (exp >= 0)
mpf_div_2exp(x.get_mpf_t(), x.get_mpf_t(), exp);
else
mpf_mul_2exp(x.get_mpf_t(), x.get_mpf_t(), -exp);
auto sqrt_time = verbose ? Time() : 0.0;
// Multiple Sqrt of x
int num_sqrt = 0;
if (x >= 1)
while (x >= 1.0 + sqrt_range) {
// https://gmplib.org/manual/Float-Arithmetic
mpf_sqrt(x.get_mpf_t(), x.get_mpf_t());
++num_sqrt;
}
else
while (x <= 1.0 - sqrt_range) {
mpf_sqrt(x.get_mpf_t(), x.get_mpf_t());
++num_sqrt;
}
if (verbose)
sqrt_time = Time() - sqrt_time;
static mpf_class const eps = [&]{
mpf_class eps = 1;
mpf_div_2exp(eps.get_mpf_t(), eps.get_mpf_t(), prec + 8);
return eps;
}(), meps = -eps;
// Taylor Serie for ln(1 + x)
// https://math.stackexchange.com/a/878376/826258
x -= 1;
mpf_class k = x, y = x, mx = -x;
size_t num_iters = 0;
for (int32_t i = 2;; ++i) {
k *= mx;
y += k / i;
// Check if error is small enough
if (meps <= k && k <= eps) {
num_iters = i;
break;
}
}
auto VerboseInfo = [&]{
if (!verbose)
return;
total_time = Time() - total_time;
std::cout << std::fixed << "Sqrt range " << sqrt_range << ", num sqrts "
<< num_sqrt << ", sqrt time " << sqrt_time << " sec" << std::endl;
std::cout << "Ln number of iterations " << num_iters << ", ln time "
<< total_time << " sec" << std::endl;
};
// Correction due to multiple sqrt of x
y *= 1 << num_sqrt;
if (exp == 0) {
VerboseInfo();
return y;
}
mpf_class ln2;
{
static std::mutex mutex;
std::lock_guard<std::mutex> lock(mutex);
static std::unordered_map<size_t, mpf_class> ln2s;
auto it = ln2s.find(size_t(prec));
if (it == ln2s.end()) {
mpf_class sqrt_sqrt_2 = 2;
mpf_sqrt(sqrt_sqrt_2.get_mpf_t(), sqrt_sqrt_2.get_mpf_t());
mpf_sqrt(sqrt_sqrt_2.get_mpf_t(), sqrt_sqrt_2.get_mpf_t());
it = ln2s.insert(std::make_pair(size_t(prec), mpf_class(mpf_ln(sqrt_sqrt_2, false, sqrt_range) * 4))).first;
}
ln2 = it->second;
}
y += ln2 * exp;
VerboseInfo();
return y;
}
std::string mpf_str(mpf_class const & x) {
mp_exp_t exp;
auto s = x.get_str(exp);
return s.substr(0, exp) + "." + s.substr(exp);
}
int main() {
// https://gmplib.org/manual/Initializing-Floats
mpf_set_default_prec(1024); // bit-precision
// http://www.math.com/tables/constants/pi.htm
mpf_class x(
"3."
"1415926535 8979323846 2643383279 5028841971 6939937510 "
"5820974944 5923078164 0628620899 8628034825 3421170679 "
"8214808651 3282306647 0938446095 5058223172 5359408128 "
"4811174502 8410270193 8521105559 6446229489 5493038196 "
"4428810975 6659334461 2847564823 3786783165 2712019091 "
"4564856692 3460348610 4543266482 1339360726 0249141273 "
"7245870066 0631558817 4881520920 9628292540 9171536436 "
);
std::cout << std::boolalpha << std::fixed << std::setprecision(14);
std::cout << "x:" << std::endl << mpf_str(x) << std::endl;
auto cmath_val = std::log(mpf_get_d(x.get_mpf_t()));
std::cout << "cmath ln(x): " << std::endl << cmath_val << std::endl;
auto volatile tmp = mpf_ln(x); // Pre-Compute to heat-up timings table.
auto time_start = Time();
size_t constexpr ntests = 20;
for (size_t i = 0; i < ntests; ++i) {
auto volatile tmp = mpf_ln(x);
}
std::cout << "mpf ln(x) time " << (Time() - time_start) / ntests << " sec" << std::endl;
auto mpf_val = mpf_ln(x, true);
std::cout << "mpf ln(x):" << std::endl << mpf_str(mpf_val) << std::endl;
std::cout << "equal to cmath: " << (std::abs(mpf_get_d(mpf_val.get_mpf_t()) - cmath_val) <= 1e-14) << std::endl;
return 0;
}
Output:
x:
3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724587007
cmath ln(x):
1.14472988584940
mpf ln(x) time 0.00004426845000 sec
Sqrt range 0.00000004747981, num sqrts 23, sqrt time 0.00001440000000 sec
Ln number of iterations 42, ln time 0.00003873100000 sec
mpf ln(x):
1.144729885849400174143427351353058711647294812915311571513623071472137769884826079783623270275489707702009812228697989159048205527923456587279081078810286825276393914266345902902484773358869937789203119630824756794011916028217227379888126563178049823697313310695003600064405487263880223270096433504959511813198
equal to cmath: true

Related

Computation of 64 bit CRC polynomial performance

I found the following page in the web:
https://users.ece.cmu.edu/~koopman/crc/crc64.html
It lists the performance of a handful of 64 bit CRC polynomials. The optimal payload for a hamming distance of 3 is listed as 18446744073709551551 bit. A polynomial providing that HD 3 payload is 0xd6c9e91aca649ad4 (Koopman notation).
On the same website there is also some basic "HDLen" C code that can compute the performance of any polynomial (https://users.ece.cmu.edu/~koopman/crc/hdlen.html). I checked that code and the HD 3 optimized loop is very simple, similar to this:
Poly_t accum = cPoly;
Length_t len = 0;
while(accum != cTopBitSet)
{
accum = (accum & 1) ? (accum >> 1) ^ cPoly) : (accum >> 1);
len++;
}
18446744073709551551 is a huge number. It is almost the full range of a 64 bit integral. Even that simple loop would run centuries on the most powerful CPU core available.
It also appears to me that this loop can not be parallelized since each iteration depends from the previous iteration.
It is claimed that payload is optimal amongst all possible 64 bit polynomials which means that all possible 64 bit polynomials would have been checked for their individual HD 3 performance. This task can be parallelized, still the huge number of candidate polynomials seems to be undoable.
I can't see a way to even compute a single (good) polynomial's (HD 3) performance. Not to mention all possible 64 bit wide polynomials.
So I wonder: How has the number been found? What kind of code or method (in contrast to the simple HDLen software) was used to find the mentioned optimal HD 3 payload?
It is a primitive polynomial, where it can be shown that the HD=3 length of any primitive polynomial over GF(2) is 2n-(n+1), where n is the degree of the polynomial.
It can be shown pretty quickly whether a polynomial over a finite field is primitive or not.
Also, it is possible to compute the CRC of a very sparse codeword of n bits in O(log n) time instead of O(n) time. Here is an example in C, demonstrating the case mentioned for the provided CRC:
#include <stdio.h>
#include <stdint.h>
// Jones' 64-bit primitive polynomial (the constant excludes the x^64 term):
// 1 + x^3 + x^5 + x^7 + x^8 + x^10 + x^12 + x^13 + x^16 + x^19 + x^22 + x^23 +
// x^26 + x^28 + x^31 + x^32 + x^34 + x^36 + x^37 + x^41 + x^44 + x^46 + x^47 +
// x^48 + x^49 + x^52 + x^55 + x^56 + x^58 + x^59 + x^61 + x^63 + x^64
#define POLY 0xad93d23594c935a9
#define HIGH 0x8000000000000000 // high bit set
// Return polynomial a times polynomial b modulo p (POLY). a must be non-zero.
static uint64_t multmodp(uint64_t a, uint64_t b) {
uint64_t prod = 0;
for (;;) {
if (a & 1) {
prod ^= b;
if (a == 1)
break;
}
a >>= 1;
b = b & HIGH ? (b << 1) ^ POLY : b << 1;
}
return prod;
}
// x2n_table[n] is x^2^n mod p.
static uint64_t x2n_table[64];
// Initialize x2n_table[].
static void x2n_table_init(void) {
uint64_t p = 2; // first entry is x^2^0 == x^1
x2n_table[0] = p;
for (size_t n = 1; n < 64; n++)
x2n_table[n] = p = multmodp(p, p);
}
// Compute x^n modulo p. This takes O(log n) time.
static uint64_t xtonmodp(uintmax_t n) {
uint64_t x = 1;
int k = 0;
for (;;) {
if (n & 1)
x = multmodp(x2n_table[k], x);
n >>= 1;
if (n == 0)
break;
k++;
}
return x;
}
// Feed n zero bits into the CRC, taking O(log n) time.
static uint64_t crc64zeros(uint64_t crc, uint64_t n) {
return multmodp(xtonmodp(n), crc);
}
// Feed one one bit into the CRC.
static uint64_t crc64one(uint64_t crc) {
return crc & HIGH ? crc << 1 : (crc << 1) ^ POLY;
}
// Return the CRC-64 of one one bit, followed by n zero bits, followed by one
// more one bit.
static uint64_t crc64_one_zeros_one(uint64_t n) {
return crc64one(crc64zeros(crc64one(0), n));
}
int main(void) {
x2n_table_init();
uint64_t n = -2; // code word with 2^64 bits: a 1, 2^64-2 0's, and a 1
printf("%llx\n", crc64_one_zeros_one(n)); // prints 0
return 0;
}
That calculation completes in about 7.4 µs on my machine. As opposed to the bit-at-a-time calculation, which would take about 560 years on my machine.

When can an algorithm have square root(n) time complexity?

Can someone give me example of an algorithm that has square root(n) time complexity. What does square root time complexity even mean?
Square root time complexity means that the algorithm requires O(N^(1/2)) evaluations where the size of input is N.
As an example for an algorithm which takes O(sqrt(n)) time, Grover's algorithm is one which takes that much time. Grover's algorithm is a quantum algorithm for searching an unsorted database of n entries in O(sqrt(n)) time.
Let us take an example to understand how can we arrive at O(sqrt(N)) runtime complexity, given a problem. This is going to be elaborate, but is interesting to understand. (The following example, in the context for answering this question, is taken from Coding Contest Byte: The Square Root Trick , very interesting problem and interesting trick to arrive at O(sqrt(n)) complexity)
Given A, containing an n elements array, implement a data structure for point updates and range sum queries.
update(i, x)-> A[i] := x (Point Updates Query)
query(lo, hi)-> returns A[lo] + A[lo+1] + .. + A[hi]. (Range Sum Query)
The naive solution uses an array. It takes O(1) time for an update (array-index access) and O(hi - lo) = O(n) for the range sum (iterating from start index to end index and adding up).
A more efficient solution splits the array into length k slices and stores the slice sums in an array S.
The update takes constant time, because we have to update the value for A and the value for the corresponding S. In update(6, 5) we have to change A[6] to 5 which results in changing the value of S1 to keep S up to date.
The range-sum query is interesting. The elements of the first and last slice (partially contained in the queried range) have to be traversed one by one, but for slices completely contained in our range we can use the values in S directly and get a performance boost.
In query(2, 14) we get,
query(2, 14) = A[2] + A[3]+ (A[4] + A[5] + A[6] + A[7]) + (A[8] + A[9] + A[10] + A[11]) + A[12] + A[13] + A[14] ;
query(2, 14) = A[2] + A[3] + S[1] + S[2] + A[12] + A[13] + A[14] ;
query(2, 14) = 0 + 7 + 11 + 9 + 5 + 2 + 0;
query(2, 14) = 34;
The code for update and query is:
def update(S, A, i, k, x):
S[i/k] = S[i/k] - A[i] + x
A[i] = x
def query(S, A, lo, hi, k):
s = 0
i = lo
//Section 1 (Getting sum from Array A itself, starting part)
while (i + 1) % k != 0 and i <= hi:
s += A[i]
i += 1
//Section 2 (Getting sum from Slices directly, intermediary part)
while i + k <= hi:
s += S[i/k]
i += k
//Section 3 (Getting sum from Array A itself, ending part)
while i <= hi:
s += A[i]
i += 1
return s
Let us now determine the complexity.
Each query takes on average
Section 1 takes k/2 time on average. (you might iterate atmost k/2)
Section 2 takes n/k time on average, basically number of slices
Section 3 takes k/2 time on average. (you might iterate atmost k/2)
So, totally, we get k/2 + n/k + k/2 = k + n/k time.
And, this is minimized for k = sqrt(n). sqrt(n) + n/sqrt(n) = 2*sqrt(n)
So we get a O(sqrt(n)) time complexity query.
Prime numbers
As mentioned in some other answers, some basic things related to prime numbers take O(sqrt(n)) time:
Find number of divisors
Find sum of divisors
Find Euler's totient
Below I mention two advanced algorithms which also bear sqrt(n) term in their complexity.
MO's Algorithm
try this problem: Powerful array
My solution:
#include <bits/stdc++.h>
using namespace std;
const int N = 1E6 + 10, k = 500;
struct node {
int l, r, id;
bool operator<(const node &a) {
if(l / k == a.l / k) return r < a.r;
else return l < a.l;
}
} q[N];
long long a[N], cnt[N], ans[N], cur_count;
void add(int pos) {
cur_count += a[pos] * cnt[a[pos]];
++cnt[a[pos]];
cur_count += a[pos] * cnt[a[pos]];
}
void rm(int pos) {
cur_count -= a[pos] * cnt[a[pos]];
--cnt[a[pos]];
cur_count -= a[pos] * cnt[a[pos]];
}
int main() {
int n, t;
cin >> n >> t;
for(int i = 1; i <= n; i++) {
cin >> a[i];
}
for(int i = 0; i < t; i++) {
cin >> q[i].l >> q[i].r;
q[i].id = i;
}
sort(q, q + t);
memset(cnt, 0, sizeof(cnt));
memset(ans, 0, sizeof(ans));
int curl(0), curr(0), l, r;
for(int i = 0; i < t; i++) {
l = q[i].l;
r = q[i].r;
/* This part takes O(n * sqrt(n)) time */
while(curl < l)
rm(curl++);
while(curl > l)
add(--curl);
while(curr > r)
rm(curr--);
while(curr < r)
add(++curr);
ans[q[i].id] = cur_count;
}
for(int i = 0; i < t; i++) {
cout << ans[i] << '\n';
}
return 0;
}
Query Buffering
try this problem: Queries on a Tree
My solution:
#include <bits/stdc++.h>
using namespace std;
const int N = 2e5 + 10, k = 333;
vector<int> t[N], ht;
int tm_, h[N], st[N], nd[N];
inline int hei(int v, int p) {
for(int ch: t[v]) {
if(ch != p) {
h[ch] = h[v] + 1;
hei(ch, v);
}
}
}
inline void tour(int v, int p) {
st[v] = tm_++;
ht.push_back(h[v]);
for(int ch: t[v]) {
if(ch != p) {
tour(ch, v);
}
}
ht.push_back(h[v]);
nd[v] = tm_++;
}
int n, tc[N];
vector<int> loc[N];
long long balance[N];
vector<pair<long long,long long>> buf;
inline long long cbal(int v, int p) {
long long ans = balance[h[v]];
for(int ch: t[v]) {
if(ch != p) {
ans += cbal(ch, v);
}
}
tc[v] += ans;
return ans;
}
inline void bal() {
memset(balance, 0, sizeof(balance));
for(auto arg: buf) {
balance[arg.first] += arg.second;
}
buf.clear();
cbal(1,1);
}
int main() {
int q;
cin >> n >> q;
for(int i = 1; i < n; i++) {
int x, y; cin >> x >> y;
t[x].push_back(y); t[y].push_back(x);
}
hei(1,1);
tour(1,1);
for(int i = 0; i < ht.size(); i++) {
loc[ht[i]].push_back(i);
}
vector<int>::iterator lo, hi;
int x, y, type;
for(int i = 0; i < q; i++) {
cin >> type;
if(type == 1) {
cin >> x >> y;
buf.push_back(make_pair(x,y));
}
else if(type == 2) {
cin >> x;
long long ans(0);
for(auto arg: buf) {
hi = upper_bound(loc[arg.first].begin(), loc[arg.first].end(), nd[x]);
lo = lower_bound(loc[arg.first].begin(), loc[arg.first].end(), st[x]);
ans += arg.second * (hi - lo);
}
cout << tc[x] + ans/2 << '\n';
}
else assert(0);
if(i % k == 0) bal();
}
}
There are many cases.
These are the few problems which can be solved in root(n) complexity [better may be possible also].
Find if a number is prime or not.
Grover's Algorithm: allows search (in quantum context) on unsorted input in time proportional to the square root of the size of the input.link
Factorization of the number.
There are many problems that you will face which will demand use of sqrt(n) complexity algorithm.
As an answer to second part:
sqrt(n) complexity means if the input size to your algorithm is n then there approximately sqrt(n) basic operations ( like **comparison** in case of sorting). Then we can say that the algorithm has sqrt(n) time complexity.
Let's analyze the 3rd problem and it will be clear.
let's n= positive integer. Now there exists 2 positive integer x and y such that
x*y=n;
Now we know that whatever be the value of x and y one of them will be less than sqrt(n). As if both are greater than sqrt(n)
x>sqrt(n) y>sqrt(n) then x*y>sqrt(n)*sqrt(n) => n>n--->contradiction.
So if we check 2 to sqrt(n) then we will have all the factors considered ( 1 and n are trivial factors).
Code snippet:
int n;
cin>>n;
print 1,n;
for(int i=2;i<=sqrt(n);i++) // or for(int i=2;i*i<=n;i++)
if((n%i)==0)
cout<<i<<" ";
Note: You might think that not considering the duplicate we can also achieve the above behaviour by looping from 1 to n. Yes that's possible but who wants to run a program which can run in O(sqrt(n)) in O(n).. We always look for the best one.
Go through the book of Cormen Introduction to Algorithms.
I will also request you to read following stackoverflow question and answers they will clear all the doubts for sure :)
Are there any O(1/n) algorithms?
Plain english explanation Big-O
Which one is better?
How do you calculte big-O complexity?
This link provides a very basic beginner understanding of O() i.e., O(sqrt n) time complexity. It is the last example in the video, but I would suggest that you watch the whole video.
https://www.youtube.com/watch?v=9TlHvipP5yA&list=PLDN4rrl48XKpZkf03iYFl-O29szjTrs_O&index=6
The simplest example of an O() i.e., O(sqrt n) time complexity algorithm in the video is:
p = 0;
for(i = 1; p <= n; i++) {
p = p + i;
}
Mr. Abdul Bari is reknowned for his simple explanations of data structures and algorithms.
Primality test
Solution in JavaScript
const isPrime = n => {
for(let i = 2; i <= Math.sqrt(n); i++) {
if(n % i === 0) return false;
}
return true;
};
Complexity
O(N^1/2) Because, for a given value of n, you only need to find if its divisible by numbers from 2 to its root.
JS Primality Test
O(sqrt(n))
A slightly more performant version, thanks to Samme Bae, for enlightening me with this. 😉
function isPrime(n) {
if (n <= 1)
return false;
if (n <= 3)
return true;
// Skip 4, 6, 8, 9, and 10
if (n % 2 === 0 || n % 3 === 0)
return false;
for (let i = 5; i * i <= n; i += 6) {
if (n % i === 0 || n % (i + 2) === 0)
return false;
}
return true;
}
isPrime(677);

Simulating a card game. degenerate suits

This might be a bit cryptic title but I have a very specific problem. First my current setup
Namely in my card simulator I deal 32 cards to 4 players in sets of 8. So 8 cards per player.
With the 4 standard suits (spades, harts , etc)
My current implementation cycles threw all combinations of 8 out of 32
witch gives me a large number of possibilities.
Namely the first player can have 10518300 different hands be dealt.
The second can then be dealt 735471 different hands.
The third player then 12870 different hands.
and finally the fourth can have only 1
giving me a grand total of 9.9561092e+16 different unique ways to deal a deck of 32 cards to 4 players. if the order of cards doesn’t matter.
On a 4 Ghz processor even with 1 tick per possibility it would take me half a year.
However I would like to simplify this dealing of cards by making the exchange of diamonds, harts and spades. Meaning that dealing of 8 harts to player 1 is equivalent to dealing 8 spades. (note that this doesn’t apply to clubs)
I am looking for a way to generate this. Because this will cut down the possibilities of the first hand by at least a factor of 6. My current implementation is in c++.
But feel free to answer in a different Languages
/** http://stackoverflow.com/a/9331125 */
unsigned cjasMain::nChoosek( unsigned n, unsigned k )
{
//assert(k < n);
if (k > n) return 0;
if (k * 2 > n) k = n-k;
if (k == 0) return 1;
int result = n;
for( int i = 2; i <= k; ++i ) {
result *= (n-i+1);
result /= i;
}
return result;
}
/** [combination c n p x]
* get the [x]th lexicographically ordered set of [r] elements in [n]
* output is in [c], and should be sizeof(int)*[r]
* http://stackoverflow.com/a/794 */
void cjasMain::Combination(int8_t* c,unsigned n,unsigned r, unsigned x){
++x;
assert(x>0);
int i,p,k = 0;
for(i=0;i<r-1;i++){
c[i] = (i != 0) ? c[i-1] : 0;
do {
c[i]++;
p = nChoosek(n-c[i],r-(i+1));
k = k + p;
} while(k < x);
k = k - p;
}
c[r-1] = c[r-2] + x - k;
}
/**http://stackoverflow.com/a/9430993 */
template <unsigned n,std::size_t r>
void cjasMain::Combinations()
{
static_assert(n>=r,"error n needs to be larger then r");
std::vector<bool> v(n);
std::fill(v.begin() + r, v.end(), true);
do
{
for (int i = 0; i < n; ++i)
{
if (!v[i])
{
COUT << (i+1) << " ";
}
}
static int j=0;
COUT <<'\t'<< j++<< "\n";
}
while (std::next_permutation(v.begin(), v.end()));
return;
}
A requirement is that from lexicographical number I can get back the original array.
Even the slightest optimization can help my monto carol simulation I hope.

Determine Position of Most Signifiacntly Set Bit in a Byte

I have a byte I am using to store bit flags. I need to compute the position of the most significant set bit in the byte.
Example Byte: 00101101 => 6 is the position of the most significant set bit
Compact Hex Mapping:
[0x00] => 0x00
[0x01] => 0x01
[0x02,0x03] => 0x02
[0x04,0x07] => 0x03
[0x08,0x0F] => 0x04
[0x10,0x1F] => 0x05
[0x20,0x3F] => 0x06
[0x40,0x7F] => 0x07
[0x80,0xFF] => 0x08
TestCase in C:
#include <stdio.h>
unsigned char check(unsigned char b) {
unsigned char c = 0x08;
unsigned char m = 0x80;
do {
if(m&b) { return c; }
else { c -= 0x01; }
} while(m>>=1);
return 0; //never reached
}
int main() {
unsigned char input[256] = {
0x00,0x01,0x02,0x03,0x04,0x05,0x06,0x07,0x08,0x09,0x0a,0x0b,0x0c,0x0d,0x0e,0x0f,
0x10,0x11,0x12,0x13,0x14,0x15,0x16,0x17,0x18,0x19,0x1a,0x1b,0x1c,0x1d,0x1e,0x1f,
0x20,0x21,0x22,0x23,0x24,0x25,0x26,0x27,0x28,0x29,0x2a,0x2b,0x2c,0x2d,0x2e,0x2f,
0x30,0x31,0x32,0x33,0x34,0x35,0x36,0x37,0x38,0x39,0x3a,0x3b,0x3c,0x3d,0x3e,0x3f,
0x40,0x41,0x42,0x43,0x44,0x45,0x46,0x47,0x48,0x49,0x4a,0x4b,0x4c,0x4d,0x4e,0x4f,
0x50,0x51,0x52,0x53,0x54,0x55,0x56,0x57,0x58,0x59,0x5a,0x5b,0x5c,0x5d,0x5e,0x5f,
0x60,0x61,0x62,0x63,0x64,0x65,0x66,0x67,0x68,0x69,0x6a,0x6b,0x6c,0x6d,0x6e,0x6f,
0x70,0x71,0x72,0x73,0x74,0x75,0x76,0x77,0x78,0x79,0x7a,0x7b,0x7c,0x7d,0x7e,0x7f,
0x80,0x81,0x82,0x83,0x84,0x85,0x86,0x87,0x88,0x89,0x8a,0x8b,0x8c,0x8d,0x8e,0x8f,
0x90,0x91,0x92,0x93,0x94,0x95,0x96,0x97,0x98,0x99,0x9a,0x9b,0x9c,0x9d,0x9e,0x9f,
0xa0,0xa1,0xa2,0xa3,0xa4,0xa5,0xa6,0xa7,0xa8,0xa9,0xaa,0xab,0xac,0xad,0xae,0xaf,
0xb0,0xb1,0xb2,0xb3,0xb4,0xb5,0xb6,0xb7,0xb8,0xb9,0xba,0xbb,0xbc,0xbd,0xbe,0xbf,
0xc0,0xc1,0xc2,0xc3,0xc4,0xc5,0xc6,0xc7,0xc8,0xc9,0xca,0xcb,0xcc,0xcd,0xce,0xcf,
0xd0,0xd1,0xd2,0xd3,0xd4,0xd5,0xd6,0xd7,0xd8,0xd9,0xda,0xdb,0xdc,0xdd,0xde,0xdf,
0xe0,0xe1,0xe2,0xe3,0xe4,0xe5,0xe6,0xe7,0xe8,0xe9,0xea,0xeb,0xec,0xed,0xee,0xef,
0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7,0xf8,0xf9,0xfa,0xfb,0xfc,0xfd,0xfe,0xff };
unsigned char truth[256] = {
0x00,0x01,0x02,0x02,0x03,0x03,0x03,0x03,0x04,0x04,0x04,0x04,0x04,0x04,0x04,0x04,
0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,0x05,
0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,
0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,0x06,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,0x07,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,
0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08,0x08};
int i,r;
int f = 0;
for(i=0; i<256; ++i) {
r=check(input[i]);
if(r !=(truth[i])) {
printf("failed %d : 0x%x : %d\n",i,0x000000FF & ((int)input[i]),r);
f += 1;
}
}
if(!f) { printf("passed all\n"); }
else { printf("failed %d\n",f); }
return 0;
}
I would like to simplify my check() function to not involve looping (or branching preferably). Is there a bit twiddling hack or hashed lookup table solution to compute the position of the most significant set bit in a byte?
Your question is about an efficient way to compute log2 of a value. And because you seem to want a solution that is not limited to the C language I have been slightly lazy and tweaked some C# code I have.
You want to compute log2(x) + 1 and for x = 0 (where log2 is undefined) you define the result as 0 (e.g. you create a special case where log2(0) = -1).
static readonly Byte[] multiplyDeBruijnBitPosition = new Byte[] {
7, 2, 3, 4,
6, 1, 5, 0
};
public static Byte Log2Plus1(Byte value) {
if (value == 0)
return 0;
var roundedValue = value;
roundedValue |= (Byte) (roundedValue >> 1);
roundedValue |= (Byte) (roundedValue >> 2);
roundedValue |= (Byte) (roundedValue >> 4);
var log2 = multiplyDeBruijnBitPosition[((Byte) (roundedValue*0xE3)) >> 5];
return (Byte) (log2 + 1);
}
This bit twiddling hack is taken from Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup where you can see the equivalent C source code for 32 bit values. This code has been adapted to work on 8 bit values.
However, you may be able to use an operation that gives you the result using a very efficient built-in function (on many CPU's a single instruction like the Bit Scan Reverse is used). An answer to the question Bit twiddling: which bit is set? has some information about this. A quote from the answer provides one possible reason why there is low level support for solving this problem:
Things like this are the core of many O(1) algorithms such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.
That was a fun little challenge. I don't know if this one is completely portable since I only have VC++ to test with, and I certainly can't say for sure if it's more efficient than other approaches. This version was coded with a loop but it can be unrolled without too much effort.
static unsigned char check(unsigned char b)
{
unsigned char r = 8;
unsigned char sub = 1;
unsigned char s = 7;
for (char i = 0; i < 8; i++)
{
sub = sub & ((( b & (1 << s)) >> s--) - 1);
r -= sub;
}
return r;
}
I'm sure everyone else has long since moved on to other topics but there was something in the back of my mind suggesting that there had to be a more efficient branch-less solution to this than just unrolling the loop in my other posted solution. A quick trip to my copy of Warren put me on the right track: Binary search.
Here's my solution based on that idea:
Pseudo-code:
// see if there's a bit set in the upper half
if ((b >> 4) != 0)
{
offset = 4;
b >>= 4;
}
else
offset = 0;
// see if there's a bit set in the upper half of what's left
if ((b & 0x0C) != 0)
{
offset += 2;
b >>= 2;
}
// see if there's a bit set in the upper half of what's left
if > ((b & 0x02) != 0)
{
offset++;
b >>= 1;
}
return b + offset;
Branch-less C++ implementation:
static unsigned char check(unsigned char b)
{
unsigned char adj = 4 & ((((unsigned char) - (b >> 4) >> 7) ^ 1) - 1);
unsigned char offset = adj;
b >>= adj;
adj = 2 & (((((unsigned char) - (b & 0x0C)) >> 7) ^ 1) - 1);
offset += adj;
b >>= adj;
adj = 1 & (((((unsigned char) - (b & 0x02)) >> 7) ^ 1) - 1);
return (b >> adj) + offset + adj;
}
Yes, I know that this is all academic :)
It is not possible in plain C. The best I would suggest is the following implementation of check. Despite quite "ugly" I think it runs faster than the ckeck version in the question.
int check(unsigned char b)
{
if(b&128) return 8;
if(b&64) return 7;
if(b&32) return 6;
if(b&16) return 5;
if(b&8) return 4;
if(b&4) return 3;
if(b&2) return 2;
if(b&1) return 1;
return 0;
}
Edit: I found a link to the actual code: http://www.hackersdelight.org/hdcodetxt/nlz.c.txt
The algorithm below is named nlz8 in that file. You can choose your favorite hack.
/*
From last comment of: http://stackoverflow.com/a/671826/315052
> Hacker's Delight explains how to correct for the error in 32-bit floats
> in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous
> union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat =
> (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on
> implementation-defined behavior) - Derrick Coetzee Jan 3 '12 at 8:35
*/
unsigned char check (unsigned char b) {
union {
float asFloat;
int asInt;
} u;
unsigned k = b & ~(b >> 1);
u.asFloat = (float)k + 0.5f;
return 32 - (158 - (u.asInt >> 23));
}
Edit -- not exactly sure what the asker means by language independent, but below is the equivalent code in python.
import ctypes
class Anon(ctypes.Union):
_fields_ = [
("asFloat", ctypes.c_float),
("asInt", ctypes.c_int)
]
def check(b):
k = int(b) & ~(int(b) >> 1)
a = Anon(asFloat=(float(k) + float(0.5)))
return 32 - (158 - (a.asInt >> 23))

g++ SSE intrinsics dilemma - value from intrinsic "saturates"

I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop before the statement that computes the inner product. Before I go further, here is the code:
//this is a sample Intrinsics program to compute inner product of two vectors and compare Intrinsics with traditional method of doing things.
#include <iostream>
#include <iomanip>
#include <xmmintrin.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
using namespace std;
typedef float v4sf __attribute__ ((vector_size(16)));
double innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume len1 = len2.
float result = 0.0;
for(int i = 0; i < len1; i++) {
for(int j = 0; j < len1; j++) {
result += (arr1[i] * arr2[i]);
}
}
//float y = 1.23e+09;
//cout << "y = " << y << endl;
return result;
}
double sse_v4sf_innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume that len1 = len2.
if(len1 != len2) {
cout << "Lengths not equal." << endl;
exit(1);
}
/*steps:
* 1. load a long-type (4 float) into a v4sf type data from both arrays.
* 2. multiply the two.
* 3. multiply the same and store result.
* 4. add this to previous results.
*/
v4sf arr1Data, arr2Data, prevSums, multVal, xyz;
//__builtin_ia32_xorps(prevSums, prevSums); //making it equal zero.
//can explicitly load 0 into prevSums using loadps or storeps (Check).
float temp[4] = {0.0, 0.0, 0.0, 0.0};
prevSums = __builtin_ia32_loadups(temp);
float result = 0.0;
for(int i = 0; i < (len1 - 3); i += 4) {
for(int j = 0; j < len1; j++) {
arr1Data = __builtin_ia32_loadups(&arr1[i]);
arr2Data = __builtin_ia32_loadups(&arr2[i]); //store the contents of two arrays.
multVal = __builtin_ia32_mulps(arr1Data, arr2Data); //multiply.
xyz = __builtin_ia32_addps(multVal, prevSums);
prevSums = xyz;
}
}
//prevSums will hold the sums of 4 32-bit floating point values taken at a time. Individual entries in prevSums also need to be added.
__builtin_ia32_storeups(temp, prevSums); //store prevSums into temp.
cout << "Values of temp:" << endl;
for(int i = 0; i < 4; i++)
cout << temp[i] << endl;
result += temp[0] + temp[1] + temp[2] + temp[3];
return result;
}
int main() {
clock_t begin, end;
int length = 100000;
float *arr1, *arr2;
double result_Conventional, result_Intrinsic;
// printStats("Allocating memory.");
arr1 = new float[length];
arr2 = new float[length];
// printStats("End allocation.");
srand(time(NULL)); //init random seed.
// printStats("Initializing array1 and array2");
begin = clock();
for(int i = 0; i < length; i++) {
// for(int j = 0; j < length; j++) {
// arr1[i] = rand() % 10 + 1;
arr1[i] = 2.5;
// arr2[i] = rand() % 10 - 1;
arr2[i] = 2.5;
// }
}
end = clock();
cout << "Time to initialize array1 and array2 = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
// printStats("Finished initialization.");
// printStats("Begin inner product conventionally.");
begin = clock();
result_Conventional = innerProduct(arr1, length, arr2, length);
end = clock();
cout << "Time to compute inner product conventionally = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
// printStats("End inner product conventionally.");
// printStats("Begin inner product using Intrinsics.");
begin = clock();
result_Intrinsic = sse_v4sf_innerProduct(arr1, length, arr2, length);
end = clock();
cout << "Time to compute inner product with intrinsics = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl;
//printStats("End inner product using Intrinsics.");
cout << "Results: " << endl;
cout << " result_Conventional = " << result_Conventional << endl;
cout << " result_Intrinsics = " << result_Intrinsic << endl;
return 0;
}
I use the following g++ invocation to build this:
g++ -W -Wall -O2 -pedantic -march=i386 -msse intrinsics_SSE_innerProduct.C -o innerProduct
Each of the loops above, in both the functions, runs a total of N^2 times. However, given that arr1 and arr2 (the two floating point vectors) are loaded with a value 2.5, the length of the array is 100,000, the result in both cases should be 6.25e+10. The results I get are:
Results:
result_Conventional = 6.25e+10
result_Intrinsics = 5.36871e+08
This is not all. It seems that the value returned from the function that uses intrinsics "saturates" at the value above. I tried putting other values for the elements of the array and different sizes too. But it seems that any value above 1.0 for the array contents and any size above 1000 meets with the same value we see above.
Initially, I thought it might be because all operations within SSE are in floating point, but floating point should be able to store a number that is of the order of e+08.
I am trying to see where I could be going wrong but cannot seem to figure it out. I am using g++ version: g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2).
Any help on this is most welcome.
Thanks,
Sriram.
The problem that you are having is that while a float can store 6.25e+10, it only has a few significant digits of precision.
This means that when you are building a large number by adding lots of small numbers together a bit at a time, you reach a point where the smaller number is smaller than the lowest precision digit in the larger number so adding it up has no effect.
As to why you are not getting this behaviour in the non-intrinsic version, it is likely that result variable is being held in a register which uses a higher precision that the actual storage of a float so it is not being truncated to the precision of a float on every iteration of the loop. You would have to look at the generated assembler code to be sure.