Lua Challenge: Can you improve the fannkuch implementation's performance? - optimization

Lua is currently the fastest scripting language out there, and its not so much slower than C/C++ for some sort of programs (on par when doing pidgits 1:1), however Lua scores really bad in a few benchmarks against C/C++.
One of those is the fannkuch test (Indexed-access to tiny integer-sequence), where it scores a horrible 1:148
-- The Computer Language Benchmarks Game
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall
local function fannkuch(n)
local p, q, s, odd, check, maxflips = {}, {}, {}, true, 0, 0
for i=1,n do p[i] = i; q[i] = i; s[i] = i end
repeat
-- Print max. 30 permutations.
if check < 30 then
if not p[n] then return maxflips end -- Catch n = 0, 1, 2.
io.write(unpack(p)); io.write("\n")
check = check + 1
end
-- Copy and flip.
local q1 = p[1] -- Cache 1st element.
if p[n] ~= n and q1 ~= 1 then -- Avoid useless work.
for i=2,n do q[i] = p[i] end -- Work on a copy.
for flips=1,1000000 do -- Flip ...
local qq = q[q1]
if qq == 1 then -- ... until 1st element is 1.
if flips > maxflips then maxflips = flips end -- New maximum?
break
end
q[q1] = q1
if q1 >= 4 then
local i, j = 2, q1 - 1
repeat q[i], q[j] = q[j], q[i]; i = i + 1; j = j - 1; until i >= j
end
q1 = qq
end
end
-- Permute.
if odd then
p[2], p[1] = p[1], p[2]; odd = false -- Rotate 1<-2.
else
p[2], p[3] = p[3], p[2]; odd = true -- Rotate 1<-2 and 1<-2<-3.
for i=3,n do
local sx = s[i]
if sx ~= 1 then s[i] = sx-1; break end
if i == n then return maxflips end -- Out of permutations.
s[i] = i
-- Rotate 1<-...<-i+1.
local t = p[1]; for j=1,i do p[j] = p[j+1] end; p[i+1] = t
end
end
until false
end
local n = tonumber(arg and arg[1]) or 1
io.write("Pfannkuchen(", n, ") = ", fannkuch(n), "\n")
So how could this be optimized (of course as with any optimization you have to measure your implementation to be sure its faster). And you aren't allowed to alter the C-core of Lua for this, or use LuaJit, its about finding ways to optimizing one of Lua's weak weak points.

Robert Gould > One of those is the fannkuch test (Indexed-access to tiny integer-sequence), where it scores a horrible 1:148
When you quote numbers from the benchmarks game please show where those numbers come from so readers have some context.
In this case you seem to have taken numbers measured on the quadcore machine where the fastest programs have been re-written to exploit multiple cores. Instead of looking at elapsed time sort by CPU time and you'll see the ratio drop to 1:43.
Or look at the median and quartiles to get a better impression of how the set of C++ measurements compares to the set of Lua measurements.
Or there's a whole set of measurements where programs are forced to use just one core - Lua compared with C++ - and if you take a look at those Lua pi-digits programs you'll see that they use the C language GNU GMP library.

Related

What is Pseudo-polynomial complexity?

Yes, I've seen this answer - What is pseudopolynomial time? How does it differ from polynomial time? - but I still don't understand.
Why does the representation in bits make a difference only sometimes?
For this program for example
function isPrime(n):
for i from 2 to n - 1:
if (n mod i) = 0, return false
return true
it says the complexity is not polynomial, because n requires log n bits to write out so the complexity is O(2^(4*log n)) but if i use that on every other problem then it could also be pseudopolynomial, right? (unless im getting it all wrong here). What makes this program so special to be measured in the amount of bits required to write out n?
You have linked to other questions where this is explained fairly well for someone who understands the concept, so here comes a very brief version.
for i from 2 to n - 1:
can be rewritten as
i = 2
while(i < n - 1):
if (n mod i) == 0:
return false
i = i + 1
Very often, we assume that the operations i < n - 1, i = i + 1 and n mod i are O(1). But this is not necessarily true. It is usually true for small values. And on a 32 bit machine, a "small value" is in the order of a billion.
Number that requires more than 32 bits to be represented will take more time to perform operations on than a number that fits in 32 bit. And it will take even more if it required more than 64 bit.
In practice, this rarely matters.
A very simple way to visualize this is to imagine that you get the task to implement the common mathematical operations where the operands are represented as strings. Here is a simple python function that takes two strings representing binary numbers and returns the sum as a string. It was quickly hacked together and assumes both strings has the same length. It may contain bugs and can most likely be refined. But it demonstrate the point. This function adds two numbers, but it will take longer time for longer numbers.
def binadd(a, b):
carry = '0'
result = list('0'*(len(a)+1))
for i in range(len(a)-1,-1, -1):
xor = '1' if (a[i] == '1') != (b[i] == '1') else '0'
val = '1' if (xor == '1') != (carry == '1') else '0'
carry = '1' if (carry == '1' and xor == '1') or (a[i] == '1' and b[i] == '1') else '0'
result[i] = val
result[0]=carry
return ''.join(result)
What makes this program so special to be measured in the amount of bits required to write out n?
There's nothing special about this particular program. At least not theoretical. In practice it is special in the sense that determining if a VERY big number is a prime is a common problem. Or to be more accurate, it would have been a much more common problem if there existed a very fast algorithm to do it. If it did, it would basically break encryption as we know it today.

Function call faster than on the fly calculation?

I am now seriously confused. I have a function creating a table with a random number of entries, and I tried two different methods to choose that number (which is somewhat wheighted):
Method 1, separated function
local function n()
local n = math.random()
if n < .7 then return 0
elseif n < .8 then return 1
end
return 2
end
local function final()
for i = 1, n() do
...
end
end
Method 2, direct calculation
local function final()
local n = math.random()
if n < .7 then n = 0
elseif n < .8 then n = 1
else n = 2
end
for i = 1, n do
...
end
end
The problem is: for some reason, the first method performs 30% faster than the second. Why is this?
No, call will never be faster than plainly inlining it. All the difference for the first method is adding extra work of setting up stack and dismantling it. The rest of code, both original and compiled is exactly the same, so it is only natural that "just calculation" will be faster than "just calculation + some extra work".
Your benchmark seem to be imprecise. For such a lightweight function a for loop and os.clock call themselves will take almost as many time as the function itself, so combined with os.clock inherent low resoulution and small amount of loops your data is not really statistically significant and you're mostly seeing results of random hiccups in your hardware. Use better timer and increase number of loops to at least 1000000.

Optimising Sparse Array Math

I have a sparse array: term_doc
its size is 622256x715 of Float64. It is very sparse:
Of its ~444,913,040 cells, only about 22,215 of them normally are nonempty.
Of the 622256 rows only 4,699 are occupied
though of the 715 columns all are occupied.
The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.
The Naive nonsparse version, I wrote is:
function doUnsparseWay()
gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
term_doc
N = term_doc./sum(term_doc,1)
println("N done")
gc()
P = term_doc./sum(term_doc,2)
println("P done")
gc()
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
Running this:
> #time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)
It is fairly simple.
It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually).
But it is fairly fast
I wanted to get it to work on the sparse matrix.
So as not to chew my memory out,
and because logically it is a faster operation -- less cells need operating on.
I followed suggestions from this post and from the performance page of the docs.
function doSparseWay()
term_doc::SparseMatrixCSC{Float64,Int64}
N= spzeros(size(term_doc)...)
N::SparseMatrixCSC{Float64,Int64}
for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
if total_terms == 0
continue
end
#fastmath #inbounds N[:,doc] = term_doc[:,doc]./total_terms
end
println("N done")
P = spzeros(size(term_doc)...)'
P::SparseMatrixCSC{Float64,Int64}
gfs = sum(term_doc,2)[:]
gfs::Array{Float64,1}
nterms = size(term_doc,1)
nterms::Int64
term_doc = term_doc'
#inbounds #simd for term in 1:nterms
#fastmath #inbounds P[:,term] = term_doc[:,term]/gfs[term]
end
println("P done")
P=P'
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
It never completes.
It gets up to outputting "N Done",
but never outputs "P Done".
I have left it running for several hours.
How can I optimize it so it can complete in reasonable time?
Or if this is not possible, explain why.
First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)
You want to use an approach similar to the answer by walnuss:
function doSparseWay(term_doc::SparseMatrixCSC)
I, J, V = findnz(term_doc)
normI = sum(term_doc, 1)
normJ = sum(term_doc, 2)
NV = similar(V)
PV = similar(V)
for idx = 1:length(V)
NV[idx] = V[idx]/normI[J[idx]]
PV[idx] = V[idx]/normJ[I[idx]]
end
m, n = size(term_doc)
sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
end
This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.

Fortran: efficient matrix-vector multiplication

I have a piece of code which is a significant bottleneck:
do s = 1,ns
msum = 0.d0
do k = 1,ns
msum = msum + tm(k,s)*f(:,:,k)
end do
m(:,:,s) = msum
end do
This is a simple matrix-vector product m=tm*f (where f is length k) for every x,y.
I thought about using a BLAS routine but i am not sure if any allows multiplying along a specific dimension (k). Do any of you have any good advice?
Unfortunately you do not mention the actual shape of f, i.e. the number of x and y. Since you mention this piece of code to be a bottleneck, you can and should replace msum and use the memory m(:,:,s) and spare the first step in you loop, e.g.
do s = 1,ns
m = tm(k,1)*f(:,:,k)
do k = 2, ns
m(:,:,s) = m(:,:,s) + tm(k,s)*f(:,:,k)
end do
end do
Secondly, a more general appraoch
There are ns summations of nK 2D matrices f(:,:,1:nK) by means of scalar factors that are stored in tm(:,1:ns). The goal is to store these sums in m(:,:,1:ns). Why not sum up element-wise wrt x and y to exploit contiguuos memory sections by means of the result? You already mentioned that you can redesign such that k is the first dimension in f, i.e. f(k,:,:).
Considering only the desired outcome, you ought to have ns 2D matrices m(:,:,1:ns) that are independent of each other (outer loop remains at it is). Lets drop this dimension for a moment. The problem then becomes:
m(:,:) = \sum_{k=1}^{ns} tm_k * f_k(:,:)
We should thus sum over k, e.g. have f(k,:,:) to determine m(:,:) as follows (note that I am adding the outer loop for s again):
nK = size(f, 1) ! the "k"s
nX = size(f, 2) ! the "x"s
nY = size(f, 3) ! the "y"s
m = 0.d0
do s = 1, ns
do ii = 1, nY
call DGEMV('N', nK, nY, &
1.d0, f(:,:,nY), 1, tm(:,s), 1, &
1.d0, m(:,nY,s), 1)
end do !ii
end do !s
See the documentation of DGEMV for more details on its usage.
Of course, the above advice of excluding the first step of the loop to spare the initialization by means of zeros may be applied at well.

Optimization of "static" loops

I'm writing a compiled language for fun, and I've recently gotten on a kick for making my optimizing compiler very robust. I've figured out several ways to optimize some things, for instance, 2 + 2 is always 4, so we can do that math at compile time, if(false){ ... } can be removed entirely, etc, but now I've gotten to loops. After some research, I think that what I'm trying to do isn't exactly loop unrolling, but it is still an optimization technique. Let me explain.
Take the following code.
String s = "";
for(int i = 0; i < 5; i++){
s += "x";
}
output(s);
As a human, I can sit here and tell you that this is 100% of the time going to be equivalent to
output("xxxxx");
So, in other words, this loop can be "compiled out" entirely. It's not loop unrolling, but what I'm calling "fully static", that is, there are no inputs that would change the behavior of the segment. My idea is that anything that is fully static can be resolved to a single value, anything that relies on input or makes conditional output of course can't be optimized further. So, from the machine's point of view, what do I need to consider? What makes a loop "fully static?"
I've come up with three types of loops that I need to figure out how to categorize. Loops that will always end up with the same machine state after every run, regardless of inputs, loops that WILL NEVER complete, and loops that I can't figure out one way or the other. In the case that I can't figure it out (it conditionally changes how many times it will run based on dynamic inputs), I'm not worried about optimizing. Loops that are infinite will be a compile error/warning unless specifically suppressed by the programmer, and loops that are the same every time should just skip directly to putting the machine in the proper state, without looping.
The main case of course to optimize is the static loop iterations, when all the function calls inside are also static. Determining if a loop has dynamic components is easy enough, and if it's not dynamic, I guess it has to be static. The thing I can't figure out is how to detect if it's going to be infinite or not. Does anyone have any thoughts on this? I know this is a subset of the halting problem, but I feel it's solvable; the halting problem is a problem due to the fact that for some subsets of programs, you just can't tell it may run forever, it may not, but I don't want to consider those cases, I just want to consider the cases where it WILL halt, or it WILL NOT halt, but first I have to distinguish between the three states.
This looks like a kind of a symbolic solver that can be defined for several classes, but not generally.
Let's restrict the requirements a bit: no number overflow, just for loops (while can be sometimes transformed to full for loop, except when using continue etc.), no breaks, no modifications of the control variable inside the for loop.
for (var i = S; E(i); i = U(i)) ...
where E(i) and U(i) are expressions that can be symbolically manipulated. There are several classes that are relatively easy:
U(i) = i + CONSTANT : n-th cycle the value of i is S + n * CONSTANT
U(i) = i * CONSTANT : n-th cycle the value of i is S * CONSTANT^n
U(i) = i / CONSTANT : n-th cycle the value of i is S * CONSTANT^-n
U(i) = (i + CONSTANT) % M : n-th cycle the value of i is (S + n * CONSTANT) % M
and some other quite easy combinations (and some very difficult ones)
Determining whether the loop terminates is searching for n where E(i(n)) is false.
This can be done by some symbolic manipulation for a lot of cases, but there is a lot of work involved in making the solver.
E.g.
for(int i = 0; i < 5; i++),
i(n) = 0 + n * 1 = n, E(i(n)) => not(n < 5) =>
n >= 5 => stops for n = 5
for(int i = 0; i < 5; i--),
i(n) = 0 + n * -1 = -n, E(i(n)) => not(-n < 5) => -n >= 5 =>
n < -5 - since n is a non-negative whole number this is never true - never stops
for(int i = 0; i < 5; i = (i + 1) % 3),
E(i(n)) => not(n % 3 < 5) => n % 3 >= 5 => this is never true => never stops
for(int i = 10; i + 10 < 500; i = i + 2 * i) =>
for(int i = 10; i < 480; i = 3 * i),
i(n) = 10 * 3^n,
E(i(n)) => not(10 * 3^n < 480) => 10 * 3^n >= 480 => 3^n >= 48 => n >= log3(48) => n >= 3.5... =>
since n is whole => it will stop for n = 4
for other cases it would be good if they can get transformed to the ones you can already solve...
Many tricks for symbolic manipulation come from Lisp era, and are not too difficult. Although the ones described (or variants) are the most common types practice, there are many more difficult and/or impossible to solve scenarios.