Optimising away residual heap allocation in Julia - optimization

I ran julia --track-allocation prof.jl resulting in the following output:
- using FixedSizeArrays
-
- immutable KernelVals{T}
- wavenumber::T
- vect::Vec{3,T}
- dist::T
- green::Complex{T}
- gradgreen::Vec{3,Complex{T}}
- end
-
- function kernelvals(k, x, y)
- r = x - y
0 R2 = r[1]*r[1]
0 R2 += r[2]*r[2]
0 R2 += r[3]*r[3]
0 R = sqrt(R2)
-
0 γ = im*k
0 expn = exp(-γ * R)
0 fctr = 1.0 / (4.0*pi*R)
0 green = fctr * expn
64 gradgreen = -(γ + 1/R) * green / R * r
-
0 KernelVals(k, r, R, green, gradgreen)
- end
-
- function payload()
- x = Vec{3,Float64}(0.47046262275611883,0.8745228524771103,-0.049820876498487966)
0 y = Vec{3,Float64}(-0.08977259509004082,0.543199687600189,0.8291184043296924)
0 k = 1.0
0 kv = kernelvals(k,x,y)
- return kv
- end
-
- function driver()
- println("Flush result: ", payload())
0 Profile.clear_malloc_data()
0 payload()
- end
-
- driver()
I cannot get rid of the final memory allocation on the line starting gradgreen.... I ran #code_warntype kernelsvals(...), revealing no type instability or uncertainty.
The allocation pattern is identical on julia-0.4.6 and julia-0.5.0-pre.
This function will be the inner kernel in a boundary element method I am implementing. It will be called literally millions of times, resulting in a gross memory allocation that can grow to be a multiple of the physical memory available
to me.
The reason I am using FixedSizeArrays is to avoid allocations related to the creation of small Arrays.
The precise location where the allocation is reported depends in a very sensitive manner on the code. At some point the memory profiler was blaming 1/(4*pi*R) as the line triggering allocation.
Any help or general tips on how to write code resulting in predictable allocation patterns is highly appreciated.

After some experiments I finally managed to get rid of all allocations. The culprit turned out to be the promotion architecture as extended in FixedSizeArrays. Apparently multiplying a complex scalar and a real vector creates a temporary along the way.
Replacing the definition of gradgreen with
c = -(γ + 1/R) * green / R
gradgreen = Vec(c*r[1], c*r[2], c*r[3])
results in allocation-free runs. In my benchmark example execution time came down from 6.5 seconds to 4.15 seconds. Total allocation size from 4.5 GB to 1.4 GB.
EDT: Reported this issue to FixedSizeArrays developers, who fixed it immediately (thank you!). Allocations disappeared completely.

Related

How can I reverse an easing function?

So, I have an easing function (stored as a lambda), that takes a factor from between 0 and 1, and eases it accordingly. This is the cubic-out easing.
{ input -> 1 - (1 - input).pow(3.0) }
In my animation class, when getting the value, if the animation is currently expanding, it simply eases the value, but when contracting, it minuses the value from 1 before easing it:
easing.ease(1f - factor)
What I want to do is to reverse the easing - like these two images (x represents the factor, t represents the time):
Expanding:
Contracting:
The full animation class can be found here, with the easings enum class in the same package. The easing happens in the getAnimationFactor function, and I know that there's a useless if statement, I accidentally pushed it.
Thanks in advance.
While there are programmatic ways to find the inverse of an arbitrary (monotonic) function, they're not simple or efficient. (I think you'd have to use a progressive approximation loop, which would be much slower and/or much less accurate.)
But you know the function! So you can use mathematical techniques to determine the inverse function, and then implement that in Kotlin.
For your first case:
            t = 1 - (1 - x)³
You can manipulate* that to give:
            x = 1 - ∛(1 - t)
…which could be implemented in Kotlin as:
            { 1 - (1 - it).pow(1.0/3) }
(The second case is left as an exercise for the reader :-)
(* The algebra is very straightforward, but I'll spell it all out here for those who aren't familiar with it:
            t = 1 - (1 - x)³
Subtract 1 from each side:
            t - 1 = - (1 - x)³
Multiply each side by -1:
            1 - t = (1 - x)³
Take the cube root of each side:
            ∛(1 - t) = 1 - x
Subtract 1 from each side:
            ∛(1 - t) - 1 = - x
Multiply each side by -1:
            1 - ∛(1 - t) = x
And switch sides:
            x = 1 - ∛(1 - t)
QED∎
The only step needing further explanation is taking the cube root of each side, which is OK here because every real number has exactly one real cube root — we're ignoring the complex ones! If it were a square root, or any other even-numbered root, if would only work for non-negative reals.)

VBA - showing wrong results

I have come across a issue while working in VBA . I'm supposed to write program that is Numerical integration of trapeze method (I'm not sure if It is how it's called in English) of function 100*x^99 lower limit = 0 upper limit = 1 . Cells (j,5) contains numbers (10,30,100,300,1000,3000,10000) - amount of point splits . Code seems to work but given wrong results , for amount of splits it should be around
10 - 5.000295129200607
30 - 1.786588019299606
100 - 1.0812206997600746
300 - 1.0091505687770146
1000 - 1.0008248693208752
3000 - 1.0000916650530287
10000 - 1.000008249986933
Function F(x)
F = 100 * (x ^ 99)
End Function
Sub calka()
Dim n As Single
Dim xp As Single
Dim dx As Single
Dim xk As Single
Dim ip As Single
Dim pole As Single
xp = 0
xk = 1
For j = 5 To 11
n = Cells(j, 5)
dx = (xk - xp) / n
pole = 0
For i = 1 To n - 1
pole = pole + F(xp + i * dx)
Next i
pole = pole + ((F(xp) + F(xk)) / 2)
pole = pole * dx
Worksheets("Arkusz1").Cells(j, 7) = pole
Next j
End Sub
I tried to implement same code in java and c++ and it worked flawlessly but VBA always gives me wrong results , I'm not sure if it's rounds at some point and I can disable in settings or my code is just not written right .
Apologies for low clarity It's hard for me to translate mathematic to English.
Use Doubles rather than Singles
http://www.techrepublic.com/article/comparing-double-vs-single-data-types-in-vb6/

Coordinate Descent Algorithm in Julia for Least Squares not converging

As a warm-up to writing my own elastic net solver, I'm trying to get a fast enough version of ordinary least squares implemented using coordinate descent.
I believe I've implemented the coordinate descent algorithm correctly, but when I use the "fast" version (see below), the algorithm is insanely unstable, outputting regression coefficients that routinely overflow a 64-bit float when the number of features is of moderate size compared to the number of samples.
Linear Regression and OLS
If b = A*x, where A is a matrix, x a vector of the unknown regression coefficients, and y is the output, I want to find x that minimizes
||b - Ax||^2
If A[j] is the jth column of A and A[-j] is A without column j, and the columns of A are normalized so that ||A[j]||^2 = 1 for all j, the coordinate-wise update is then
Coordinate Descent:
x[j] <-- A[j]^T * (b - A[-j] * x[-j])
I'm following along with these notes (page 9-10) but the derivation is simple calculus.
It's pointed out that instead of recomputing A[j]^T(b - A[-j] * x[-j]) all the time, a faster way to do it is with
Fast Coordinate Descent:
x[j] <-- A[j]^T*r + x[j]
where the total residual r = b - Ax is computed outside the loop over coordinates. The equivalence of these update rules follows from noting that Ax = A[j]*x[j] + A[-j]*x[-j] and rearranging terms.
My problem is that while the second method is indeed faster, it's wildly numerically unstable for me whenever the number of features isn't small compared to the number of samples. I was wondering if anyone might have some insight as to why that's the case. I should note that the first method, which is more stable, still starts disagreeing with more standard methods as the number of features approaches the number of samples.
Julia code
Below is some Julia code for the two update rules:
function OLS_builtin(A,b)
x = A\b
return(x)
end
function OLS_coord_descent(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
for j = 1:P
x[j] = dot(A[:,j], b - A[:,1:P .!= j]*x[1:P .!= j])
end
end
return(x)
end
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
end
end
return(x)
end
Example of the problem
I generate data with the following:
n = 100
p = 50
σ = 0.1
β_nz = float([i*(-1)^i for i in 1:10])
β = append!(β_nz,zeros(Float64,p-length(β_nz)))
X = randn(n,p); X .-= mean(X,1); X ./= sqrt(sum(abs2(X),1))
y = X*β + σ*randn(n); y .-= mean(y);
Here I use p=50, and I get good agreement between OLS_coord_descent(X,y) and OLS_builtin(X,y), whereas OLS_coord_descent_fast(X,y)returns exponentially large values for the regression coefficients.
When p is less than about 20, OLS_coord_descent_fast(X,y) agrees with the other two.
Conjecture
Since things agrees for the regime of p << n, I think the algorithm is formally correct, but numerically unstable. Does anyone have any thoughts on whether this guess is correct, and if so how to correct for the instability while retaining (most) of the performance gains of the fast version of the algorithm?
The quick answer: You forgot to update r after each x[j] update. Following is the fixed function which behaves like OLS_coord_descent:
function OLS_coord_descent_fast(A,b)
N,P = size(A)
x = zeros(P)
for cycle in 1:1000
r = b - A*x
for j = 1:P
x[j] += dot(A[:,j],r)
r -= A[:,j]*dot(A[:,j],r) # Add this line
end
end
return(x)
end

Optimising Sparse Array Math

I have a sparse array: term_doc
its size is 622256x715 of Float64. It is very sparse:
Of its ~444,913,040 cells, only about 22,215 of them normally are nonempty.
Of the 622256 rows only 4,699 are occupied
though of the 715 columns all are occupied.
The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.
The Naive nonsparse version, I wrote is:
function doUnsparseWay()
gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
term_doc
N = term_doc./sum(term_doc,1)
println("N done")
gc()
P = term_doc./sum(term_doc,2)
println("P done")
gc()
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
Running this:
> #time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)
It is fairly simple.
It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually).
But it is fairly fast
I wanted to get it to work on the sparse matrix.
So as not to chew my memory out,
and because logically it is a faster operation -- less cells need operating on.
I followed suggestions from this post and from the performance page of the docs.
function doSparseWay()
term_doc::SparseMatrixCSC{Float64,Int64}
N= spzeros(size(term_doc)...)
N::SparseMatrixCSC{Float64,Int64}
for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
if total_terms == 0
continue
end
#fastmath #inbounds N[:,doc] = term_doc[:,doc]./total_terms
end
println("N done")
P = spzeros(size(term_doc)...)'
P::SparseMatrixCSC{Float64,Int64}
gfs = sum(term_doc,2)[:]
gfs::Array{Float64,1}
nterms = size(term_doc,1)
nterms::Int64
term_doc = term_doc'
#inbounds #simd for term in 1:nterms
#fastmath #inbounds P[:,term] = term_doc[:,term]/gfs[term]
end
println("P done")
P=P'
N[isnan(N)] = 0.0
P[isnan(P)] = 0.0
N,P,term_doc
end
It never completes.
It gets up to outputting "N Done",
but never outputs "P Done".
I have left it running for several hours.
How can I optimize it so it can complete in reasonable time?
Or if this is not possible, explain why.
First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)
You want to use an approach similar to the answer by walnuss:
function doSparseWay(term_doc::SparseMatrixCSC)
I, J, V = findnz(term_doc)
normI = sum(term_doc, 1)
normJ = sum(term_doc, 2)
NV = similar(V)
PV = similar(V)
for idx = 1:length(V)
NV[idx] = V[idx]/normI[J[idx]]
PV[idx] = V[idx]/normJ[I[idx]]
end
m, n = size(term_doc)
sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
end
This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.

VB.NET doesn't round numbers correctly?

I'm testing the speed of some functions so I made a test to run the functions over and over again and I stored the results in an array. I needed them to be sorted by the size of the array I randomly generated. I generate 100 elements. Merge sort to the rescue! I used this link to get me started.
The section of code I'm focusing on:
private void mergesort(int low, int high) {
// check if low is smaller then high, if not then the array is sorted
if (low < high) {
// Get the index of the element which is in the middle
int middle = low + (high - low) / 2;
// Sort the left side of the array
mergesort(low, middle);
// Sort the right side of the array
mergesort(middle + 1, high);
// Combine them both
merge(low, middle, high);
}
}
which translated to VB.NET is
private sub mergesort(low as integer, high as integer)
' check if low is smaller then high, if not then the array is sorted
if (low < high)
' Get the index of the element which is in the middle
dim middle as integer = low + (high - low) / 2
' Sort the left side of the array
mergesort(low, middle)
' Sort the right side of the array
mergesort(middle + 1, high)
' Combine them both
merge(low, middle, high)
end if
end sub
Of more importance the LOC that only matters to this question is
dim middle as integer = low + (high - low) / 2
In case you wanna see how merge sort is gonna run this baby
high low high low
100 0 10 0
50 0 6 4
25 0 5 4
12 0 12 7
6 0 10 7
3 0 8 7
2 0 :stackoverflow error:
The error comes from the fact 7 + (8 - 7) / 2 = 8. You'll see 7 and 8 get passed in to mergesort(low, middle) and then we infinite loop. Now earlier in the sort you see a comparison like this again. At 5 and 4. 4 + (5 - 4) / 2 = 4. So essentially for 5 and 4 it becomes 4 + (1) / 2 = 4.5 = 4. For 8 and 7 though it's 7 + (1) / 2 = 7.5 = 8. Remember the numbers are typecasted to an int.
Maybe I'm just using a bad implementation of it or my typecasting is wrong, but my question is: Shouldn't this be a red flag signaling something isn't right with the rounding that's occuring?
Without understanding the whole algorithm, note that VB.NET / is different than C# /. The latter has integer division by default, if you want to truncate decimal places also in VB.NET you have to use \.
Read: \ Operator
So i think that this is what you want:
Dim middle as Int32 = low + (high - low) \ 2
You are correct in your diagnosis: there's something inconsistent with the rounding that's occurring, but this is entirely expected if you know where to look.
From the VB.NET documentation on the / operator:
Divides two numbers and returns a floating-point result.
This documentation explicitly states that , if x and y are integral types, x / y returns a Double. So, 5 / 2 in VB.NET would be expected to be 2.5.
From the C# documentation on the / operator:
All numeric types have predefined division operators.
And further down the page:
When you divide two integers, the result is always an integer.
In the case of C#, if x and y are integers, x / y returns an integer (rounded down). 5 / 2 in C# is expected to return 2.