Doing efficient Numerics in Haskell - optimization

I was inspired by this post called "Only fast languages are interesting" to look at the problem he suggests (sum'ing a couple of million numbers from a vector) in Haskell and compare to his results.
I'm a Haskell newbie so I don't really know how to time correctly or how to do this efficiently, my first attempt at this problem was the following. Note that I'm not using random numbers in the vector as I'm not sure how to do in a good way. I'm also printing stuff in order to ensure full evaluation.
import System.TimeIt
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
print $ V.length vec
return vec
sumit :: IO ()
sumit = do
vec <- vector
print $ V.sum vec
time = timeIt sumit
Loading this up in GHCI and running time tells me that it took about 0.22s to run for 3 million numbers and 2.69s for 30 million numbers.
Compared to the blog authors results of 0.02s and 0.18s in Lush it's quite a lot worse, which leads me to believe this can be done in a better way.
Note: The above code needs the package TimeIt to run. cabal install timeit will get it for you.

First of all, realize that GHCi is an interpreter, and it's not designed to be very fast. To get more useful results you should compile the code with optimizations enabled. This can make a huge difference.
Also, for any serious benchmarking of Haskell code, I recommend using criterion. It uses various statistical techniques to ensure that you're getting reliable measurements.
I modified your code to use criterion and removed the print statements so that we're not timing the I/O.
import Criterion.Main
import Data.Vector as V
vector :: IO (Vector Int)
vector = do
let vec = V.replicate 3000000 10
return vec
sumit :: IO Int
sumit = do
vec <- vector
return $ V.sum vec
main = defaultMain [bench "sumit" $ whnfIO sumit]
Compiling this with -O2, I get this result on a pretty slow netbook:
$ ghc --make -O2 Sum.hs
$ ./Sum
warming up
estimating clock resolution...
mean is 56.55146 us (10001 iterations)
found 1136 outliers among 9999 samples (11.4%)
235 (2.4%) high mild
901 (9.0%) high severe
estimating cost of a clock call...
mean is 2.493841 us (38 iterations)
found 4 outliers among 38 samples (10.5%)
2 (5.3%) high mild
2 (5.3%) high severe
benchmarking sumit
collecting 100 samples, 8 iterations each, in estimated 6.180620 s
mean: 9.329556 ms, lb 9.222860 ms, ub 9.473564 ms, ci 0.950
std dev: 628.0294 us, lb 439.1394 us, ub 1.045119 ms, ci 0.950
So I'm getting an average of just over 9 ms with a standard deviation of less than a millisecond. For the larger test case, I'm getting about 100ms.
Enabling optimizations is especially important when using the vector package, as it makes heavy use of stream fusion, which in this case is able to eliminate the data structure entirely, turning your program into an efficient, tight loop.
It may also be worthwhile to experiment with the new LLVM-based code generator by using -fllvm option. It is apparently well-suited for numeric code.

Your original file, uncompiled, then compiled without optimization, then compiled with a simple optimization flag:
$ runhaskell boxed.hs
3000000
30000000
CPU time: 0.35s
$ ghc --make boxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.34s
$ ghc --make -O2 boxed.hs
$ ./boxed
3000000
30000000
CPU time: 0.09s
Your file with import qualified Data.Vector.Unboxed as V instead of import qualified Data.Vector as V (Int is an unboxable type) --
first without optimization then with:
$ ghc --make unboxed.hs -o unoptimized
$ ./unoptimized
3000000
30000000
CPU time: 0.27s
$ ghc --make -O2 unboxed.hs
$ ./unboxed
3000000
30000000
CPU time: 0.04s
So, compile, optimize ... and where possible use Data.Vector.Unboxed

Try to use an unboxed vector, although I'm not sure whether it makes a noticable difference in this case. Note also that the comparison is slightly unfair, because the vector package should optimize the vector away entirely (this optimization is called stream fusion).

If you use big enough vectors, Vector Unboxed might become impractical. For me pure (lazy) lists are quicker, if vector size > 50000000:
import System.TimeIt
sumit :: IO ()
sumit = print . sum $ replicate 50000000 10
main :: IO ()
main = timeIt sumit
I get these times:
Unboxed Vectors
CPU time: 1.00s
List:
CPU time: 0.70s
Edit: I've repeated the benchmark using Criterion and making sumit pure. Code and results as follow:
Code:
import Criterion.Main
sumit :: Int -> Int
sumit m = sum $ replicate m 10
main :: IO ()
main = defaultMain [bench "sumit" $ nf sumit 50000000]
Results:
warming up
estimating clock resolution...
mean is 7.248078 us (80001 iterations)
found 24509 outliers among 79999 samples (30.6%)
6044 (7.6%) low severe
18465 (23.1%) high severe
estimating cost of a clock call...
mean is 68.15917 ns (65 iterations)
found 7 outliers among 65 samples (10.8%)
3 (4.6%) high mild
4 (6.2%) high severe
benchmarking sumit
collecting 100 samples, 1 iterations each, in estimated 46.07401 s
mean: 451.0233 ms, lb 450.6641 ms, ub 451.5295 ms, ci 0.950
std dev: 2.172022 ms, lb 1.674497 ms, ub 2.841110 ms, ci 0.950
It looks like print makes a big difference, as it should be expected!

Related

denoise image segmentation (mode filtering?) - or how to vectorize this operation in numpy?

UPDATE: In my initial post I stupidly applied stats.mode patch-wise rather than along the axis of the patches. Fixing this increased my speed by a factor of 4. however its still slow and my original questions still exist: (1) can i increase the speed? (2) are there different/better/standard approaches to cleaning up noisy categorical data? Back the post:
I have some image segmentation results that are noisy and I want to clean it up. My idea was to take the mode value for (3,3) patches. This code works but its too slow.:
from sklearn.feature_extraction import image
from scipy import stats
def _mode(a,axis=None):
m,_=stats.mode(a,axis=axis)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
patches=patches.reshape(nb_patches,-1)
return _mode(patches,1).reshape(int(np.sqrt(nb_patches)),-1)
""" original method (new version is ~ 5 times faster, but still slow)
def _mode(arr):
m,_=stats.mode(arr,axis=None)
return m
def mode_smoothing(data,kernel=(3,3)):
patches=image.extract_patches_2d(data,kernel)
nb_patches=patches.shape[0]
w=int(np.sqrt(nb_patches))
o=np.array([_mode(patches[p]) for p in range(nb_patches)])
return o.reshape(w,-1)
"""
Questions:
Is there a way to do this that is much much faster? eliminate for loop/vectorize in numpy? porting to c directly or using numba etc? I struggled getting something to work along these paths
Are there better / more standard methods for accomplishing denoising like this on categorical image data?
Here is a before/after example from the mode_smoothing method above
Below I present 2 answers to question my question:
by expanding out my initial attempt into a function that numba can handle
using the above suggestion by Alex Alex, which i'll call "categorical-smoothing" (is there a standard name for this method?)
I haven't written out a mathematical proof yet but it appears this patch-wise mode smoothing is equivalent to the categorical-smoothing for the correct choice of parameters. Both lead to a big speed-boost but categorical-smoothing solution is cleaner, faster, and doesn't involve numba - so it wins.
NUMBA
#njit
def mode_smoothing(data,kernel=(3,3),step=(1,1),edges=False,high_value=False,center_boost=False):
""" mode smoothing over patches
Args:
data<np.array>: numpy array (ndim=2)
kernel<tuple[int]>: (height,width) of patch window
step<tuple[int]>: (y-step,x-step)
edges<bool>:
- if true
* include edge patches by taking mode over smaller patch window
* the returned image be the same shape as the input data
- if false
* only run over patches with the full kernel size
* the returned image will be reduced in size by the radius of the kernel
high_value<bool>:
when there are multiple possible mode values choose the highest if true,
otherwise choose the lowest value
center_boost<int|bool>:
if true, instead of using pure mode-value increase the count on the center pixel
Return
<np.array> of patch wise mode values. shape my be different than input. see `edges` above
"""
h,w=data.shape
ry=int(kernel[0]//2)
rx=int(kernel[1]//2)
sy,sx=step
_mode_vals=[]
if edges:
j0,j1=0,h
i0,i1=0,w
else:
j0,j1=ry,h-ry
i0,i1=rx,w-rx
for j in range(j0,j1,sy):
for i in range(i0,i1,sx):
ap=data[
max(j-ry,0):j+ry+1,
max(i-rx,0):i+rx+1]
cv=data[j,i]
values=np.unique(ap)
count=0
for v in values:
newcount=(ap==v).sum()
if center_boost and (v==cv):
newcount+=center_boost
if high_value:
test=newcount>=count
else:
test=newcount>count
if test:
count=newcount
mode_value=v
_mode_vals.append(mode_value)
return np.array(_mode_vals).reshape(j1-j0,i1-i0)
CATEGORICAL SMOOTHING
from scipy.signal import convolve2d
KERNEL=np.ones((3,3))
def categorical_smoothing(data,nb_categories,kernel=KERNEL):
data=np.eye(nb_categories)[:,data]
for i in range(nb_categories):
data[i]=convolve2d(data[i],kernel,mode='same')
return data.argmax(axis=0)
EQUIVALENCE/SPEED-CHECK
This is probably easy to prove but...
S=512
N=19
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=False)
kernel=np.ones((3,3))
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
print()
data=np.random.randint(0,N,(S,S))
%time o1=mode_smoothing(data,edges=True,center_boost=1)
kernel=np.ones((3,3))
kernel[1,1]=2
%time o2=categorical_smoothing(data,N,kernel=kernel)
print((o1==o2).all())
""" OUTPUT
CPU times: user 826 ms, sys: 0 ns, total: 826 ms
Wall time: 825 ms
CPU times: user 416 ms, sys: 7.83 ms, total: 423 ms
Wall time: 423 ms
True
CPU times: user 825 ms, sys: 3.78 ms, total: 829 ms
Wall time: 828 ms
CPU times: user 422 ms, sys: 3.91 ms, total: 426 ms
Wall time: 425 ms
True
"""

Fastest Cython implementation depends on computer?

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference.
Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU # 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU # 2.5GHz × 2 64bit. 8GB RAM
V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used
V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:
cdef char board[ 362 ]
and the function get_board_feature in go.pyx replaces numpy list comprehension:
cdef char get_board_feature( self, short location ):
# return correct board feature value
# 0 active player stone
# 1 opponent stone
# 2 empty location
cdef char value = self.board[ location ]
if value == EMPTY:
return 2
if value == self.player_current:
return 0
return 1
get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
"""A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
always refers to the current player and plane 1 to the opponent
"""
cdef short location
for location in range( 0, state.size * state.size ):
tensor[ offSet + state.get_board_feature( location ), location ] = 1
return offSet + 3
Please let me know if i should include any other information or run certain tests.
cmp, diff test
the V2 go.c and preprocessing.c files are identical.
V1 does not generate a .c file to compare
update compared .so files
the V2 go.so files are different:
goD.so goL.so differ: byte 473, line 1
the preprocessing.so files are identical, not sure what to think of that..
They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.
Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

Logtalk method calls performance optimization

While playing with Logtalk, is seems my program was longer to execute with Logtalk object versus plain Prolog. I did a benchmark comparing the execution of the simple predicate in plain Prolog with the logtalk object encapsulation equivalent below :
%%
% plain prolog predicate
plain_prolog_simple :-
fail.
%%
% object encapsulation
:- object(logtalk_obj).
:- public([simple/0]).
simple :-
fail.
:- end_object.
Here’s what I get :
?- benchmark(plain_prolog_simple).
Number of repetitions: 500000
Total time calls: 0.33799099922180176 seconds
Average time per call: 6.759819984436035e-7 seconds
Number of calls per second: 1479329.3346604244
true.
?- benchmark(logtalk_obj::simple).
Number of repetitions: 500000
Total time calls: 2.950408935546875 seconds
Average time per call: 5.90081787109375e-6 seconds
Number of calls per second: 169468.0333888435
true.
We can see logtalk_obj::simple call is slower than plain_prolog_simple call.
I use SWI Prolog as backend, I tried to set some log talk flags, without success.
Edit : We can find benchmark code samples to https://github.com/koryonik/logtalk-experiments/tree/master/benchmarks
What's wrong ? Why this performance diff? How to optimize Logtalk method calls ?
In a nutshell, you're benchmarking the Logtalk compilation of the ::/2 goal at the top-level INTERPRETER. That's a classic benchmarking error. Goals at the top-level, being it plain Prolog goals, module explicitly-qualified predicate goals, or message sending goals are always going to be interpreted, i.e. compiled on the fly.
You get performance close to plain Prolog for message sending goals in compiled source files, which is the most common scenario. See the benchmarks example in the Logtalk distribution for a benchmarking solution that avoids the above trap.
The performance gap (between plain Prolog and Logtalk goals) depend on the chosen backend Prolog compiler. The gap is negligible with mature Prolog VMs (e.g. SICStus Prolog or ECLiPSe) when static binding is possible. Some Prolog VMs (e.g. SWI-Prolog) lack some optimizations that can make the gap bigger, specially in tight loops, however.
P.S. Logtalk comes out-of-box with a settings configuration for development, not for performance. See in particular the documentation on the optimize flag, which should be turned on for static binding optimizations.
UPDATE
Starting from the code in your repository, and assuming SWI-Prolog as backend compiler, try:
----- code.lgt -----
% plain prolog predicate
plain_prolog_simple :-
fail.
% object encapsulation
:- object(logtalk_obj).
:- public(simple/0).
simple :-
fail.
:- end_object.
--------------------
----- bench.lgt -----
% load the SWI-Prolog "statistics" library
:- use_module(library(statistics)).
:- object(bench).
:- public(bench/0).
bench :-
write('Plain Prolog goal:'), nl,
prolog_statistics:time({plain_prolog_simple}).
bench :-
write('Logtalk goal:'), nl,
prolog_statistics:time(logtalk_obj::simple).
bench.
:- end_object.
---------------------
Save both files and then startup Logtalk:
$ swilgt
...
?- set_logtalk_flag(optimize, on).
true.
?- {code, bench}.
% [ /Users/pmoura/Desktop/bench/code.lgt loaded ]
% (0 warnings)
% [ /Users/pmoura/Desktop/bench/bench.lgt loaded ]
% (0 warnings)
true.
?- bench::bench.
Plain Prolog goal:
% 2 inferences, 0.000 CPU in 0.000 seconds (69% CPU, 125000 Lips)
Logtalk goal:
% 2 inferences, 0.000 CPU in 0.000 seconds (70% CPU, 285714 Lips)
true.
The time/1 predicate is a meta-predicate. The Logtalk compiler uses the meta-predicate property to compile the time/1 argument. The {}/1 control construct is a Logtalk compiler bypass. It ensures that its argument is called as-is in the plain Prolog database.
A benchmarking trick that works with SWI-Prolog and YAP (possibly others) that provide a time/1 meta-predicate is to use this predicate with Logtalk's <</2 debugging control construct and the logtalk built-in object. Using SWI-Prolog as the backend compiler:
?- set_logtalk_flag(optimize, on).
...
?- time(true). % ensure the library providing time/1 is loaded
...
?- {code}.
...
?- time(plain_prolog_simple).
% 2 inferences, 0.000 CPU in 0.000 seconds (59% CPU, 153846 Lips)
false.
?- logtalk<<(prolog_statistics:time(logtalk_obj::simple)).
% 2 inferences, 0.000 CPU in 0.000 seconds (47% CPU, 250000 Lips)
false.
A quick explanation, the <</2 control construct compiles its goal argument before calling it. As the optimize flag is turned on and time/1 is a meta-predicate, its argument is fully compiled and static binding is used for the message sending. Hence the same number of inferences we get above. Thus, this trick allows you to do quick benchmarking at the top-level for Logtalk message-sending goals.
Using YAP is similar but simpler as time/1 is a built-in meta-predicate instead of a library meta-predicate as in SWI-Prolog.
You can also make interpreters for object orientation which are quite fast. Jekejeke Prolog has a purely interpreted (::)/2 operator. There is not much overhead as of now. This is the test code:
Jekejeke Prolog 3, Runtime Library 1.3.0
(c) 1985-2018, XLOG Technologies GmbH, Switzerland
?- [user].
plain :- fail.
:- begin_module(obj).
simple(_) :- fail.
:- end_module.
And these are some actual results. There is not such a drastic difference between a plain call and an (::)/2 operator based call. Under the hood both predicate lookups are inline cached:
?- time((between(1,500000,_), plain, fail; true)).
% Up 76 ms, GC 0 ms, Thread Cpu 78 ms (Current 06/23/18 23:02:41)
Yes
?- time((between(1,500000,_), obj::simple, fail; true)).
% Up 142 ms, GC 11 ms, Thread Cpu 125 ms (Current 06/23/18 23:02:44)
Yes
We have still an overhead which might be removed in the future. It has to do that we still do a miniature rewrite for each (::)/2 call. But maybe this goes away, we are working on it.
Edit 23.06.2018: We have now a built-in between/3 and implemented already a few optimizations. The above figures show a preview of this new prototype which is not yet out.

Optimizing Python KD Tree Searches

Scipy (http://www.scipy.org/) offers two KD Tree classes; the KDTree and the cKDTree.
The cKDTree is much faster, but is less customizable and query-able than the KDTree (as far as I can tell from the docs).
Here is my problem:
I have a list of 3million 2 dimensional (X,Y) points. I need to return all of the points within a distance of X units from every point.
With the KDtree, there is an option to do just this: KDtree.query_ball_tree() It generates a list of lists of all the points within X units from every other point. HOWEVER: this list is enormous and quickly fills up my virtual memory (about 744 million items long).
Potential solution #1: Is there a way to parse this list into a text file as it is writing?
Potential solution #2: I have tried using a for loop (for every point in the list) and then finding that single point's neighbors within X units by employing: KDtree.query_ball_point(). HOWEVER: this takes forever as it needs to run the query millions of times. Is there a cKDTree equivalent of this KDTree tool?
Potential solution #3: Beats me, anyone else have any ideas?
From scipy 0.12 on, both KD Tree classes have feature parity. Quoting its announcement:
cKDTree feature-complete
Cython version of KDTree, cKDTree, is now feature-complete. Most
operations (construction, query, query_ball_point, query_pairs,
count_neighbors and sparse_distance_matrix) are between 200 and 1000
times faster in cKDTree than in KDTree. With very minor caveats,
cKDTree has exactly the same interface as KDTree, and can be used as a
drop-in replacement.
Try using KDTree.query_ball_point instead. It takes a single point, or array of points, and produces the points within a given distance of the input point(s).
You can use this function to perform batch queries. Give it, say, 100000 points at a time, and then write the results out to a file. Something like this:
BATCH_SIZE = 100000
for i in xrange(0, len(pts), BATCH_SIZE):
neighbours = tree.query_ball_point(pts[i:i+BATCH_SIZE], X)
# write neighbours to a file...

Moving beyond R's optim function

I am trying to use R to estimate a multinomial logit model with a manual specification. I have found a few packages that allow you to estimate MNL models here or here.
I've found some other writings on "rolling" your own MLE function here. However, from my digging around - all of these functions and packages rely on the internal optim function.
In my benchmark tests, optim is the bottleneck. Using a simulated dataset with ~16000 observations and 7 parameters, R takes around 90 seconds on my machine. The equivalent model in Biogeme takes ~10 seconds. A colleague who writes his own code in Ox reports around 4 seconds for this same model.
Does anyone have experience with writing their own MLE function or can point me in the direction of something that is optimized beyond the default optim function (no pun intended)?
If anyone wants the R code to recreate the model, let me know - I'll glady provide it. I haven't provided it since it isn't directly relevant to the problem of optimizing the optim function and to preserve space...
EDIT: Thanks to everyone for your thoughts. Based on a myriad of comments below, we were able to get R in the same ballpark as Biogeme for more complicated models, and R was actually faster for several smaller / simpler models that we ran. I think the long term solution to this problem is going to involve writing a separate maximization function that relies on a fortran or C library, but am certainly open to other approaches.
Tried with the nlm() function already? Don't know if it's much faster, but it does improve speed. Also check the options. optim uses a slow algorithm as the default. You can gain a > 5-fold speedup by using the Quasi-Newton algorithm (method="BFGS") instead of the default. If you're not concerned too much about the last digits, you can also set the tolerance levels higher of nlm() to gain extra speed.
f <- function(x) sum((x-1:length(x))^2)
a <- 1:5
system.time(replicate(500,
optim(a,f)
))
user system elapsed
0.78 0.00 0.79
system.time(replicate(500,
optim(a,f,method="BFGS")
))
user system elapsed
0.11 0.00 0.11
system.time(replicate(500,
nlm(f,a)
))
user system elapsed
0.10 0.00 0.09
system.time(replicate(500,
nlm(f,a,steptol=1e-4,gradtol=1e-4)
))
user system elapsed
0.03 0.00 0.03
Did you consider the material on the CRAN Task View for Optimization ?
I am the author of the R package optimParallel, which could be helpful in your case. The package provides parallel versions of the gradient-based optimization methods of optim(). The main function of the package is optimParallel(), which has the same usage and output as optim(). Using optimParallel() can significantly reduce optimization times as illustrated in the following figure (p is the number of paramters).
See https://cran.r-project.org/package=optimParallel and http://arxiv.org/abs/1804.11058 for more information.
FWIW, I've done this in C-ish, using OPTIF9. You'd be hard-pressed to go faster than that. There are plenty of ways for something to go slower, such as by running an interpreter like R.
Added: From the comments, it's clear that OPTIF9 is used as the optimizing engine. That means that most likely the bulk of the time is spent in evaluating the objective function in R. While it is possible that C functions are being used underneath for some of the operations, there still is interpreter overhead. There is a quick way to determine which lines of code and function calls in R are responsible for most of the time, and that is to pause it with the Escape key and examine the stack. If a statement costs X% of time, it is on the stack X% of the time. You may find that there are operations that are not going to C and should be. Any speedup factor you get this way will be preserved when you find a way to parallelize the R execution.