Haskell: list/vector/array performance tuning - optimization

I am trying out Haskell to compute partition functions of models in statistical physics. This involves traversing quite large lists of configurations and summing various observables - which I would like to do as efficiently as possible.
The current version of my code is here: https://gist.github.com/2420539
Some strange things happen when trying to choose between lists and vectors to enumerate the configurations; in particular, to truncate the list, using V.toList . V.take (3^n) . V.fromList (where V is Data.Vector) is faster than just using take, which feels a bit counter-intuitive. In both cases the list is evaluated lazily.
The list itself is built using iterate; if instead I use Vectors as much as possible and build the list by using V.iterateN, again it becomes slower ...
My question is, is there a way (other than splicing V.toList and V.fromList at random places in the code) to predict which one will be the quickest? (BTW, I compile everything using ghc -O2 with the current stable version.)

Vectors are strict, and have O(1) subsets (e.g. take). They also have an optimized insert and delete. So you will sometimes see performance improvements by switching data structures on the fly. However, it is usually the wrong approach -- keeping all data in either one form or the other is better. (And you're using UArrays as well -- further confusing the issue).
General rules:
If the data is large and being transformed only in bulk fashion, using a dense, efficient structures like vectors make sense.
If the data is small, and traversed linearly, rarely, then lists make sense.
Remember that operations on lists and vectors have different complexity, so while iterate . replicate on lists is O(n), but lazy, the same on vectors will not necessarily be as efficient (you should prefer the built in methods in vector to generate arrays).
Generally, vectors should always be better for numerical operations. It might be that you have to use different functions that you do in lists.
I would stick to vectors only. Avoid UArrays, and avoid lists except as generators.


Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.
Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).
Is the first approach using atomic function faster than the second?
In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.
You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:
Optimizing Parallel Reduction in CUDA
to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.
Also relevant: CUDA: how to sum all elements of an array into one number within the GPU?
If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

how are histograms constructed in sklearn's HistGradientBoostingClassifier to decide on best split point

Both lightgbm and sklearn's HistGradientBoostingClassifier estimators use histograms to decide on best splits for continuous features.
Is it possible to explain intuitively (or with some example) the process of histogram creation and how does it help in deciding in faster split point at a node.
I have looked for answers extensively over the Internet but could not find any simple or intuitive way as to how histograms are constructed.
I am not sure but it could be related to how (unique) Regression trees are constructed in XGBoost. For a continuous feature, you construct an histogram, decide on the split (e.g. weight < 70kg), construct a Regression tree and compute the Similarity score as well as the Gain. However, when the range of the values in the continuous feature is quite large then it is quite computationally expensive to try all the possible split values. In that case, XGBoost basically makes the split by making use of the quantiles which involves dividing all the observations into equally sized sets.
I guess sklearn's HistGradientBoostingClassifier might involve the above tool optimization as well for coming up with the best split.

How to translate computation in index notation into sequence of SIMD ops in general case?

UPD: the question in it's original form is poorly formulated because I strongly confuse terminology (SIMD vs vectorized computations) and give too broad example that does not specify exactly what is the problem; I voted to close it with "unclear what you're asking", I'll link a better-formulated question above whenever it appears
In mathematics, one would usually describe n-dimensional tensor computation using index notation, that would look something like:
A[i,j,k] = B[k,j] + C[d[k],i,B[k,j]] + d[k]*f[j] // for 0<i<N, 0<j<M, 0<k<K
but if we want to use any SIMD library to efficiently parallelize that computation (and take advantage of linear-algebraic magic), we would have to express it using primitives from BLAS, numpy, tensorflow, OpenCL, ... that is often quite tricky.
Expressions in [Einstein notation][1] like A_ijk*B_kj are generally solved via [np.einsum][2] (using tensordot, sum and transpose, I guess?). Summation and other element-wise ops are also okay, "smart" indexing is quite tricky, though (especially, if an index appears more then single time in the expression).
I wonder if there any language-agnostic libraries that take an expression in certain form (lets say, form above) and translates it into some Intermediate Representation that can be efficiently executed using existing linear-algebra libraries?
There are libraries that attempt to parallelize loop computations (user API usually looks like #pragma in C++ or #numba.jit in python), but I'm asking about slightly different thing: translate abritary expression in form above into a finite sequence of SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
If there are no language-agnostic solutions yet, I am personally interested in numpy computations :)
Further questions about the code:
I see B[k,j] is used an an index and as a value. Is everything integer? If not, which parts are FP, and where does the conversion happen?
Why does i not appear in the right hand side? Is the same data repeated N times?
Oh yikes, so you have a gather operation, with indices coming from d[k] and B[k,j]. Only a few SIMD instruction sets support this (e.g. AVX2).
I mostly manually vectorize stuff in C, with Intel's x86 intrinsics, (or auto-vectorization and check the compiler's asm output to make sure it didn't suck), so IDK if there's any kind of platform-independent way to express that operation.
I wouldn't expect that many cross-platform SIMD languages would provide a gather or anything built on top of a gather. I haven't used numpy though.
I don't expect you'd find a BLAS, LAPACK, or other library function that includes a gather, unless you go looking for implementations of this exact problem.
With an efficient gather (e.g. Intel Skylake or Xeon Phi), it might vectorize ok if you use SIMD in the loop over j, so you load a whole vector at once from B[], and from f[], and use it with a vector holding d[k] broadcast to every position. You probably want to store a transposed result matrix, like A[i][k][j], so the final store doesn't have to be a scatter. You definitely need to avoid looping over k in the inner-most loop, since that makes loads from B[] non-contiguous, and you have d[k] instead of f[j] varying inside the inner loop.
I haven't done much with GPGPU, but they do SIMD differently. Instead of short vectors like CPUs use, they have effectively many scalar processors ganged together. OpenCL or CUDA or whatever other hot new GPGPU tech might handle your gathers much more efficiently.
SIMD commands, like elementwise-ops, matvecs, tensordots and etc.
When I think of "SIMD commands", I think of x86 assembly instructions (or ARM NEON, or whatever), or at least C / C++ intrinsics that compile to single instructions. :P
A matrix-vector product is not a single "instruction". If you used that terminology, every function that processes a buffer would be "a SIMD instruction".
The last part of your question seems to be asking for a programming-language independent version of numpy, for gluing together high-performance library functions. Or were you thinking that there might be something that would inter-optimize such operations, so you could write something that would compile to a vectorized loop that did stuff like use each input more than once without having to reload it in separate library calls?
IDK if there's anything like that, other than normal C compiler auto-vectorization of loops over arrays.

Where can I find several significant sorting algorithms tests cases?

I want to develop a very efficient sorting algorithm based on some ideas that I have. The problem is that I want to test my algorithm's efficiency against the majority highly appreciated sorting algorithms that already exist.
Ideally I would like to find:
a large bunch of sorting tests that are SIGNIFICANT for providing me with the efficiency of my algorithm
a large set of already existing and strongly-optimized sorting algorithms (with their code - no matter the language)
even better, software that provides adequate environment for sorting algorithms developers
Here's a post that I found earlier which contains 2 tables with comparisons between timsort, quicksort, dual-pivot quicksort and java 6 sort: http://blog.quibb.org/2009/10/sorting-algorithm-shootout/
I can see in those tables that those TXT files (starting from 1245.repeat.1000.txt on to sequential.10000000.txt) contain the test cases for those algorithms, but I can't find the original TXT's anywhere!
Can anyone point me to any link with many sorting test-cases AND/OR many HIGHLY EFFICIENT sorting algorithms? (it's the test cases I am interested in the most, sorting algorithms are all over the internet)
Thank you very much in advance!
A few things:
Quicksort goes nuts on forward- and reverse sorted lists so it will need other list types.
Testing on random data is fine, but if you want to compare the performance of different algorithms that means you cannot generate new random data every time or your results won't be reliable. I think you should try to come up with a pseudo"random" algorithm that writes data in in an order that is based on the number of entries. That way the data generated for lists of size n, 10n and 100n will be similar.
Testing of sorting is not primarily about speed (until an algorithm has been finalized) but the ratio of comparisons to entries. If one sort requires 15 comparisons per entry in a list and another 12 for the same list the second will be more efficient even if it executed in twice the time. For the more trivial sorting concepts the number of exchanges necessary will also come into play.
For testing use a vector of integers in RAM. If the algorithm works well the vector of integers can be translated to a vector of indeces into a buffer containing data to be compared. Such an algorithm would sort the vector of indeces based on the data they point to.