Fast way to count how often pairs appear in Clojure 1.4 - optimization

I need to count the number of times that particular pairs of integers occur in some process (a heuristic detecting similar images using locality sensitive hashing - integers represent images and the "neighbours" below are images that have the same hash value, so the count indicates how many different hashes connect the given pair of images).
The counts are stored as a map from (ordered) pair to count (matches below).
The input (nbrs-list below) is a list of a list of integers that are considered neighbours, where every distinct (ordered) pair in the inner list ("the neighbours") should be counted. So, for example, if nbrs-list is [[1,2,3],[10,8,9]] then the pairs are [1,2],[1,3],[2,3],[8,9],[8,10],[9,10].
The routine collect is called multiple times; the matches parameter accumulates results and the nbrs-list is new on each call. The smallest number of neighbours (size of the inner list) is 1 and the largest ~1000. Each integer in nbrs-list occurs just once in any call to collect (this implies that, on each call, no pair occurs more than once) and the integers cover the range 0 - 1,000,000 (so the size of nbrs-list is less than 1,000,000 - since each value occurs just once and sometimes they occur in groups - but typically larger than 100,000 - as most images have no neighbours).
I have pulled the routines out of a larger chunk of code, so they may contain small edit errors, sorry.
(defn- flatten-1
[list]
(apply concat list))
(defn- pairs
([nbrs]
(let [nbrs (seq nbrs)]
(if nbrs (pairs (rest nbrs) (first nbrs) (rest nbrs)))))
([nbrs a bs]
(lazy-seq
(let [bs (seq bs)]
(if bs
(let [b (first bs)]
(cons (if (> a b) [[b a] 1] [[a b] 1]) (pairs nbrs a (rest bs))))
(pairs nbrs))))))
(defn- pairs-map
[nbrs]
(println (count nbrs))
(let [pairs-list (flatten-1 (pairs nbrs))]
(apply hash-map pairs-list)))
(defn- collect
[matches nbrs-list]
(let [to-add (map pairs-map nbrs-list)]
(merge-with + matches (apply (partial merge-with +) to-add))))
So the above code expands each set of neighbours to ordered pairs; creates a map from pairs to 1; then combines maps using addition of values.
I'd like this to run faster. I don't see how to avoid the O(n^2) expansion of pairs, but I imagine I can at least reduce the constant overhead by avoiding so many intermediate structures. At the same time, I'd like the code to be fairly compact and readable...
Oh, and now I am exceeding the "GC overhead limit". So reducing memory use/churn is also a priority :o)
[Maybe this is too specific? I am hoping the lessons are general and haven't seem many posts about optimising "real" clojure code. Also, I can profile etc, but my code seems so ugly I am hoping there's an obvious, cleaner approach - particularly for the pairs expansion.]

I guess you want the frequency with which each pair occurs ?
Try function frequencies. It uses transients under the hood, which should avoid GC overheads.

(I hope I haven't misunderstood your question)
If you just want to count the pairs in lists as this, then [[1,2,3],[8,9,10]] is
(defn count-nbr-pairs [n]
(/ (* n (dec n))
2))
(defn count-pairs [nbrs-list]
(apply + (map #(count-nbr-pairs (count %)) nbrs-list)))
(count-pairs [[1 2 3 4] [8 9 10]])
; => 9
This of course assumes, that you don't need to remove duplicate pairs.

=>(require '[clojure.math.combinatorics :as c])
=>(map #(c/combinations % 2) [[1 2 3] [8 9 10]])
(((1 2) (1 3) (2 3)) ((8 9) (8 10) (9 10)))
It's a pretty small library, take a look at the source
Performance wise you're looking at around the following number for your usecase of 1K unique values under 1M
=> (time (doseq
[i (c/combinations (take 1000 (shuffle (range 1000000))) 2)]))
"Elapsed time: 270.99675 msecs"
That's including generating the target set, which takes about 100 ms on my machine.

the above suggestions seemed to help, but were insufficient. i finally got decent performance with:
packing pairs into a long value. this works because MAX_LONG > 1e12 and long instances are comparable (so work well as hash keys, unlike long[2]). this had a significant effect on lowering memory use compared to [n1 n2].
using a TLongByteHashMap primitive type hash map and mutating it.
handling the pair code with nested doseq loops (or nested for loops when using immutable data structures).
improving my locality sensitive hash. a big part of the problem was that it was too weak, so finding too many neighbours - if the neighbours of a million images are ill-constrained then you get a million million pairs, which consumes a little too much memory...
the inner loop now looks like:
(doseq [i (range 1 (count nbrs))]
(doseq [j (range i)]
(let [pair (pack size (nth nbrs i) (nth nbrs j))]
(if-let [value (.get matches pair)]
(.put matches pair (byte (inc value)))
(.put matches pair (byte 0))))))
where
(defn- pack
[size n1 n2]
(+ (* size n1) n2))
(defn- unpack
[size pair]
[(int (/ pair size)) (mod pair size)])

Related

Kotlin - A function is marked as tail-recursive but no tail calls are found

I have a function that calculates the Fibonacci series with recursion.
fun fibonacciRecursive(n: Long): Long {
if (n <= 1) {
return n
}
return fibonacciRecursive(n - 1) + fibonacciRecursive(n - 2)
}
It works, but if the number is big then it takes a long time to execute. I've heard that kotlin has a keyword tailrec which improves the recursion and rewrites the function as a loop.
But when I add this keyword to this function, the compiler tells A function is marked as tail-recursive but no tail calls are found. Why I can't add tailrec?
Tail recursion is a type of recursion where at each level no calculation needs to be done after the recursive call - it just calls the deeper lever, and the result at the deepest level is the result of the entire stack of calls.
In essence, you can forget about each intermediate level, and when you get to the "bottom", just take the result and give it immediately to the original caller.
In your case, you have to call the next level, wait for it, then another, wait for it, then sum it - that means you do not have a tail-recursion implementation since you have to do something in the intermediate levels after the recursive calls.
The Fibonacci sequence (as all recursive solutions) can be solved iteratively with a regular loop that does not require a large call stack that will eventually also slow you down. It would be good practice to try that implementation, though by definition this sequence is "hard" to compute and quickly requires many iterations. This can also give you a hint on how to re-write it as a tail recursion:
Each iteration n given x_n and x_n-1, You set x_n-1 = x_n, x_n = x_n + x_n-1 (or something similar).
You do this until the requested amount of iterations, where there immediately you have the final sum.
Here we don't need to "go back" to finish our calculation. So in your implementation of a tail recursion instead of passing only the iteration index, you need to pass x_n and x_n+x_n-1, so at the final level you will already have the sum - no need to "go back". Tail recursion is essentially recursion that can easily be transformed to a loop.
If you want to use tail-recursion, you usually need to write your algorithm accordingly. In case of Fibonacci, probably the easiest way to turn it into a tail recursive function is to calculate upwards instead of downwards:
(Unfortunately I have no idea about Kotlin, so code below is in Common Lisp).
(defun fibo-slow (n)
"The classic, non- tail recursive way..."
(if (<= n 1)
n
(+ (fibo-slow (- n 1))
(fibo-slow (- n 2)))))
(defun up (n i-1 i-2 i)
"Tail recursive 'kernel', passing the history as arguments."
(if (= i n)
(+ i-1 i-2)
(up n (+ i-1 i-2) i-1 (+ i 1))))
(defun fibo-up (n)
"The adapter for function up, so the interface remains the same as 'fibo-slow'."
(case n
(0 0)
(1 1)
(otherwise
(up n 0 1 1))))
As you can see, in fibo-slow, the tail position is the addition (+), not the call to the recursive function(s).

Lisp, iterating backwards

Is there a way (with loop or iterate, doesn't matter) to iterate over sequence backwards?
Apart from (loop for i downfrom 10 to 1 by 1 do (print i)) which works with indexes, and requires length, or (loop for elt in (reverse seq)) which requires reversing sequence (even worse then the first option).
For lists the easiest is (dolist (x (reverse list)) ..) or using the more efficient nreverse if the list can be modified.
For vectors an alternative is dotimes with index calculation, something like:
(let* ((vec #(1 2 3))
(len (length vec)))
(dotimes (i len)
(print (aref vec (- len i 1)))))
Typically lists are iterated over from the start as each cons points to the next. Doing it from the back is inherently inefficient.
If you nevertheless have a list and wish fast reverse or random access, an option is to coerce it to a vector using e.g (coerce my-list 'array) and then access the elements using aref (or coerce to simple-vector and use svref).
If you are the one building the list, consider creating an adjustable vector with fill-pointer (see make-array documentation) and then use vector-push-extend to add items. That gives fast random access from the beginning.
Iterate can do it:
(iterate (for x :in-sequence #(1 2 3) :downto 0)
(princ x))
; => 321
As others have noted, this will be very inefficient if used on lists.

Comparison of lists in lisp vs Comparison of numbers(value and objects)

I am having trouble understanding how to compare numbers by value vs by address.
I have tried the following:
(setf number1 5)
(setf number2 number1)
(setf number3 5)
(setf list1 '(a b c d) )
(setf list2 list1)
(setf list3 '(a b c d) )
I then used the following predicate functions:
>(eq list1 list2) T
>(eq list1 list3) Nil
>(eq number1 number2) T
>(eq number1 number3) T
Why is it that with lists eq acts like it should (both pointers for list1 and list3 are different) yet for numbers it does not act like I think it should as number1 and number3 should have different addresses. Thus my question is why this doesn't act like I think it should and if there is a way to compare addresses of variables containing numbers vs values.
Equality Predicates in Common Lisp
how to compare numbers by value vs by address.
While there's a sense in which can be applied, that's not really the model that Common Lisp provides. Reading about the built-in equality predicates can help clarify the way in which objects are stored in memory (implicitly)..
EQ is generally what checks the "same address", but that's not how it's specified, and that's not exactly what it does, either. It "returns true if its arguments are the same, identical object; otherwise, returns false."
What does it mean to be the same identical object? For things like cons-cells (from which lists are built), there's an object in memory somewhere, and eq checks whether two values are the same object. Note that eq could return true or false on primitives like numbers, since the implementation is free to make copies of them.
EQL is like eq, but it adds a few extra conditions for numbers and characters. Numbers of the same type and value are guaranteed to be eql, as are characters that represent the same character.
EQUAL and EQUALP are where things start to get more complex and you actually get something like element-wise comparison for lists, etc.
This specific case
Why is it that with lists eq acts like it should (both pointers for
list1 and list3 are different) yet for numbers it does not act like I
think it should as number1 and number3 should have different
addresses. Thus my question is why this doesn't act like I think it
should and if there is a way to compare addresses of variables
containing numbers vs values.
The examples in the documentation for eq show that (eq 3 3) (and thus, (let ((x 3) (y 3)) (eq x y)) can return true or false. The behavior you're observing now isn't the only possible one.
Also, note that in compiled code, constant values can be coalesced into one. That means that the compiler has the option of making the following return true:
(let ((x '(1 2 3))
(y '(1 2 3)))
(eq x y))
One of the problems is that testing it in one implementation in a specific setting does not tell you much. Implementations may behave differently when the ANSI Common Lisp specification allows it.
do not assume that two numbers of the same value are EQ or not EQ. This is unspecified in Common Lisp. Use EQL or = to compare numbers.
do not assume that two literal lists, looking similar in a printed representation, are EQ or not EQ. This is unspecified in Common Lisp for the general case.
For example:
A file with the following content:
(defvar *a* '(1 2 3))
(defvar *b* '(1 2 3))
If one now compiles and loads the file it is unspecified if (eq *a* *b*) is T or NIL. Common Lisp allows an optimizing compiler to detect that the lists have the similar content and then will allocate only one list and both variables will be bound to the same list.
An implementation might even save space when not the whole lists are having similar content. For example in (a 1 2 3 4) and (b 1 2 3 4) a sublist (1 2 3 4) could be shared.
For code with a lot of list data, this could help saving space both in code and memory. Other implementations might not be that sophisticated. In interactive use, it is unlikely that an implementation will try to save space like that.
In the Common Lisp standard quite a bit behavior is unspecified. It was expected that implementations with different goals might benefit from different approaches.

How to generate random elements depending on previous elements using Quickcheck?

I'm using QuickCheck to do generative testing in Clojure.
However I don't know it well and often I end up doing convoluted things. One thing that I need to do quite often is something like that:
generate a first prime number from a list of prime-numbers (so far so good)
generate a second prime number which is smaller than the first one
generate a third prime number which is smaller than the first one
However I have no idea as to how to do that cleanly using QuickCheck.
Here's an even simpler, silly, example which doesn't work:
(prop/for-all [a (gen/choose 1 10)
b (gen/such-that #(= a %) (gen/choose 1 10))]
(= a b))
It doesn't work because a cannot be resolved (prop/for-all isn't like a let statement).
So how can I generate the three primes, with the condition that the two latter ones are inferior to the first one?
In test.check we can use gen/bind as the bind operator in the generator monad, so we can use this to make generators which depend on other generators.
For example, to generate pairs [a b] where we must have (>= a b) we can use this generator:
(def pair-gen (gen/bind (gen/choose 1 10)
(fn [a]
(gen/bind (gen/choose 1 a)
(fn [b]
(gen/return [a b]))))))
To satisfy ourselves:
(c/quick-check 10000
(prop/for-all [[a b] pair-gen]
(>= a b)))
gen/bind takes a generator g and a function f. It generates a value from g, let's call it x. gen/bind then returns the value of (f x), which must be a new generator. gen/return is a generator which only generates its argument (so above I used it to return the pairs).

clojure sum of all the primes under 2000000

It's a Project Euler problem .
I learned from Fastest way to list all primes below N
and implemented a clojure :
(defn get-primes [n]
(loop [numbers (set (range 2 n))
primes []]
(let [item (first numbers)]
(cond
(empty? numbers)
primes
:else
(recur (clojure.set/difference numbers (set (range item n item)))
(conj primes item))))))
used like follows:
(reduce + (get-primes 2000000))
but It is so slow..
I am wondering why, Can someone enlighten me?
This algorithm is not even correct: at each iteration except the final one it adds the value of (first numbers) at that point to primes, but there is no guarantee that it will in fact be a prime, since the set data structure in use is unordered. (This is also true of the Python original, as mentioned by its author in an edit to the question you link to.) So, you'd first need to fix it by changing (set (range ...)) to (into (sorted-set) (range ...)).
Even then, this is simply not a great algorithm for finding primes. To do better, you may want to write an imperative implementation of the Sieve of Eratosthenes using a Java array and loop / recur, or maybe a functional SoE-like algorithm such as those described in Melissa E. O'Neill's beautiful paper The Genuine Sieve of Eratosthenes.