How to remove elements from a vector in a fast way in Clojure? - optimization

I'm trying to remove elements from a Clojure vector:
Note that I'm using Clojure's operations from Kotlin
val set = PersistentHashSet.create("foo")
val vec = PersistentVector.create("foo", "bar")
val seq = clojure.`core$remove`.invokeStatic(set, vec) as ISeq
val resultVec = clojure.`core$vec`.invokeStatic(seq) as PersistentVector
This is the equivalent of the following Clojure code:
(remove #{"foo"} ["foo" "bar"])
The code works fine but I've noticed that creating a vector from the seq is extrmely slow. I've written a benchmark and these were the results:
| Item count | Remove ms | Remove with converting back to vector ms|
-----------------------------------------------------------------
| 1000 | 51 | 1355 |
| 10000 | 71 | 5123 |
Do you know how I can convert the seq resulting from the remove operation back to a vector without the harsh performance penalty?
If it is not possible is there an alternative way to perform the remove operation?

You could try the complementary operation to remove that returns a vector:
(filterv (complement #{"foo"})
["foo" "bar"])
Note the use of filterv. The v indicates that it uses a vector from the start, and returns a vector, so no conversion is required. It uses a transient vector behind the scenes, so it should be pretty fast.
I'm negating the predicate using complement so I can use filterv, since there is no removev. remove is just defined as the complement of filter anyway though, so it's basically what you were already doing, just strict.

What you are trying to do fundamentally performs badly. Vectors are for fast indexed read/write, and O(1) access to the right end. To do anything else you must tear the vector apart and rebuild it again, an O(N) operation. If you need an operation like this to be efficient, you must use a different data structure.

Why not a PersistentHashSet? Fast removal, though not ordered. I do vaguely recall Clojure also having a sorted set in case that’s needed.

You have made an error of accepting the lazy result of remove as equivalent to the concrete result of converting back to a vector. Compare the lazy result of (remove ...) with the concrete result implied by (count (remove ...)). You will see that it is slightly slower than just doing (vec (remove ...)). Also, for real speed-critical applications, there is nothing like using a native Java ArrayList:
(ns tst.demo.core
(:require
[criterium.core :as crit] )
(:import [java.util ArrayList]))
(def N 1000)
(def tgt-item (/ N 2))
(def pred-set #{ (long tgt-item) })
(def data-vec (vec (range N)))
(def data-al (ArrayList. data-vec))
(def tgt-items (ArrayList. [tgt-item]))
(println :lazy)
(crit/quick-bench
(remove pred-set data-vec))
(println :lazy-count)
(crit/quick-bench
(count (remove pred-set data-vec)))
(println :vec)
(crit/quick-bench
(vec (remove pred-set data-vec)))
(println :ArrayList)
(crit/quick-bench
(let [changed? (.removeAll data-al tgt-items)]
data-al))
with results:
:lazy Evaluation count : 35819946 time mean : 10.856 ns
:lazy-count Evaluation count : 8496 time mean : 69941.171 ns
:vec Evaluation count : 9492 time mean : 62965.632 ns
:ArrayList Evaluation count : 167490 time mean : 3594.586 ns

Related

Idiomatic way of listing elements of a sum type in Idris

I have a sum type representing arithmetic operators:
data Operator = Add | Substract | Multiply | Divide
and I'm trying to write a parser for it. For that, I would need an exhaustive list of all the operators.
In Haskell I would use deriving (Enum, Bounded) like suggested in the following StackOverflow question: Getting a list of all possible data type values in Haskell
Unfortunately, there doesn't seem to be such a mechanism in Idris as suggested by Issue #19. There is some ongoing work by David Christiansen on the question so hopefully the situation will improve in the future : david-christiansen/derive-all-the-instances
Coming from Scala, I am used to listing the elements manually, so I pretty naturally came up with the following:
Operators : Vect 4 Operator
Operators = [Add, Substract, Multiply, Divide]
To make sure that Operators contains all the elements, I added the following proof:
total
opInOps : Elem op Operators
opInOps {op = Add} = Here
opInOps {op = Substract} = There Here
opInOps {op = Multiply} = There (There Here)
opInOps {op = Divide} = There (There (There Here))
so that if I add an element to Operator without adding it to Operators, the totality checker complains:
Parsers.opInOps is not total as there are missing cases
It does the job but it is a lot of boilerplate.
Did I miss something? Is there a better way of doing it?
There is an option of using such feature of the language as elaborator reflection to get the list of all constructors.
Here is a pretty dumb approach to solving this particular problem (I'm posting this because the documentation at the moment is very scarce):
%language ElabReflection
data Operator = Add | Subtract | Multiply | Divide
constrsOfOperator : Elab ()
constrsOfOperator =
do (MkDatatype _ _ _ constrs) <- lookupDatatypeExact `{Operator}
loop $ map fst constrs
where loop : List TTName -> Elab ()
loop [] =
do fill `([] : List Operator); solve
loop (c :: cs) =
do [x, xs] <- apply `(List.(::) : Operator -> List Operator -> List Operator) [False, False]
solve
focus x; fill (Var c); solve
focus xs
loop cs
allOperators : List Operator
allOperators = %runElab constrsOfOperator
A couple comments:
It seems that to solve this problem for any inductive datatype of a similar structure one would need to work through the Elaborator Reflection: Extending Idris in Idris paper.
Maybe the pruviloj library has something that might make solving this problem for a more general case easier.

Clojure: Store and Compile Large Derived Data Structure

I have a large data structure, a tree, that takes up about 2gb in ram. It includes clojure sets in the leaves, and refs as the branches. The tree is built by reading and parsing a large flat file and inserting the rows into the tree. However this takes about 30 seconds. Is there a way I can build the tree once, emit it to a clj file, and then compile the tree into my standalone jar so I can lookup values in the tree without re-reading the large text file? I think this will trim out the 30 second tree build, but also this will help me deploy my standalone jar without needing the text file to come along for the ride.
My first swing at this failed:
(def x (ref {:zebra (ref #{1 2 3 4})}))
#<Ref#6781a7dc: {:zebra #<Ref#709c4f85: #{1 2 3 4}>}>
(def y #<Ref#6781a7dc: {:zebra #<Ref#709c4f85: #{1 2 3 4}>}>)
RuntimeException Unreadable form clojure.lang.Util.runtimeException (Util.java:219)
Embedding data this big in compiled code may not be possible because of size limits imposed upon the JVM. In particular, no single method may exceed 64 KiB in length. Embedding data in the way I describe further below also necessitates including tons of stuff in the class file it's going to live in; doesn't seem like a great idea.
Given that you're using the data structure read-only, you can construct it once, then emit it to a .clj / .edn (that's for edn, the serialization format based on Clojure literal notation), then include that file on your class path as a "resource", so that it's included in the überjar (in resources/ with default Leiningen settings; it'll then get included in the überjar unless excluded by :uberjar-exclusions in project.clj) and read it from the resource at runtime at full speed of Clojure's reader:
(ns foo.core
(:require [clojure.java.io :as io]))
(defn get-the-huge-data-structure []
(let [r (io/resource "huge.edn")
rdr (java.io.PushbackReader. (io/reader r))]
(read r)))
;; if you then do something like this:
(def ds (get-the-huge-data-structure))
;; your app will load the data as soon as this namespace is required;
;; for your :main namespace, this means as soon as the app starts;
;; note that if you use AOT compilation, it'll also be loaded at
;; compile time
You could also not add it to the überjar, but rather add it to the classpath when running your app. This way your überjar itself would not have to be huge.
Handling stuff other than persistent Clojure data could be accomplished using print-method (when serializing) and reader tags (when deserializing). Arthur already demonstrated using reader tags; to use print-method, you'd do something like
(defmethod print-method clojure.lang.Ref [x writer]
(.write writer "#ref ")
(print-method #x writer))
;; from the REPL, after doing the above:
user=> (pr-str {:foo (ref 1)})
"{:foo #ref 1}"
Of course you only need to have the print-method methods defined when serializing; you're deserializing code can leave it alone, but will need appropriate data readers.
Disregarding the code size issue for a moment, as I find the data embedding issue interesting:
Assuming your data structure only contains immutable data natively handled by Clojure (Clojure persistent collections, arbitrarily nested, plus atomic items such as numbers, strings (atomic for this purpose), keywords, symbols; no Refs etc.), you can indeed include it in your code:
(defmacro embed [x]
x)
The generated bytecode will then recreate x without reading anything, by using constants included in the class file and static methods of the clojure.lang.RT class (e.g. RT.vector and RT.map).
This is, of course, how literals are compiled, since the macro above is a noop. We can make things more interesting though:
(ns embed-test.core
(:require [clojure.java.io :as io])
(:gen-class))
(defmacro embed-resource [r]
(let [r (io/resource r)
rdr (java.io.PushbackReader. (io/reader r))]
(read r)))
(defn -main [& args]
(println (embed-resource "foo.edn")))
This will read foo.edn at compile time and embed the result in the compiled code (in the sense of including appropriate constants and code to reconstruct the data in the class file). At run time, no further reading will be performed.
Is this structure something that doesn't change? If not, consider using Java serialization to persist the structure. Deserializing will be much faster than rebuilding every time.
If you can structure the tree to be a single value instead of a tree fo references to many values then you would be able to print the tree and read it. Because refs are not readable you won't be able to treat the entire tree as something readable without doing doing your own parsing.
It may be worth looking into using the extensible reader to add print and read functions for your tree by making it a type.
here is a minimal example of using data-readers to produce references to sets and maps from a string:
first define handlers for the contents of each EDN tag/type
user> (defn parse-map-ref [m] (ref (apply hash-map m)))
#'user/parse-map-ref
user> (defn parse-set-ref [s] (ref (set s)))
#'user/parse-set-ref
Then bind the map data-readers to associate the handlers with textual tags:
(def y-as-string
"#user/map-ref [:zebra #user/set-ref [1 2 3 4]]")
user> (def y (binding [*data-readers* {'user/set-ref user/parse-set-ref
'user/map-ref user/parse-map-ref}]
(read-string y-as-string)))
user> y
#<Ref#6d130699: {:zebra #<Ref#7c165ec0: #{1 2 3 4}>}>
this also works with more deeply nested trees:
(def z-as-string
"#user/map-ref [:zebra #user/set-ref [1 2 3 4]
:ox #user/map-ref [:amimal #user/set-ref [42]]]")
user> (def z (binding [*data-readers* {'user/set-ref user/parse-set-ref
'user/map-ref user/parse-map-ref}]
(read-string z-as-string)))
#'user/z
user> z
#<Ref#2430c1a0: {:ox #<Ref#7cf801ef: {:amimal #<Ref#7e473201: #{42}>}>,
:zebra #<Ref#7424206b: #{1 2 3 4}>}>
producing strings from trees can be accomplished by extending the print-method multimethod, though it would be a lot easier if you define a type for ref-map and ref-set using deftype so the printer can know which ref should produce which strings.
If in general reading them as strings is too slow there are faster binary serialization libraries like protocol buffers.

QuickCheck: Arbitrary instances of nested data structures that generate balanced specimens

tl;dr: how do you write instances of Arbitrary that don't explode if your data type allows for way too much nesting? And how would you guarantee these instances produce truly random specimens of your data structure?
I want to generate random tree structures, then test certain properties of these structures after I've mangled them with my library code. (NB: I'm writing an implementation of a subtyping algorithm, i.e. given a hierarchy of types, is type A a subtype of type B. This can be made arbitrarily complex, by including multiple-inheritance and post-initialization updates to the hierarchy. The classical method that supports neither of these is Schubert Numbering, and the latest result known to me is Alavi et al. 2008.)
Let's take the example of rose-trees, following Data.Tree:
data Tree a = Node a (Forest a)
type Forest a = [Tree a]
A very simple (and don't-try-this-at-home) instance of Arbitray would be:
instance (Arbitrary a) => Arbitrary (Tree a) where
arbitrary = Node <$> arbitrary <$> arbitrary
Since a already has an Arbitrary instance as per the type constraint, and the Forest will have one, because [] is an instance, too, this seems straight-forward. It won't (typically) terminate for very obvious reasons: since the lists it generates are arbitrarily long, the structures become too large, and there's a good chance they won't fit into memory. Even a more conservative approach:
arbitrary = Node <$> arbitrary <*> oneof [arbitrary,return []]
won't work, again, for the same reason. One could tweak the size parameter, to keep the length of the lists down, but even that won't guarantee termination, since it's still multiple consecutive dice-rolls, and it can turn out quite badly (and I want the odd node with 100 children.)
Which means I need to limit the size of the entire tree. That is not so straight-forward. unordered-containers has it easy: just use fromList. This is not so easy here: How do you turn a list into a tree, randomly, and without incurring bias one way or the other (i.e. not favoring left-branches, or trees that are very left-leaning.)
Some sort of breadth-first construction (the functions provided by Data.Tree are all pre-order) from lists would be awesome, and I think I could write one, but it would turn out to be non-trivial. Since I'm using trees now, but will use even more complex stuff later on, I thought I might try to find a more general and less complex solution. Is there one, or will I have to resort to writing my own non-trivial Arbitrary generator? In the latter case, I might actually just resort to unit-tests, since this seems too much work.
Use sized:
instance Arbitrary a => Arbitrary (Tree a) where
arbitrary = sized arbTree
arbTree :: Arbitrary a => Int -> Gen (Tree a)
arbTree 0 = do
a <- arbitrary
return $ Node a []
arbTree n = do
(Positive m) <- arbitrary
let n' = n `div` (m + 1)
f <- replicateM m (arbTree n')
a <- arbitrary
return $ Node a f
(Adapted from the QuickCheck presentation).
P.S. Perhaps this will generate overly balanced trees...
You might want to use the library presented in the paper "Feat: Functional Enumeration of Algebraic Types" at the Haskell Symposium 2012. It is on Hackage as testing-feat, and a video of the talk introducing it is available here: http://www.youtube.com/watch?v=HbX7pxYXsHg
As Janis mentioned, you can use the package testing-feat, which creates enumerations of arbitrary algebraic data types. This is the easiest way to create unbiased uniformly distributed generators
for all trees of up to a given size.
Here is how you would use it for rose trees:
import Test.Feat (Enumerable(..), uniform, consts, funcurry)
import Test.Feat.Class (Constructor)
import Data.Tree (Tree(..))
import qualified Test.QuickCheck as QC
-- We make an enumerable instance by listing all constructors
-- for the type. In this case, we have one binary constructor:
-- Node :: a -> [Tree a] -> Tree a
instance Enumerable a => Enumerable (Tree a) where
enumerate = consts [binary Node]
where
binary :: (a -> b -> c) -> Constructor c
binary = unary . funcurry
-- Now we use the Enumerable instance to create an Arbitrary
-- instance with the help of the function:
-- uniform :: Enumerable a => Int -> QC.Gen a
instance Enumerable a => QC.Arbitrary (Tree a) where
QC.arbitrary = QC.sized uniform
-- QC.shrink = <some implementation>
The Enumerable instance can also be generated automatically with TemplateHaskell:
deriveEnumerable ''Tree

F# automatic generalization and performance

I ran into an unexpected code optimization recently, and wanted to check whether my interpretation of what I was observing was correct. The following is a much simplified example of the situation:
let demo =
let swap fst snd i =
if i = fst then snd else
if i = snd then fst else
i
[ for i in 1 .. 10000 -> swap 1 i i ]
let demo2 =
let swap (fst: int) snd i =
if i = fst then snd else
if i = snd then fst else
i
[ for i in 1 .. 10000 -> swap 1 i i ]
The only difference between the 2 blocks of code is that in the second case, I explicitly declare the arguments of swap as integers. Yet, when I run the 2 snippets in fsi with #time, I get:
Case 1 Real: 00:00:00.011, CPU: 00:00:00.000, GC gen0: 0, gen1: 0, gen2: 0
Case 2 Real: 00:00:00.004, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
i.e. the 2nd snippet runs 3 times faster than the first. The absolute performance difference here is obviously not an issue, but if I were to use the swap function a lot, it would pile up.
My assumption is that the reason for the performance hit is that in the first case, swap is generic and "requires equality", and checks whether int supports it, whereas the second case doesn't have to check anything. Is this the reason this is happening, or am I missing something else? And more generally, should I consider automatic generalization a double-edged sword, that is, an awesome feature which may have unexpected effects on performance?
I think this is generally the same case as in the question Why is this F# code so slow. In that question, the performance issue is caused by a constraint requiring comparison and in your case, it is caused by the equality constraint.
In both cases, the compiled generic code has to use interfaces (and boxing), while the specialized compiled code can directly use IL instructions for comparison or equality of integers or floating-point numbers.
The two ways to avoid the performance issue are:
Specialize the code to use int or float as you did
Mark the function as inline so that the compiler specializes it automatically
For smaller functions, the second approach is better, because it does not generate too much code and you can still write functions in a generic way. If you use function just for single type (by design), then using the first approach is probably appropriate.
The reason for the difference is that the compiler is generating calls to
IL_0002: call bool class [FSharp.Core]Microsoft.FSharp.Core.LanguagePrimitives/HashCompare::GenericEqualityIntrinsic<!!0> (!!0, !!0)
in the generic version, whilst the int version can just compare directly.
If you used inline I suspect that this problem would disappear as the compiler now has extra type information

Transition from infix to prefix notation

I started learning Clojure recently. Generally it looks interesting, but I can't get used to some syntactic inconveniences (comparing to previous Ruby/C# experience).
Prefix notation for nested expressions. In Ruby I get used to write complex expressions with chaining/piping them left-to-right: some_object.map { some_expression }.select { another_expression }. It's really convenient as you move from input value to result step-by-step, you can focus on a single transformation and you don't need to move cursor as you type. Contrary to that when I writing nested expressions in Clojure, I write the code from inner expression to outer and I have to move cursor constantly. It slows down and distracts. I know about -> and ->> macros but I noticed that it's not an idiomatic. Did you have the same problem when you started coding in Clojure/Haskell etc? How did you solve it?
I felt the same about Lisps initially so I feel your pain :-)
However the good news is that you'll find that with a bit of time and regular usage you will probably start to like prefix notation. In fact with the exception of mathematical expressions I now prefer it to infix style.
Reasons to like prefix notation:
Consistency with functions - most languages use a mix of infix (mathematical operators) and prefix (functional call) notation . In Lisps it is all consistent which has a certain elegance if you consider mathematical operators to be functions
Macros - become much more sane if the function call is always in the first position.
Varargs - it's nice to be able to have a variable number of parameters for pretty much all of your operators. (+ 1 2 3 4 5) is nicer IMHO than 1 + 2 + 3 + 4 + 5
A trick then is to use -> and ->> librerally when it makes logical sense to structure your code this way. This is typically useful when dealing with subsequent operations on objects or collections, e.g.
(->>
"Hello World"
distinct
sort
(take 3))
==> (\space \H \W)
The final trick I found very useful when working in prefix style is to make good use of indentation when building more complex expressions. If you indent properly, then you'll find that prefix notation is actually quite clear to read:
(defn add-foobars [x y]
(+
(bar x y)
(foo y)
(foo x)))
To my knowledge -> and ->> are idiomatic in Clojure. I use them all the time, and in my opinion they usually lead to much more readable code.
Here are some examples of these macros being used in popular projects from around the Clojure "ecosystem":
Ring cookie parsing
Leiningen internals
ClojureScript compiler
Proof by example :)
If you have a long expression chain, use let. Long runaway expressions or deeply nested expressions are not especially readable in any language. This is bad:
(do-something (map :id (filter #(> (:age %) 19) (fetch-data :people))))
This is marginally better:
(do-something (map :id
(filter #(> (:age %) 19)
(fetch-data :people))))
But this is also bad:
fetch_data(:people).select{|x| x.age > 19}.map{|x| x.id}.do_something
If we're reading this, what do we need to know? We're calling do_something on some attributes of some subset of people. This code is hard to read because there's so much distance between first and last, that we forget what we're looking at by the time we travel between them.
In the case of Ruby, do_something (or whatever is producing our final result) is lost way at the end of the line, so it's hard to tell what we're doing to our people. In the case of Clojure, it's immediately obvious that do-something is what we're doing, but it's hard to tell what we're doing it to without reading through the whole thing to the inside.
Any code more complex than this simple example is going to become pretty painful. If all of your code looks like this, your neck is going to get tired scanning back and forth across all of these lines of spaghetti.
I'd prefer something like this:
(let [people (fetch-data :people)
adults (filter #(> (:age %) 19) people)
ids (map :id adults)]
(do-something ids))
Now it's obvious: I start with people, I goof around, and then I do-something to them.
And you might get away with this:
fetch_data(:people).select{|x|
x.age > 19
}.map{|x|
x.id
}.do_something
But I'd probably rather do this, at the very least:
adults = fetch_data(:people).select{|x| x.age > 19}
do_something( adults.map{|x| x.id} )
It's also not unheard of to use let even when your intermediary expressions don't have good names. (This style is occasionally used in Clojure's own source code, e.g. the source code for defmacro)
(let [x (complex-expr-1 x)
x (complex-expr-2 x)
x (complex-expr-3 x)
...
x (complex-expr-n x)]
(do-something x))
This can be a big help in debugging, because you can inspect things at any point by doing:
(let [x (complex-expr-1 x)
x (complex-expr-2 x)
_ (prn x)
x (complex-expr-3 x)
...
x (complex-expr-n x)]
(do-something x))
I did indeed see the same hurdle when I first started with a lisp and it was really annoying until I saw the ways it makes code simpler and more clear, once I understood the upside the annoyance faded
initial + scale + offset
became
(+ initial scale offset)
and then try (+) prefix notation allows functions to specify their own identity values
user> (*)
1
user> (+)
0
There are lots more examples and my point is NOT to defend prefix notation. I just hope to convey that the learning curve flattens (emotionally) as the positive sides become apparent.
of course when you start writing macros then prefix notation becomes a must-have instead of a convenience.
to address the second part of your question, the thread first and thread last macros are idiomatic anytime they make the code more clear :) they are more often used in functions calls than pure arithmetic though nobody will fault you for using them when they make the equation more palatable.
ps: (.. object object2 object3) -> object().object2().object3();
(doto my-object
(setX 4)
(sety 5)`