When should I use `drain` vs `into_iter`? - iterator

On the surface, it looks like both drain and into_iter provide similar iterators, namely over the values of the collection. However, they are different:
fn main() {
let mut items1 = vec![0u8, 1, 2, 3, 4, 5, 6, 7, 8, 9];
let items2 = items1.clone();
println!("{:?}", items1.drain().count());
println!("{:?}", items2.into_iter().count());
println!("{:?}", items1);
// println!("{:?}", items2); Moved
}
drain takes a &mut to the collection and the collection is available afterwards. into_iter consumes the collection. What are the appropriate uses for each iterator?

They are somewhat redundant with each other. However, as you say, Drain just borrows the vector, in particular, it has a lifetime connected with the vector. If one is wishing to return an iterator, or otherwise munge iterators in the most flexible way possible, using into_iter is better, since it's not chained to the owner of the originating Vec. If one is wishing to reuse the data structure (e.g. reuse the allocation) then drain is the most direct way of doing this.
Also, a (somewhat) theoretical concern is that Drain needs to result in the originating structure being a valid instance of whatever type it is, that is, either preserve invariants, or fix them up at the end, while IntoIter can mangle the structure as much as it likes, since it has complete control of the value.
I say only "somewhat" theoretical because there is a small, real world example of this in std already: HashMap exposes .drain and .into_iter via its internal RawTable type, which also has those methods. into_iter can just read the hash of the value being moved directly and that's that, but drain has to be careful to update the hash to indicate that the cell is then empty, not just read it. Obviously this is absolutely tiny in this instance (probably only one or two additional instructions) but for more complicated data structures like trees there may be some non-trivial gains to be had from breaking the invariants of the data structure.

After using drain, the Vec is empty but the storage previously allocated for its elements remains allocated. This means that you can insert new elements in the Vec without having to allocate storage for them until you reach the Vec's capacity.
Note that before you can use the Vec again, you must drop all references to the Drain iterator. Drain's implementation of Drop will remove all elements that hadn't been removed from the Vec yet, so the Vec will be empty even if you didn't finish the iteration.

For the 2018 edition of Rust:
into_iter consumes the collection itself, drain only consumes the values in the collection.
Therefore drain allows draining of only a part of the collection, and now in fact requires specifying a range. (This appears to have changed since when the question was asked.)
So use into_iter if you want to consume the entire collection, and use drain if you only want to consume part of the collection or if you want to reuse the emptied collection later.

Related

Synchronize access to mutable fields with Kotlin's map delegation

Is this implementation safe to synchronize the access to the public fields/properties?
class Attributes(
private val attrsMap: MutableMap<String, Any?> = Collections.synchronizedMap(HashMap())
) {
var attr1: Long? by attrsMap
var attr2: String? by attrsMap
var attr3: Date? by attrsMap
var attr4: Any? = null
...
}
Mostly.
Because the underlying map is is only accessible via the synchronised wrapper, you can't have any issues caused by individual calls, such as simultaneous gets and/or puts (which is the main cause of race conditions): only one thread can be making such a call, and the Java memory model ensures that the results are then visible to all threads.
You could have race conditions involving a sequence of calls, such as iterating through the map, or a check followed by a modify, if the map could be modified in between.  (That sort of problem can occur even on a single thread.)  But as long as the rest of your class avoided such sequences, and didn't leak a reference to the map, you'd be safe.
And because the types Long, String, and Date are immutable, you can't have any issues with their contents being modified.
That is a concern with the Any parameter, though.  If it stored e.g. a StringBuilder, one thread could be modifying its contents while another was accessing it, with hilarious consequences.  There's not much you can do about that in a wrapper class, though.
By the way, instead of using a synchronised wrapper, you could use a ConcurrentHashMap, which would avoid the synchronisation in most cases (at the cost of a bit more memory).  It also provides many methods which can replace call sequences, such as getOrPut(); it's a really powerful tool for writing high-performance multithreaded code.

Kotlin most efficient way sequence to set

Hi I've got list of 1330 objects and would like to apply method and obtain set as result.
val result = listOf1330
.asSequence()
.map {
someMethod(it)
}
val resultSet = result.toSet()
It works fine without toSet but if then execution time is about 10 times longer.
I've used sequence to make it work faster and it is but as a result I need list without duplicates (set).
Simply: What is most effective way to convert sequence to set?
val result = listOf1330.mapTo(HashSet()) { someMethod(it) }
It makes less sense to use streams or sequences to implement the transformation - you will need all elements from the collection, not several. The mapTo (and map) functions are inline in Kotlin. It means the code will be substituted into the call site, it will not have lambda created and executed many times. We use mapTo to avoid the second copy of the collection done by the toSet() function.
The .parallelStream() may add more performance, if you like to run the computation in several threads. It is still a good idea to measure how good the load is balanced between threads. The performance may depend on the collection implementation class, on which you call it
If your someObject has a slow implementation of equals() or hashCode(), or gives the same hash code for many objects, then that could account for the delay, and you may be able to improve it.
Otherwise, if the objects are big, the delay may be mostly due to the amount of memory that must be accessed to store them all; if so, that's the price you'll have to pay if you want a set with all those objects in memory.
Sequence.toSet() uses a LinkedHashSet.  You could try providing another Set instance, using e.g. toCollection(HashSet()), to see if that's any faster.  (You wouldn't get the same iteration order, though.)
I agree with gidds answer on HashSet and LinkedHashSet performance.
LinkedHashSet is more expensive for insertions than HashSet;
However, in the above use case, I think we can leverage parallelStream to improve the performance. Under the hood, Kotlin uses the Java parallelStream.
val result: Set<String> = listOf("sdgds", "fdgdfsg", "dsfgsdfg")
.parallelStream()
.map {
someMethod(it)
}.collect(Collectors.toSet())
The Collectors.toSet() uses HashSet. So, we should be ok in insertion performance perspective.
Use distict or distictBy.
val result = sequenceOf("a", "b", "a", "c").distinct()
// -> "a", "b", "c"
// for more complex cases use custom comparator function
val result = getMyObjectsSequence().distinctBy { it.name }
This approach lets keep using sequence without involving explicit Iterables (List, Set, etc.).
Nevertheless, there is no magic, and "distinct" still uses HashSet under the hood and in case of really huge sequence it may cause sufficient memory usage and it must be kept in mind while applying this function.

Efficiently make a view of (or copy) a subset of a large HashMap in Kotlin

I am trying to create a subhashmap from a huge hashmap without copy the original one.
currently I use this:
val map = hashMapOf<Job, Int>()
val copy = HashMap(map)
listToRemoveFromCopy.forEach { copy.remove(it) }
this cost me around 50% of my current algorithm. Because java is calculating the hash of the job really often.
I only want the map minus the listToRemoveFromCopy in a new variable without removing the listToRemoveFromCopy elements from the original list.
anyone know this?
Thanks for help
First, you need to cache the hashcode for Job because any approach you use will be inefficient if you cannot have a set or a map of Job objects that operate at top speed.
Hopefully, the parts that make it a hashcode are immutable otherwise it should not be used as a key. It is very dangerous to mutate a key hashcode/equals while in use in a map or set. You should cache it on the first call to hashCode() so that you do not incur a cost until then unless you are sure you will always need it.
Then change listToRemoveFromCopy to be a Set so it can be efficiently used in many ways. You need to do the prior step before this.
Now you have multiple options. The most efficient is:
Guava has a utility function Maps.filterKeys which returns a view into a map, and you can create a predicate that works against a Set of the items to remove.
val removeKeys = listToRemoveFromCopy.toSet()
val mapView = Maps.filterKeys(map, Predicates.not(Predicates.in(removeKeys)))
But be sure to note some methods on the view are not very efficient. If you avoid those methods, this will be the top performing option:
Many of the filtered map's methods, such as size(), iterate across every key/value mapping in the underlying map and determine which satisfy the filter. When a live view is not needed, it may be faster to copy the filtered map and use the copy.
If you need to make a copy instead, you have a few approaches:
Use filterKeys on the map to create a new map in one shot. This is good if the remove list might be a larger percentage of the total keys.
val removeKeys = listToRemoveFromCopy.toSet()
val newMap = map.filterKeys { it !in removeKeys }
Another tempting option you should be careful about is the minus - operator which copies the full map and then removes the items. It can use the listToRemoveFromCopy as-is without it being a set, but the full map copy might undo the benefit. So do not do this unless the remove list is a small percentage of keys.
val newMapButSlower = map - listToRemoveFromCopy
You could pick one model over the other depending on the ratio between map size and remove list size, find a breaking point that works for your "huge".
Implementing your own view into the map to avoid a copy is possible, but not trivial (and by that I mean very complex). Every method you override has to do the correct thing at all times (including the map's own hashCode and equals), and other views would have to be created around the key set and values. The entrySet would be nasty to get right. I'd look for a pre-written solution before attempting your own (the Guava one above or other). This zero-copy model would be the most efficient solution but the most code and is what I would do in the same case if "huge" meant significant processing time. There is a lot that you can get wrong with this approach if you misunderstand any part of the implementation contract.
You could wrap the Guava solution with one that maintains the size attribute as items are manipulated and therefore be efficient for that case. You can also write a more efficient solution if you know the original map is read-only. For ideas, check out the Guava implementation of FilteredKeyMap and its ancestor AbstractFilteredMap.
In summary, likely the caching of your hashcode is going to give you the biggest result for the effort. Start there. You'll need it to do even for the Guava approach.
In addition to Axel's direct answer:
Could calculating the hashcode of a Job be optimised?  If the calculation can't be sped up, could it cache the result?  (There's ample precedent for this, including java.lang.String.)  Or if the class isn't under your control, could you create a delegate/wrapper that overrides the hashcode calculation?
You can use filterKeys function. It will iterate map only once
val copy = map.filterKeys { it !in listToRemoveFromCopy }

Why is Kotlin's mapOf(a to b, c to d) considered not suitable for performance-critical code?

Kotlin's official "collections" page contains the following remark:
Map creation in NOT performance-critical code can be accomplished with
simple idiom: mapOf(a to b, c to d).
Questions:
a/ What is the reason behind this remark (the best explanation I could come up with is that the "a to b" expressions create an extra, transient, Pair object, but I am not sure).
b/ What is the suggested best practice for initializing a map in a way that is suitable for performance-critical code?
There are two things that happen under the hood that might affect performance:
The mapOf(...) function expects a vararg of pairs, and, during a call, an array is created for the arguments and then passed to the function. This operation involves allocating an array and then filling it with the items.
As you correctly noted, the a to b infix function creates a pair (equivalent to Pair(a, b)), which is another object allocation.
Allocating many objects affects performance when done many times in tight loops (including additional load on garbage collector, when dealing with short-living objects).
Additionally, using an array for the vararg may affect locality of reference (instead of passing the arguments through the stack, they are placed into a separate memory region located somewhere else in the heap).
While JVMs are usually good at local optimizations and can sometimes even eliminate allocations, these optimizations are not guaranteed to happen.
A more performant way to create a map and fill it with items is:
val map: Map<Foo, Bar> = HashMap().apply {
put(a, b)
put(c, d)
}
Using apply { ... } introduces no overhead since it is an inline function. An explicit type annotation Map<Foo, Bar> shows the intention not to mutate the map after it is created.

Stateful objects, properties and parameter-less methods in favour of stateless objects, parameters and return values

I find this class definition a bit odd:
http://www.extremeoptimization.com/Documentation/Reference/Extreme.Mathematics.LinearAlgebra.SingleLeastSquaresSolver_Members.aspx
The Solve method does have a return value but would not need to because the result is also available in the Solution property.
This is what I see as traditional code:
var sqrt2 = Math.Sqrt(2)
This would be an alternative in the same spirit as the solver in the link:
var sqrtCalculator = new SqrtCalculator();
sqrtCalculator.Parameter = 2;
sqrtCalculator.Run();
var sqrt2 = sqrtCalculator.Result;
What are the pros and cons besides the second version being a bit "untraditional"?
Yes, the compiler won't help the user who forgot to assign some property (parameter) BUT this is the case with all components that contain writeable properties and don't have mandatory values in the constructor.
Yes, threading will not work, BUT each thread can create its own solver.
Yes, the garbage collector won't be able to dispose the solver's result, BUT if the entire solver is disposed it will.
Yes, compilers and processors have special treatment of parameters and return values which makes them fast, BUT the time for parameter handling is mostly neglectable.
And so on. Other ideas?
Well, after a year I found a clear flaw with this "introvert" approach. I am using an existing filter object which should operate on a measurement object but rather operates on itself in a "it's all me and nothing else"-fashion described above. Now the customer wants a recalculation of a measurement object a few minutes after the first calculation, and meanwhile the filter has processed other measurement objects. If it had been stateless and stored its data in the measurement object, it would have been an easy matter to implement a Recalculate method. The only way to solve the problem with an introvert filter is to let a filter instance be a part of the measurement object. Then filters need to be instantiated for every new measurement object. And since filters are a part of a chain the entire chain needs to be recreated. Well, there is some merit to being stateless.