how to optimize search difference between array / list of object

how to optimize search difference between array / list of object - optimization

Premesis:
I am using ActionScript with two arraycollections containing objects with values to be matched...
I need a solution for this (if in the framework there is a library that does it better) otherwise any suggestions are appreciated...
Let's assume I have two lists of elements A and B (no duplicate values) and I need to compare them and remove all the elements present in both, so at the end I should have
in A all the elements that are in A but not in B
in B all the elements that are in B but not in A
now I do something like that:
for (var i:int = 0 ; i < a.length ;)
{
var isFound:Boolean = false;
for (var j:int = 0 ; j < b.length ;)
{
if (a.getItemAt(i).nome == b.getItemAt(j).nome)
{
isFound = true;
a.removeItemAt(i);
b.removeItemAt(j);
break;
}
j++;
}
if (!isFound)
i++;
}
I cycle both the arrays and if I found a match I remove the items from both of the arrays (and don't increase the loop value so the for cycle progress in a correct way)
I was wondering if (and I'm sure there is) there is a better (and less CPU consuming) way to do it...

If you must use a list, and you don't need the abilities of arraycollection, I suggest simply converting it to using AS3 Vectors. The performance increase according to this (http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/) are 60% compared to Arrays. I believe Arrays are already 3x faster than ArrayCollections from some article I once read. Unfortunately, this solution is still O(n^2) in time.
As an aside, the reason why Vectors are faster than ArrayCollections is because you provide type-hinting to the VM. The VM knows exactly how large each object is in the collection and performs optimizations based on that.
Another optimization on the vectors is to sort the data first by nome before doing the comparisons. You add another check to break out of the loop if the nome of list b simply wouldn't be found further down in list A due to the ordering.
If you want to do MUCH faster than that, use an associative array (object in as3). Of course, this may require more refactoring effort. I am assuming object.nome is a unique string/id for the objects. Simply assign that the value of nome as the key in objectA and objectB. By doing it this way, you might not need to loop through each element in each list to do the comparison.

Related

Determine if List of Strings contains substring of all Strings in other list

I have this situation:
val a = listOf("wwfooww", "qqbarooo", "ttbazi")
val b = listOf("foo", "bar")
I want to determine if all items of b are contained in substrings of a, so the desired function should return true in the situation above. The best I can come up with is this:
return a.any { it.contains("foo") } && a.any { it.contains("bar") }
But it iterates over a twice. a.containsAll(b) doesn't work either because it compares on string equality and not substrings.

Not sure if there is any way of doing that without iterating over a the same amount as b.size. Because if you only want 1 iteration of a, you will have to check all the elements on b and now you are iterating over b a.size times plus, in this scenario, you also need to keep track of which item in b already had a match, and not check them again, which might be worse than just iterating over a, since you can only do that by either removing them from the list (or a copy, which you use instead of b), or by using another list to keep track of the matches, then compare that to the original b.
So I think that you are on the right track with your code there, but there are some issues. For example you don't have any reference to b, just hardcoded strings, and doing it like that for all elements in b will result in quite a big function if you have more than 2, or better yet, if you don't already know the values.
This code will do the same thing as the one you put above, but it will actually use elements from b, and not hardcoded strings that match b. (it will iterate over b b.size times, and partially over a b.size times)
return b.all { bItem ->
a.any { it.contains(bItem) }
}

Alex's answer is by far the simplest approach, and is almost certainly the best one in most circumstances.
However, it has complexity A*B (where A and B are the sizes of the two lists) — which means that it doesn't scale: if both lists get big, it'll get very slow.
So for completeness, here's a way that's more involved, and slower for the small cases, but has complexity proportional to A+B and so can cope efficiently with much larger lists.
The idea is to preprocess the a list, to generate a set of all the possible substrings, and then scan through the b list just checking for inclusion in that set.  (The preprocessing step takes time proportional* to A.  Converting the substrings into a set means that it can check whether a string is present in constant time, using its hash code; so the rest then takes time proportional to B.)
I think this is clearest using a helper function:
/**
* Generates a list of all possible substrings, including
* the string itself (but excluding the empty string).
*/
fun String.substrings()
= indices.flatMap { start ->
((start + 1)..length).map { end ->
substring(start, end)
}
}
For example, "1234".substrings() gives [1, 12, 123, 1234, 2, 23, 234, 3, 34, 4].
Then we can generate the set of all substrings of items from a, and check that every item of b is in it:
return a.flatMap{ it.substrings() }
.toSet()
.containsAll(b)
(* Actually, the complexity is also affected by the lengths of the strings in the a list.  Alex's version is directly proportional to the average length, while the preprocessing part of the algorithm above is proportional to its square (as indicated by the map nested in the flatMap).  That's not good, of course; but in practice while the lists are likely to get longer, the strings within them probably won't, so that's unlikely to be significant.  Worth knowing about, though.
And there are probably other, still more complex algorithms, that scale even better…)

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?

Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!

Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

Does initialising an auxiliary array to 0 count as n time complexity already?

very new to big O complexity and I was wondering if an algorithm where you have a given array, and you initialise an auxilary array with the same amount of indexes count as n time already, or do you just assume this is O(1), or nothing at all?

TL;DR: Ignore it
Long answer: This will depend on the rest of your algorithm as well as what you want to achieve. Typically you will do something useful with the array afterwards which does have at least the same time complexity as filling the array, so that array-filling does not contribute to the time complexity. Furthermore filling an array with 0 feels like something you do to initialize the array, so your "real" algorithm can work properly. But nevertheless there are some cases you could consider.
Please note that I use pseudocode in the following examples, I hope it's clear what the algorithm should do. Also note that all the examples don't do anything useful with the array. It's just to show my point.
Lets say you have following code:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
print "Hello World"
Then obviously the runtime of your algorithm is highly dependent on the value of n and thus should be counted as linear complexity O(n)
On the other hand, if you have a much more complicated function, say this one:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
for(i=0, i<n, i++)
for(j=n-1, j>=0, j--)
print "Hello World"
Then even if you take the complexity of filling the array into account, you will end with complexity of O(n^2+2n) which is equal to the class O(n^2), so it does not matter in this case.
The most interesting case is surely when you have different options to use as basic operation. Say we have the following code (someFunction being an arbitrary function):
A = Array[n*n]
for(i=0, i<n*n, i++)
A[i] = 0
for(i=0, i*i<n, i++)
someFunction(i)
Now it depends on what you choose as basic operation. Which one you choose is highly dependent on what you want to achieve. Let's say someFunction is a very cheap function (regarding time complexity) and accessing the array A is more expensive. Then you would propably go with O(n^2), since accessing the array is done n^2 times. If on the other hand someFunction is expensive compared to filling the array, you would propably choose this as base operation and go with O(sqrt(n)).
Please be aware that one could also come to the conclusion that since the first part (array-filling) is executed more often than the other part (someFunction) it does not matter which one of the operations will take longer time to finish, since at some point the array-filling will need longer time. Thus you could argue that the complexity has to be quadratic O(n^2) This may be right from a theoretical view. But in real life you usually will have an operation you want to count and don't care about the other operations.
Actually you could consider ignoring the array filling as well as taking it into account in all the examples I provided above, depending whether print or accessing the array is more expensive. But I hope in the first two examples it is obvious which one will add more runtime and thus should be considered as the basic operation.

What is the quickest way to iterate through a Iterator in reverse

Let's say I'd like to iterate through a generic iterator in reverse, without knowing about the internals of the iterator and essentially not cheating via untyped magic and assuming this could be any type of iterable, which serves a iterator; can we optimise the reverse of a iterator at runtime or even via macros?
Forwards
var a = [1, 2, 3, 4].iterator();
// Actual iteration bellow
for(i in a) {
trace(i);
}
Backwards
var a = [1, 2, 3, 4].iterator();
// Actual reverse iteration bellow
var s = [];
for(i in a) {
s.push(i);
}
s.reverse();
for(i in s) {
trace(i);
}
I would assume that there has to be a simpler way, or at least fast way of doing this. We can't know a size because the Iterator class doesn't carry one, so we can't invert the push on to the temp array. But we can remove the reverse because we do know the size of the temp array.
var a = [1,2,3,4].iterator();
// Actual reverse iteration bellow
var s = [];
for(i in a) {
s.push(i);
}
var total = s.length;
var totalMinusOne = total - 1;
for(i in 0...total) {
trace(s[totalMinusOne - i]);
}
Is there any more optimisations that could be used to remove the possibility of the array?

It bugs me that you have to duplicate the list, though... that's nasty. I mean, the data structure would ALREADY be an array, if that was the right data format for it. A better thing (less memory fragmentation and reallocation) than an Array (the "[]") to copy it into might be a linked List or a Hash.
But if we're using arrays, then Array Comprehensions (http://haxe.org/manual/comprehension) are what we should be using, at least in Haxe 3 or better:
var s = array(for (i in a) i);
Ideally, at least for large iterators that are accessed multiple times, s should be cached.
To read the data back out, you could instead do something a little less wordy, but quite nasty, like:
for (i in 1-s.length ... 1) {
trace(s[-i]);
}
But that's not very readable and if you're after speed, then creating a whole new iterator just to loop over an array is clunky anyhow. Instead I'd prefer the slightly longer, but cleaner, probably-faster, and probably-less-memory:
var i = s.length;
while (--i >= 0) {
trace(s[i]);
}

First of all I agree with Dewi Morgan duplicating the output generated by an iterator to reverse it, somewhat defeats its purpose (or at least some of its benefits). Sometimes it's okay though.
Now, about a technical answer:
By definition a basic iterator in Haxe can only compute the next iteration.
On the why iterators are one-sided by default, here's what we can notice:
if all if iterators could run backwards and forwards, the Iterator classes would take more time to write.
not all iterators run on collections or series of numbers.
E.g. 1: an iterator running on the standard input.
E.g. 2: an iterator running on a parabolic or more complicated trajectory for a ball.
E.g. 3: slightly different but think about the performance problems running an iterator on a very large single-linked list (eg the class List). Some iterators can be interrupted in the middle of the iteration (Lambda.has() and Lambda.indexOf() for instance return as soon as there is a match, so you normally don't want to think of what's iterated as a collection but more as an interruptible series or process iterated step by step).
While this doesn't mean you shouldn't define two-ways iterators if you need them (I've never done it in Haxe but it doesn't seem impossible), in the absolute having two-ways iterators isn't that natural, and enforcing Iterators to be like that would complicate coding one.
An intermediate and more flexible solution is to simply have ReverseXxIter where you need, for instance ReverseIntIter, or Array.reverseIter() (with using a custom ArrayExt class). So it's left for every programmer to write their own answers, I think it's a good balance; while it takes more time and frustration in the beginning (everybody probably had the same kind of questions), you end up knowing the language better and in the end there are just benefits for you.

Complementing the post of Dewi Morgan, you can use for(let i = a.length; --i >= 0;) i; if you wish to simplify the while() method. if you really need the index values, I think for(let i=a.length, k=keys(a); --i in k;) a[k[i]]; is the best that give to do keeping the performance. There is also for(let i of keys(a).reverse()) a[i]; which has cleaner writing, but its iteration rate increases 1n using .reduce()

What naming conventions should I use on the second integer on a nested for loop?

I'm pretty new to programming, and I was just wondering in the following case what would be an appropriate name for the second integer I use in this piece of code
for (int i = 0; i < 10; i++)
{
for (int x = 0; x < 10; x++)
{
//stuff
}
}
I usually just name it x but I have a feeling that this could get confusing quickly. Is there a standard name for this kind of thing?

Depending upon what you're iterating over, a name might be easy or obvious by context:
for(struct mail *mail=inbox->start; mail ; mailid++) {
for (struct attachment *att=mail->attachment[0]; att; att++) {
/* work on all attachments on all mails */
}
}
For the cases where i makes the most sense for an outer loop variable, convention uses j, k, l, and so on.
But when you start nesting, look harder for meaningful names. You'll thank yourself in six months.

You could opt to reduce the nesting by making a method call. Inside of this method, you would be using a local variable also named i.
for (int i = 0; i < 10; i++)
{
methodCall(array[i], array);
}
I have assumed you need to pass the element at position i in the outer loop as well as the array to be iterated over in the inner loop - this is an assumption as you may actually require different arguments.
As always, you should measure the performance of this - there shouldn't be a massive overhead in making a method call within a loop, but this depends on the language.

Personally I feel that you should give variables meaningful names - here i and x mean nothing and will not help you understand your code in 3 months time, at which point it will appear to you as code written by a dyslexic monkey.
Name variables so that other people can understand what your code is trying to accomplish. You will save yourself time in the long run.

Since you said you are beginning, I'd say it's beneficial to experiment with multiple styles.
For the purposes of your example, my suggestion is simply replace x with j.
There's tons of real code that will use the convention of i, j, and k for single letter nested loop variables.
There's also tons that uses longer more meaningful names.
But there's much less that looks like your example.
So you can consider it a step forward because you're code looks more like real world code.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to optimize search difference between array / list of object - optimization

Related

Determine if List of Strings contains substring of all Strings in other list

Kotlin: Why is Sequence more performant in this example?

Does initialising an auxiliary array to 0 count as n time complexity already?

What is the quickest way to iterate through a Iterator in reverse

What naming conventions should I use on the second integer on a nested for loop?

Categories

Resources