Determine if List of Strings contains substring of all Strings in other list - kotlin

I have this situation:
val a = listOf("wwfooww", "qqbarooo", "ttbazi")
val b = listOf("foo", "bar")
I want to determine if all items of b are contained in substrings of a, so the desired function should return true in the situation above. The best I can come up with is this:
return a.any { it.contains("foo") } && a.any { it.contains("bar") }
But it iterates over a twice. a.containsAll(b) doesn't work either because it compares on string equality and not substrings.

Not sure if there is any way of doing that without iterating over a the same amount as b.size. Because if you only want 1 iteration of a, you will have to check all the elements on b and now you are iterating over b a.size times plus, in this scenario, you also need to keep track of which item in b already had a match, and not check them again, which might be worse than just iterating over a, since you can only do that by either removing them from the list (or a copy, which you use instead of b), or by using another list to keep track of the matches, then compare that to the original b.
So I think that you are on the right track with your code there, but there are some issues. For example you don't have any reference to b, just hardcoded strings, and doing it like that for all elements in b will result in quite a big function if you have more than 2, or better yet, if you don't already know the values.
This code will do the same thing as the one you put above, but it will actually use elements from b, and not hardcoded strings that match b. (it will iterate over b b.size times, and partially over a b.size times)
return b.all { bItem ->
a.any { it.contains(bItem) }
}

Alex's answer is by far the simplest approach, and is almost certainly the best one in most circumstances.
However, it has complexity A*B (where A and B are the sizes of the two lists) — which means that it doesn't scale: if both lists get big, it'll get very slow.
So for completeness, here's a way that's more involved, and slower for the small cases, but has complexity proportional to A+B and so can cope efficiently with much larger lists.
The idea is to preprocess the a list, to generate a set of all the possible substrings, and then scan through the b list just checking for inclusion in that set.  (The preprocessing step takes time proportional* to A.  Converting the substrings into a set means that it can check whether a string is present in constant time, using its hash code; so the rest then takes time proportional to B.)
I think this is clearest using a helper function:
/**
* Generates a list of all possible substrings, including
* the string itself (but excluding the empty string).
*/
fun String.substrings()
= indices.flatMap { start ->
((start + 1)..length).map { end ->
substring(start, end)
}
}
For example, "1234".substrings() gives [1, 12, 123, 1234, 2, 23, 234, 3, 34, 4].
Then we can generate the set of all substrings of items from a, and check that every item of b is in it:
return a.flatMap{ it.substrings() }
.toSet()
.containsAll(b)
(* Actually, the complexity is also affected by the lengths of the strings in the a list.  Alex's version is directly proportional to the average length, while the preprocessing part of the algorithm above is proportional to its square (as indicated by the map nested in the flatMap).  That's not good, of course; but in practice while the lists are likely to get longer, the strings within them probably won't, so that's unlikely to be significant.  Worth knowing about, though.
And there are probably other, still more complex algorithms, that scale even better…)

Related

What is the most efficient way to join one list to another in kotlin?

I start with a list of integers from 1 to 1000 in listOfRandoms.
I would like to left join on random from the createDatabase list.
I am currently using a find{} statement within a loop to do this but feel like this is too heavy. Is there not a better (quicker) way to achieve same result?
Psuedo Code
data class DatabaseRow(
val refKey: Int,
val random: Int
)
fun main() {
val createDatabase = (1..1000).map { i -> DatabaseRow(i, Random()) }
val listOfRandoms = (1..1000).map { j ->
val lookup = createDatabase.find { it.refKey == j }
lookup.random
}
}
As mentioned in comments, the question seems to be mixing up database and programming ideas, which isn't helping.
And it's not entirely clear which parts of the code are needed, and which can be replaced. I'm assuming that you already have the createDatabase list, but that listOfRandoms is open to improvement.
The ‘pseudo’ code compiles fine except that:
You don't give an import for Random(), but none of the likely ones return an Int. I'm going to assume that should be kotlin.random.Random.nextInt().
And because lookup is nullable, you can't simply call lookup.random; a quick fix is lookup!!.random, but it would be safer to handle the null case properly with e.g. lookup?.random ?: -1. (That's irrelevant, though, given the assumption above.)
I think the general solution is to create a Map. This can be done very easily from createDatabase, by calling associate():
val map = createDatabase.associate{ it.refKey to it.random }
That should take time roughly proportional to the size of the list. Looking up values in the map is then very efficient (approx. constant time):
map[someKey]
In this case, that takes rather more memory than needed, because both keys and values are integers and will be boxed (stored as separate objects on the heap). Also, most maps use a hash table, which takes some memory.
Since the key is (according to comments) “an ascending list starting from a random number, like 18123..19123”, in this particular case it can instead be stored in an IntArray without any boxing. As you say, array indexes start from 0, so using the key directly would need a huge array and use only the last few cells — but if you know the start key, you could simply subtract that from the array index each time.
Creating such an array would be a bit more complex, for example:
val minKey = createDatabase.minOf{ it.refKey }
val maxKey = createDatabase.maxOf{ it.refKey }
val array = IntArray(maxKey - minKey + 1)
for (row in createDatabase)
array[row.refKey - minKey] = row.random
You'd then access values with:
array[someKey - minKey]
…which is also constant-time.
Some caveats with this approach:
If createDatabase is empty, then minOf() will throw a NoSuchElementException.
If it has ‘holes’, omitting some keys inside that range, then the array will hold its default value of 0 — you can change that by using the alternative IntArray constructor which also takes a lambda giving the initial value.)
Trying to look up a value outside that range will give an ArrayIndexOutOfBoundsException.
Whether it's worth the extra complexity to save a bit of memory will depend on things like the size of the ‘database’, and how long it's in memory for; I wouldn't add that complexity unless you have good reason to think memory usage will be an issue.

Is this an example of logarithmic time complexity?

It is commonly said that an algorithm with a logarithmic time complexity O(log n) is one where doubling the inputs does not necessarily double the amount of work that is required. And often times, search algorithms are given as an example of algorithms with logarithmic complexity.
With this in mind, let’s say I have an function that takes an array of strings as the first argument, as well as an individual string as the second argument, and returns the index of the string within the array:
function getArrayItemIndex(array, str) {
let i = 0
for(let item of array) {
if(item === str) {
return i
}
i++
}
}
And lets say that this function is called as follows:
getArrayItemIndex(['John', 'Jack', 'James', 'Jason'], 'Jack')
In this instance, the function will not end up stepping through the entire array before it returns the index of 1. And similarly, if we were to double the items in the array so that it ends up being called as follows:
getArrayItemIndex(
[
'John',
'Jack',
'James',
'Jason',
'Jerome',
'Jameson',
'Jamar',
'Jabar'
],
'John'
)
...then doubling the items in the array would not have necessarily caused the running time of the function to double, seeing that it would have broken out of the loop and returned after the very first iteration. Because of this, is it then accurate to say that the getArrayItemIndex function has a logarithmic time complexity?
Not quite. What you have here is Linear Search. Its worst-ccase performance is Theta(n) since it has to check all the elements if the search target isn't in the list. What you have discovered is that its best-case performance is Theta(1) since the algorithm only has to run a few checks if you get lucky.
Binary search on pre-sorted arrays is an example of an O(log n) worst-case algorithm (the best case is still O(1)). It works like this:
Check the middle element. If it matches, return. Otherwise, if the element is too big, perform binary search on the first half of the array. If it's too big, perform binary search on the second half. Continue until you find the target or you run out of new elements to check.
In Binary Search, we never look at all the elements. That is the difference.

How to effectively get the N lowest values from the collection (Top N) in Kotlin?

How to effectively get the N lowest values from the collection (Top N) in Kotlin?
Is there any other way besides collectionOrSequence.sortedby{it.value}.take(n)?
Assume I have a collection with +100500 elements and I need to found 10 lowest. I'm afraid that the sortedby will create new temporary collection which later will take only 10 items.
You could keep a list of the n smallest elements and just update it on demand, e.g.
fun <T : Comparable<T>> top(n: Int, collection: Iterable<T>): List<T> {
return collection.fold(ArrayList<T>()) { topList, candidate ->
if (topList.size < n || candidate < topList.last()) {
// ideally insert at the right place
topList.add(candidate)
topList.sort()
// trim to size
if (topList.size > n)
topList.removeAt(n)
}
topList
}
}
That way you only compare the current element of your list once to the largest element of the top n elements which would usually be faster than sorting the entire list https://pl.kotl.in/SyQPtDTcQ
If you're running on the JVM, you could use Guava's Comparators.least(int, Comparator), which uses a more efficient algorithm than any of these suggestions, taking O(n + k log k) time and O(k) memory to find the lowest k elements in a collection of size n, as opposed to zapl's algorithm (O(nk log k)) or Lior's (O(nk)).
You have more to worry about.
collectionOrSequence.sortedby{it.value} runs java.util.Arrays.sort, that will run timSort (or mergeSort if requested).
timSort is great, but usually ends by n*log(n) operations, which is much more than the O(n) of copying the array.
Each of the O(n*log.n) operations will run a function (the lambda you provided, {it.value}) --> an additional meaningful overhead.
Lastly, java.util.Arrays.sort will convert the collection to Array and back to a List - 2 additional conversions (which you wanted to avoid, but this is secondary)
The efficient way to do it is probably:
map the values for comparison into a list: O(n) conversions (once per element) rather than O(n*log.n) or more.
Iterate over the list (or Array) created to collect the N smallest elements in one pass
Keep a list of N smallest elements found so far and their index on the original list. If it is small (e.g. 10 items) - mutableList is a good fit.
Keep a variable holding the max value for the small element list.
When iterating over the original collection, compare the current element on the original list against the max value of the small values list. If smaller than it - replace it in the "small list" and find the updated max value in it.
Use the indexes from the "small list" to extract the 10 smallest elements of the original list.
That would allow you to go from O(n*log.n) to O(n).
Of course, if time is critical - it is always best to benchmark the specific case.
If you managed, on the first step, to extract primitives for the basis of comparison (e.g. int or long) - that would be even more efficient.
I suggest implementing your own sort method based on a typical quickSort algorithm(in descending order, and take the first N elements), if the collection has 1k+ values spread randomly.

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?
Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!
Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

how to optimize search difference between array / list of object

Premesis:
I am using ActionScript with two arraycollections containing objects with values to be matched...
I need a solution for this (if in the framework there is a library that does it better) otherwise any suggestions are appreciated...
Let's assume I have two lists of elements A and B (no duplicate values) and I need to compare them and remove all the elements present in both, so at the end I should have
in A all the elements that are in A but not in B
in B all the elements that are in B but not in A
now I do something like that:
for (var i:int = 0 ; i < a.length ;)
{
var isFound:Boolean = false;
for (var j:int = 0 ; j < b.length ;)
{
if (a.getItemAt(i).nome == b.getItemAt(j).nome)
{
isFound = true;
a.removeItemAt(i);
b.removeItemAt(j);
break;
}
j++;
}
if (!isFound)
i++;
}
I cycle both the arrays and if I found a match I remove the items from both of the arrays (and don't increase the loop value so the for cycle progress in a correct way)
I was wondering if (and I'm sure there is) there is a better (and less CPU consuming) way to do it...
If you must use a list, and you don't need the abilities of arraycollection, I suggest simply converting it to using AS3 Vectors. The performance increase according to this (http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/) are 60% compared to Arrays. I believe Arrays are already 3x faster than ArrayCollections from some article I once read. Unfortunately, this solution is still O(n^2) in time.
As an aside, the reason why Vectors are faster than ArrayCollections is because you provide type-hinting to the VM. The VM knows exactly how large each object is in the collection and performs optimizations based on that.
Another optimization on the vectors is to sort the data first by nome before doing the comparisons. You add another check to break out of the loop if the nome of list b simply wouldn't be found further down in list A due to the ordering.
If you want to do MUCH faster than that, use an associative array (object in as3). Of course, this may require more refactoring effort. I am assuming object.nome is a unique string/id for the objects. Simply assign that the value of nome as the key in objectA and objectB. By doing it this way, you might not need to loop through each element in each list to do the comparison.