Large list literals in Kotlin stalling/crashing compiler - kotlin

I'm using val globalList = listOf("a1" to "b1", "a2" to "b2") to create a large list of Pairs of strings.
All is fine until you try to put more than 1000 Pairs into a List. The compiler either takes > 5 minutes or just crashes (Both in IntelliJ and Android Studio).
Same happens if you use simple lists of Strings instead of Pairs.
Is there a better way / best practice to include large lists in your source code without resorting to a database?

You can replace a listOf(...) expression with a list created using a constructor or a factory function and adding the items to it:
val globalList: List<Pair<String, String>> = mutableListOf().apply {
add("a1" to "b1")
add("a2" to "b2")
// ...
}
This is definitely a simpler construct for the compiler to analyze.

If you need something quick and dirty instead of data files, one workaround is to use a large string, then split and map it into a list. Here's an example mapping into a list of Ints.
val onCommaWhitespace = "[\\s,]+".toRegex() // in this example split on commas w/ whitespace
val bigListOfNumbers: List<Int> = """
0, 1, 2, 3, 4,
:
:
:
8187, 8188, 8189, 8190, 8191
""".trimIndent()
.split(onCommaWhitespace)
.map { it.toInt() }
Of course for splitting into a list of Strings, you'd have to choose an appropriate delimiter and regex that don't interfere with the actual data set.

There's no good way to do what you want; for something that size, reading the values from a data file (or calculating them, if that were possible) is a far better solution all round — more maintainable, much faster to compile and run, easier to read and edit, less likely to cause trouble with build tools and frameworks…
If you let the compiler finish, its output will tell you the problem.  (‘Always read the error messages’ should be one of the cardinal rules of development!)
I tried hotkey's version using apply(), and it eventually gave this error:
…
Caused by: org.jetbrains.org.objectweb.asm.MethodTooLargeException: Method too large: TestKt.main ()V
…
There's the problem: MethodTooLargeException.  The JVM allows only 65535 bytes of bytecode within a single method; see this answer.  That's the limit you're coming up against here: once you have too many entries, its code would exceed that limit, and so it can't be compiled.
If you were a real masochist, you could probably work around this to an extent by splitting the initialisation across many methods, keeping each one's code just under the limit.  But please don't!  For the sake of your colleagues, for the sake of your compiler, and for the sake of your own mental health…

Related

How to find the reasoning behind IntelliJ Kotlin lint check

I received a lint message while writing Kotlin using IntelliJ as my IDE: The argument can be converted to 'Set' to improve performance
My code went something like this:
val variable3 = variable1 - variable2 variables are of type List<Int>
The linter recommended me to change it to val variable3 = variable1 - variable2.toSet()
I would like to find out why it recommends this change and where the documentation is so next time I can look up the messages and learn the reasoning behind the lint check.
This is about performance and efficiency — in particular, about how the performance scales as your data gets bigger.
The - operator calls the standard library's Iterable<T>.minus(elements: Iterable<T>) extension function. If you look at its code (which you can do in IntelliJ), you'll see that that works by taking the first iterable (variable1 in this case) and filtering it to keep only the values not in the second one (variable2).
How does it check whether an element is in the second one? By calling its contains() method. But how that works will depend on the type of iterable. Most Sets, for example, can look up by hash code, which takes a short time, regardless of how big the set is.
However, most Lists and other iterables can't do that: they need to search through the whole list, element by element. How long that takes will obviously depend on the size of the list — it'll be very quick for short lists, but it might take some time to search through a list with thousands or millions of elements.
What makes it particularly important in this case is that it has to do that search repeatedly: once for each element of the first one. So the time can really mount up.
Say that the first iterable has 𝐌 elements, and the second has 𝐍. The subtraction has to make 𝐌 checks; if the second one is a set, then each check takes about the same time, so the overall time is proportional to 𝐌. But if not, then each check will take time proportional to 𝐍, so the overall time is 𝐌×𝐍 — which can get very big very fast! (For example, if you make each list 10× bigger, it'll take 100× as long.)
So if you don't want your program to grind to a halt when it starts handling more data, it can be well worth converting the second list to a set first. For small amounts of data, it adds a little extra work, but that probably won't be noticeable; and for large amounts of data, it can be a really big win.
That's why the IntelliJ inspection suggests it.
(For those of you who know about algorithmic complexity, please forgive the simplifications I've made here :)
Interestingly, when I tried it myself (in IntelliJ 2021.2.3 with Kotlin v1.5.73), it didn't make that suggestion. And looking into that implementation of the standard library, I see that in some cases the minus() method will do the conversion for you! However, I think it doesn't cover some other common cases, so it's still worth doing the conversion yourself if you think the lists could get big.
You can hover over and open the documentation like this:
The general quick fix options can be found and turn on/off under
settings -> inspections -> kotlin (example)

How to yield all substrings from string using sequence?

I'm trying to learn the Sequence in Kotlin.
Assume I want to get a sequence of all substrings of a string with the yield statement. I understand how to do this with two nested loops with the right and left borders.
It seems to me that there is an efficient way to use a Sequence or a pair of nested Sequences instead of loops. But I can't figure out how to do it.
How to yield all substrings from string using sequence?
Thanks
Frankly, I don't know what is the most efficient method. And I would just use for loops. But here's my solution to this problem, maybe it will help you understand sequences and this style of writing code:
Here it is on the Playground
fun String.substrings() =
indices.asSequence().flatMap { left ->
(left + 1..length).asSequence().map { right -> substring(left, right) }
}
Sequences aren't especially efficient, there's a bunch of overhead involved for each one - their main strength is being able to pass each element through the whole chain of operations one at a time.
This means you don't have to create an entire new collection of elements for each intermediate step (lower memory usage), you can terminate earlier once you find a result you're looking for, and sequences can be infinite. Even then, they might still be slower than the normal list version, depending on exactly what you're working with.
The most efficient sequence is probably what you're doing, using a couple of for loops and yielding items. But if you mean "efficient" like "using the standard library instead of writing out for loops" then #Furetur's answer is a way to do it, or you could use sliding windows like this:
val stuff = "12345"
val substrings = with(stuff) {
indices.asSequence().flatMap { i ->
windowedSequence(length - i)
}
}
print(substrings.toList())
>>>>[12345, 1234, 2345, 123, 234, 345, 12, 23, 34, 45, 1, 2, 3, 4, 5]
basically just using windowed (with the default of partialWindows=false) for every possible substring length, from length to 1, using the sequence versions of everything

Creating 4 digit number with no repeating elements in Kotlin

Thanks to #RedBassett for this Ressource (Kotlin problem solving): https://kotlinlang.org/docs/tutorials/koans.html
I'm aware this question exists here:
Creating a 4 digit Random Number using java with no repetition in digits
but I'm new to Kotlin and would like to explore the direct Kotlin features.
So as the title suggests, I'm trying to find a Kotlin specific way to nicely solve generate a 4 digit number (after that it's easy to make it adaptable for length x) without repeating digits.
This is my current working solution and would like to make it more Kotlin. Would be very grateful for some input.
fun createFourDigitNumber(): Int {
var fourDigitNumber = ""
val rangeList = {(0..9).random()}
while(fourDigitNumber.length < 4)
{
val num = rangeList().toString()
if (!fourDigitNumber.contains(num)) fourDigitNumber +=num
}
return fourDigitNumber.toInt()
}
So the range you define (0..9) is actually already a sequence of numbers. Instead of iterating and repeatedly generating a new random, you can just use a subset of that sequence. In fact, this is the accepted answer's solution to the question you linked. Here are some pointers if you want to implement it yourself to get the practice:
The first for loop in that solution is unnecessary in Kotlin because of the range. 0..9 does the same thing, you're on the right track there.
In Kotlin you can call .shuffled() directly on the range without needing to call Collections.shuffle() with an argument like they do.
You can avoid another loop if you create a string from the whole range and then return a substring.
If you want to look at my solution (with input from others in the comments), it is in a spoiler here:
fun getUniqueNumber(length: Int) = (0..9).shuffled().take(length).joinToString('')
(Note that this doesn't gracefully handle a length above 10, but that's up to you to figure out how to implement. It is up to you to use subList() and then toString(), or toString() and then substring(), the output should be the same.)

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

If Statement Optimization - comparing character strings vs constant boolean flags

Consider the following Java code:
public void DoStuff(String[] strings, boolean preEval)
{
final String compareTo = "A Somewhat Long String of Characters";
for ( int i = 0; i < strings.length; ++i )
{
if ( preEval )
{
if( strings[i].equals(compareTo) )
{
//do something process intensive
}
}
//do something process intensive
}
}
Now pay attention to if (preEval) and the inner statement within that. If the algorithm in use requires a condition such as preEval, does it make sense to include the preEval condition for the purposes of code optimization?
From my understanding, evaluating to see if a conditional flag resolves to true or false is much faster than iterating through a collection of characters and comparing each character within that collection with another corresponding character from a different collection.
My knowledge of assembly is about 30% I'd say in terms of the internals and opcodes/mnemonics involved, hence why I'm asking this question.
Update
Note: the code posted here is meant to be language independent; I simply chose Java just for the sake of something tangible and easy to read, as well as something which is widely known among the programmer community.
I would say that this would probably be an optimization in most cases.
That said, you should not spend time on optimizing code that has not been measured.
This might for example not be a worthwhile optimization if:
most of your cases involves few strings or very short strings.
it takes a long time to calculate the preEval parameter before calling the function.
Measure your code under realistic circumstances, identify your bottle necks, then you optimize.
A less costly approach might be to use a HashSet::contains(string) method to check for existence of a string in a collection. You can probably design away the need for string compares while iterating using a HashSet of strings or a HashMap keyed by String.
I always try to use a HashMap where i can to avoid conditional logic entirely.
_ryan