using toRegex() with split and replace increases performance - kotlin

I am reading a file line by line consisting of 4 million records where I do split and replace operations to be used further down the line. While doing the split and replace, I found that using toRegex() boosts the performance.
private fun parseRecord(record: String) { // this is the method that will be called for each line
val tokens = record.split(" ") //record will always contain 2 tokens
}
// pseudo-code for calling parseRecord() for each line in the file
for line in file:
parseRecord(line)
This takes approximately 30seconds to complete the processing. Whereas if I use below code, it completes in 20seconds.
private fun parseRecord(record: String) {
val tokens = record.split(" ".toRegex())
}
The above scenario applies for replace as well.
What is the reason behind the increase in the performance while using toRegex()? Also, is there any other efficient way to further more optimise this?
Thanks in advance for the help!

Although both calls look similar, their implementation is very different.
record.split(" ") is implemented by traveling along your record and doing the necessary book keeping.
record.split(" ".toRegex()) is actually implemented as regex.split(record) and uses regular expressions to do the work. This is often more efficient than the naive approach.
You may even squeeze a bit more performance by reusing your regular expression instead of creating a new one every time you split:
val SPACE_REGEX = " ".toRegex()
...
val tokens = record.split(SPACE_REGEX)

If your file is like below, you can avoid toRegex() and pass " " to split because the columns are separated by one literal space, which can be handled normal split. It'd be great performance improvement since Regex may not recognize the situation.
format:
foo1 bar1
foo2 bar2
foo3 bar3
foo4 bar4
foo5 bar5
foo6 bar6
foo7 bar7
You may want to readLines then parallelStream to make your task done in multi-core, if you're writing code on Kotlin/JVM. So:
val processed = File("/path/to/file") // replace this with actual path
.readLines()
.parallelStream()
.map { line ->
replace(split(line)) // assuming `replace` and `split` are given from outer scope
}
.toList()

Related

Finding best delimiter by size of resulting array after split Kotlin

I am trying to obtain the best delimiter for my CSV file, I've seen answers that find the biggest size of the header row. Now instead of doing the standard method that would look something like this:
val supportedDelimiters: Array<Char> = arrayOf(',', ';', '|', '\t')
fun determineDelimiter(headerRow): Char {
var headerLength = 0
var chosenDelimiter =' '
supportedDelimiters.forEach {
if (headerRow.split(it).size > headerLength) {
headerLength = headerRow.split(it).size
chosenDelimiter = it
}
}
return chosenDelimiter
}
I've been trying to do it with some in-built Kotlin collections methods like filter or maxOf, but to no avail (the code below does not work).
fun determineDelimiter(headerRow: String): Char {
return supportedDelimiters.filter({a,b -> headerRow.split(a).size < headerRow.split(b)})
}
Is there any way I could do it without forEach?
Edit: The header row could look something like this:
val headerRow = "I;am;delimited;with;'semi,colon'"
I put the '' over an entry that could contain other potential delimiter
You're mostly there, but this seems simpler than you think!
Here's one answer:
fun determineDelimiter(headerRow: String)
= supportedDelimiters.maxByOrNull{ headerRow.split(it).size } ?: ' '
maxByOrNull() does all the hard work: you just tell it the number of headers that a delimiter would give, and it searches through each delimiter to find which one gives the largest number.
It returns null if the list is empty, so the method above returns a space character, like your standard method. (In this case we know that the list isn't empty, so you could replace the ?: ' ' with !! if you wanted that impossible case to give an error, or you could drop it entirely if you wanted it to give a null which would be handled elsewhere.)
As mentioned in a comment, there's no foolproof way to guess the CSV delimiter in general, and so you should be prepared for it to pick the wrong delimiter occasionally. For example, if the intended delimiter was a semicolon but several headers included commas, it could wrongly pick the comma. Without knowing any more about the data, there's no way around that.
With the code as it stands, there could be multiple delimiters which give the same number of headers; it would simply pick the first. You might want to give an error in that case, and require that there's a unique best delimiter. That would give you a little more confidence that you've picked the right one — though there's still no guarantee. (That's not so easy to code, though…)
Just like gidds said in the comment above, I would advise against choosing the delimiter based on how many times each delimiter appears. You would get the wrong answer for a header row like this:
Type of shoe, regardless of colour, even if black;Size of shoe, regardless of shape
In the above header row, the delimiter is obviously ; but your method would erroneously pick ,.
Another problem is that a header column may itself contain a delimiter, if it is enclosed in quotes. Your method doesn't take any notice of possible quoted columns. For this reason, I would recommend that you give up trying to parse CSV files yourself, and instead use one of the many available Open Source CSV parsers.
Nevertheless, if you still want to know how to pick the delimiter based on its frequency, there are a few optimizations to readability that you can make.
First, note that Kotlin strings are iterable; therefore you don't have to use a List of Char. Use a String instead.
Secondly, all you're doing is counting the number of times a character appears in the string, so there's no need to break the string up into pieces just to do that. Instead, count the number of characters directly.
Third, instead of finding the maximum value by hand, take advantage of what the standard library already offers you.
const val supportedDelimiters = ",;|\t"
fun determineDelimiter(headerRow: String): Char =
supportedDelimiters.maxBy { delimiter -> headerRow.count { it == delimiter } }
fun main() {
val headerRow = "one,two,three;four,five|six|seven"
val chosenDelimiter = determineDelimiter(headerRow)
println(chosenDelimiter) // prints ',' as expected
}

Creating 4 digit number with no repeating elements in Kotlin

Thanks to #RedBassett for this Ressource (Kotlin problem solving): https://kotlinlang.org/docs/tutorials/koans.html
I'm aware this question exists here:
Creating a 4 digit Random Number using java with no repetition in digits
but I'm new to Kotlin and would like to explore the direct Kotlin features.
So as the title suggests, I'm trying to find a Kotlin specific way to nicely solve generate a 4 digit number (after that it's easy to make it adaptable for length x) without repeating digits.
This is my current working solution and would like to make it more Kotlin. Would be very grateful for some input.
fun createFourDigitNumber(): Int {
var fourDigitNumber = ""
val rangeList = {(0..9).random()}
while(fourDigitNumber.length < 4)
{
val num = rangeList().toString()
if (!fourDigitNumber.contains(num)) fourDigitNumber +=num
}
return fourDigitNumber.toInt()
}
So the range you define (0..9) is actually already a sequence of numbers. Instead of iterating and repeatedly generating a new random, you can just use a subset of that sequence. In fact, this is the accepted answer's solution to the question you linked. Here are some pointers if you want to implement it yourself to get the practice:
The first for loop in that solution is unnecessary in Kotlin because of the range. 0..9 does the same thing, you're on the right track there.
In Kotlin you can call .shuffled() directly on the range without needing to call Collections.shuffle() with an argument like they do.
You can avoid another loop if you create a string from the whole range and then return a substring.
If you want to look at my solution (with input from others in the comments), it is in a spoiler here:
fun getUniqueNumber(length: Int) = (0..9).shuffled().take(length).joinToString('')
(Note that this doesn't gracefully handle a length above 10, but that's up to you to figure out how to implement. It is up to you to use subList() and then toString(), or toString() and then substring(), the output should be the same.)

Large list literals in Kotlin stalling/crashing compiler

I'm using val globalList = listOf("a1" to "b1", "a2" to "b2") to create a large list of Pairs of strings.
All is fine until you try to put more than 1000 Pairs into a List. The compiler either takes > 5 minutes or just crashes (Both in IntelliJ and Android Studio).
Same happens if you use simple lists of Strings instead of Pairs.
Is there a better way / best practice to include large lists in your source code without resorting to a database?
You can replace a listOf(...) expression with a list created using a constructor or a factory function and adding the items to it:
val globalList: List<Pair<String, String>> = mutableListOf().apply {
add("a1" to "b1")
add("a2" to "b2")
// ...
}
This is definitely a simpler construct for the compiler to analyze.
If you need something quick and dirty instead of data files, one workaround is to use a large string, then split and map it into a list. Here's an example mapping into a list of Ints.
val onCommaWhitespace = "[\\s,]+".toRegex() // in this example split on commas w/ whitespace
val bigListOfNumbers: List<Int> = """
0, 1, 2, 3, 4,
:
:
:
8187, 8188, 8189, 8190, 8191
""".trimIndent()
.split(onCommaWhitespace)
.map { it.toInt() }
Of course for splitting into a list of Strings, you'd have to choose an appropriate delimiter and regex that don't interfere with the actual data set.
There's no good way to do what you want; for something that size, reading the values from a data file (or calculating them, if that were possible) is a far better solution all round — more maintainable, much faster to compile and run, easier to read and edit, less likely to cause trouble with build tools and frameworks…
If you let the compiler finish, its output will tell you the problem.  (‘Always read the error messages’ should be one of the cardinal rules of development!)
I tried hotkey's version using apply(), and it eventually gave this error:
…
Caused by: org.jetbrains.org.objectweb.asm.MethodTooLargeException: Method too large: TestKt.main ()V
…
There's the problem: MethodTooLargeException.  The JVM allows only 65535 bytes of bytecode within a single method; see this answer.  That's the limit you're coming up against here: once you have too many entries, its code would exceed that limit, and so it can't be compiled.
If you were a real masochist, you could probably work around this to an extent by splitting the initialisation across many methods, keeping each one's code just under the limit.  But please don't!  For the sake of your colleagues, for the sake of your compiler, and for the sake of your own mental health…

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

If Statement Optimization - comparing character strings vs constant boolean flags

Consider the following Java code:
public void DoStuff(String[] strings, boolean preEval)
{
final String compareTo = "A Somewhat Long String of Characters";
for ( int i = 0; i < strings.length; ++i )
{
if ( preEval )
{
if( strings[i].equals(compareTo) )
{
//do something process intensive
}
}
//do something process intensive
}
}
Now pay attention to if (preEval) and the inner statement within that. If the algorithm in use requires a condition such as preEval, does it make sense to include the preEval condition for the purposes of code optimization?
From my understanding, evaluating to see if a conditional flag resolves to true or false is much faster than iterating through a collection of characters and comparing each character within that collection with another corresponding character from a different collection.
My knowledge of assembly is about 30% I'd say in terms of the internals and opcodes/mnemonics involved, hence why I'm asking this question.
Update
Note: the code posted here is meant to be language independent; I simply chose Java just for the sake of something tangible and easy to read, as well as something which is widely known among the programmer community.
I would say that this would probably be an optimization in most cases.
That said, you should not spend time on optimizing code that has not been measured.
This might for example not be a worthwhile optimization if:
most of your cases involves few strings or very short strings.
it takes a long time to calculate the preEval parameter before calling the function.
Measure your code under realistic circumstances, identify your bottle necks, then you optimize.
A less costly approach might be to use a HashSet::contains(string) method to check for existence of a string in a collection. You can probably design away the need for string compares while iterating using a HashSet of strings or a HashMap keyed by String.
I always try to use a HashMap where i can to avoid conditional logic entirely.
_ryan