Need to filter lines from a sequence but keep it as a single sequence - kotlin

I need to read a file from disk, filter out some rows based on conditions, then return the result as a single stream/sequence, not a sequence of strings. The file is too large to hold in memory all at once, so it must be treated as a Stream/Sequence throughout processing. This is what I tried.
File(filename).bufferedReader()
// break into lines
.lineSequence()
// filter each line based on condition
.filter{meetsSomeCondition(it)}
// add newline back in
.map{(it+"\n").byteInputStream()}
// reduce back into a single stream with Java's SequenceInputStream
.reduce<InputStream, ByteArrayInputStream> { acc, i -> SequenceInputStream(acc, i) }
This works when testing on a small file, but when using a large file it errors with a StackOverFlow exception. It seems that Java's SequenceInputStream can't handle repeatedly nesting itself like I do with the reduce call.
I see that SequenceInputStream also has a way of accepting an Enumeration argument that takes a List of elements. But that's the problem, as far as I can tell, it doesn't seem to accept a Stream.

Your code does not really do what you think it does. reduce() is a terminal operation, meaning that it consumes all elements in the sequence. After reduce() line the whole file has been read already. Also, it is not SequenceInputStream that does not support such reduce operation. You created a very long chain of objects. SequenceInputStream objects does not really know they were chained like this and they can't do too much about it.
Instead, you need to keep a sequence "alive" and create an InputStream that will read from the source sequence whenever required. I don't think there is an utility like this in the stdlib. It is a very specialized requirement.
The easiest is to create a sequence of bytes and then provide InputStream which reads from it:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.flatMap { "$it\n".toByteArray().asIterable() }
.asInputStream()
fun Sequence<Byte>.asInputStream() = object : InputStream() {
val iter = iterator()
override fun read() = if (iter.hasNext()) iter.next().toUByte().toInt() else -1
}
However, sequence of bytes isn't really the best for performance. We can optimize it by reading line by line, so creating a sequence of strings or byte arrays:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.map { "$it\n".toByteArray() }
.asInputStream()
fun Sequence<ByteArray>.asInputStream() = object : InputStream() {
val iter = iterator()
var curr = iter.next()
var pos = 0
override fun read(): Int {
return when {
pos < curr.size -> curr[pos++].toUByte().toInt()
!iter.hasNext() -> -1
else -> {
curr = iter.next()
pos = 0
read()
}
}
}
}
(Note this implementation of asInputStream() will fail for empty sequence)
Still, there is much room for improvement regarding the performance. We read from sequence line by line, but we still read from InputStream byte by byte. To improve it further we would need to implement more methods of InputStream to read in bigger chunks. If you really care about the performance then I suggest looking into BufferedInputStream implementation and try to re-use some of its codebase.
Also, remember to close the file reader that was created in the first step. It won't close automatically when InputStream will be closed.

Related

How do I make this Sequence lazy?

I was trying to generate all permutations of a list in Kotlin. There are a zillion examples out there which return a List<List<T>>, but my input list breaks those as they try to fit all the results in the output list. So I thought I would try to make a version returning Sequence<List<T>>...
fun <T> List<T>.allPermutations(): Sequence<List<T>> {
println("Permutations of $this")
if (isEmpty()) return emptySequence()
val list = this
return indices
.asSequence()
.flatMap { i ->
val elem = list[i]
(list - elem).allPermutations().map { perm -> perm + elem }
}
}
// Then try to print the first permutation
println((0..15).toList().allPermutations().first())
Problem is, Kotlin just seems to give up and asks for the complete contents of one of the nested sequences - so it never (or at least not for a very long time) ends up getting to the first element. (It will probably run out of memory before it gets there.)
I tried the same using Flow<T>, with the same outcome.
As far as I can tell, at no point does my code ask it to convert the sequence into a list, but it seems like something internal is doing it to me anyway, so how do I stop that?
As mentioned in the comments, you have handled the empty base case incorrectly. You should return a sequence of one empty list.
// an empty list has a single permutation - "itself"
if (isEmpty()) return sequenceOf(emptyList())
If you return an empty sequence, first will never find anything - your sequence is always empty - so it will keep evaluating the sequence until it ends, and throw an exception. (Try this with a smaller input like 0..2!)

Why can't I use map on Kotlins Regex Result sequence

I worked with Kotlin's Regex API to get occurences of some regular expression. I wanted to convert the finding directly into another object so I intuitively used map() on the result sequence.
I was very surprised that the map function is never called but forEach is working. This example should make it clear:
val regex = "a.".toRegex()
val txt = "abacad"
var counter = 0
regex.findAll(txt).forEach { counter++ }
println(counter) // 3
regex.findAll(txt).map { counter++ }
println(counter) // still 3 since map is not called
regex.findAll(txt).forEach { counter++ }
println(counter) // 6
My question is why? Did I oversee it in the documentation?
(tested on Kotlin 1.5.30)
findAll() returns a Sequence<MatchResult>. Operations on Sequence are classified either as intermediate or terminal. The documentation for the functions declares which type they are. map and onEach are intermediate. Their action is deferred until a terminal operation is made. forEach is terminal.
Manipulating a Sequence with map returns a new Sequence that will perform the mapping function only when it is actually iterated, such as by a call to forEach or using it in a for loop.
This is the purpose of Sequence, to defer mutating functional calls. It can reduce allocations of intermediate Lists, or in some cases avoid applying the mutations on every single item, such as if the terminal call in the chain is a find() call.

With Arrow: How do I apply a transformation of type (X)->IO<Y> to data of type Sequence<X> to get IO<Sequence<Y>>?

I am learning functional programming using Arrow.kt, intending to walk a path hierarchy and hash every file (and do some other stuff). Forcing myself to use functional concepts as much as possible.
Assume I have a data class CustomHash(...) already defined in code. It will be referenced below.
First I need to build a sequence of files by walking the path. This is an impure/effectful function, so it should be marked as such with the IO monad:
fun getFiles(rootPath: File): IO<Sequence<File>> = IO {
rootPath.walk() // This function is of type (File)->Sequence<File>
}
I need to read the file. Again, impure, so this is marked with IO
fun getRelevantFileContent(file: File): IO<Array<Byte>> {
// Assume some code here to extract only certain data relevant for my hash
}
Then I have a function to compute a hash. If it takes a byte array, then it's totally pure. Making it suspend because it will be slow to execute:
suspend fun computeHash(data: Array<Byte>): CustomHash {
// code to compute the hash
}
My issue is how to chain this all together in a functional manner.
fun main(rootPath: File) {
val x = getFiles(rootPath) // IO<Sequence<File>>
.map { seq -> // seq is of type Sequence<File>
seq.map { getRelevantFileContent(it) } // This produces Sequence<IO<Hash>>
}
}
}
Right now, if I try this, x is of type IO<Sequence<IO<Hash>>>. It is clear to me why this is the case.
Is there some way of turning Sequence<IO<Any>> into IO<Sequence<Any>>? Which I suppose is essentially, probably getting the terms imprecise, taking blocks of code that execute in their own coroutines and running the blocks of code all on the same coroutine instead?
If Sequence weren't there, I know IO<IO<Hash>> could have been IO<Hash> by using a flatMap in there, but Sequence of course doesn't have that flattening of IO capabilities.
Arrow's documentation has a lot of "TODO" sections and jumps very fast into documentation that presumes a lot of intermediate/advanced functional programming knowledge. It hasn't really been helpful for this problem.
First you need to convert the Sequence to SequenceK then you can use the sequence function to do that.
import arrow.fx.*
import arrow.core.*
import arrow.fx.extensions.io.applicative.applicative
val sequenceOfIOs: Sequence<IO<Any>> = TODO()
val ioOfSequence: IO<Sequence<Any>> = sequenceOfIOs.k()
.sequence(IO.applicative())
.fix()

Clean way of reading all input lines in Kotlin

A common pattern when doing coding challenges is to read many lines of input. Assuming you don't know in advance how many lines, you want to read until EOF (readLine returns null).
Also as a preface, I don't want to rely on java.utils.* since I'm coding in KotlinNative, so no Scanner.
I would like to maybe do something like
val lines = arrayListOf<String>()
for (var line = readLine(); line != null; line = readLine()) {
lines.add(line)
}
But that clearly isn't valid Kotlin. The cleanest I can come up with is:
while (true) {
val line = readLine()
if (line == null) break
lines.add(line)
}
This works, but it just doesn't seem very idiomatic. Is there a better way to read all lines into an array, without using a while/break loop?
generateSequence has the nice property that it will complete if the internal generator returns null and accepts only a single iteration, so the following code could be valid:
val input = generateSequence(::readLine)
val lines = input.toList()
Then like s1m0nw1's answer you can use any of the available Sequence<String> methods to refine this as desired for your solution.
I guess you're talking about reading from System.in (stdin) here. You could make that work with sequences:
val lines = generateSequence(readLine()) {
readLine()
}
lines.take(5).forEach { println("read: $it") }
We begin our sequence with a first readLine (the sequence's seed) and then read the next line until null is encountered. The sequence is possibly infinite, therefore we just take the first five inputs in the example. Read about details on Sequence here.

How to use the same iterator twice, once for counting and once for iteration?

It seems that an iterator is consumed when counting. How can I use the same iterator for counting and then iterate on it?
I'm trying to count the lines in a file and then print them. I am able to read the file content, I'm able to count the lines count, but then I'm no longer able to iterate over the lines as if the internal cursor was at the end of the iterator.
use std::fs::File;
use std::io::prelude::*;
fn main() {
let log_file_name = "/home/myuser/test.log";
let mut log_file = File::open(log_file_name).unwrap();
let mut log_content: String = String::from("");
//Reads the log file.
log_file.read_to_string(&mut log_content).unwrap();
//Gets all the lines in a Lines struct.
let mut lines = log_content.lines();
//Uses by_ref() in order to not take ownership
let count = lines.by_ref().count();
println!("{} lines", count); //Prints the count
//Doesn't enter in the loop
for value in lines {
println!("{}", value);
}
}
Iterator doesn't have a reset method, but it seems the internal cursor is at the end of the iterator after the count. Is it mandatory to create a new Lines by calling log_content.lines() again or can I reset the internal cursor?
For now, the workaround that I found is create a new iterator:
use std::fs::File;
use std::io::prelude::*;
fn main() {
let log_file_name = "/home/myuser/test.log";
let mut log_file = File::open(log_file_name).unwrap();
let mut log_content: String = String::from("");
//Reads the log file.
log_file.read_to_string(&mut log_content).unwrap();
//Counts all and consume the iterator
let count = log_content.lines().count();
println!("{} lines", count);
//Creates a pretty new iterator
let lines = log_content.lines();
for value in lines {
println!("{}", value);
}
}
Calling count consumes the iterator, because it actually iterates until it is done (i.e. next() returns None).
You can prevent consuming the iterator by using by_ref, but the iterator is still driven to its completion (by_ref actually just returns the mutable reference to the iterator, and Iterator is also implemented for the mutable reference: impl<'a, I> Iterator for &'a mut I).
This still can be useful if the iterator contains other state you want to reuse after it is done, but not in this case.
You could simply try forking the iterator (they often implement Clone if they don't have side effects), although in this case recreating it is just as good (most of the time creating an iterator is cheap; the real work is usually only done when you drive it by calling next directly or indirectly).
So no, (in this case) you can't reset it, and yes, you need to create a new one (or clone it before using it).
The other answers have already well-explained that you can either recreate your iterator or clone it.
If the act of iteration is overly expensive or it's impossible to do multiple times (such as reading from a network socket), an alternative solution is to create a collection of the iterator's values that will allow you to get the length and the values.
This does require storing every value from the iterator; there's no such thing as a free lunch!
use std::fs;
fn main() {
let log_content = fs::read_to_string("/home/myuser/test.log").unwrap();
let lines: Vec<_> = log_content.lines().collect();
println!("{} lines", lines.len());
for value in lines {
println!("{}", value);
}
}
Iterators can generally not be iterated twice because there might be a cost to their iteration. In the case of str::lines, each iteration needs to find the next end of line, which means scanning through the string, which has some cost. You could argue that the iterator could save those positions for later reuse, but the cost of storing them would be even bigger.
Some Iterators are even more expensive to iterate, so you really don't want to do it twice.
Many iterators can be recreated easily (here calling str::lines a second time) or be cloned. Whichever way you recreate an iterator, the two iterators are generally completely independent, so iterating will mean you'll pay the price twice.
In your specific case, it is probably fine to just iterate the string twice as strings that fit in memory shouldn't be so long that merely counting lines would be a very expensive operation. If you believe this is the case, first benchmark it, second, write your own algorithm as Lines::count is probably not optimized as much as it could since the primary goal of Lines is to iterate lines.