File I/O in Kotlin with (potentially) unknown encoding - kotlin

This is my first attempt to learn and use Kotlin. I have a simple task: read a file line by line, preprocess each line and put a specific portion into a map. Each line is tab-separated.
When trying to preprocess, things start going horribly wrong. I tried to debug, and instead of the normal characters, this is what I can see:
In between every two adjacent readable characters, there is the strange-looking block with horizontal lines.
Here is the code I wrote:
fun mapUserToId(path: String): MutableMap<String, Int> {
val user2id = mutableMapOf<String, Int>()
val bufferedReader = File(path).bufferedReader()
bufferedReader.useLines { lines ->
lines.drop(1).forEach { // dropping the first line with column headers
val components: List<String> = it.trim().split("\t") // split by tab delimiter
val user: String = components[2]
println(user.length) // length is nearly double due to the strange block-like characters
val id: String = components[3]
user2id[user] = id.toInt() // fails due to number format exception, because of those block-like characters
}
}
return user2id
}
This looks like a charset issue, but I can't figure out what the charset could be, and how to specify that charset in the above code. Opening the file in vim looks perfectly normal (as in, one would suspect that this file has UTF-8 encoding).

This is, indeed, an encoding issue. The problem is resolved by specifying the encoding while creating the buffered reader as follows:
val bufferedReader: BufferedReader = File(path).bufferedReader(Charsets.UTF_16)

Related

Need to filter lines from a sequence but keep it as a single sequence

I need to read a file from disk, filter out some rows based on conditions, then return the result as a single stream/sequence, not a sequence of strings. The file is too large to hold in memory all at once, so it must be treated as a Stream/Sequence throughout processing. This is what I tried.
File(filename).bufferedReader()
// break into lines
.lineSequence()
// filter each line based on condition
.filter{meetsSomeCondition(it)}
// add newline back in
.map{(it+"\n").byteInputStream()}
// reduce back into a single stream with Java's SequenceInputStream
.reduce<InputStream, ByteArrayInputStream> { acc, i -> SequenceInputStream(acc, i) }
This works when testing on a small file, but when using a large file it errors with a StackOverFlow exception. It seems that Java's SequenceInputStream can't handle repeatedly nesting itself like I do with the reduce call.
I see that SequenceInputStream also has a way of accepting an Enumeration argument that takes a List of elements. But that's the problem, as far as I can tell, it doesn't seem to accept a Stream.
Your code does not really do what you think it does. reduce() is a terminal operation, meaning that it consumes all elements in the sequence. After reduce() line the whole file has been read already. Also, it is not SequenceInputStream that does not support such reduce operation. You created a very long chain of objects. SequenceInputStream objects does not really know they were chained like this and they can't do too much about it.
Instead, you need to keep a sequence "alive" and create an InputStream that will read from the source sequence whenever required. I don't think there is an utility like this in the stdlib. It is a very specialized requirement.
The easiest is to create a sequence of bytes and then provide InputStream which reads from it:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.flatMap { "$it\n".toByteArray().asIterable() }
.asInputStream()
fun Sequence<Byte>.asInputStream() = object : InputStream() {
val iter = iterator()
override fun read() = if (iter.hasNext()) iter.next().toUByte().toInt() else -1
}
However, sequence of bytes isn't really the best for performance. We can optimize it by reading line by line, so creating a sequence of strings or byte arrays:
File(filename).bufferedReader()
.lineSequence()
.filter{meetsSomeCondition(it)}
.map { "$it\n".toByteArray() }
.asInputStream()
fun Sequence<ByteArray>.asInputStream() = object : InputStream() {
val iter = iterator()
var curr = iter.next()
var pos = 0
override fun read(): Int {
return when {
pos < curr.size -> curr[pos++].toUByte().toInt()
!iter.hasNext() -> -1
else -> {
curr = iter.next()
pos = 0
read()
}
}
}
}
(Note this implementation of asInputStream() will fail for empty sequence)
Still, there is much room for improvement regarding the performance. We read from sequence line by line, but we still read from InputStream byte by byte. To improve it further we would need to implement more methods of InputStream to read in bigger chunks. If you really care about the performance then I suggest looking into BufferedInputStream implementation and try to re-use some of its codebase.
Also, remember to close the file reader that was created in the first step. It won't close automatically when InputStream will be closed.

Write text and Print in Kotlin

I have the following code:
package com.zetcode
import java.io.File
fun main() {
val fileName = "P3.txt"
val content = File(P3.txt).readText()
println(content)
}
My goal is to write a code in kotlin that reads the text file (P3.txt) and prints its content. I know there is something wrong because I keep receiving "unresolved reference.
File takes a string as input, you should change the line:
val content = File(P3.txt).readText()
to
val content = File("P3.txt").readText()
The difference is that without the quotes, kotlin thinks P3 is a reference that is not declared anywhere and you get the error you've mentioned.

Why won't Kotlin print the string I select from a .txt file unless it's the last line?

I'm using to open a text file, randomly select a line, and format a string that includes the randomly selected line. The string is then printed to the console, but for some reason it won't work unless the last line of the file gets randomly selected.
Text file:
Neversummer
Abelhaven
Phandoril
Tampa
Sanortih
Trell
Zan'tro
Hermi Hermi
Curlthistle Forest
Code:
import java.io.File
fun main() {
var string = File("data/towns.txt")
.readText()
.split("\n")
.shuffled()
.first()
println("$string has printed")
}
Output when last line is selected:
Curlthistle Forest has printed
Output when any other line is selected:
has printed
As suggested by dyukha in the comment section it is indeed a platform specific issue. I prefer the solution (s)he provided using readLines() since you can condense two function calls into one.
However, should you ever need to check for the line delimiter in a platform independent manner you should use the built-in System.lineSeparator() property (Since Java 7).
import java.io.File
fun main() {
var string = File("data/towns.txt")
.readText()
.split(System.lineSeparator())
.shuffled()
.first()
println("$string has printed")
}
...
Still, I do recommend that you use readLines() since it packages the functionality of both .readText() and .split(System.lineSeparator()).

Clean way of reading all input lines in Kotlin

A common pattern when doing coding challenges is to read many lines of input. Assuming you don't know in advance how many lines, you want to read until EOF (readLine returns null).
Also as a preface, I don't want to rely on java.utils.* since I'm coding in KotlinNative, so no Scanner.
I would like to maybe do something like
val lines = arrayListOf<String>()
for (var line = readLine(); line != null; line = readLine()) {
lines.add(line)
}
But that clearly isn't valid Kotlin. The cleanest I can come up with is:
while (true) {
val line = readLine()
if (line == null) break
lines.add(line)
}
This works, but it just doesn't seem very idiomatic. Is there a better way to read all lines into an array, without using a while/break loop?
generateSequence has the nice property that it will complete if the internal generator returns null and accepts only a single iteration, so the following code could be valid:
val input = generateSequence(::readLine)
val lines = input.toList()
Then like s1m0nw1's answer you can use any of the available Sequence<String> methods to refine this as desired for your solution.
I guess you're talking about reading from System.in (stdin) here. You could make that work with sequences:
val lines = generateSequence(readLine()) {
readLine()
}
lines.take(5).forEach { println("read: $it") }
We begin our sequence with a first readLine (the sequence's seed) and then read the next line until null is encountered. The sequence is possibly infinite, therefore we just take the first five inputs in the example. Read about details on Sequence here.

Iterating over files, splitting by pattern

I'm trying to wrap my head around file stream processing. Got input looking like this:
bla
blubb
blubber
testcode
There's several files all looking like the above. Right now, I'm using a single file approach that reads the whole file into memory and splits it:
Files.newBufferedReader("myfile").use { f ->
f.readText().splitToSequence("\n\n").forEach {
// do my stuff
}
}
Now, I'm trying to generalize this to larger inputs (making it impractical to hold the file in memory) and several files. Ideally, I'd treat a whole directory of input files as a single stream of lines I split on \n\n and work on the parts. How would I do this?
You can read file as a sequence of text lines and then regroup those lines taking empty line as a delimiter:
File("myfile").useLines { lines ->
val lineBlocks: Sequence<List<String>> = buildSequence {
val block = mutableListOf<String>()
for (line in lines) {
when {
line.isNotEmpty() -> block.add(line)
block.isNotEmpty() -> {
yield(block.toList())
block.clear()
}
}
}
if (block.isNotEmpty()) yield(block.toList())
}
lineBlocks.forEach {
println(it.joinToString())
}
}
Here you get the result in lineBlocks, which is a sequence where each element is a list of lines in a single block.