Using CsvMapper in Kotlin to parse huge CSV file - kotlin

I have a CSV file generated as a report by the server. The CSV file has more than 100k lines of text. I'm reading the file in Kotlin using CsvMapper, but I'm ending up with IOException.
Here's the code that I'm implemented:
//Declare the mapper object
private var csvMapper = CsvMapper().registerModule(KotlinModule())
//Generate Iterator
inline fun <reified T> getIterator(fileName: String): MappingIterator<T>? {
csvMapper.disable(JsonParser.Feature.AUTO_CLOSE_SOURCE)
FileReader(fileName).use { reader ->
return csvMapper
.readerFor(T::class.java)
.without(StreamReadFeature.AUTO_CLOSE_SOURCE)
.with(CsvSchema.emptySchema().withHeader())
.readValues<T>(reader)
}
}
//Read the file using iterator
fun read(csvFile: String) {
val iterator = getIterator<BbMembershipData>(csvFile)
if (iterator != null) {
while (iterator.hasNext()) {
try {
val lineElement = iterator.next()
println(lineElement)
} catch (e: RuntimeJsonMappingException) {
println("Iterator Exception: " + e.localizedMessage)
}
}
}
}
After printing 10 lines of code, its throwing below exception:
Exception in thread "main" java.lang.RuntimeException: Stream closed
at com.fasterxml.jackson.databind.MappingIterator._handleIOException(MappingIterator.java:420)
at com.fasterxml.jackson.databind.MappingIterator.next(MappingIterator.java:203)
at Main$FileProcessor.read(Main.kt:39)
at Main.main(Main.kt:54)
Caused by: java.io.IOException: Stream closed
How can I prevent the "Stream Closed" exception?

The way CsvMapper works is that it reads lazily, rather than reading the whole file the moment you call readValues.
When the end of the use block is reached, basically nothing has actually been read yet! You only start reading the file when you start using the iterator, but by then, the file is already closed!
Therefore, read needs to open and close the file, not getIterator:
//Generate Iterator
inline fun <reified T> getIterator(reader: Reader): MappingIterator<T>? {
csvMapper.disable(JsonParser.Feature.AUTO_CLOSE_SOURCE)
return csvMapper
.readerFor(T::class.java)
.without(StreamReadFeature.AUTO_CLOSE_SOURCE)
.with(CsvSchema.emptySchema().withHeader())
.readValues<T>(reader)
//Read the file using iterator
fun read(csvFile: String) {
FileReader(csvFile).use { reader ->
val iterator = getIterator<BbMembershipData>(reader)
if (iterator != null) {
while (iterator.hasNext()) {
try {
val lineElement = iterator.next()
println(lineElement)
} catch (e: RuntimeJsonMappingException) {
println("Iterator Exception: " + e.localizedMessage)
}
}
}
}
}
Notice how the end of the use block changed, to where you finish reading the whole file.

Related

Heap issue when using kotlin coroutine in a Batch process

I want to call an API for each element in a list.
So I created below code which is an extension function:
suspend fun <T, V> Iterable<T>.customAsyncAll(method: suspend (T) -> V): Iterable<V> {
val deferredList = mutableListOf<Deferred<V>>()
val scope = CoroutineScope(dispatchers.io)
forEach {
val deferred = scope.async {
try {
method(it)
} catch (e: Exception) {
log.error { "customAsyncAll Exception in $method method " + e.stackTraceToString())
}
throw e
}
}
deferredList.add(deferred)
}
return deferredList.awaitAll()
}
Call the code as:
val result = runBlocking{ list.customAsyncAll { apiCall(it) }.toList() }
I see error posting Resource Exhausted event: Java heap space. What is wrong with this code?
When an exception is thrown in one of the api calls, will the rest of the courouting async stuff be released or it still occupies heap space?
I'm guessing you are passing a somewhat large list (50+ items). I do believe that making so many calls is the problem, and realistically speaking I don't think you will have any performance gain by opening more than 10 connections to the API at a time. Μy suggestion would be to limit the concurrent calls to any number of less than 20.
There are many ways to implement this, using Semaphore is my recommendation.
suspend fun <T, V> Iterable<T>.customAsyncAll(method: suspend (T) -> V): Iterable<V> {
val deferredList = mutableListOf<Deferred<V>>()
val scope = CoroutineScope(Dispatchers.IO)
val sema = Semaphore(10)
forEach {
val deferred = scope.async {
sema.withPermit {
try {
method(it)
} catch (e: Exception) {
log.error {
"customAsyncAll Exception in $method method "
+ e.stackTraceToString())
}
throw e
}
}
}
deferredList.add(deferred)
}
return deferredList.awaitAll()
}
 
sidenote
Be sure to cancel any custom CouroutineScope you create after you are done with it, see Custom usage.

How to read a file in one coroutine and print lines in another coroutine?

I'm trying to get comfortable with Kotlin/coroutines. My current goal is to read a text file in one coroutine, and emit each line through a Channel to be printed in another coroutine. Here's what I have so far:
fun main() = runBlocking {
val ch = Channel<String>()
launch {
for (msg in ch) {
println(msg.length)
}
}
launch {
File("file.txt").forEachLine {
ch.send(it)
}
}
}
Hopefully this shows my intent, but it doesn't compile because you can't call a suspending function (send) from the lambda passed to forEachLine. In Golang everything is modeled synchronously, so I would just run it in a goroutine and send would block, but Kotlin seems to have a lower level concurrency model. What would be the canonical way to accomplish this?
If it's helpful, my final goal is to read JSON events emitted from a subprocess via stdout. I'll have a separate JSON object on each line, and will need to parse and handle each separately.
This is the best I've been able to come up with so far. It seems to work but I feel like there must be a more idiomatic way to accomplish this.
fun main() = runBlocking {
val ch = Channel<String>()
launch {
for (msg in ch) {
println(msg.length)
}
}
launch {
val istream = File("file.txt").inputStream()
val buf = ByteArray(4096)
while (true) {
val n = istream.read(buf)
if (n == -1) {
break
}
val msg = buf.sliceArray(0..n-1).toString(Charsets.UTF_8)
ch.send(msg)
}
ch.close()
}
}
I've been trying the same and based on some ideas I got from https://kotlinlang.org/docs/channels.html#fan-out the following seems to work nicely:
fun main() {
val fileToRead = File("somefile.csv")
runBlocking {
// Producer reading the file
val fileChannel = readFileIntoChannel(fileToRead)
// Consumer writing file lines to stdout
launch { fileChannel.consumeEach { line -> println(line) } }
}
}
fun CoroutineScope.readFileIntoChannel(f: File) = produce<String> {
for (line in f.bufferedReader().lines() ) { send(line) }
}

Crash in coroutine

My function is quite straightforward,
Main Thread: Initializes a variable ->
Background Thread: Fire network request, assign the result back to the previous variable ->
Main Thread: Display that variable
Code below:
suspend fun createCity(context: Context, newCity: MutableLiveData<NewIdea>, mapBody: Map<String, String>, token: String) {
lateinit var response: NewIdea
try {
withContext(Dispatchers.IO) {
val map = generateRequestBody(mapBody)
response = webservice.createIdea(tripId, map, "Bearer $token")
getTrip(context, token)
}
} catch (e: Exception) {
Log.e(TAG, e.message)
}
newCity.value = response
}
But sometimes (it only happened 2 times actually) crashlytics reports crash for this line newCity.value = response
Fatal Exception: kotlin.UninitializedPropertyAccessException: lateinit property response has not been initialized
I don't really understand how that can happen.
Is this the correct way to return value from coroutine function?
thanks
Well if try block fails, it might happen that the lateinit variable isn't set at all. You should put the ui update code inside the try block as well, and handle the Exception separately:
Sidenote: withContext is well-optimized to return values, so you can make use of it.
suspend fun createCity(context: Context, newCity: MutableLiveData<NewIdea>, mapBody: Map<String, String>, token: String) {
try {
val response: NewIdea = withContext(Dispatchers.IO) {
val map = generateRequestBody(mapBody)
// does createIdea() first store it in var, then does getTrip(), then returns the result of createIdea() stored previously
webservice.createIdea(tripId, map, "Bearer $token").also { getTrip(context, token) } // ^withContext
}
newCity.value = response
} catch (e: Exception) {
Log.e(TAG, e.message)
}
}
A quick tip (optional): You can wrap the UI updating code with a withContext that dispatches the work to Dispatchers.Main when not running in main thread, while if running in main do nothing:
withContext(Dispatchers.Main.immediate) {
val response: NewIdea = withContext(Dispatchers.IO) {
val map = generateRequestBody(mapBody)
// does createIdea() first store it in var, then does getTrip(), then returns the result of createIdea() stored previously
webservice.createIdea(tripId, map, "Bearer $token").also { getTrip(context, token) } // ^withContext
}
newCity.value = response
}

getting error Missing calls inside every { ... } block in writing unit test cases in kotlin + Mockk + Junit5

the function I am testing,
class FileUtility {
companion object {
#JvmStatic
fun deleteFile(filePath: String) {
try {
val file = getFileObject(filePath)
file.delete()
} catch (ex :Exception) {
log.error("Exception while deleting the file", ex)
}
}
}
}
Unit test,
#Test
fun deleteFileTest() {
val filePath = "filePath"
val file = mockk<File>()
every { getFileObject(filePath) } returns file
deleteFile(filePath)
verify { file.delete() }
}
getting the following error on running this test case
io.mockk.MockKException: Missing calls inside every { ... } block.
is this any bug or am I writing wrong test case?
Assuming getFileObject is a top level function in FileUtility.kt file, you need to mock module wide functions with mockkStatic(...) with argument as the module’s class name.
For example “pkg.FileKt” for module File.kt in the pkg package.
#Test
fun deleteFileTest() {
val file = mockk<File>()
mockkStatic("pkg.FileUtilityKt")
val filePath = "filePath"
every { getFileObject(filePath) } returns file
every {file.delete()} answers {true}
deleteFile(filePath)
verify { file.delete() }
}

Why not read all lines from text file?

In my Kotlin project in folder src/resources/ I has file pairs_ids.txt.
This is a property file:
key=value
The count of all lines are 1389.
Here code that read content of this file line by line.
open class AppStarter : Application<AppConfig>() {
override fun getName() = "stats"
override fun run(configuration: AppConfig?, environment: Environment?) {
val logger = LoggerFactory.getLogger(this::class.java)
val inputStream = javaClass.getResourceAsStream("/pairs_ids.txt")
val isr = InputStreamReader(inputStream)
val br = BufferedReader(isr)
for (line in br.lines()) {
logger.info("current_line = " + line)
}
br.close()
isr.close()
inputStream.close()
}
}
fun main(args: Array<String>) {
AppStarter().run(*args)
}
The problem is that count of current_line is every time different.
Start project - the count of current_line is 803.
Start again project - the count of current_line is 1140.
Why every time the count is different and not equal to 1389?
Kotlin has some brilliant extension methods to easily deal with streams, reading lines and text and such.
Try this:
open class AppStarter : Application<AppConfig>() {
override fun getName() = "stats"
override fun run(configuration: AppConfig?, environment: Environment?) {
val logger = LoggerFactory.getLogger(this::class.java)
javaClass.getResourceAsStream("/pairs_ids.txt").bufferedReader().use { reader -> reader.readLines() }.forEach { line -> logger.info(line) }
}
}
fun main(args: Array<String>) {
AppStarter().run(*args)
}
When playing with input/output streams, use Kotlin's use extension methods, and do all of your processing inside of the use block.
This will handle all opening and closing of the streams so that there are no leaks, or forgetting to close/flush etc.
Given an InputStream, you can use forEachLine to read each line separately:
inputStream.bufferedReader().use {
it.forEachLine {
println(it)
}
}
Note: You should use use which is a convenient way to make sure the stream is closed once the reading is done (as user8159708 already suggested).