How to use datastax driver 4 for performant concurrent cassandra database queries in kotlin? - kotlin

I have a kotlin application which serves data via a RESTFul api. That data is stored in a cassandra database. To fulfill a request, the application needs to perform n queries to Cassandra. I want the API to respond quickly so I would like those n queries to execute in parallel. I also want to be able to handle multiple concurrent users without performance degrading.
Libraries:
implementation("com.datastax.oss:java-driver-core:4.13.0")
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-jdk8:1.4.3")
In Datastax 3, I have code which uses the synchronous method execute. I am wrapping this in a coroutine dispatcher and awaiting all requests.
Here is a sample code which queries the same row n times in a loop,
val numbers: List<Int> = (1..NUMBER_OF_QUERIES).toList()
val query = "SELECT JSON * FROM keyspace.table WHERE partition_key=X AND clustering_key=Y"
val (result, elapsed1) = measureTimedValue {
numbers.map { num: Int ->
CoroutineScope(Dispatchers.IO).async {
session.execute(q).all().map{it ->
toJson(it.getString(0).toString())
)
}
}
}.awaitAll()
}
Datastax 3 offers executeAsync using guava's ListenableFuture, but I couldn't get that to work within a coroutine even with https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-guava/index.html
For Datastax 4, I am trying to use the asynchronous API to achieve a similar result. My hope is the asynchronous API can perform better using fewer threads as it is non-blocking. However when I run a similar test case, I observe that the above code runs slower than the sync api from V3. In addition, the code does not perform well as more concurrent users are added.
val numbers: List<Int> = (1..NUMBER_OF_QUERIES).toList()
val query = "SELECT JSON * FROM keyspace.table WHERE partition_key=X AND clustering_key=Y"
val (result, elapsed1) = measureTimedValue {
numbers.map { num: Int ->
CoroutineScope(Dispatchers.IO).async {
session.executeAsync(q).asDeferred()
}
}.awaitAll().awaitAll().map{rs -> toJson(rs).await()}
}
Is there a better way to handle parallel executions of tasks returning CompletionStage<T> in kotlin?

Related

Neo4j 3.5's embeded database does not seem to persist data

I am trying to build a small command line tool that will store data in a neo4j graph. To do this I have started experimenting with Neo4j3.5's embedded databases. After putting together the following example I have found that either the nodes I am creating are not being saved to the database or the method of database creation is overwriting my previous run.
The Example:
fun main() {
//Spin up data base
val graphDBFactory = GraphDatabaseFactory()
val graphDB = graphDBFactory.newEmbeddedDatabase(File("src/main/resources/neo4j"))
registerShutdownHook(graphDB)
val tx = graphDB.beginTx()
graphDB.createNode(Label.label("firstNode"))
graphDB.createNode(Label.label("secondNode"))
val result = graphDB.execute("MATCH (a) RETURN COUNT(a)")
println(result.resultAsString())
tx.success()
}
private fun registerShutdownHook(graphDb: GraphDatabaseService) {
// Registers a shutdown hook for the Neo4j instance so that it
// shuts down nicely when the VM exits (even if you "Ctrl-C" the
// running application).
Runtime.getRuntime().addShutdownHook(object : Thread() {
override fun run() {
graphDb.shutdown()
}
})
}
I would expect that every time I run main the resulting query count will increase by 2.
That is currently not the case and I can find nothing in the docs that references a different method of opening an already created embedded database. Am I trying to use the embedded database incorrectly or am I missing something? Any help or info would be appreciated.
build Info:
Kotlin jvm 1.4.21
Neo4j-comunity-3.5.35
Transactions in neo4j 3.x have a 3 stage model
create
success / failure
close
you missed the third, which would then commit or rollback.
You can use Kotlin's use as Transaction is an AutoCloseable

Spring kafka - kafka streams topology set up

We have a kafka streams spring boot app(using spring-kafka), this app currently reads messages from an upstream topic applies some transformation, and writes them to a downstream topic, it does not do any aggregates or joins or any advanced kafka streams feature.
The code currently looks similar to this
#Bean
fun topology(streamsBuilder: StreamsBuilder): KStream<String, SomeObject> {
val stream = streamsBuilder.stream<String, SomeObject>(inputTopicName)
val branches: Array<KStream<String, SomeObject>> = stream.branch(
{ _, value -> isValidRawData(value)},
{ _, failedValue -> true}
)
branches[0].map { _, value -> transform(value) }.to(outputTopicName)
branches[1].foreach { _, value -> s3Service.uploadEvent(value) }
}
This is working fine for use but now we need to extend this code to consume messages of a different schema from a second upstream topic and apply a slightly different transformation and then write them to the same downstream topic(with a similar schema) as the topology above.
In order to achieve this we have 2 options;
Create a second #Bean factory method almost similar to the one above except that its topology consumes from a separate topic and applies a different transformation.
Modify to the topology above to consume both topics, create a third branch for messages from the second topic as follows
#Bean
fun topology(streamsBuilder: StreamsBuilder): KStream<String, SpecificRecord> {
val topics = listOf("topic1", "topic2")
val stream = streamsBuilder.stream<String, SpecificRecord>(topics)
val branches: Array<KStream<String,SpecificRecord>> = stream.branch(
{ _, value -> isRecordFromTopic1(value)},
{ _, value -> isRecordFromTopic2(value)},
{ _, failedValue -> true}
)
branches[0].map { _, value -> transformTopic1Record(value) }.to(outputTopicName)
branches[1].map { _, value -> transformTopic2Record(value) }.to(outputTopicName)
branches[2].foreach { _, value -> s3Service.uploadEvent(value) }
}
Which one of these approaches would be the recommended one? Are there things we need to consider from a kafka streams resource management or performance perspective?
Thanks for you suggestions.
Since there is that collection of topics API as you show in the second code, I would say both variant are valid and makes sense. Everything else is just a personal preference. I would go the first one since technically in the end everything is going to work on the same Streams engine. The first solution is much easier to support in the future when you would introduce a third record type and so on. Or you may have extra logic for the specific stream. You may have a common stream to read from all the topics and distribute them via that condition and branches. The rest of logic you may do in their individual stream via their own intermediate topics. But still: just my opinion...

Quarkus: execute parallel unis

In a quarkus / kotlin application, I want to start multiple database requests concurrently. I am new at quarkys and I am not sure if I am doing things right:
val uni1 = Uni.createFrom().item(repo1).onItem().apply { it.request() }
val uni2 = Uni.createFrom().item(repo2).onItem().apply { it.request() }
return Uni.combine().all()
.unis(uni1, uni2)
.asTuple()
.onItem()
.apply { tuple ->
Result(tuple.item1, tuple.item2) }
.await()
.indefinitely()
Will the request() really be made in parallel? Is it the right way to do it in quarkus?
Yes, your code is right.
Uni.combine().all() runs all the passed Unis concurrently. You will get the tuple (containing the individual results) when all the Unis have completed (emitted a result).
From your code, you may remove the tuple step and use combineWith instead.
Finally, note that the await().indefinitely() blocks the caller thread, forever if one of the Uni does not complete (for whatever reason). I strongly recommend using await().atMost(...)

Parallel IO requests with Kotlin Flow, Coroutines and NOT suspend function

I run my Netty based Kotlin application with Spring Boot and WebFlux. The details are as follows:
Java 11;
Kotlin 1.3.61;
Spring Boot 2.2.5.RELEASE;
Spring Vault Core 2.2.2.RELEASE.
I get a file on the web layer. WebFlux creates a Part (org.springframework.http.codec.multipart) out of it. The data is stored in the Project Reactor's Flux in the Part as a stream of DataBuffer chunks of size 4Kb:
Flux<DataBuffer> content();
Due to compliance with the consistency of frameworks, I transform the Flux to a Kotlin's Flow.
Then I use a synchronous Vault client's encrypt(...) submitting the chunks asynchronously (as far as I understand) within the flatMapMerge method (note encrypt(...) is not suspend and it is a wrapper on top of an HTTP client to a remote encryption provider):
public String encrypt(String keyName, String plaintext);
I have checked this answer https://stackoverflow.com/a/58659423/6612401 and found out that the Flow-Based Approach should be used with flow { emit(...)}.
My question is can I use this Flow-Based Approach with not suspend functions? Or is there a better approach, considering I am using runBlocking(Dispatchers.IO) and a suspend fold(...) function.
The code is as follows:
#FlowPreview
#ExperimentalCoroutinesApi
private fun getOpenByteArrayAndEncryptText(part: Part): Pair<ByteArray, String> = runBlocking(Dispatchers.IO) {
val pair = part.content().asFlow()
.flatMapMerge { dataBuffer ->
val openByteArray = dataBuffer.asInputStream().readBytes()
val opentextBase64 = Base64Utils.encodeToString(openByteArray)
flow { emit(Pair(openByteArray, vaultTransitTemplate.encrypt(KEY_NAME, opentextBase64))) }
}.fold(Pair(ByteArrayOutputStream(), StringBuilder())) { result, curPair ->
result.first.writeBytes(curPair.first)
result.second.append(curPair.second)
result
}
Pair(pair.first.toByteArray(), pair.second.toString())
}
P.S. The fold(...) function collects open chunks to a ByteArrayOutputStream to calculate a hash later and it collects encrypted chunks to a StringBuilder as the result of encrypting the file.
P.P.S. I have tried my approach. The method submits 5-7 parallel requests on average on my Core i5 8gen 4 physical cores machine. It does its job but not that fast. Having Vault being deployed not locally I get roughly 1 second per 1 Mb of encryption. I understand that it depends on the latency of the network. I don't even consider the speed of encryption on the side of the Vault, it is lightning fast due to the size of chunks which is 4Kb only. Are there any methods to increase the speed concurrency-wise?
P.P.P.S I have tried playing with concurrency = MAX_CONCURRENT_REQUESTS in flatMapMerge{...}. Nothing significant in results so far. It's even better leaving it default.

How to avoid duplicates in BigQuery by streaming with Apache Beam IO?

We are using a pretty simple flow where messages are retrieved from PubSub, their JSON content is being flatten into two types (for BigQuery and Postgres) and then inserted into both sinks.
But, we are seeing duplicates in both sinks (Postgres was kinda fixed with a unique constraint and a "ON CONFLICT... DO NOTHING").
At first we trusted in the supposedly "insertId" UUId that the Apache Beam/BigQuery creates.
Then we add a "unique_label" attribute to each message before queueing them into PubSub, using data from the JSON itself, which gives them uniqueness (a device_id + a reading's timestamp). And subscribed to the topic using that attribute with "withIdAttribute" method.
Finally we paid for GCP Support, and their "solutions" do not work. They have told us to even use Reshuffle transform, which is deprecated by the way, and some windowing (that we do not won't since we want near-real time data).
This the main flow, pretty basic:
[UPDATED WITH LAST CODE]
Pipeline
val options = PipelineOptionsFactory.fromArgs(*args).withValidation().`as`(OptionArgs::class.java)
val pipeline = Pipeline.create(options)
var mappings = ""
// Value only available at runtime
if (options.schemaFile.isAccessible){
mappings = readCloudFile(options.schemaFile.get())
}
val tableRowMapper = ReadingToTableRowMapper(mappings)
val postgresMapper = ReadingToPostgresMapper(mappings)
val pubsubMessages =
pipeline
.apply("ReadPubSubMessages",
PubsubIO
.readMessagesWithAttributes()
.withIdAttribute("id_label")
.fromTopic(options.pubSubInput))
pubsubMessages
.apply("AckPubSubMessages", ParDo.of(object: DoFn<PubsubMessage, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info("Processing readings: " + context.element().attributeMap["id_label"])
context.output("")
}
}))
val disarmedMessages =
pubsubMessages
.apply("DisarmedPubSubMessages",
DisarmPubsubMessage(tableRowMapper, postgresMapper)
)
disarmedMessages
.get(TupleTags.readingErrorTag)
.apply("LogDisarmedErrors", ParDo.of(object: DoFn<String, String>() {
#ProcessElement
fun processElement(context: ProcessContext) {
LOG.info(context.element())
context.output("")
}
}))
disarmedMessages
.get(TupleTags.tableRowTag)
.apply("WriteToBigQuery",
BigQueryIO
.writeTableRows()
.withoutValidation()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withFailedInsertRetryPolicy(InsertRetryPolicy.neverRetry())
.to(options.bigQueryOutput)
)
pipeline.run()
DissarmPubsubMessage is a PTransforms that uses FlatMapElements transform to get TableRow and ReadingsInputFlatten (own class for Postgres)
We expect zero duplicates or the "best effort" (and we append some cleaning cron job), we paid for these products to run statistics and bigdata analysis...
[UPDATE 1]
I even append a new simple transform that logs our unique attribute through a ParDo which supposedly should ack the PubsubMessage, but this is not the case:
new flow with AckPubSubMessages step
Thanks!!
Looks like you are using the global window. One technique would be to window this into an N minute window. Then process the keys in the window and drop an items with dup keys.
The supported programming languages are Python and Java, your code seems to be Scala and as far as I know it is not supported. I strongly recommend using Java to avoid any unsupported feature for the programming language you use.
In addition, I would recommend the following approaches to work on duplicates, the option 2 could meet your need of near-real-time:
message_id. Probably you already read the FAQ - duplicates which points to deprecated doc. However, if you check the PubsubMessage object you will notice that messageId is still available and it will be populated if not set by the publisher:
"ID of this message, assigned by the server when the message is
published ... It must not be populated by the publisher in a
topics.publish call"
BigQuery Streaming. To validate duplicate during loading the data, right before inserting in BQ you can create UUID.Please refer the section Example sink: Google BigQuery.
Try the Dataflow template PubSubToBigQuery and validate there are not duplicates in BQ.