Spring WebClient: Parse + Stream very large JSON

Spring WebClient: Parse + Stream very large JSON - spring-webflux

This question is similar to Spring reactive streaming data from regular WebClient request with the difference that I'm not getting JSON array immediately from my WebClient, but something like this:
This JSON object can be very large (~100MB), and thus needs to be worked on and streamed to the client, instead of parsed. This here is the only way I seem to be able to get the semantics correct:
{
"result-set":{
"docs":[
{
"id":"auhcsasb1005_100000"
},
{
"id":"auhcsasb1005_1000000"
},
{
"id":"auhcsasb1005_1000001"
},
{
"id":"auhcsasb1005_1000002"
},
...
...
{
"EOF":true
}
]
}
}
WebClient.create()
.get()
.retrieve()
.bodyToMono(DontKnowWhatClass.class)
.flatMapMany(resultSet -> Flux.fromIterable(resultSet.getDocs()))
BUT that means that I'm deserializing 100MB or more in memory, to then create a flux from it. What I'm wondering is: Am I missing something crucial? Can I somehow just create a Flux from an Object like that? I have now way to influence how the result-set object is rendered, sadly.

I cannot imagine a way to reliably parse such a huge chunk of json in smaller parts. You could try to somehow convert the big chunk into smaller tokens and try to process them step by step.
But I would assume that this ending with the expected result, i.e. allow a more memory efficient parsing, is absolutely not guaranteed.
But, there are other ways to approach this problem, especially when you work reactive.
If you work in a reactive WebFlux context, you could try to use backpressure or rate limiting to make your application only parse a limited number of JSON objects at the same time. This would preserve the limited resources (RAM, CPU, JVM threads etc.) of your application.
And if the size of the objects really passes the 100MB limit, you should really consider questioning the current data model. Is this data structure and its size really suitable?
Maybe this problem cannot be solved with technical / implementation means. Maybe the current application design need to be changed.

You can accept a ServerWebExchange to your controller which has a method that will take a Publisher exchange.response.writeWith().
If you have a way to parse the payload in chunks you just create a Flux that emits the parts.
For example, if you don't care about the payload at all and just want to ship it as-is:
#GetMapping("/api/foo/{myId}")
fun foo(exchange: ServerWebExchange, #PathVariable myId: Long): Mono<Void> {
val content: Flux<DataBuffer> = webClient
.get()
.uri("/api/up-stream/bar/$myId")
.exchange()
.flatMapMany { it.bodyToFlux<DataBuffer>() }
return exchange.response.writeWith(content)
}
Make sure you check the content negotiation settings to avoid something buffering you didn't expect.

Related

Monitor buffer size of Kotlin Flow

I have a complex chain of operators on a Kotlin Flow, and many of them are ran in groups in different contexts using flowOn like this:
flowOf(1, 2, 3)
.map { /*do some stuff*/ }
.flowOn(context1)
.map { /*do some different stuff*/ }
.flowOn(context2)
According to documentation, each flowOn introduces a channel buffer with default size 64 (configurable).
In addition to this, I have a MutableSharedFlow with a fixed buffer size configured by the extraBufferCapacity parameter to which I'm emitting items.
I would like to monitor the current buffer sizes, however, the buffers are private property and there seems to be no method to retrieve the buffer reference or its current size. Is there any way to retrieve it, or is it intended solely for internal Flow purposes?

Kafka, Avro and Schema Registry

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.

In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

Can I chunk an InputStream in kotlin by using the sequence/collections APIs?

I'm currently looking at putting together something like the following in my project to make iterating over unknown-length InputStreams easier:
fun InputStream.chunk(size: Int, block: (ByteArray) -> Unit) {
val buffer = ByteArray(size)
while (this.read(buffer) > 0) {
block(buffer)
}
}
I noticed however that kotlin has chunking support in its collections which got me to thinking about whether that's something I could leverage instead of coming up with my own solution.
So this is more of a gut check to ensure I'm not reinventing the wheel: Is there any idiomatic way that I can apply a scope function to an InputStream in kotlin, preserving the semantics of streaming and not reading it entirely into memory?

QueryCursorImpl.getAll possibly loses data. Should I prefer IgniteCacheProxy.getAll?

I am trying to figure out what kind of getAll I should use. Currently, in a project, I work for, is implemented way through QueryCursorImpl
try (QueryCursor<Cache.Entry<IgnitePrimaryKey, V>> query = cache.query(new ScanQuery<>(filter))) {
return query.getAll()
.stream()
...
} catch (Exception e) {
...
}
However, Most of the times my application cannot obtain data (data is empty). On the other hand cache variable is a IgniteCacheProxy and it has getAll method itself. There is some problem with set of keys, because actual signature of IgniteCacheProxy.getAll is following:
public Map<K, V> getAll(Set<? extends K> keys)
However, if I solve issue with keys, maybe I should prefer IgniteCacheProxy.getAll because I can see asynchronouse code inside it rather than QueryCursorImpl.getAll?

Sounds like you want to simply iterate over all the data you have in cache. If that's the case, then the easiest way would be to simple get Iterator (IgniteCache implements Iterable) and the loop through it. Using getAll would mean fetching all the data from a distributed cache to a client, which is typically a very bad idea, especially for large data set. Ignite iterator, however, will fetch data in pages and therefore will never blow the client's memory.

Are serializers the right spot to remove shared state from Akka messages?

I am working on a distributed algorithm and decided to use a Akka to scale it across machines. The machines need to exchange messages very frequently and these messages reference some immutable objects that exist on every machine. Hence, it seems sensible to "compress" the messages in the sense that the shared, replicated objects should not be serialized in the messages. Not only would this save on network bandwidth but it also would avoid creating duplicate objects in the receiver side whenever a message is deserialized.
Now, my question is how to do this properly. So far, I could think of two options:
Handle this on the "business layer", i.e., converting my original message objects to some reference objects that replace references to the shared, replicated objects by some symbolic references. Then, I would send those reference objects rather than the original messages. Think of it as replacing some actual web resource with a URL. Doing this seems rather straight-forward in terms of coding but it also drags serialization concerns into the actual business logic.
Write custom serializers that are aware of the shared, replicated objects. In my case, it would be okay that this solution would introduce the replicated, shared objects as global state to the actor systems via the serializers. However, the Akka documentation does not describe how to programmatically add custom serializers, which would be necessary to weave in the shared objects with the serializer. Also, I could imagine that there are a couple of reasons, why such a solution would be discouraged. So, I am asking here.
Thanks a lot!

It's possible to write your own, custom serializers and let them do all sorts of weird things, then you can bind them at the config level as usual:
class MyOwnSerializer extends Serializer {
// If you need logging here, introduce a constructor that takes an ExtendedActorSystem.
// class MyOwnSerializer(actorSystem: ExtendedActorSystem) extends Serializer
// Get a logger using:
// private val logger = Logging(actorSystem, this)
// This is whether "fromBinary" requires a "clazz" or not
def includeManifest: Boolean = true
// Pick a unique identifier for your Serializer,
// you've got a couple of billions to choose from,
// 0 - 40 is reserved by Akka itself
def identifier = 1234567
// "toBinary" serializes the given object to an Array of Bytes
def toBinary(obj: AnyRef): Array[Byte] = {
// Put the code that serializes the object here
//#...
Array[Byte]()
//#...
}
// "fromBinary" deserializes the given array,
// using the type hint (if any, see "includeManifest" above)
def fromBinary(
bytes: Array[Byte],
clazz: Option[Class[_]]): AnyRef = {
// Put your code that deserializes here
//#...
null
//#...
}
}
But this raises an important question: if your messages all references data that is shared on the machines already, why would you want to put in the message the pointer to the object (very bad! messages should be immutable, and a pointer isn't!), rather than some sort of immutable, string objectId (kinda your option 1) ? This is a much better option when it comes to preserving the immutability of the messages, and there is little change in your business logic (just put a wrapper over the shared state storage)
for more info, see the documentation

I finally went with the solution proposed by Diego and want to share some more details on my reasoning and solution.
First of all, I am also in favor of option 1 (handling the "compaction" of messages in the business layer) for those reasons:
Serializers are global to the actor system. Making them stateful is actually a most severe violation of Akka's very philosophy as it goes against the encapsulation of behavior and state in actors.
Serializers have to be created upfront, anyway (even when adding them "programatically").
Design-wise, one can argue that "message compaction is not a responsibility of the serializer, either. In a strict sense, serialization is merely the transformation of runtime-specific data into a compact, exchangable representation. Changing what to serialize, is not a task of a serializer, though.
Having settled upon this, I still strived for a clear separation of "message compaction" and the actual business logic in the actors. I came up with a neat way to do this in Scala, which I want to share here. The basic idea is to make the message itself look like a normal case class but still allow these messages to "compactify" themselves. Here is an abstract example:
class Sender extends ActorRef {
def context: SharedContext = ... // This is the shared data present on every node.
// ...
def someBusinessLogic(receiver: ActorRef) {
val someData = computeData
receiver ! MyMessage(someData)
}
}
class Receiver extends ActorRef {
implicit def context: SharedContext = ... // This is the shared data present on every node.
def receiver = {
case MyMessage(someData) =>
// ...
}
}
object Receiver {
object MyMessage {
def apply(someData: SomeData) = MyCompactMessage(someData: SomeData)
def unapply(myCompactMessage: MyCompactMessage)(implicit context: SharedContext)
: Option[SomeData] =
Some(myCompactMessage.someData(context))
}
}
As you can see, the sender and receiver code feels just like using a case class and in fact, MyMessage could be a case class.
However, by implementing apply and unapply manually, one can insert its own "compactification" logic and also implicitly inject the shared data necessary to do the "uncompactification", without touching the sender and receiver. For defining MyCompactMessage, I found Protocol Buffers to be especially suited, as it is already a dependency of Akka and efficient in terms of space and computation, but any other solution would do.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas