I have a complex chain of operators on a Kotlin Flow, and many of them are ran in groups in different contexts using flowOn like this:
flowOf(1, 2, 3)
.map { /*do some stuff*/ }
.flowOn(context1)
.map { /*do some different stuff*/ }
.flowOn(context2)
According to documentation, each flowOn introduces a channel buffer with default size 64 (configurable).
In addition to this, I have a MutableSharedFlow with a fixed buffer size configured by the extraBufferCapacity parameter to which I'm emitting items.
I would like to monitor the current buffer sizes, however, the buffers are private property and there seems to be no method to retrieve the buffer reference or its current size. Is there any way to retrieve it, or is it intended solely for internal Flow purposes?
Related
I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.
In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)
After aggregating exchanges using a GroupedExchangeAggregationStrategy I need to split them back apart (to emit individual processing time metrics) into the original exchanges.
I tried splitting with the following but the resulting split exchange wraps the original exchange and puts it in the Message body.
Is it possible to split a GroupedExchangeAggregationStrategy aggregate exchange into the original exchanges without the wrapper exchange? I need to use the original exchange properties and would like to do so with a SpEL expression.
.aggregate(constant(true), myGroupedExchangeAggregationStrategy)
.completionInterval(1000)
.completeAllOnStop()
.process { /* do stuff */ }
.split(exchangeProperty(Exchange.GROUPED_EXCHANGE))
.to(/* micrometer timer metric using SpEL expression */)
// ^- the resulting split exchange is wrapped in another exchange
In the event that this isn't currently supported, I'm trying to figure out the best way to implement this behavior on my own without creating a custom Splitter processor for this single feature. I was hoping to somehow override the SplitterIterable that does the wrapping but it doesn't appear to be possible.
Yeah, the GroupedExchangeAggregationStrategy does nothing else than create a java.util.List of all Exchanges. The Splitter EIP on the other hand splits by default a List into the elements and puts the element into the message body. Therefore you end up with an Exchange that contains an Exchange in its body.
What you need is an AggregationStrategy that collects all body Objects in a List instead of all Exchanges.
You could try to use Camels FlexibleAggregationStrategy that is configurable through a fluent API.
new FlexibleAggregationStrategy()
.storeInBody()
.accumulateInCollection(ArrayList.class)
.pick(new SimpleExpression("${body}"));
This should create an AggregationStrategy that extracts the body of every message (you can perhaps omit the pick method since body extraction is the pick default), collects them in a List and stores the aggregation in the message body.
To split this aggregate again, a simple split(body()) should be enough.
EDIT due to comment
Yes, you are right, a side effect of my solution is that you lose properties and headers of the original messages because it only aggregates the message bodies.
What you want to do, is splitting the List of Exchanges back into the originals. i.e. the Splitter must not create new Exchanges, but use the already present ones and throw away the aggregated wrapper Exchange.
As far as I can see in the source code of the Splitter, this is currently not possible:
Exchange newExchange = ExchangeHelper.createCorrelatedCopy(copy, false);
...
if (part instanceof Message) {
newExchange.setIn((Message) part);
} else {
Message in = newExchange.getIn();
in.setBody(part);
}
Per the accepted answer, it doesn't appear to be natively supported.
This custom processor will unwrap a split exchange (i.e. copying the nested exchange Message and properties to the root exchange). The unwrapped exchange will be nearly identical to the original -- it will retain all non-conflicting properties from the root exchange (e.g. Splitter-related properties like split index, etc.)
class ExchangeUnwrapper : Processor {
override fun process(exchange: Exchange) {
val wrappedExchange = exchange.`in`.body as Exchange
ExchangeHelper.copyResultsPreservePattern(exchange, wrappedExchange)
}
}
// Route.kt
from(...)
.aggregate(...)
.process { /* do things with aggregate */ }
.split(exchangeProperty(Exchange.GROUPED_EXCHANGE))
.process(ExchangeUnwrapper())
.process { /* do something with the original exchange */ }
.end()
This question is similar to Spring reactive streaming data from regular WebClient request with the difference that I'm not getting JSON array immediately from my WebClient, but something like this:
This JSON object can be very large (~100MB), and thus needs to be worked on and streamed to the client, instead of parsed. This here is the only way I seem to be able to get the semantics correct:
{
"result-set":{
"docs":[
{
"id":"auhcsasb1005_100000"
},
{
"id":"auhcsasb1005_1000000"
},
{
"id":"auhcsasb1005_1000001"
},
{
"id":"auhcsasb1005_1000002"
},
...
...
{
"EOF":true
}
]
}
}
WebClient.create()
.get()
.retrieve()
.bodyToMono(DontKnowWhatClass.class)
.flatMapMany(resultSet -> Flux.fromIterable(resultSet.getDocs()))
BUT that means that I'm deserializing 100MB or more in memory, to then create a flux from it. What I'm wondering is: Am I missing something crucial? Can I somehow just create a Flux from an Object like that? I have now way to influence how the result-set object is rendered, sadly.
I cannot imagine a way to reliably parse such a huge chunk of json in smaller parts. You could try to somehow convert the big chunk into smaller tokens and try to process them step by step.
But I would assume that this ending with the expected result, i.e. allow a more memory efficient parsing, is absolutely not guaranteed.
But, there are other ways to approach this problem, especially when you work reactive.
If you work in a reactive WebFlux context, you could try to use backpressure or rate limiting to make your application only parse a limited number of JSON objects at the same time. This would preserve the limited resources (RAM, CPU, JVM threads etc.) of your application.
And if the size of the objects really passes the 100MB limit, you should really consider questioning the current data model. Is this data structure and its size really suitable?
Maybe this problem cannot be solved with technical / implementation means. Maybe the current application design need to be changed.
You can accept a ServerWebExchange to your controller which has a method that will take a Publisher exchange.response.writeWith().
If you have a way to parse the payload in chunks you just create a Flux that emits the parts.
For example, if you don't care about the payload at all and just want to ship it as-is:
#GetMapping("/api/foo/{myId}")
fun foo(exchange: ServerWebExchange, #PathVariable myId: Long): Mono<Void> {
val content: Flux<DataBuffer> = webClient
.get()
.uri("/api/up-stream/bar/$myId")
.exchange()
.flatMapMany { it.bodyToFlux<DataBuffer>() }
return exchange.response.writeWith(content)
}
Make sure you check the content negotiation settings to avoid something buffering you didn't expect.
I am working on a distributed algorithm and decided to use a Akka to scale it across machines. The machines need to exchange messages very frequently and these messages reference some immutable objects that exist on every machine. Hence, it seems sensible to "compress" the messages in the sense that the shared, replicated objects should not be serialized in the messages. Not only would this save on network bandwidth but it also would avoid creating duplicate objects in the receiver side whenever a message is deserialized.
Now, my question is how to do this properly. So far, I could think of two options:
Handle this on the "business layer", i.e., converting my original message objects to some reference objects that replace references to the shared, replicated objects by some symbolic references. Then, I would send those reference objects rather than the original messages. Think of it as replacing some actual web resource with a URL. Doing this seems rather straight-forward in terms of coding but it also drags serialization concerns into the actual business logic.
Write custom serializers that are aware of the shared, replicated objects. In my case, it would be okay that this solution would introduce the replicated, shared objects as global state to the actor systems via the serializers. However, the Akka documentation does not describe how to programmatically add custom serializers, which would be necessary to weave in the shared objects with the serializer. Also, I could imagine that there are a couple of reasons, why such a solution would be discouraged. So, I am asking here.
Thanks a lot!
It's possible to write your own, custom serializers and let them do all sorts of weird things, then you can bind them at the config level as usual:
class MyOwnSerializer extends Serializer {
// If you need logging here, introduce a constructor that takes an ExtendedActorSystem.
// class MyOwnSerializer(actorSystem: ExtendedActorSystem) extends Serializer
// Get a logger using:
// private val logger = Logging(actorSystem, this)
// This is whether "fromBinary" requires a "clazz" or not
def includeManifest: Boolean = true
// Pick a unique identifier for your Serializer,
// you've got a couple of billions to choose from,
// 0 - 40 is reserved by Akka itself
def identifier = 1234567
// "toBinary" serializes the given object to an Array of Bytes
def toBinary(obj: AnyRef): Array[Byte] = {
// Put the code that serializes the object here
//#...
Array[Byte]()
//#...
}
// "fromBinary" deserializes the given array,
// using the type hint (if any, see "includeManifest" above)
def fromBinary(
bytes: Array[Byte],
clazz: Option[Class[_]]): AnyRef = {
// Put your code that deserializes here
//#...
null
//#...
}
}
But this raises an important question: if your messages all references data that is shared on the machines already, why would you want to put in the message the pointer to the object (very bad! messages should be immutable, and a pointer isn't!), rather than some sort of immutable, string objectId (kinda your option 1) ? This is a much better option when it comes to preserving the immutability of the messages, and there is little change in your business logic (just put a wrapper over the shared state storage)
for more info, see the documentation
I finally went with the solution proposed by Diego and want to share some more details on my reasoning and solution.
First of all, I am also in favor of option 1 (handling the "compaction" of messages in the business layer) for those reasons:
Serializers are global to the actor system. Making them stateful is actually a most severe violation of Akka's very philosophy as it goes against the encapsulation of behavior and state in actors.
Serializers have to be created upfront, anyway (even when adding them "programatically").
Design-wise, one can argue that "message compaction is not a responsibility of the serializer, either. In a strict sense, serialization is merely the transformation of runtime-specific data into a compact, exchangable representation. Changing what to serialize, is not a task of a serializer, though.
Having settled upon this, I still strived for a clear separation of "message compaction" and the actual business logic in the actors. I came up with a neat way to do this in Scala, which I want to share here. The basic idea is to make the message itself look like a normal case class but still allow these messages to "compactify" themselves. Here is an abstract example:
class Sender extends ActorRef {
def context: SharedContext = ... // This is the shared data present on every node.
// ...
def someBusinessLogic(receiver: ActorRef) {
val someData = computeData
receiver ! MyMessage(someData)
}
}
class Receiver extends ActorRef {
implicit def context: SharedContext = ... // This is the shared data present on every node.
def receiver = {
case MyMessage(someData) =>
// ...
}
}
object Receiver {
object MyMessage {
def apply(someData: SomeData) = MyCompactMessage(someData: SomeData)
def unapply(myCompactMessage: MyCompactMessage)(implicit context: SharedContext)
: Option[SomeData] =
Some(myCompactMessage.someData(context))
}
}
As you can see, the sender and receiver code feels just like using a case class and in fact, MyMessage could be a case class.
However, by implementing apply and unapply manually, one can insert its own "compactification" logic and also implicitly inject the shared data necessary to do the "uncompactification", without touching the sender and receiver. For defining MyCompactMessage, I found Protocol Buffers to be especially suited, as it is already a dependency of Akka and efficient in terms of space and computation, but any other solution would do.
I have slightly peculiar program which deals with cases very similar to this
(in C#-like pseudo code):
class CDataSet
{
int m_nID;
string m_sTag;
float m_fValue;
void PrintData()
{
//Blah Blah
}
};
class CDataItem
{
int m_nID;
string m_sTag;
CDataSet m_refData;
CDataSet m_refParent;
void Print()
{
if(null == m_refData)
{
m_refParent.PrintData();
}
else
{
m_refData.PrintData();
}
}
};
Members m_refData and m_refParent are initialized to null and used as follows:
m_refData -> Used when a new data set is added
m_refParent -> Used to point to an existing data set.
A new data set is added only if the field m_nID doesn't match an existing one.
Currently this code is managing around 500 objects with around 21 fields per object and the format of choice as of now is XML, which at 100k+ lines and 5MB+ is very unwieldy.
I am planning to modify the whole shebang to use ProtoBuf, but currently I'm not sure as to how I can handle the reference semantics. Any thoughts would be much appreciated
Out of the box, protocol buffers does not have any reference semantics. You would need to cross-reference them manually, typically using an artificial key. Essentially on the DTO layer you would a key to CDataSet (that you simply invent, perhaps just an increasing integer), storing the key instead of the item in m_refData/m_refParent, and running fixup manually during serialization/deserialization. You can also just store the index into the set of CDataSet, but that may make insertion etc more difficult. Up to you; since this is serialization you could argue that you won't insert (etc) outside of initial population and hence the raw index is fine and reliable.
This is, however, a very common scenario - so as an implementation-specific feature I've added optional (opt-in) reference tracking to my implementation (protobuf-net), which essentially automates the above under the covers (so you don't need to change your objects or expose the key outside of the binary stream).