Kafka, Avro and Schema Registry - kotlin

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.

In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

Related

Spring WebClient: Parse + Stream very large JSON

This question is similar to Spring reactive streaming data from regular WebClient request with the difference that I'm not getting JSON array immediately from my WebClient, but something like this:
This JSON object can be very large (~100MB), and thus needs to be worked on and streamed to the client, instead of parsed. This here is the only way I seem to be able to get the semantics correct:
{
"result-set":{
"docs":[
{
"id":"auhcsasb1005_100000"
},
{
"id":"auhcsasb1005_1000000"
},
{
"id":"auhcsasb1005_1000001"
},
{
"id":"auhcsasb1005_1000002"
},
...
...
{
"EOF":true
}
]
}
}
WebClient.create()
.get()
.retrieve()
.bodyToMono(DontKnowWhatClass.class)
.flatMapMany(resultSet -> Flux.fromIterable(resultSet.getDocs()))
BUT that means that I'm deserializing 100MB or more in memory, to then create a flux from it. What I'm wondering is: Am I missing something crucial? Can I somehow just create a Flux from an Object like that? I have now way to influence how the result-set object is rendered, sadly.
I cannot imagine a way to reliably parse such a huge chunk of json in smaller parts. You could try to somehow convert the big chunk into smaller tokens and try to process them step by step.
But I would assume that this ending with the expected result, i.e. allow a more memory efficient parsing, is absolutely not guaranteed.
But, there are other ways to approach this problem, especially when you work reactive.
If you work in a reactive WebFlux context, you could try to use backpressure or rate limiting to make your application only parse a limited number of JSON objects at the same time. This would preserve the limited resources (RAM, CPU, JVM threads etc.) of your application.
And if the size of the objects really passes the 100MB limit, you should really consider questioning the current data model. Is this data structure and its size really suitable?
Maybe this problem cannot be solved with technical / implementation means. Maybe the current application design need to be changed.
You can accept a ServerWebExchange to your controller which has a method that will take a Publisher exchange.response.writeWith().
If you have a way to parse the payload in chunks you just create a Flux that emits the parts.
For example, if you don't care about the payload at all and just want to ship it as-is:
#GetMapping("/api/foo/{myId}")
fun foo(exchange: ServerWebExchange, #PathVariable myId: Long): Mono<Void> {
val content: Flux<DataBuffer> = webClient
.get()
.uri("/api/up-stream/bar/$myId")
.exchange()
.flatMapMany { it.bodyToFlux<DataBuffer>() }
return exchange.response.writeWith(content)
}
Make sure you check the content negotiation settings to avoid something buffering you didn't expect.

Need of serialization

I'm new to serialization concept, please help in understanding concept.
What exactly serialization means? I have read the definition, but could not understand in details.
How basic types (int, string) are serialized?
If we don't use serialization in our code how data will be transmitted?
Is there any implicit serialization process involved while accessing database from front end Java/C# code? example insert/delete from database.
Serialization just takes an object and translates it into something simpler. Imagine that you had an object in C# like so:
class Employee
{
public int age;
public string fullname;
}
public static void Main()
{
var john = new Employee();
john.age = 21;
john.fullname = "John Smith";
var matt = new Employee();
matt.age = 44;
matt.fullname = "Matt Rogers";
...
This is C# friendly. But if you wanted to save that information in a text file in CSV format, you would end up with something like this:
age,fullname
21,John Smith
44,Matt Rogers
When you write a CSV, you are basically serializing information into a different format - in this case a CSV file. You can serialize your object to XML, JSON, database table(s), memory or something else. Here's an example from Udemy regarding serialization.
If you don't serialize, confusion will be transmitted. Perhaps your object's ToString() will be implictly called before transmission and whatever result gets transmitted. Therefore it is vital to convert your data to something that is receiver friendly.
There's always some serialization happening. When you execute a query that populates a DataTable, for example, serialization occurred.
Concept :
Serialization is the process of converting an object into series of bytes.
Usually the objects we use in application will be complex and all of them can be easily represented in the form of series of bytes which can be stored in the file/database or transfered over network.
You can make a class Serializable just by making it implement Serializable interface.
For a class to be serialized successfully, two conditions must be met:
The class must implement the java.io.Serializable interface.
All of the fields in the class must be serializable. If a field is not serializable, it must be marked transient.
When the program is done serializing, and if it is stored in a file with extension .ser then it can be used for deserializing.
Serialization gives an serialVersionUID to the serialized object which has to match for deserialization

Write pojo's to parquet file using reflection

HI Looking for APIs to write parquest with Pojos that I have.
I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter.
Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files.
Any suggestions ?
If you want to go through avro you have two options:
1) Let avro generate your pojos (see the tutorial here). The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.
2) Write the conversion from your pojo to GenericRecord yourself. You can do this either manually or a more generic solution would be to use reflection. However, I encountered difficulties with this approach when I tried to read the data. Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. Because of this reason I went with option 1.
Parquet also supports now writing pojo directly. Here is the pull request on parquet github page. However, I think this is not part of an official release yet. In another words, I did not find this code in maven.
DISCLAIMER: The following code was written when I was in a hurry. It is not efficient and future versions of parquet will surely fix this more directly. That being said, this is a lightweight inefficient approach to what you need. The strategy is POJO -> AVRO -> PARQUET
POJO -> AVRO: Declare a schema via reflection. Declare writers and readers based on the schema. At the time of conversion write the object to byte stream and read it back as avro.
AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project.
private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);
public GenericRecord toAvroGenericRecord() throws IOException {
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}
One more thing: it seems the parquet writers are currently very strict about null fields. Make sure none of your fields are null before attempting to write to parquet
I wasn't able to find an existing solution, so I implemented it myself. Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
In short, it does the following:
uses reflection to get Avro schema from the pojo
using the schema and reflection it converts pojos to GenericRecord objects
reflection is applied recursively if the pojo contains other pojos or list of pojos

Protobuf concatenation of serialized messages into one file

I have some serialization in google protobuf in a series of files, but wonder if there is a shortcut way of concatenating these smaller files into one larger protobuf without worrying about reading each and every protobuf and then grouping these objects and outputting.
Is there a cheap way to join files together? I.e. do I have serialize each individual file?
You can combine protocol buffers messages by simple concatenation. It appears that you want the result to form an array, so you'll need to serialize each individual file as an array itself:
message MyItem {
...
}
message MyCollection {
repeated MyItem items = 1;
}
Now if you serialize each file as a MyCollection and then concatenate them (just put the raw binary data together), the resulting file can be read as a one large collection itself.
In addition to jpas answer, it might be relevant to say that the data does not need to be in the exact same container, when being serialized, for it being compatible on deserialisation.
Consider the following messages:
message FileData{
required uint32 versionNumber = 1;
repeated Data initialData = 2;
}
message MoreData{
repeated Data data = 2;
}
It is possible to serialize those different messages into one single data container and deserialize it as one single FileData message, as long as the FileData is serialized before zero or more MoreData and both, the FileData and MoreData have the same index for the repeated field.

vb.net object persisted in database

How can I go about storing a vb.net user defined object in a sql database. I am not trying to replicate the properties with columns. I mean something along the lines of converting or encoding my object to a byte array and then storing that in a field in the db. Like when you store an instance of an object in session, but I need the info to persist past the current session.
#Orion Edwards
It's not a matter of stances. It's because one day, you will change your code. Then you will try de-serialize the old object, and YOUR PROGRAM WILL CRASH.
My Program will not "CRASH", it will throw an exception. Lucky for me .net has a whole set of classes dedicated for such an occasion. At which time I will refresh my stale data and put it back in the db. That is the point of this one field (or stance, as the case may be).
You can use serialization - it allows you to store your object at least in 3 forms: binary (suitable for BLOBs), XML (take advantage of MSSQL's XML data type) or just plain text (store in varchar or text column)
Before you head down this road towards your own eventual insanity, you should take a look at this (or one day repeat it):
http://thedailywtf.com/Articles/The-Mythical-Business-Layer.aspx
Persisting objects in a database is not a good idea. It kills all the good things that a database is designed to do.
You could use the BinaryFormatter class to serialize your object to a binary format, then save the resulting string in your database.
The XmlSerializer or the DataContractSerializer in .net 3.x will do the job for you.
#aku, lomaxx and bdukes - your solutions are what I was looking for.
#1800 INFORMATION - while i appreciate your stance on the matter, this is a special case of data that I get from a webservice that gets refreshed only about once a month. I dont need the data persisted in db form because thats what the webservice is for. Below is the code I finally got to work.
Serialize
#'res is my object to serialize
Dim xml_serializer As System.Xml.Serialization.XmlSerializer
Dim string_writer As New System.IO.StringWriter()
xml_serializer = New System.Xml.Serialization.XmlSerializer(res.GetType)
xml_serializer.Serialize(string_writer, res)
Deserialize
#'string_writer and xml_serializer from above
Dim serialization As String = string_writer.ToString
Dim string_reader As System.IO.StringReader
string_reader = New System.IO.StringReader(serialization)
Dim res2 As testsedie.EligibilityResponse
res2 = xml_serializer.Deserialize(string_reader)
What you want to do is called "Serializing" your object, and .Net has a few different ways to go about it. One is the XmlSerializer class in the System.Xml.Serialization namespace.
Another is in the System.Runtime.Serialization namespace. This has support for a SOAP formatter, a binary formatter, and a base class you can inherit from that all implement a common interface.
For what you are talking about, the BinaryFormatter suggested earlier will probably have the best performance.
I'm backing #1800 Information on this one.
Serializing objects for long-term storage is never a good idea
while i appreciate your stance on the matter, this is a special case of data that I get from a webservice that gets refreshed only about once a month.
It's not a matter of stances. It's because one day, you will change your code. Then you will try de-serialize the old object, and YOUR PROGRAM WILL CRASH.
If it crashes (or throws an exception) all you are left with is a bunch of binary data to try and sift through to recreate your objects.
If you are only persisting binary why not just save straight to disk. You also might want to look at using something like xml as, as has been mentioned, if you alter your object definition you may not be able to unserialise it without some hard work.