Kotlin: How to parse Avro message to Json or data class - kotlin

I am trying to set up a service that consumes Kafka topics. I am stuck at current part.
When topic is consumed I get back something like:
�{"id": "df7c4bb5-dc19-40f2-9f0a-77956c15be15", "dob": "1924-12-07", "last_name": "Doe", "first_name": "John",} created
which I think is Avro format I presume
Now, any idea how this can be parsed? I would love to have some type of data class or class in general defined as:
val id: String,
val first_name: String,
val last_name: String,
...
)
And map that value to it. Tried some libs, but couldn't quite make it to work. Thoughts, examples?

Based on the leading character, it looks like Avro, yes, however it might be simply be an Avro string type rather than an object with individual fields...
In any case, if you use the Avro maven plugin (there's a different plugin for Gradle) and define a schema matching the record (seems like you might be using a Schema Registry, so you can get the schema from there), then Maven will build a matching Java class for you (mvn compile) to use from Kotlin.
Upon consuming the data with an appropriate AvroDeserializer class, your record will be deserialized to that type

Related

Kafka, Avro and Schema Registry

I have a Kafka consumer configured with schema polling from the topic, what I would like to do, is create another Avro schema, on top of the current one, and hydrate data using it, basically I don't need 50% of the information and need to write some logic to change a couple of fields. Thats just an example
val consumer: KafkaConsumer<String, GenericRecord>(props) = createConsumer()
while (true) {
consumer.poll(Duration.ofSeconds(10).forEach {it ->
println(it.value())
}
}
The event returned from stream is pretty complex, so I've modelled a smaller CustomObj as a .avsc file and compiled it to java. And when trying to run the code with the CustomObj, Error deserializing key/value for partition all I want to do is consume an event, and then deserialize it into a much smaller object with just selected fields.
return KafkaConsumer<String, CustomObj>(props)
This didn't work, not sure how can I deserialize it using CustomObj from the GenericRecord? Let me just add that I don't have any access to the stream or its config I can just consume from it.
In Avro, your reader schema needs to be compatible with the writer schema. By giving the smaller object, you're providing a different reader schema
It's not possible to directly deserialize to a subset of the input data, so you must parse the larger object and map it to the smaller one (which isn't what deserialization does)

Best way to define model using AndroidAnnotations #Rest and Kotlin?

I am totally new to Android and Kotlin and I was looking into Android Annotations.
I managed to decode a JSON response using the following code:
class ExampleModel {
#JvmField
final var id: Int = 0
lateinit var title: String
var description: String? = null
var author: Author? = null
}
#Rest(
rootUrl = "...",
converters = [MappingJackson2HttpMessageConverter::class]
)
interface ExampleClient {
#Get("/promotions")
fun getModels(): List<ExampleModel>
}
Now it does work but there are a couple of questions I'd like to ask.
Is it possible to use data classes? I tried but I kept getting an error from MappingJackson2HttpMessageConverter saying that there was no constructor available.
Is it somehow possible to just ignore extra keys that might appear in the JSON? Let's say that I am not interested in the author data for now, is there a way to just remove its declaration without having the decoding fail with "unexpected key"?
Consider that I usually work with Swift so if you could point me to the "Codable" equivalent in Kotlin I would really appreciate it.
Cheers
Kotlin Data classes don't have default constructor which is usually required by json deserialization libraries. Any data class require at least one constructor argument, but you can work around it. Define default values, you can use null. For example:
data class Pojo(val name: String? = null, val age: Int? = null)
Such code will allow to use Pojo() constructor. It should work, but it's better to use json deserializer that is more kotlin native or generate data classes with AutoValue.
Jackson that you're using here allows to ignore fields with #JsonIgnoreProperties.
If you're learning Android, don't start from Android Annotations if you don't have to. It's not very popular or modern solution. I used it in few projects back in the day, those were very difficult to maintain or to introduce new developers. Look into android architecture components and jetpack - google made few nice code labs. Also for json pick Moshi or Gson.

Write pojo's to parquet file using reflection

HI Looking for APIs to write parquest with Pojos that I have.
I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter.
Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files.
Any suggestions ?
If you want to go through avro you have two options:
1) Let avro generate your pojos (see the tutorial here). The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.
2) Write the conversion from your pojo to GenericRecord yourself. You can do this either manually or a more generic solution would be to use reflection. However, I encountered difficulties with this approach when I tried to read the data. Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. Because of this reason I went with option 1.
Parquet also supports now writing pojo directly. Here is the pull request on parquet github page. However, I think this is not part of an official release yet. In another words, I did not find this code in maven.
DISCLAIMER: The following code was written when I was in a hurry. It is not efficient and future versions of parquet will surely fix this more directly. That being said, this is a lightweight inefficient approach to what you need. The strategy is POJO -> AVRO -> PARQUET
POJO -> AVRO: Declare a schema via reflection. Declare writers and readers based on the schema. At the time of conversion write the object to byte stream and read it back as avro.
AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project.
private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);
public GenericRecord toAvroGenericRecord() throws IOException {
ByteArrayOutputStream bytes = new ByteArrayOutputStream();
reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}
One more thing: it seems the parquet writers are currently very strict about null fields. Make sure none of your fields are null before attempting to write to parquet
I wasn't able to find an existing solution, so I implemented it myself. Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66
In short, it does the following:
uses reflection to get Avro schema from the pojo
using the schema and reflection it converts pojos to GenericRecord objects
reflection is applied recursively if the pojo contains other pojos or list of pojos

Why does collection+json use anonymous objects instead of key value pairs

I'm trying to find a data schema which can be used for different scenaries and a promissing format I found so far is the collection+json format (http://amundsen.com/media-types/collection/ ).
So far it has a lot of the functionallity I need and is very flexible, however I don't get why it uses anonymous objects ( example: {"name" : "full-name", "value" : "J. Doe", "prompt" : "Full Name"}, ) instead of simple key value pairs. (example: "full-name": "J. Doe", ).
I see how you can transfer more information like the prompt, etc. but the parsing is much slower and it is harder to create a client for it since he has to access the fields by searching in an array. When binding the data to a spezific view, it has to be know which fields exists, so the anonymous objects have to be converted into a key value map again.
So is there a real advante using this anonymous objects instead of a key value map?
I think that the main reason is because a consumer client does not need to know in advance the format of the data.
As it is proposed now in collection+json, you know that in the data object you will find stuff about data simply by parsing through it, 'name' is always the identifying name for the field, 'value' is the value and so on, your client can be agnostic about how many fields or their names:
{
"name" : "full-name",
"value" : "J. Doe",
"prompt" : "Full Name"
},
{
"name" : "age",
"value" : "42",
"prompt" : "Age"
}
if you had instead
{
"full-name" : "J. Doe",
"age" : "42"
}
the client needs to have previous knowledge about your representation, so it should expect and understand 'full-name', 'age, and all the application specific fields.
I wrote this question and then forgott about it, but found the answer I looked for here:
https://groups.google.com/forum/#!searchin/collectionjson/key/collectionjson/_xaXs2Q7N_0/GQkg2mvPjqMJ
From Mike Amundsen the creator of collection+JSON
I understand your desire to add object serialization patterns to CJ.
However, one of the primary goals of my CJ design is to not support
object serialization. I know that serialization is an often-requested
feature for Web messages. It's a great way to optimize both the code
and the programmer experience. But it's just not what I am aiming for
in this particular design.
I think the extension Kevin ref'd is a good option. Not sure if anyone
is really using it, tho. If you'd like to design one (based on your
"body" idea), that's cool. If you'd like to start w/ a gist instead of
doing the whole "pull" thing, that's cool. Post it and let's talk
about it.
On another note, I've written a very basic client-side parser for CJ
(it's in the basic folder of the repo) to show how to hide the
"costly" parts of navigating CJ's state model cleanly. I actually
have a work item to create a client-side lib that can convert the
state representation into an arbitrary local object model, but haven't
had time to do the work yet. Maybe this is something you'd like help
me with?
At a deeper (and possibly more boring) level, this state-model
approach is part of a bigger pattern I have in mind to treat messages
as "type-less" containers and to allow clients and servers to utilize
whatever local object models they prefer - without the need for
cross-web agreement on that object model. This is an "opinionated
design model" that I am working toward.
Finally, as you rightly point out at the top of the thread, HAL and
Siren are much better able to support object serialization style
messages. I think that's very cool and I encourage folks (including my
clients) to use these other formats when object serialization is the
preferred pattern and to use CJ when state transfer is the preferred
pattern.

Hadoop Serializer Not Found Exception

I have a job whose output format is SequenceFileOuputFormat.
I set the output key and value class like this:
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(SplitInfo.class);
The SplitInfo class implements Serializable,Writable
I set the io.serializations property as follows:
conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
+ "org.apache.hadoop.io.serializer.WritableSerialization");
However, on the reducer side I get this error, telling me that Hadoop couldn't find a serializer:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:961)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:892)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:393)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:354)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:476)
at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:61)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:569)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
Can anyone help, please ?
The problem was that I was making a stupid mistake: I was not updating a jar. So, basically SplitInfo was not implementing the Writable interface in the old (in use) jar.
As a general observation: the error specified in the OP has as underlying cause the fact that HADOOP can't find a Serializer for a specific type which you're trying to serialize (being directly or indirectly, e.g. by using that type as an output key/value). Hadoop cannot find a Serilizer for one of the 2 reasons:
your type is not serializable (i.e. it doesn't implement Writable or Serializable)
There is no Serializer available to Hadoop for the type of serialization your type implements (e.g.: your type implements Writable but hadoop for one reason or another cannot use the org.apache.hadoop.io.serializer.WritableSerialization class)
I think you're trying to do something you don't need to. Your output value only needs to implement the Writable interface and you should just set the output format.
conf.setOutputFormatClass(SequenceFileOutputFormat.class);
You only use the "io.serializations" configuration if you want to use a different serialization framework, which it doesn't look like you need.