Schema in Avro message - apache

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead?
So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema?
On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?

I am reading following points from Avro Specs
Apache Avro is a data serialization system.
Avro relies on schemas.
When Avro data is read, the schema used when writing it is always
present.
The goal of serialization is to avoid per-value
overheads, to make serialization both fast and small.
When Avro data is stored in a file, its schema is stored with it.
You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.
There is no option available to eliminate schema, while writing avro file. It would be against avro specification.
IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.

You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.
Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.
And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.

Related

Sending avro schema along with payload

I want to implement avro serializer/deserializer for Kafka producer/consumer. There can be multiple scenarios
Writer schema and reader schema are same, will never change. In such scenario, no need to send avro schema along with payload. At consumer we can use reader schema itself to deserialise payload. Sample implementation is provided in this post
Using schema resolution feature when schema will evolve over time. So avro can still deserialize different reader and writer schema using schema resolution rules. So we need to send avro scehma along with payload
My Question How to send schema as well while producing, so that deserialiser read whole bytes and separate out actual payload and schema ? I am using avro generated class. Note, I don't want to use schema registry.
You need a reader and writer schema, in any Avro use-case, even if they are the same. SpecificDatumWriter (for serializer) and SpecificDatumReader (for deserializer) both take a schema.
You could use Kafka record headers to encode the AVSC string, and send along with the payload, but keep in mind that Kafka records/batches have an upper-bound in allowed size. Using some Schema Registry (doesn't have to be Confluent's), reduces the overhead from a whole string to a simple integer ID.

Keeping track of Datalake schemas

I have a general question about keeping track of schemas in Datalake. In various logs, I have some fields which exist in every log. There are other fields which differ by log types. My team has a consensus to only add field, and not delete existing fields.
We first bring in all the logs into AWS S3 in JSON format, and then transform the logs into PARQUET, and here the schema becomes important. For the fields which exist in every log, we force the original data types, for example id or date. For the other fields which differ in log types, they are converted into JSON STRING and save as a single column.
In this case, is there any tools that can be used to find out the exact schema of the data? AWS GLUE doesn't seem to offer a way to catalog this kind of data.
Or, in other case, please feel free to tell me an appropriate way of keeping track of schema evolution. Thanks much in advance!

Sending data using Avro objects, is there an advantage to using schema registry?

I have an application where I generate avro objects using an AVSC file and then produce using them objects. I can consumer then with the same schema in another application if I wish by generating the pojos there. This is done with an Avro plugin. The thing I noticed is that the schema doesn't exist in the schema registry.
I think if I change my producer type/settings it might create it there (I am using kafka spring). Is there any advantage to having it there, is what I am doing at the minute just serialization of data, is it the same as say just creating GSON objects from data and producing them?
Is it bad practice not having the schema in the registry?
To answer the question, "is there an advantage" - yes. At the least, it allows for other applications to discover what is contained in the topic, whether that's another Java application using Spring, or not.
You don't require the schemas to be contained within the consumer codebase
And you say you're using the Confluent serializers, but there's no way to "skip" schema registration, so the schemas should be in the Registry by default under "your_topic-value"

data exchange format to use with Apache Kakfa that provides schema validation

what is the best message format to use with apachhe kafka so that producers and consumers can define contract and validate data and serialize/deserialize data? for example in xml we have xsd. but in json there is no universal schema.. i read about using apache avro but not sure how fast will it be as i can't afford more then 5 to 6 ms for schema validation and deserialisation? any inputs please?
We will be processing thousands of transactions per second and SLA for each transaction is 150ms so i am looking for something that's very fast
Avro is often quoted as being slow(er), and adds overhead compared to other binary formats, but I believe that is for the use-case of not using a Schema Registry where the schema is excluded from the actual payload.
Alternatively, you can use Protobuf or Thrift if you absolutely want a schema, however, I don't think serializers for these formats are readily available, from what I've seen. Plus, the schemas need to be passed between your clients if not otherwise committed to a central location.
I can confidently say that Avro should be fine for starting out, though, and the Registry is definitely useful, and not just for Kafka use cases.

Event Hub, Stream Analytics and Data Lake pipe questions

After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.