stream analytics and compressed avro - azure-stream-analytics

Hi currenlty were are sending avro message to eventhub. With stream analytics we read from this eventhub. We saw that it is possible to compress our avro in deflate compression. Can Stream analytics read deflated compressed avro

ASA natively supports Avro, CSV or raw JSON for input or output sinks.
I personally use this ability to get Avro -> JSON deserialization for "free" without any custom code. You just have to tell ASA the serialization format when configuring the inputs and outputs:

ASA does not support compressed messages for any serialization format as of now, except for avro. Deflate should work.

Related

how to impl kafka-connect with my own message

I have Kafka topic contains binary message (byte Arr).
I would like to write the message to S3 as Parquet format.
I tried to use Kafka Connect and struggled with the configuration.
My Kafka contains also some Kafka Headers that need to written to parquet as well.
what is the right configuration in this case?
it's not Avro and not Json.
can i write the byte Arr as is to the parquet file without serializing it?
Thanks

Is it possible to deserialize ORC files in chunks?

I have a huge ORC object ( > 50GB) in S3. I would like to deserialize it in chunks (in a streaming manner). This allows me to retry from last offset in case of S3 file download failures.
I understand ORC stores metadata as a footer. So, I'm looking for some solution that reads the footer first, followed by chunked deserializing.
s3 supports querying for specific file ranges over their http api. Assuming you know your stripe size ahead of time, you can use the api to get filesize. You can calculate the postscript offset, and download only it as a chunk. With that metadata, you can then start pulling in the remainder of the file. It'd probably be best to do several requests, one for each stripe, and decode them concurrently.

Schema in Avro message

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead?
So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema?
On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?
I am reading following points from Avro Specs
Apache Avro is a data serialization system.
Avro relies on schemas.
When Avro data is read, the schema used when writing it is always
present.
The goal of serialization is to avoid per-value
overheads, to make serialization both fast and small.
When Avro data is stored in a file, its schema is stored with it.
You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.
There is no option available to eliminate schema, while writing avro file. It would be against avro specification.
IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.
You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.
Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.
And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.

Streaming data to an object in S3

We have a input stream which need to be written to S3. This stream has large data and I cannot keep it in memory. We don't want to write to local disk and then transfer to S3 because of security reasons.
Is there a way to stream data to s3 object?
I think our problem can be solved using s3 multipart upload. But, that is used for different purpose - uploading large files. Instead is there a out of the box way to stream data to s3?
This stream has large data and I cannot keep it in memory.
So multipart upload is the correct way to solve this.

NiFi: The incoming flow file can not be read as an Avro file

I just started with NiFi 1.4.
I am trying to send pipe delimited message via kafka into Hive. So I am using ConsumeKafkaRecord_0_10 and PutHivStreaming processor. Consume Kafka reader sends data on success to PutHiveStreaming.
Consume Kafka reader writing data in avro format but PutHiveStreaming gives error as
The incoming flow file can not be read as an Avro filee: java.io.IOException: Not a data file."
PutHiveStreaming can only read Avro datafiles, so you have to make sure the writer used by ConsumeKafkaRecord is an AvroRecordSetWriter with Schema Write Strategy set to Embedded Schema.
If the schema isn't embedded then when it gets to the Hive processor, it won't be a valid Avro datafile.