data exchange format to use with Apache Kakfa that provides schema validation - serialization

what is the best message format to use with apachhe kafka so that producers and consumers can define contract and validate data and serialize/deserialize data? for example in xml we have xsd. but in json there is no universal schema.. i read about using apache avro but not sure how fast will it be as i can't afford more then 5 to 6 ms for schema validation and deserialisation? any inputs please?
We will be processing thousands of transactions per second and SLA for each transaction is 150ms so i am looking for something that's very fast

Avro is often quoted as being slow(er), and adds overhead compared to other binary formats, but I believe that is for the use-case of not using a Schema Registry where the schema is excluded from the actual payload.
Alternatively, you can use Protobuf or Thrift if you absolutely want a schema, however, I don't think serializers for these formats are readily available, from what I've seen. Plus, the schemas need to be passed between your clients if not otherwise committed to a central location.
I can confidently say that Avro should be fine for starting out, though, and the Registry is definitely useful, and not just for Kafka use cases.

Related

best practice for ingesting pubsub payloads into Big Query

I am looking for best practices on how to ingest various payloads coming from a single subscription(one topic) into Big Query. The data can be ingested in raw format but need to be structured in more tabular form and mapped to published layer for business consumption. I was googling some options but need guidance on designing pubsub message ingestion to publication solution in real time.
Should I simply dump all the data keeping similar structure to the message in raw as string or Json or should I structure attributes into columns/structures/arrays what are the pros and cons
does it make sense to split the subscription into multiple filtered subsciptions and map them to multiple tables.
Case 1: You know the possible columns (JSON keys) you can receive
I would structure the Pub/Sub message into columns (and if your JSON is pretty nested, take advantage of BigQuery's nested structures)
Case 2: You don't know the possible columns you can receive or suspect they will frequently change
Separate it into columns in your target BigQuery tables as much as you can, and dump the rest as a raw string in a column that you can always parse later with BigQuery SQL.
Also, in case you haven't seen it already: you can now push messages straight form Pub/Sub to BigQuery (without going through Dataflow, which is great at parsing the Pub/Sub messages but can tend to be costly): https://cloud.google.com/blog/products/data-analytics/pub-sub-launches-direct-path-to-bigquery-for-streaming-analytics

Sending data using Avro objects, is there an advantage to using schema registry?

I have an application where I generate avro objects using an AVSC file and then produce using them objects. I can consumer then with the same schema in another application if I wish by generating the pojos there. This is done with an Avro plugin. The thing I noticed is that the schema doesn't exist in the schema registry.
I think if I change my producer type/settings it might create it there (I am using kafka spring). Is there any advantage to having it there, is what I am doing at the minute just serialization of data, is it the same as say just creating GSON objects from data and producing them?
Is it bad practice not having the schema in the registry?
To answer the question, "is there an advantage" - yes. At the least, it allows for other applications to discover what is contained in the topic, whether that's another Java application using Spring, or not.
You don't require the schemas to be contained within the consumer codebase
And you say you're using the Confluent serializers, but there's no way to "skip" schema registration, so the schemas should be in the Registry by default under "your_topic-value"

What to use to serve as an intermediary data source in ETL job?

I am creating an ETL pipeline that uses variety of sources and sends the data to Big Query. Talend cannot handle both relational and non relational database components in one job for my use case so here's how i am doing it currently:
JOB 1 --Get data from a source(SQL Server, API etc), transform it and store transformed data in a delimited file(text or csv)
JOB 1 -- Use the stored transformed data from delimited file in JOB 1 as source and then transform it according to big query and send it.
I am using delimited text file/csv as intermediary data storage to achieve this.Since confidentiality of data is important and solution also needs to be scalable to handle millions of rows, what should i use as this intermediary source. Will a relational database help? or delimited files are good enough? or anything else i can use?
PS- I am deleting these files as soon as the job finishes but worried about security till the time job runs, although will run on safe cloud architecture.
Please share your views on this.
In Data Warehousing architecture, it's usually a good practice to have the staging layer to be persistent. This gives you among other things, the ability to trace the data lineage back to source, enable to reload your final model from the staging point when business rules change as well as give a full picture about the transformation steps the data went through from all the way from landing to reporting.
I'd also consider changing your design and have the staging layer persistent under its own dataset in BigQuery rather than just deleting the files after processing.
Since this is just a operational layer for ETL/ELT and not end-user reports, you will be paying only for storage for the most part.
Now, going back to your question and considering your current design, you could create a bucket in Google Cloud Storage and keep your transformation files there. It offers all the security and encryption you need and you have full control over permissions. Big Query works seemingly with Cloud Storage and you can even load a table from a Storage file straight from the Cloud Console.
All things considered, whatever the direction you chose I recommend to store the files you're using to load the table rather than deleting them. Sooner or later there will be questions/failures in your final report and you'll likely need to trace back to the source for investigation.
In a nutshell. The process would be.
|---Extract and Transform---|----Load----|
Source ---> Cloud Storage --> BigQuery
I would do ELT instead of ETL: load the source data as-is and transform in Bigquery using SQL functions.
This allows potentially to reshape data (convert to arrays), filter out columns/rows and perform transform in one single SQL.

Schema in Avro message

I see that the Avro messages have the schema embedded, and then the data in binary format. If multiple messages are sent and new avro files are getting created for every message, is not Schema embedding an overhead?
So, does that mean, it is always important for the producer to batch up the messages and then write, so multiple messages writing into one avro file, just carry one schema?
On a different note, is there an option to eliminate the schema embedding while serializing using the Generic/SpecificDatum writers?
I am reading following points from Avro Specs
Apache Avro is a data serialization system.
Avro relies on schemas.
When Avro data is read, the schema used when writing it is always
present.
The goal of serialization is to avoid per-value
overheads, to make serialization both fast and small.
When Avro data is stored in a file, its schema is stored with it.
You are not supposed to use data serialization system, if you want to write 1 new file for each new message. This is opposed to goal of serialization. In this case, you want to separate metadata and data.
There is no option available to eliminate schema, while writing avro file. It would be against avro specification.
IMO, There should be balance while batching multiple messages into single avro file. Avro files should be ideally broken down to improve i/o efficiency. In case of HDFS, block size would be ideal avro file size.
You are correct, there is an overhead if you write a single record, with the schema. This may seem wasteful, but in some scenarios the ability to construct a record from the data using this schema is more important than the size of the payload.
Also take into account that even with the schema included, the data is encoded in a binary format so is usually smaller than Json anyway.
And finally, frameworks like Kafka can plug into a Schema Registry, where rather than store the schema with each record, they store a pointer to the schema.

incremental updates documentation is not clear enough

I have a database where I need keep up with changes on Wikidata changes, and while I was looking for ways to do it, I found these three:
RSS
API Call
Socket.IO
I would like to know if there are other ways and which one is the best or recommended by Wikidata
The answer depends on how up to date you need to keep your database.
As up to date as possible
If you need to keep you database as up to date with Wikidata as possible then you will probably want to use a combination of the solutions that you have found.
Socket.IO will provide you with a stream of what has changed, but will not necessarily give you all of the information that you need.
(Note: there is an IRC stream that would allow you to do the same thing)
Based on the data provided by the stream you can then make calls to the Wikidata API retrieving the new data.
Of course this could result in lots of API calls, so make sure you batch them and also don't retrieve update immediately in case lots of changes occur in a row.
Daily or Weekly
As well as the 3 options you have listed above you also have the Database dumps!
https://www.wikidata.org/wiki/Wikidata:Database_download
The JSON & RDF dumps are generally recommended. The JSON dump contains the data exactly as it is stored. These dumps are made weekly.
The XML dumps are not guaranteed to have the same JSON format as the JSON dumps as they use the internal serialization format. However daily XML dumps are provided.