Confluent.Kafka SchemaRegistry: reusing Avro schemas? - schema

Avro supports schema reusing. One might put one custom type in file a.avsc and reuse it in the another file b.avsc. That's fatefully important for more or less complicated projects to avoid having identical entities under different namespaces and overflood with stupid mappings between them and protect. But I see no such possibility mentioned in the official documentantion of the Confluent Kafka platform. Does it mean that there is no way to reuse schemas?

Related

How to graphically represent and manipulate apache avro schema

I have been working with apache avro lately, where i was writing the avro schema files by myself, now i am dealing with other developers schema and looking for a tool to visualize the schema to save me sometime, and it would be better if this tool also capable of manipulating the schema. My question is there are any tools that server my need?
I don't have the rep to comment, so leaving this as an answer: This isn't a visualization tool, but if you're dealing with multiple schemata, I highly recommend writing protocol files (Avro IDL) instead of writing schemata directly. They're much easier to read, and each protocol file can compile down to multiple schema files.
As it can tersely define multiple schema files, it might make it easier to grok dependencies without needing a viewer.

"Best practice" for HBase data "serialization"

Hi I am new to HBase and I wonder what is the best approach to serialize and store the data to HBase. Is there any convenient way how to transform "business objects" at application level to HBase objects (Put) - transformation to byte[]. I doubt that it has to be converted manually via helpers methods like .toByte etc.
What are the best practices and experiences?
I read about Avro, Thrift, n-orm, ...
Can someone share his knowledge?
I would go with the default Java API and enable compression on HDFS rather than using a framework for serializing / deserializing efficiently during RPC calls.
Apparently, updates like addition of a column to records in Avro/Thrift would be difficult as you are forced to delete and recreate.
Secondly, I don't see support for Filters in thrift/avro. In case you need to filter data at the source.
My two cents .
For a ORM solution, kindly have a look at https://github.com/impetus-opensource/Kundera .

Apache ACE XML repository

Currently, XML file based repositories have been used in Apache ACE. Can we change them to make DBMS based? If yes, any guidelines are available?
ACE uses two layers of abstraction when it comes to storage:
Repository
I'll start at the bottom. Here, ACE introduces the notion of a Repository, which is nothing more than a versioned BLOB of data. Each repository starts versioning at 1, and every time you commit a new BLOB, that version gets bumped. There are multiple such repositories, which can be addressed by name.
Writing a different implementation of this Repository interface is fairly straightforward, and you can use any back-end that supports some form of BLOB, including a DBMS. Do note that at this level, there is no notion of what's inside these BLOBs, so depending on your reasons for using a DBMS here, that might or might not be what you want.
Object Graph
On top of this Repository, ACE uses an in-memory object graph of POJOs to represent its state. The POJOs hold metadata like for an artifact its URL, bundle symbolic name, version, etc. The POJOs are currently persisted and restored using XStream (that's where the XML comes from). At this level you could opt for storing the graph in a completely different way as well (maybe even completely bypassing the underlying Repository in favor of something else). Note though that ACE in general assumes that this whole graph of objects is versioned every time it is persisted (so we're not overwriting any old data).
Hopefully this explains a bit more about what's involved. If you want to discuss this some more, don't hesitate to subscribe to the ACE dev mailing list (see http://ace.apache.org/get-involved/mailing-lists.html for information on how to subscribe).

Avro a replacement for Writables

I am very new to Hadoop and have to delve into its serialization. I know that Hadoop comes with its own serializer called Writables. I was curious to know whether Avro (or protobuf, thrift) replaces the Writables interface or Avro is just meant for serializing the MR client data but not the internal communication between say namenode and datanode.
AVRO is a serialization library (with apis for a number of languages). AVRO is an alternative to using/implementing your Key / Value objects as Writables, but hadoop still uses it's own RPC data structures when communicating between the various services (datanodes, namenodes, job and task trackers).
I've read somewhere that Avro may well end up being the standard internal data exchange mechanism/serialization framework within Hadoop, which makes sense as it is based on inheritance, much like the "new" Hadoop API (the one that uses the mapreduce namespace for its libraries), whereas the "old" API (mapred libraries) is based on interfaces. That means, in practice, that you can certainly use avro with both APIs, although one or two things may require custom code if you're using the mapred libs (e.g. multiple output formats, chain mappers).
But Avro offers far more than "just" doing away with the need for your own writables (although that is, in my view, a considerable plus): it offers fairly efficient serialization, the choice between serializing against generated entity classes (like thrift requires) or using a so-called GenericRecord structure instead, and not having to have tagged data. This is possible as Avro always has its data schema available at read and write time (it's actually saved in json format as a header in the data file) which means you have the option of "projecting" from one set of fields to a subset of those fields by simply providing this information implicitly in the schema used to read the data. You can then adapt to changes in input data structure by tweaking your schemas, rather than changing your code in multiple places. You can also change the way your data is sorted by defining your schema appropriately (as there is an optional ORDER attribute avalailable).

How to generate ORM.XML mapping files from annotations?

At work, we design solutions for rather big entities in the financial services area, and we prefer to have our deployment mappings in XML, since it's easy to change without having to recompile.
We would like to do our development using annotations and generate from them the orm.xml mapping files. I found this proof of concept annotation processor, and something like that is what I'm looking for, but something that has support for most JPA annotations.
We're using WebSphere for development so we would prefer something that considers the OpenJPA implementation
Here is a possible approach:
use the annotated classes to generate the database schema
use OpenJPA's SchemaTool to reverse engineer the database schema into their XML schema file
use OpenJPA's ReverseMappingTool to generate XML mapping files from the XML schema file