"Best practice" for HBase data "serialization" - serialization

Hi I am new to HBase and I wonder what is the best approach to serialize and store the data to HBase. Is there any convenient way how to transform "business objects" at application level to HBase objects (Put) - transformation to byte[]. I doubt that it has to be converted manually via helpers methods like .toByte etc.
What are the best practices and experiences?
I read about Avro, Thrift, n-orm, ...
Can someone share his knowledge?

I would go with the default Java API and enable compression on HDFS rather than using a framework for serializing / deserializing efficiently during RPC calls.
Apparently, updates like addition of a column to records in Avro/Thrift would be difficult as you are forced to delete and recreate.
Secondly, I don't see support for Filters in thrift/avro. In case you need to filter data at the source.
My two cents .
For a ORM solution, kindly have a look at https://github.com/impetus-opensource/Kundera .

Related

Liquibase load data in a format other than CSV

With the load data option that Liquibase provides, one can specify seed data in a CSV format. Is there a way I can provide say, a JSON or XML file with data that Liquibase would understand?
The use case is we are trying to put in some sample data which is hierarchical. E.g. Category - Subcategory relation which would require putting in parent id for all related categories. If there is a way to avoid including the ids in the seed data via say, JSON.
{
"MainCat1": ["SubCat11", "SubCat12"],
"MainCat2": ["SubCat21", "SubCat22"]
}
Very likely to have this as not supported (couldn't make Google help me) but is there a way to write a plugin or something that does this? Pointer to a guide (if any) would help.
NOTE: This is not about specifying the change log in that format.
This not currently supported and supporting it robustly would be pretty difficult. The main difficultly lies in the fact that Liquibase is designed to be database-platform agnostic, combined with the design goal of being able to generate the SQL required to do an operation without actually doing the operation live.
Inserting data like you want without knowing the keys and just generating SQL that could be run later is going to be very difficult, perhaps even impossible. I would suggest approaching Nathan, who is the main developer for Liquibase, more directly. The best way to do that might be through the JIRA bug database for Liquibase.
If you want to have a crack at implementing it, you could start by looking at the code for the LoadDataChange class (source in Github), which is where the CSV support currently lives.

Avro a replacement for Writables

I am very new to Hadoop and have to delve into its serialization. I know that Hadoop comes with its own serializer called Writables. I was curious to know whether Avro (or protobuf, thrift) replaces the Writables interface or Avro is just meant for serializing the MR client data but not the internal communication between say namenode and datanode.
AVRO is a serialization library (with apis for a number of languages). AVRO is an alternative to using/implementing your Key / Value objects as Writables, but hadoop still uses it's own RPC data structures when communicating between the various services (datanodes, namenodes, job and task trackers).
I've read somewhere that Avro may well end up being the standard internal data exchange mechanism/serialization framework within Hadoop, which makes sense as it is based on inheritance, much like the "new" Hadoop API (the one that uses the mapreduce namespace for its libraries), whereas the "old" API (mapred libraries) is based on interfaces. That means, in practice, that you can certainly use avro with both APIs, although one or two things may require custom code if you're using the mapred libs (e.g. multiple output formats, chain mappers).
But Avro offers far more than "just" doing away with the need for your own writables (although that is, in my view, a considerable plus): it offers fairly efficient serialization, the choice between serializing against generated entity classes (like thrift requires) or using a so-called GenericRecord structure instead, and not having to have tagged data. This is possible as Avro always has its data schema available at read and write time (it's actually saved in json format as a header in the data file) which means you have the option of "projecting" from one set of fields to a subset of those fields by simply providing this information implicitly in the schema used to read the data. You can then adapt to changes in input data structure by tweaking your schemas, rather than changing your code in multiple places. You can also change the way your data is sorted by defining your schema appropriately (as there is an optional ORDER attribute avalailable).

Which ORM can do this?

Apologies for the shopping list, but I've played with a few ORM-type libraries, and most are good, but none have done everything :) On my next project, am hoping to find one that can do a few more things out of the box. Have you got any good suggestions?
This is what I am looking for:
Easily select deeply nested data.
for example, PHP Yii's CActiveRecord can do something like this: Contact::model()->with('phone_numbers', 'addresses', 'createdBy.user.company')->findAll();
Easily create/return deeply nested JSON from the database or ORM
Easily load deeply nested JSON data, validate it, and save it to the database correctly
Supports optimistic concurrency control
Handles multi-tenant systems gracefully
ORM stands for Object-Relational mapper. It lets you convert from the world of rows to the world of objects and associations between these objects. Nothing in both worlds has anything to do with JSON or XML serialization. In order to achieve what you want you will need to employ separate serialization framework. It also looks like you don't need ORM because you don't plan on having an actual Object model, you seem to be thinking in terms of 'data' not 'objects', you just need a 'glue' between a database and network app.
Easily select deeply nested data / Easily create/return deeply nested
JSON from the database or ORM
Yet to find one...you need a generic way to convert to/from objects, arrays, json in and out, recursively
Easily load deeply nested JSON data, validate it, and save it to the
database correctly
Yet to find one.
Supports optimistic concurrency control
Doctrine or brew your own w a "version" counter on a record
Handles multi-tenant systems gracefully
Ruby ActiveRecord + Postgres

Flexible Persistence Layer

I am designing an ASP.NET MVC 2 application. Currently I am leveraging Entity Framework 4 with switchable SQLServer and MySQL datastores.
A requirement recently surfaced for the application to allow user-defined models/entities to be manipulated. Now I'm unsure if a SQL/relational database is appropriate at all; instead of adding/removing 'Employee' objects, for example, the user should be able to define an 'Employee' and what properties it has - effectively adding/removing tables and columns on the fly, at runtime.
Is SQL unsuitable for this? Are there options which allow me to stay within a relational database structure and still satisfy this requirement? Within the Entity Framework, can I regenerate .edmx files 'on the fly' or are there alternatives which achieve similar goals?
I've looked briefly at other options like 'document-based' dbs and 'schema-free/no-sql' dbs, such as MongoDb. I've also looked at some serialization formats such as Google's Protocol Buffers, JSON, and XML. From your experience, are any of these particularly suitable for this purpose? Serialization performance is not a big concern.
The application is in its infancy and I have no time constraints. Essentially I am free to rewrite it as I please, so if scrapping and starting over is a better alternative, I am very open to this. What are your suggestions? Thanks in advance!
Before looking at options I'd suggest (if you have not already done it :-) that you need to get a clear definition of exactly what users will be able to define. Once you have that you can then deduce an idea of the level of flexibility needed and therefore the type of data store needed to do the job.
One other word of advice would be that if they clients demand to be able to create anything any way they want - walk away. I've dealt with clients and users at all levels and one thing that is guaranteed is is that users have no interest if the effective and efficient design of data and therefore will always reduce the data to a pile of poo through shear neglect.
You need to set some boundaries so that the data store behind the system maintains some integrity.

What is the best way to create mapping from SQL database to GUI?

I think its quite usual task, but still solutions I saw look not so nice.
For example in Qt used approach based on MVC pattern -- you must assign all connections manually.
Or I remember one PHP engine where pages were creating from DB schema.
Which are other approaches? Which one you are prefer? What are the best practices?
Usually, there is not a one to one mapping from the database to the GUIThere are many subtle combinations that change between how you store the data and how that stored data is visualized and edited by the user.
however, you can automate a datamodel layer in your code with tools like Hibernate. You will still need to use the model as required to present your user interface.