Reason for ACID property support in ORC tables - hive

Hive supports ACID property only for ORC formatted tables.
Can anyone please let me know the reason or any guide available ?

It's the current limitation, Here's the text from official documentation:
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
More details about Hive transactions can be found here

There is no specific reason per say.
More formats will be supported in later versions. ORC was the first one to be supported.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Related

Atomic update of multipart s3

I need to update multiple files to s3 from a Java application. But the catch is we need all the files atomically i.e. All or nothing.
I am unable to find any solution for that.
Any suggestions are welcome.
Thanks!
S3 is an eventual consistency store so you'll need some mechanism like _commit. Parquet format and others do this for you. The format options depend on your readers, for example, no RedShift bulk loader for Parquet, so AVRO is a better format for that use case.
What common formats are supported by all systems that need to work with these files?
Till date only elegant solution I could find was reading it in DataFrame (using spark libs) and write it.
I also implemented basically checking of some commit files (let's say _commit) for locking/sync purposes which is basically done by Spark APIs as well.
Hope that helps. If anyone has any other solution - they are most welcome to please share. :)

Doubts on Hive external table insertion and update

I've seen from Hive 0.14 insertions and updates are available. My first question is: do insertions and updates work for external tables?
If they do, how it works? I guess related HDFS files have to be modified by appending new lines and by updating involved lines, respectively.
Thanks!
Yes, Hive 0.14 supports inserts/ deletes. Having said that, it comes with a number of limitations. Currently there is no support for external tables. Please see here for the full list of limitations - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Is it possible to use avro schema resolution without reader?

I know there is a way to read avro record into schema compatible with one used to write that record (given both schemas). I wonder whether there is a way to transform record to the similar record with another (but compatible with old one) schema. Compatibility is meant in sense of as it is described here.

How can I see different version of Hbase data using Hive?

How can I see different version of Hbase data in Hive.
As per my understanding using HbaseStorageHandler only latest version of Hbase data will be available in Hive .Is my understanding correct/updated?
Is there any way to access different version of Hbase data using Hive??
Thanks in advance :)
(New to Hbase-Hive Integration)
That would depend on the version of hive that you are using.
Prior to hive 1.1, hbase timestamps were not accessible through the hive-hbase integration [1] (Related: [2]).
So the answer being, You require hive 1.1 or higher.
Hope it helps.
[1] https://issues.apache.org/jira/browse/HIVE-2828
[2] https://issues.apache.org/jira/browse/HIVE-8267
Not 100% answer but directions. In normal life HBase is always about special cases.
Here is slightly outdated but really simple article to understand approach:
http://hortonworks.com/blog/hbase-via-hive-part-1/
So practically you can implement any InputFormat or OutputFormat you need.
But this is related to MapReduce gears.
In principle Spark can always rely on InputFormat too so the question is only about your special case.
Another good idea is depicted here: http://www.slideshare.net/HBaseCon/ecosystem-session-3a
So snapshots could help to take state of tables you really need and then you are free to use any gear to connect Hive with HBase if it follow standards.
In general basic idea is to tune gears which connects Hive to your HBase data so they will apply needed version filters to you. This does not depend so much on versions as this interface is pretty stable.
Hope this will help you.

Liquibase load data in a format other than CSV

With the load data option that Liquibase provides, one can specify seed data in a CSV format. Is there a way I can provide say, a JSON or XML file with data that Liquibase would understand?
The use case is we are trying to put in some sample data which is hierarchical. E.g. Category - Subcategory relation which would require putting in parent id for all related categories. If there is a way to avoid including the ids in the seed data via say, JSON.
{
"MainCat1": ["SubCat11", "SubCat12"],
"MainCat2": ["SubCat21", "SubCat22"]
}
Very likely to have this as not supported (couldn't make Google help me) but is there a way to write a plugin or something that does this? Pointer to a guide (if any) would help.
NOTE: This is not about specifying the change log in that format.
This not currently supported and supporting it robustly would be pretty difficult. The main difficultly lies in the fact that Liquibase is designed to be database-platform agnostic, combined with the design goal of being able to generate the SQL required to do an operation without actually doing the operation live.
Inserting data like you want without knowing the keys and just generating SQL that could be run later is going to be very difficult, perhaps even impossible. I would suggest approaching Nathan, who is the main developer for Liquibase, more directly. The best way to do that might be through the JIRA bug database for Liquibase.
If you want to have a crack at implementing it, you could start by looking at the code for the LoadDataChange class (source in Github), which is where the CSV support currently lives.