How can I see different version of Hbase data in Hive.
As per my understanding using HbaseStorageHandler only latest version of Hbase data will be available in Hive .Is my understanding correct/updated?
Is there any way to access different version of Hbase data using Hive??
Thanks in advance :)
(New to Hbase-Hive Integration)
That would depend on the version of hive that you are using.
Prior to hive 1.1, hbase timestamps were not accessible through the hive-hbase integration [1] (Related: [2]).
So the answer being, You require hive 1.1 or higher.
Hope it helps.
[1] https://issues.apache.org/jira/browse/HIVE-2828
[2] https://issues.apache.org/jira/browse/HIVE-8267
Not 100% answer but directions. In normal life HBase is always about special cases.
Here is slightly outdated but really simple article to understand approach:
http://hortonworks.com/blog/hbase-via-hive-part-1/
So practically you can implement any InputFormat or OutputFormat you need.
But this is related to MapReduce gears.
In principle Spark can always rely on InputFormat too so the question is only about your special case.
Another good idea is depicted here: http://www.slideshare.net/HBaseCon/ecosystem-session-3a
So snapshots could help to take state of tables you really need and then you are free to use any gear to connect Hive with HBase if it follow standards.
In general basic idea is to tune gears which connects Hive to your HBase data so they will apply needed version filters to you. This does not depend so much on versions as this interface is pretty stable.
Hope this will help you.
Related
Hive supports ACID property only for ORC formatted tables.
Can anyone please let me know the reason or any guide available ?
It's the current limitation, Here's the text from official documentation:
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
More details about Hive transactions can be found here
There is no specific reason per say.
More formats will be supported in later versions. ORC was the first one to be supported.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
The DHIS2 documentation mentions that it supports mysql (https://docs.dhis2.org/2.28/en/implementer/html/installation.html), however thats the last point mysql is ever mentioned.
Does the current version really support mysql? If it does, will GIS still work?
From direct dhis2 support email...
Up until and including version 2.28, mysql should work.
However, from version 2.29 we require PostgreSQL as the database platform, together with the PostGIS spatial extension. This means that MySQL is no longer supported.
The minimum version required is PostgreSQL 9.1. However we recommend upgrading to a later version as we plan to take advantage of some of the useful features part of PostgreSQL 10 such as logical replication and native partitioning in future versions of DHIS 2.
First of all it is recommended to use postgres.
Secondly most of the testing and QA is done on instances with postgres.
Thirdly POST GIS extension is available only in postgres , which can cause a hurdle for you at later stage.
Fourthly , the GIS data points and boundaries are stored in a format which is better handled in postgres db structure.
Therefore please go with postgres and chill
Drill looks like an interesting tool for the ad-hoc drill down queries as opposed to the high-latency Hive.
It seems that there should be a decent integration between those two but i couldn't find it.
Lets assume that today all of my work is done on Hive/Shark how can i integrate it with Drill?
Do I have to switch to the Drill engine back and forth?
I'm looking for an integration similar to what Shark and Hive have.
Although there are provisions to implement Drill-Hive integration, your question seems to be a bit "before the time" thing. Drill still has a long way to go and folks have been trying really hard to get all this done as soon as possible.
As per their roadmap, Drill will first support Hadoop FileSystem implementations and HBase. Second, Hadoop-related data formats will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based tools will be provided to produce column-based formats. Fourth, Drill tables can be registered in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation.
See this for more details.
I am looking into replicating in real time data from Oracle to Vertica database.
So far i can not find anything that is able to do this !!!
But i have foung Tungsten replicator that seems to work well in Mysql to Vertica (havent test it yet).
My Question is :
Is there any tools or ways of doing this (oracle => vertica)???
And if so how would updates would be handled ?
WisdomForce - acquired by Informatica - does data replication between oracle and vertica. http://www.wisdomforce.com/.
Look at Tungsten Replicator. It replicates from MySQL/Oracle to Vertica and is fully open source (GPL V2). You can find the Tungsten Replicator project at code.google.com. (Disclaimer: I work for Continuent and wrote a good chunk of the replicator code.)
Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???
Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.
As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!
I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.
You could write a custom input format for Hive to access lucene index in Hadoop.