I've seen from Hive 0.14 insertions and updates are available. My first question is: do insertions and updates work for external tables?
If they do, how it works? I guess related HDFS files have to be modified by appending new lines and by updating involved lines, respectively.
Thanks!
Yes, Hive 0.14 supports inserts/ deletes. Having said that, it comes with a number of limitations. Currently there is no support for external tables. Please see here for the full list of limitations - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
Related
Why Hive shouldn't use TRUNCATE in its operations?
I heard that the reason why TRUNCATE is avoided in Hive and it can also cause problems, so why?
I don't know if TRUNCATE has issues in hive. It removes all data from file, update statistics and result an empty table ready for next load. If your files are clean and statistics updated, what else you expect from this command.
When someone says it has problems, there must be a strong reason behind it or they said from their past experience. But as per documentation and my experience i do not see this as an issue.
You can refer to this official link for more info - https://docs.cloudera.com/documentation/enterprise/5-9-x/topics/impala_truncate_table.html
Hive supports ACID property only for ORC formatted tables.
Can anyone please let me know the reason or any guide available ?
It's the current limitation, Here's the text from official documentation:
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
More details about Hive transactions can be found here
There is no specific reason per say.
More formats will be supported in later versions. ORC was the first one to be supported.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
I am a bit of surprised to know that Hive now has UPDATE statement (although it looks like its way back from v0.14), although I am quite aware for some time that it is in Hive's roadmap to have a full or near RDBMS-SQL functionality.
Can you summarize how Hive's INSERT, UPDATE, DELETE different from Relational Databases and what are its limitations (Hive is v2.1.0 as of this writing)?
Should Hive continue to improve its RDBMS-like SQL capabilities, say 2-3 years time, will it then be useful for Relational DB workloads?
(I'm not aware of the full roadmap though. Pardon if this is a stupid question, or a question due to laziness in browsing through documentations.)
Hive supported insert. However for update and delete operation following are the requirements
only for ORC format
only for bucketed tables
have to specify TBLPROPERTIES ("transactional"="true")
The latency is still an issue with this operations following has use cases of why ACID compatibility is introduced. However in road map hive is not planning to replace transaction relational database.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Limitations
We use Hive (v. 1.2.1) to read with "sql like" on accumulo (v. 1.7.1) tables.
Is there any special settings what we can configure in hive or somewhere to gain our performance or stability?
If we use the hive this way is there any point for example trying out some hive indexing or whatever settings like "hive.auto.convert.join" or it works different way and not really affect in these case?
Thank you!
Obligatory: I wrote (most of) the AccumuloStorageHandler, but I am by no means a Hive expert.
The biggest gain you will probably be able to find is when you can structure your query in such a way that you can either prune the row-space (via a statement in the WHERE clause over the :rowid-mapped column). To my knowledge, there isn't much (any?) query optimization that is pushed down into Accumulo itself.
Depending on your workload, you could use Hive to generate your own "index tables" in Accumulo. If you can make a custom table that has the column you want to actively query stored in the Accumulo row, your queries should run much faster.
We have currently one running project which uses RDBMS database( with lots of tables and stored procedures for manipulating data). The current flow is like : the data access layer will call stored procedures, which will insert/delete/update or fetch data from RDBMS(please note that these stored procedures are not doing any bulk proccesses.). The current data structure contains lots of primary key, foreign key relation ship and have lots of updates to existing database tables.a I just want to know whether we can use HBase for our purpose? then how can we use Hadoop with HBase replacing RDBMS?
You need to ask yourself, what is the RDBMS not doing for you, and what is it that you hope to achieve by moving to Hadoop/HBase?
This article may help. There are a lot more.
http://it.toolbox.com/blogs/madgreek/nosql-vs-rdbms-apples-and-oranges-37713
If the purpose is trying new technology, I suggest trying their tutorial/getting started.
If it's a clear problem you're trying to solve, then you may want to articulate the problem.
Good Luck!
I hesitate to suggest replacing your current rdbms simply because of the large developer effort that you've already spent. Consider that your organization probably has no employees with the needed experience for hbase. Moving to hbase with the attendant data conversion and application rewriting will be very expensive and risky.