Limitations of Hive's INSERT, UPDATE, DELETE

Limitations of Hive's INSERT, UPDATE, DELETE - hive

I am a bit of surprised to know that Hive now has UPDATE statement (although it looks like its way back from v0.14), although I am quite aware for some time that it is in Hive's roadmap to have a full or near RDBMS-SQL functionality.
Can you summarize how Hive's INSERT, UPDATE, DELETE different from Relational Databases and what are its limitations (Hive is v2.1.0 as of this writing)?
Should Hive continue to improve its RDBMS-like SQL capabilities, say 2-3 years time, will it then be useful for Relational DB workloads?
(I'm not aware of the full roadmap though. Pardon if this is a stupid question, or a question due to laziness in browsing through documentations.)

Hive supported insert. However for update and delete operation following are the requirements
only for ORC format
only for bucketed tables
have to specify TBLPROPERTIES ("transactional"="true")
The latency is still an issue with this operations following has use cases of why ACID compatibility is introduced. However in road map hive is not planning to replace transaction relational database.
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-Limitations

Related

Doubts on Hive external table insertion and update

I've seen from Hive 0.14 insertions and updates are available. My first question is: do insertions and updates work for external tables?
If they do, how it works? I guess related HDFS files have to be modified by appending new lines and by updating involved lines, respectively.
Thanks!

Yes, Hive 0.14 supports inserts/ deletes. Having said that, it comes with a number of limitations. Currently there is no support for external tables. Please see here for the full list of limitations - https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions

Hive on Accumulo recommended settings

We use Hive (v. 1.2.1) to read with "sql like" on accumulo (v. 1.7.1) tables.
Is there any special settings what we can configure in hive or somewhere to gain our performance or stability?
If we use the hive this way is there any point for example trying out some hive indexing or whatever settings like "hive.auto.convert.join" or it works different way and not really affect in these case?
Thank you!

Obligatory: I wrote (most of) the AccumuloStorageHandler, but I am by no means a Hive expert.
The biggest gain you will probably be able to find is when you can structure your query in such a way that you can either prune the row-space (via a statement in the WHERE clause over the :rowid-mapped column). To my knowledge, there isn't much (any?) query optimization that is pushed down into Accumulo itself.
Depending on your workload, you could use Hive to generate your own "index tables" in Accumulo. If you can make a custom table that has the column you want to actively query stored in the Accumulo row, your queries should run much faster.

SQL - Why transactions?

I just realized I've had a headache for years. Well, metaphorically speaking. In reality I was looking at my database structure and somehow just realized I never use transactions. Doh.
There's a lot of data on the internet about transactions (begin transaction, rollback, commit, etc.), but surprisingly not much detail about exactly why they are vital, and just exactly how vital?
I understand the concept of handling if something goes wrong. This made sense when one is doing multiple updates, for example, in multiple tables in one go, but this is bad practice as far as I know and I don't do this. All of my queries just update one table. If a query errors, it cancels, transaction or no transaction. What else could go wrong or potentially corrupt a one table update, besides my pulling the plug out of my server?
In other words, my question is,
exactly how vital is it that i implement transactions on all of my tables - I am fully blasphemous for not having them, or does it really matter that much?
UPDATE
+1 to invisal, who pointed out that queries are automatically wrapped as transactions, which I did not know. Pointed out multiple good references on the subject of my question.

This made a lot of sense when one is doing multiple updates, for
example, in multiple tables in one go. But basically all of my queries
just update one table at a time. If a query errors, it cancels,
transaction or no transaction.
In your case, it does nothing. A single statement has its own transaction itself. For more information you can read the existed question and answers:
What does a transaction around a single statement do?
Transaction necessary for single update query?
Do i need transaction for joined query?

Most important property of the database is to keep your data, reliably.
Database reliability is assured by conforming to ACID principles (Atomicity, Consistency, Isolation, Durability). In the context of databases, a single logical operation on the data is called a transaction. Without transactions, such reliability would not be possible.
In addition to reliability, using transactions properly lets you improve performance of some data operations considerably. For example, you can start transaction, insert a lot of data (say 100k rows), and only then commit. Server does not have to actually write to disk until commit is called, effectively batching data in memory. This allows to improve performance a lot.

You should be aware that every updating action against your database is performed inside a transaction, even if only 1 table (SQL server automatically creates a transaction for it).
The reason for always doing transactions is to ensure ACID as others have mentioned. Here I'd like to elaborate on the isolation point. Without transaction isolation, you may have problems with: read uncommitted, unrepeatable read, phantom read,..

it depends if you are updating one table and one row, then the only advantage is going to be in the logging... but if you update multiple row in a table at one time... without transactions you could still run into somecurruption

Well it depends, SQL is most of the times used for supporting data for some host languages like c, c++, java, php, c# and others. Well I have not worked with much technologies.. but if you are using following combinations then here is my point of view:
SQL with C / C++ : Commit Required
SQL with Java : Not Required
SQL with C# : Not Required
SQL with PHP : Not Required
And it also depends which SQL you are using. It would also depend from different flavors of SQL like Oracle SQL, SQL Server, SQLite, MySQL etc...
When you are using Oracle SQL in its console, like Oracle 11g, Oracle 10g etc... COMMIT is required.
And as far as corruption of table and data is concerned. YES it happens, I had a very bad experience with it. So, if you pull out your wire or something while you are updating in your table, then you might end up with a massive disaster.
Well concluding, I will suggest you to do commit.

How to Design HBase Schema

We have currently one running project which uses RDBMS database( with lots of tables and stored procedures for manipulating data). The current flow is like : the data access layer will call stored procedures, which will insert/delete/update or fetch data from RDBMS(please note that these stored procedures are not doing any bulk proccesses.). The current data structure contains lots of primary key, foreign key relation ship and have lots of updates to existing database tables.a I just want to know whether we can use HBase for our purpose? then how can we use Hadoop with HBase replacing RDBMS?

You need to ask yourself, what is the RDBMS not doing for you, and what is it that you hope to achieve by moving to Hadoop/HBase?
This article may help. There are a lot more.
http://it.toolbox.com/blogs/madgreek/nosql-vs-rdbms-apples-and-oranges-37713
If the purpose is trying new technology, I suggest trying their tutorial/getting started.
If it's a clear problem you're trying to solve, then you may want to articulate the problem.
Good Luck!

I hesitate to suggest replacing your current rdbms simply because of the large developer effort that you've already spent. Consider that your organization probably has no employees with the needed experience for hbase. Moving to hbase with the attendant data conversion and application rewriting will be very expensive and risky.

Does a version control database storage engine exist?

I was just wondering if a storage engine type existed that allowed you to do version control on row level contents. For instance, if I have a simple table with ID, name, value, and ID is the PK, I could see that row 354 started as (354, "zak", "test")v1 then was updated to be (354, "zak", "this is version 2 of the value")v2 , and could see a change history on the row with something like select history (value) where ID = 354.
It's kind of an esoteric thing, but it would beat having to keep writing these separate history tables and functions every time a change is made...

It seems you are looking more for auditing features. Oracle and several other DBMS have full auditing features. But many DBAs still end up implementing trigger based row auditing. It all depends on your needs.
Oracle supports several granularities of auditing that are easy to configure from the command line.
I see you tagged as MySQL, but asked about any storage engine. Anyway, other answers are saying the same thing, so I'm going delete this post as originally it was about the flashback features.

Obviously you are really after a MySQL solution, so this probably won't help you much, but Oracle has a feature called Total Recall (more formally Flashback Archive) which automates the process you are currently hand-rolling. The Archive is a set of compressed tables which are populated with changes automatically, and queryable with a simple AS OF syntax.
Naturally being Oracle they charge for it: it needs an additional license on top of the Enterprise Edition, alas. Find out more (PDF).

Oracle and Sql Server both call this feature Change Data Capture. There is no equivalent for MySql at this time.

You can achieve similar behavior with triggers (search for "triggers to catch all database changes") - particularly if they implement SQL92 INFORMATION_SCHEMA.
Otherwise I'd agree with mrjoltcola
Edit: The only gotcha I'd mention with MySQL and triggers is that (as of the latest community version I downloaded) it requires the user account have the SUPER privilege, which can make things a little ugly

CouchDB has full versioning for every change made, but it is part of the NOSQL world, so would probably be a pretty crazy shift from what you are currently doing.

The wikipedia article on google's bigtable mentions that it allows versioning by adding a time dimension to the tables:
Each table has multiple dimensions
(one of which is a field for time,
allowing versioning).
There are also links there to several non-google implementations of a bigtable-type dbms.

I think Big table, the Google DB engine, does something like that : it associate a timestamp with every update of a row.
Maybe you can try Google App Engine.
There is a Google paper explaining how Big Table works.

The book Refactoring Databases has some insights on the matter.
But it also points out there is no real solution currently, other then carefully making changes and managing them manually.

One approximation to this is a temporal database - which allows you to see the status of the whole database at different times in the past. I'm not sure that wholly answers your question though; it would not allow you to see the contents of Row 1 at time t1 while simultaneously letting you look at the contents of Row 2 at a separate time t2.

"It's kind of an esoteric thing, but it would beat having to keep writing these separate history tables and functions every time a change is made..."
I wouldn't call audit trails (which is obviously what you're talking of) an "esoteric thing" ...
And : there is still a difference between the history of database updates, and the history of reality. Historical database tables should really be used to reflect the history of reality, NOT the history of database updates.
The history of database updates is already kept by the DBMS in its logs and journals. If someone needs to inquire the history of database upates, then he/she should really resort to the logs and journals, not to any kind of application-level construct that can NEVER provide sufficient guarantee that it reflects ALL updates.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Limitations of Hive's INSERT, UPDATE, DELETE - hive

Related

Doubts on Hive external table insertion and update

Hive on Accumulo recommended settings

SQL - Why transactions?

How to Design HBase Schema

Does a version control database storage engine exist?

Categories

Resources