Support for Key Constraints and Delete Cascade in Hive

Support for Key Constraints and Delete Cascade in Hive - hive

I have come to know that primary and foreign keys are introduced to Hive tables. Is it possible to achieve delete cascade functionality for the rows through Hive QL?
Can anyone help me in understanding the reason and uses of implementing key constraints in Hive (as the constraints are not imposed on table data directly)? Also please help me in understanding the possibility to achieve delete cascade functionality through Hive Schema and HiveQL without coding

These constraints can be only in DISABLE state, meaning that all incoming data is not checked:
PRIMARY KEY
FOREIGN KEY
UNIQUE KEY
These constraints in Hive are not backed by indexes like it is in RDBMS and NO CASCADE operations supported.
The purpose of having disabled constraints is to
provide information to CBO (RELY | NORELY) to make it possible to make smart optimizations
provide information to modelling tools like ERWin
document what is supposed to be PK, UK, FK

Related

Why would someone need to enable/disable constraints?

Just starting to learn basics of SQL. In some versions of SQL (Oracle, SQL server etc.) there are enable/disable constraints keywords. What is the difference between these and add/drop constraints keywords? Why do we need it?

Constraint validation has a performance penalty when performing a DML operation. It's common to disable a constraint before a bulk insert/import of data (especially if you know that data is "OK"), and then enable it after the bulk operation is done.

I use disabled constraints in a special situation. I have an application with many tables (around 1000). The records in these table have "natural keys", i.e. identifiers and relations which are given by external source. Some tables use even different natural keys as foreign key references to different tables.
But I like to use common surrogate keys as primary key and for foreign references.
Here is one example (not 100% sure about correct syntax):
CREATE TABLE T_BTS (
OBJ_ID number constraint BTS_PK (OBJ_ID) PRIMARY KEY,
BTS_ID VARCHAR2(20) CONSTRAINT BTS_UK (BTS_ID) UNIQUE,
some more columns);
CREATE TABLE T_CELL (
OBJ_ID number constraint BTS_PK (OBJ_ID) PRIMARY KEY,
OBJ_ID_PARENT number,
BTS_ID VARCHAR2(20),
CELL_ID VARCHAR2(20) CONSTRAINT CELL_UK (BTS_ID, CELL_ID) UNIQUE,
some more columns);
ALTER TABLE T_CELL ADD CONSTRAINT CELL_PARENT_FK
FOREIGN KEY (OBJ_ID_PARENT)
REFERENCES T_BTS (OBJ_ID);
ALTER TABLE T_CELL ADD CONSTRAINT CELL_PARENT
FOREIGN KEY (BTS_ID)
REFERENCES T_BTS (BTS_ID) DISABLE;
In all my tables the primary key column is always OBJ_ID and the key to parent table is always OBJ_ID_PARENT, not matter how the natural key is defined. This makes me easier to have common PL/SQL procedures and compose dynamic SQL Statements.
One example: In order to set OBJ_ID_PARENT after insert, following update would be needed
UPDATE T_CELL cell SET OBJ_ID_PARENT =
(SELECT OBJ_ID
FROM T_BTS bts
WHERE cell.BTS_ID = bts.BTS_ID)
I am too lazy to write 1000+ such individual statements. By using views USER_CONSTRAINTS and USER_CONS_COLUMNS I am able to link the natural keys and the surrogate keys and I can execute these updates via dynamic SQL.
All my keys and references are purely defined by constraints. I don't need to maintain any extra table where I track relations or column names. The only limitation in my application design is, I have to utilize a certain naming convention for the constraints. But the countervalue for this is almost no maintenance is required to keep the data consistent and have good performance.
In order to use all above, some constrains needs to be disabled - even permanently.

I [almost] never disable constraints during the normal operation of the application. The point of the constraints is to preserve data quality.
Now, during maintenance, I can disable them temporarily while adding or removing massive amounts of data. Once they data is loaded I make sure they are enabled again before restarting the application.

Should I remove the foreign keys if we manually guarantee database integrity?

I use foreign keys at work. But we pretty much manually manage our tables and we always make sure that we always have a parent entry in another table for a child entry that references it by its Id. We insert, update and delete the parent and child entities in the table in the same transaction.
So why should we still keep those foreign keys? They slow the database down when inserting new entities in the database and may be one of the reasons we get deadlocks from time to time.
Are they actually used by Sql Server for other things? Like gathering better statistics or is their only purpose to keep data integrity?

You shouldn't. Drop constraints with their foreign keys.

Checks at the Database lever are the last integrity barrier protecting your data.
For performance issues you might want to remove foreign keys but you might end up having to maintain a partially corrupted DB what ends up being a nightmare.
Can Foreign key improve performance
Foreign key constraint improve performance at the time of reading data
but at the same time it slows down the performance at the time of
inserting / modifying / deleting data.
In case of reading the query, the optimizer can use foreign key
constraints to create more efficient query plans as foreign key
constraints are pre declared rules. This usually involves skipping
some part of the query plan because for example the optimizer can see
that because of a foreign key constraint, it is unnecessary to execute
that particular part of the plan.

SQL Server - add foreign key relationship

I have an already existing database with tables. I added foreign key relationships (because they were referring data from another table, just that relationship was not explicit in the way tables were created) for one of the tables.
How does this change impact the existing database? Does the database engine have to do some extra work on existing data in the database? Can this change be a "breaking change" if you already have an application that uses the current database schema?

If you added a referential constraint, then the database stores that constraint and ensures it is maintained. For example, if table A has a foreign key referring to table B, then you cannot insert a row into table A that refers to a key that does not exist in table B.

There is indeed some extra work (though very minimal, depending on your database server) to enforce referential integrity. In practice, the performance impact is almost never something you'd notice.
It can be a "breaking change" - your client code may insert data that doesn't meet the referential constraints. If the DB allowed you to create the constraints in the first place, it's not likely, but it is possible.

You can specify WITH NOCHECK when creating a foreign key constraint:
The WITH NOCHECK option is useful when the existing data already meets
the new FOREIGN KEY constraint, or when a business rule requires the
constraint to be enforced only from this point forward.
However, you should be careful when you add a constraint without
checking existing data because this bypasses the controls in the
Database Engine that enforce the data integrity of the table.

Why use Foreign Key constraints in MySQL?

I was wondering,
What will be my motivation to use constraint as foreign key in MySQL, as I am sure that I can rule the types that are added?
Does it improve performance?

Foreign keys enforce referential integrity. These constraints guarantee that a row in a table order_details with a field order_id referencing an orders table will never have an order_id value that doesn't exist in the orders table.
Foreign keys aren't required to have a working relational database (in fact MySQL's default storage engine doesn't support FKs), but they are definitely essential to avoid broken relationships and orphan rows (ie. referential integrity). The ability to enforce referential integrity at the database level is required for the C in ACID to stand.
As for your concerns regarding performance, in general there's a performance cost, but will probably be negligible. I suggest putting in all your foreign key constraints, and only experiment without them if you have real performance issues that you cannot solve otherwise.

One reason is that a set of tables with foreign key constraints cannot be sharded into multiple databases.

Why use primary keys?

What are primary keys used aside from identifying a unique column in a table? Couldn't this be done by simply using an autoincrement constraint on a column? I understand that PK and FK are used to relate different tables, but can't this be done by just using join?
Basically what is the database doing to improve performance when joining using primary keys?

Mostly for referential integrity with foreign keys,, When you have a PK it will also create an index behind the scenes and this way you don't need table scans when looking up values

RDBMS providers are usually optimized to work with tables that have primary keys. Most store statistics which helps optimize query plans. These statistics are very important to performance especially on larger tables and they are not going to work the same without primary keys, and you end up getting unpredictable query response times.
Most database best practices books suggest creating all tables with a primary key with no exceptions, it would be wise to follow this practice. Not many things say junior software dev more than one who builds a database without referential integrity!

Some PKs are simply an auto-incremented column. Also, you typically join USING the PK and FK. There has to be some relationship to do a join. Additionally, most DBMS automatically index PKs by default, which improves join performance as well as querying for a particular record based on ID.

You can join without a primary key within a query, however, you must have a primary key defined to enforce data integrity constraints, at least with SQL Server. (Foreign Keys, etc..)
Also, here is an interesting read for you on Primary Keys.

In Microsoft Access, if you have a linked table to, say, SQL Server, the source table must have a primary key in order for the linked table to be writeable. At least, that was the case with Access 2000 and SQL Server 6.5. It may be different with later versions.

Keys are about data integrity as well as identification. The uniqueness of a key is guaranteed by having a constraint in the database to keep out "bad" data that would otherwise violate the key. The fact that data integrity rules are guaranteed in that way is precisely what makes a key usable as an identifier. That goes for any key. One key per table by convention is called a "primary" key but that doesn't make other alternate keys any less important.
In practice we need to be able to enforce uniqueness rules against all types of data (not just numbers) to satisfy the demands of data quality and usability.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Support for Key Constraints and Delete Cascade in Hive - hive

Related

Why would someone need to enable/disable constraints?

Should I remove the foreign keys if we manually guarantee database integrity?

SQL Server - add foreign key relationship

Why use Foreign Key constraints in MySQL?

Why use primary keys?

Categories

Resources