Contraints for unique row insertion in Hive - hive

I am creating a hive table with a large data set, Is there way of creating constraints on the table so that no two rows are the same when we insert the data.

Hive does not provide validated UNIQUE, PRIMARY KEY constraints.
As of 2.1.0 Hive includes support for non-validated primary and foreign key constraints. Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. And as of 3.0.0 Hive includes support for UNIQUE, NOT NULL, DEFAULT and CHECK constraints. Beside UNIQUE all three type of constraints are enforced.
You can apply DISTINCT or ROW_NUMBER, to all the dataset or partition. Also you can use UNION old data with new data for simply removing duplicates. If your table is partitioned, you can rewrite partition in such way:
insert overwrite table MYTABLE partition(load_date='2020-07-25')
select col1, col2, ... colN
from MYTABLE where load_date='2020-07-25'
UNION
select col1, col2, ... colN
from DAILY_INCREMENT_DATA
UNION will return distinct rows.
See also this answer for more details about using row_number and other loading scenarios.
Also Hive 2.2 supports MERGE in ACID mode.

Related

Generate and Insert surrogate keys into already existing BigQuery table

I have an existing table without any unique ID. I'm planning to generate surrogate keys using GENERATE_UUID() statement however I'm sure not how to insert this new column... What is the best option here?
One way is to use CREATE OR REPLACE TABLE ... AS SELECT
CREATE OR REPLACE TABLE table_a
AS SELECT GENERATE_UUID() uuid, * FROM table_a
The drawback is:
The metadata of the table is lost (table options, column descriptions etc.)
Nullability of the columns is lost, all columns becomes NULLABLE
If both are acceptable, then approach above is the simplest way
If not, then you need to add a column through UI or API, then do
UPDATE table_a
SET uuid = GENERATE_UUID()
WHERE uuid IS NULL

Create a table with a foreign key referencing to a temporary table generated by a query

I need to create a table having a field, which is a foreign key referencing to another query rather than existing table. E.g. the following statement is correct:
CREATE TABLE T1 (ID1 varchar(255) references Types)
but this one throws a syntax error:
CREATE TABLE T2 (ID2 varchar(255) references SELECT ID FROM BaseTypes UNION SELECT ID FROM Types)
I cannot figure out how I can achieve my goal. In the case it’s needed to introduce a temporary table, how can I force this table being updated each time when tables BaseTypes and Types are changed?
I am using Firebird DB and IBExpert management tool.
A foreign key constraint (references) can only reference a table (or more specifically columns in the primary or unique key of a table). You can't use it to reference a select.
If you want to do that, you need to use a CHECK constraint, but that constraint would only be checked on insert and updates: it wouldn't prevent other changes (eg to the tables in your select) from making the constraint invalid while the data is at rest. This means that at insert time the value could meet the constraint, but the constraint could - unnoticed! - become invalid. You would only notice this when updating the row.
An example of the CHECK-constraint could be:
CREATE TABLE T2 (
ID2 varchar(255) check (exists(
SELECT ID FROM BaseTypes WHERE BaseTypes.ID = ID2
UNION
SELECT ID FROM Types WHERE Types.ID = ID2))
)
For a working example, see this fiddle.
Alternatively, if your goal is to 'unite' two tables, define a 'super'-table that contains the primary keys of both tables, and reference that table from the foreign key constraint. You could populate and update (eg insert and delete) this table using triggers. Or you could use a single table, and replace the existing views with an updatable view (if this is possible depends on the exact data, eg IDs shouldn't overlap).
This is more complex, but would give you the benefit that the foreign key is also enforced 'at rest'.

Why does this Redshift create table query with DISTKEY and DISTSTYLE not work?

I run this query in Redshift:
CREATE TABLE my_table(
auto_increment BIGINT IDENTITY(0, 1),
id INTEGER NOT NULL,
col_1 INTEGER NOT NULL DISTKEY SORTKEY,
foreign key(col_1) references foreign_table(id),
col_2 INTEGER,
col_3 VARCHAR(255),
col_4 TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
col_5 TIMESTAMP,
PRIMARY KEY (id)
) DISTSTYLE ALL;
But I get an error saying:
Cannot specify DISTKEY for column "col_1" of table "my_table" when DISTSTYLE is NONE or EVEN;
Why am I getting this error? How do I fix it?
Thanks!
You cannot specify a column as DISTKEY when your DISTSTYLE is ALL .
What DISTSTYLE ALL means is that your table will be copied as a whole and stored across all the nodes, so you're not distributing the data by any KEY.
So if you want to distribute the data based on a DISTKEY, you'll have to set DISTSTYLE KEY.
The Distribution Style can be one of several options. From Distribution Styles - Amazon Redshift:
Auto: Amazon Redshift assigns an optimal distribution style based on the size of the table data.
Event: The leader node distributes the rows across the slices in a round-robin fashion.
Key: The rows are distributed according to the values in one column.
All: A copy of the entire table is distributed to every node.
This specification:
col_1 INTEGER NOT NULL DISTKEY SORTKEY,
is telling Redshift to use the Key distribution style, since it is nominating the column to use as the DISTKEY.
However, the DISTSTYLE ALL at the bottom is telling Redshift to use the All distribution style.
Thus, Redshift is giving an error because two different distribution styles have been requested. You will need to pick one, not both.
Given that you have selected a column as DISTKEY, you should probably remove DISTSTYLE ALL.
A quick guide for DISTKEY and SORTKEY is:
For DISTKEY, use the column that is most frequently used in JOINs
ForSORTKEY, use the column that is most frequently used in WHEREs
I notice that you have selected one column for both DISTKEY and SORTKEY. You might want to confirm that this is suitable for your data.

How do I partition efficiently in SQL Server based on foreign keys?

I am working on SQL Server and want to create a partition on a table. I want to base it off of a foreign key which is in another table.
table1 (
fk uniqueidentifier,
data
)
fk points to table2
table 2 (
partition element here
)
I want to partition table1 base on table2's data, ie if table2 contains categories
The foreign key relationship doesn't really matter, horizontal partitioning is based on the values in the table itself. The foreign key just makes sure they already exist in another table.
Links:
SQL SERVER – 2005 – Database Table Partitioning Tutorial – How to Horizontal Partition Database Table
Partitioning a SQL Server Database Table
Steps for Creating Partitioned Tables

How to insert duplicate rows in SQLite with a unique ID?

This seems simple enough: I want to duplicate a row in a SQLite table:
INSERT INTO table SELECT * FROM table WHERE rowId=5;
If there were no explicit unique column declarations, the statement would work, but the table's first column is declared rowID INTEGER NOT NULL PRIMARY KEY. Is there any way to create a simple statement like the one above that works without knowing the schema of the table (aside from the first column)?
This can be done using * syntax without having to know the schema of the table (other than the name of the primary key). The trick is to create a temporary table using the "CREATE TABLE AS" syntax.
In this example I assume that there is an existing, populated, table called "src" with an INTEGER PRIMARY KEY called "id", as well as several other columns. To duplicate the rows of "src", use the following SQL in SQLite3:
CREATE TEMPORARY TABLE tmp AS SELECT * FROM src;
UPDATE tmp SET id = NULL;
INSERT INTO src SELECT * FROM tmp;
DROP TABLE tmp;
The above example duplicates all rows of the table "src". To only duplicate a desired row, simply add a WHERE clause to the first line. This example works because the table "tmp" has no primary key constraint, but "src" does. Inserting NULL primary keys into src causes them to be given auto-generated values.
From the sqlite documentation: http://www.sqlite.org/lang_createtable.html
A "CREATE TABLE ... AS SELECT" statement creates and populates a database table based on the results of a SELECT statement. A table created using CREATE TABLE AS has no PRIMARY KEY and no constraints of any kind.
If you want to get really fancy, you can add a trigger that updates a third table which maps old primary keys to newly generated primary keys.
No. You need to know the schema of the table to write the insert statement properly.
You need to be able to write the statement in the form of:
insert into Table (column1, column2, column3)
select column1, column2, column3
from OtherTable
where rowId = 5
Well, since I was unable to do this the way I wanted, I resorted to using the implicit row id, which handily enough has the same name as the rowId column I defined explicitly, so now I can use the query I had in the question, and it will insert all the data with a new rowId. To keep the rest of the program working, I just changed SELECT * FROM table to SELECT rowId,* FROM table and everything's fine.
Absolutely no way to do this. Primary Key declaration implies this field is unique. You can't have a non unique PK. There is no way to create a row with existing PK in the same table.