How surrogate keys are handles in hive - hive

I know that hive cannot create surrogate keys or is rather difficult. I want to understand how companies have implemented dimensional modeling in their warehouse.
One way I can think of is leaving the dimension details as is in fact. Then move the distinct of dimension to a different table. But then how are scd1 and scd2 handled. I have checked talks by Kimball on cloudera and I still don't understand how this works.

There are two ways of handling this problem in Hive.
The first does not directly answer your question, and that is to use natural keys instead of surrogates. While surrogates are more convenient and performant, since you're using Hive I'm guessing that performance isn't one of your major criteria, so the cost of using natural keys will just be in the extra lines of code you have to write to cater for compound keys.
The second way is to use Hive's windowing functions to calculate the surrogate. I don't have a Hive environment handy to test this query, but the surrogate would look something like:
(select max(surrogate_key_column) from dimension_table)
+ row_number() over (order by 1)

As far as I know, In version 3.0, Hive supports the surrogate keys on ACID tables
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_surrogate_keys.html
Summarised from the link:
The SURROGATE_KEY UDF generates a unique Id for every row that you insert into a table.
Example usage:
-Create a table
CREATE TABLE students_v2
(`ID` BIGINT DEFAULT SURROGATE_KEY(),
row_id INT,
name VARCHAR(64),
dorm INT,
PRIMARY KEY (ID) DISABLE NOVALIDATE);
-Insert data, which automatically generates surrogate keys for the primary keys.
INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;
-Take a look at the surrogate keys.
SELECT * FROM students_v2;

Related

How do you define a primary key on a table in Trino?

The query below does not work.
CREATE TABLE test_table (date varchar, id varchar, PRIMARY KEY (date,id))
I can't seem to find any docs on primary keys in Trino.
You don't. As description says:
Trino is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
Trino does not maintain primary keys, indexes and so on. See also use cases and Trino concepts.

SQL: lookup for UUID

I have my user table (pseudo sql, because I use an ORM and I must support several different DB types):
id: INTEGER, PK, AUTOINCREMENT
UUID : BINARY(16) (inserted by an update, it's a hash(id) )
I am currently using id for FK in all other tables.
However, in my REST API, I have to serve informations with the UUID, which causes a problem later to query.
Should I:
FK on the UUID instead?
just lookup id(UUID) each time (fast thanks to cache mechanism after a while)?
In general, it is better to use the auto-incremented id for the foreign key reference rather than some other combination of unique columns.
One important reason is that indexes on a single integer are more efficient than indexes on other column types -- if for no other reason than the index being smaller, so it occupies less disk and less memory. Also, there is additional overhead to storing the longer UUID in secondary tables.
This is not the only consideration. Another consideration is that you could change the UUID, if necessary, without changing the foreign key references. For instance, you may wake up one day and say "that id has to start with AAA". You can alter the table and update the table and be done with it -- or you could worry about foreign key references as well. Or, you might add an organization column and decide that the unique key is a combination of the UUID and organization. These operations are much harder/slower if the UUID is being used as a foreign key reference.
When you have composite primary keys (more than one column), using the auto-incremented id is an even better idea. In this case, using the id for joins prevents mistakes where one of the join conditions might be left out.
As you point out, looking up the UUID for a given id should be a fast operation with the correct indexes. There may be some borderline cases where you would not want to have an id, but in general, it is a good idea.

SQL - many-to-many table primary key

This question comes up after reading a comment in this question:
Database Design
When you create a many-to-many table, should you create a composite primary key on the two foreign key columns, or create a auto-increment surrogate "ID" primary key, and just put indexes on your two FK columns (and maybe a unique constraint)? What are the implications on performance for inserting new records/re-indexing in each case?
Basically, this:
PartDevice
----------
PartID (PK/FK)
DeviceID (PK/FK)
vs. this:
PartDevice
----------
ID (PK/auto-increment)
PartID (FK)
DeviceID (FK)
The commenter says:
making the two IDs the PK means the
table is physically sorted on the disk
in that order. So if we insert
(Part1/Device1), (Part1/Device2),
(Part2/Device3), then (Part 1/Device3)
the database will have to break the
table apart and insert the last one
between entries 2 and 3. For many
records, this becomes very problematic
as it involves shuffling hundreds,
thousands, or millions of records
every time one is added. By contrast,
an autoincrementing PK allows the new
records to be tacked on to the end.
The reason I'm asking is because I've always been inclined to do the composite primary key with no surrogate auto-increment column, but I'm not sure if the surrogate key is actually more performant.
With a simple two-column many-to-many mapping, I see no real advantage to having a surrogate key. Having a primary key on (col1,col2) is guaranteed unique (assuming your col1 and col2 values in the referenced tables are unique) and a separate index on (col2,col1) will catch those cases where the opposite order would execute faster. The surrogate is a waste of space.
You won't need indexes on the individual columns since the table should only ever be used to join the two referenced tables together.
That comment you refer to in the question is not worth the electrons it uses, in my opinion. It sounds like the author thinks the table is stored in an array rather than an extremely high performance balanced multi-way tree structure.
For a start, it's never necessary to store or get at the table sorted, just the index. And the index won't be stored sequentially, it'll be stored in an efficient manner to be able to be retrieved quickly.
In addition, the vast majority of database tables are read far more often than written. That makes anything you do on the select side far more relevant than anything on the insert side.
No surrogate key is needed for link tables.
One PK on (col1, col2) and another unique index on (col2, col1) is all you need
Unless you use an ORM that can't cope and dictates your DB design for you...
Edit: I answered the same here: SQL: Do you need an auto-incremental primary key for Many-Many tables?
An incremental primary key could be needed if the table is referenced. There might be details in the many-to-many table which needed to be pulled up from another table using the incremental primary key.
for example
PartDevice
----------
ID (PK/auto-increment)
PartID (FK)
DeviceID (FK)
Other Details
It's easy to pull the 'Other Details' using PartDevice.ID as the FK. Thus the use of incremental primary key is needed.
The shortest and most direct way I can answer your question is to say that there will be a performance impact if the two tables you are linking don't have sequential primary keys. As you stated/quoted, the index for the link table will either become fragmented, or the DBMS will work harder to insert records if the link table does not have its own sequential primary key. This is the reason most people put a sequentially incrementing primary key on link tables.
So it seems like if the ONLY job is to link the two tables, the best PK would be the dual-column PK.
But if it serves other purposes then add another NDX as a PK with a foreign keys and a second unique index.
Index or PK is the best way to make sure there are no duplicates. PK lets tools like Microsoft Management Studio do some of the work (creating views) for you

in general, should every table in a database have an identity field to use as a PK?

I'm running into an issue with a join: getting back too many records. I added a table to the set of joins and the number of rows expanded. Usually when this happens I add a select of all the ID fields that are involved in the join. That way it's pretty obvious where the expansion is happening and I can change the ON of the join to fix it. Except in this case, the table that I added doesn't have an ID field. This is a problem. But perhaps I'm wrong.
Should every table in a database have an IDENTITY field that's used as the PK? Are there any drawbacks to having an ID field in every table? What if you're reasonably sure this table will never be used in a PK/FK relationship?
When having an identity column is not a good idea?
Surrogate vs. natural/business keys
Wikipedia Surrogate Key article
There are two concepts that are close but should not be confused: IDENTITY and PRIMARY KEY
Every table (except for the rare conditions) should have a PRIMARY KEY, that is a value or a set of values that uniquely identify a row.
See here for discussion why.
IDENTITY is a property of a column in SQL Server which means that the column will be filled automatically with incrementing values.
Due to the nature of this property, the values of this column are inherently UNIQUE.
However, no UNIQUE constraint or UNIQUE index is automatically created on IDENTITY column, and after issuing SET IDENTITY_INSERT ON it's possible to insert duplicate values into an IDENTITY column, unless it had been explicity UNIQUE constrained.
The IDENTITY column should not necessarily be a PRIMARY KEY, but most often it's used to fill the surrogate PRIMARY KEYs
It may or may not be useful in any particular case.
Therefore, the answer to your question:
The question: should every table in a database have an IDENTITY field that's used as the PK?
is this:
No. There are cases when a database table should NOT have an IDENTITY field as a PRIMARY KEY.
Three cases come into my mind when it's not the best idea to have an IDENTITY as a PRIMARY KEY:
If your PRIMARY KEY is composite (like in many-to-many link tables)
If your PRIMARY KEY is natural (like, a state code)
If your PRIMARY KEY should be unique across databases (in this case you use GUID / UUID / NEWID)
All these cases imply the following condition:
You shouldn't have IDENTITY when you care for the values of your PRIMARY KEY and explicitly insert them into your table.
Update:
Many-to-many link tables should have the pair of id's to the table they link as the composite key.
It's a natural composite key which you already have to use (and make UNIQUE), so there is no point to generate a surrogate key for this.
I don't see why would you want to reference a many-to-many link table from any other table except the tables they link, but let's assume you have such a need.
In this case, you just reference the link table by the composite key.
This query:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (a_id, b_id, PRIMARY KEY (a_id, b_id))
CREATE TABLE business_rule (id, a_id, b_id, FOREIGN KEY (a_id, b_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a
ON a.id = br.a_id
is much more efficient than this one:
CREATE TABLE a (id, data)
CREATE TABLE b (id, data)
CREATE TABLE ab (id, a_id, b_id, PRIMARY KEY (id), UNIQUE KEY (a_id, b_id))
CREATE TABLE business_rule (id, ab_id, FOREIGN KEY (ab_id) REFERENCES ab)
SELECT *
FROM business_rule br
JOIN a_to_b ab
ON br.ab_id = ab.id
JOIN a
ON a.id = ab.a_id
, for obvious reasons.
Almost always yes. I generally default to including an identity field unless there's a compelling reason not to. I rarely encounter such reasons, and the cost of the identity field is minimal, so generally I include.
Only thing I can think of off the top of my head where I didn't was a highly specialized database that was being used more as a datastore than a relational database where the DBMS was being used for nearly every feature except significant relational modelling. (It was a high volume, high turnover data buffer thing.)
I'm a firm believer that natural keys are often far worse than artificial keys because you often have no control over whether they will change which can cause horrendous data integrity or performance problems.
However, there are some (very few) natural keys that make sense without being an identity field (two-letter state abbreviation comes to mind, it is extremely rare for these official type abbreviations to change.)
Any table which is a join table to model a many to many relationship probably also does not need an additional identity field. Making the two key fields together the primary key will work just fine.
Other than that I would, in general, add an identity field to most other tables unless given a compelling reason in that particular case not to. It is a bad practice to fail to create a primary key on a table or if you are using surrogate keys to fail to place a unique index on the other fields needed to guarantee uniqueness where possible (unless you really enjoy resolving duplicates).
Every table should have some set of field(s) that uniquely identify it. Whether or not there is a numeric identifier field separate from the data fields will depend on the domain you are attempting to model. Not all data easily falls into the 'single numeric id' paradigm, and as such it would be inappropriate to force it. Given that, a lot of data does easily fit in this paradigm and as such would call for such an identifier. There is no one answer to always do X in any programming environment, and this is another example.
If you have modelled, designed, normalised etc, then you will have no identity columns.
You will have identified natural and candidate keys for your tables.
You may decide on a surrogate key because of the physical architecture (eg narrow, numeric, strictly monotonically increasing), say, because using a nvarchar(100) column is not a good idea (still need unique constraint).
Or because of ideology: they appeal to OO developers I've found.
Ok, assume ID columns. As your db gets more complex, say several layers, how can you jon parent and grand-.child tables directly. You can't: you always need intermediate tables and well indexed PK-FL columns. With a composite key, it's all there for you...
Don't get me wrong: I use them. But I know why I use them...
Edit:
I'd be interested to collate "always ID"+"no stored procs" matches on one hand, with "use stored procs"+"IDs when they benefit" on the other...
No. Whenever you have a table with an artificial identity column, you also need to identify the natural primary key for the table and ensure that there is a unique constraint on that set of columns too so that you don't get two rows that are identical apart from the meaningless identity column by accident.
Adding an identity column is not cost free. There is an overhead in adding an unnecessary identity column to a table - typically 4 bytes per row of storage for the identity value, plus a whole extra index (which will probably weigh in at 8-12 bytes per row plus overhead). It also takes slightly to work out the most cost-effective query plan because there is an extra index per table. Granted, if the table is small and the machine is big, this overhead is not critical - but for the biggest systems, it matters.
Yes, for the vast majority of cases.
Edge cases or exceptions might be things like:
two-way join tables to model m:n relationships
temporary tables used for bulk-inserting huge amounts of data
But other than that, I think there is no good reason against having a primary key to uniquely identify each row in a table, and in my opinion, using an IDENTITY field is one of the best choices (I prefer surrogate keys over natural keys - they're more reliable, stable, never changing etc.).
Marc
I can't think of any drawback about having an ID field in each table. Providing your the type of your ID field provides enough space for your table to grow.
However, you don't necessarily need a single field to ensure the identity of your rows.
So no, a single ID field is not mandatory.
Primary and Foreign Keys can consist not only of one field, but of multiple fields. This is typical for tables implementing a N-N relationship.
You can perfectly have PRIMARY KEY (fa, fb) on your table:
CREATE TABLE t(fa INT , fb INT);
ALTER TABLE t ADD PRIMARY KEY(fa , fb);
Recognize the distinction between an Identity field and a key... Every table should have a key, to eliminate the data corruption of inadvertently entering multiple rows that represent the same 'entity'. If the only key a table has is a meaningless surrogate key, then this function is effectively missing.
otoh, No table 'needs' an identity, and certainly not every table benefits from one... Examples are: A table with a short and functional key, a table which does not have any other table referencing it through a foreign Key, or a table which is in a one to zero-or-one relationship with another table... none of these need an Identity
I'd say, if you can find a simple, natural key in your table (i.e. one column), use that as a key instead of an identity column.
I generally give every table some kind of unique identifier, whether it is natural or generated, because then I am guaranteed that every row is uniquely identified somehow.
Personally, I avoid IDENTITY (incrementing identity columns, like 1, 2, 3, 4) columns like the plague. They cause a lot of hassle, especially if you delete rows from that table. I use generated uniqueidentifiers instead if there is no natural key in the table.
Anyway, no idea if this is the accepted practice, just seems right to me. YMMV.

Should I use an ENUM for primary and foreign keys?

An associate has created a schema that uses an ENUM() column for the primary key on a lookup table. The table turns a product code "FB" into it's name "Foo Bar".
This primary key is then used as a foreign key elsewhere. And at the moment, the FK is also an ENUM().
I think this is not a good idea. This means that to join these two tables, we end up with four lookups. The two tables, plus the two ENUM(). Am I correct?
I'd prefer to have the FKs be CHAR(2) to reduce the lookups. I'd also prefer that the PKs were also CHAR(2) to reduce it completely.
The benefit of the ENUM()s is to get constraints on the values. I wish there was something like: CHAR(2) ALLOW('FB', 'AB', 'CD') that we could use for both the PK and FK columns.
What is: Best PracticeYour preference
This concept is used elsewhere too. What if the ENUM()'s values are longer? ENUM('Ding, dong, dell', 'Baa baa black sheep'). Now the ENUM() is useful from a space point-of-view. Should I only care about this if there are several million rows using the values? In which case, the ENUM() saves storage space.
ENUM should be used to define a possible range of values for a given field. This also implies that you may have multiple rows which have the same value for this perticular field.
I would not recommend using an ENUM for a primary key type of foreign key type.
Using an ENUM for a primary key means that adding a new key would involve modifying the table since the ENUM has to be modified before you can insert a new key.
I am guessing that your associate is trying to limit who can insert a new row and that number of rows is limited. I think that this should be achieved through proper permission settings either at the database level or at the application and not through using an ENUM for the primary key.
IMHO, using an ENUM for the primary key type violates the KISS principle.
but when you only trapped with differently 10 or less rows that wont be a problem
e.g's
CREATE TABLE `grade`(
`grade` ENUM('A','B','C','D','E','F') PRIMARY KEY,
`description` VARCHAR(50) NOT NULL
)
This table it is more than diffecult to get a DML
We've had more discussion about it and here's what we've come up with:
Use CHAR(2) everywhere. For both the PK and FK. Then use mysql's foreign key constraints to disallow creating an FK to a row that doesn't exist in the lookup table.
That way, given the lookup table is L, and two referring tables X and Y, we can join X to Y without any looking up of ENUM()s or table L and can know with certainty that there's a row in L if (when) we need it.
I'm still interested in comments and other thoughts.
Having a lookup table and a enum means you are changing values in two places all the time. Funny... We spent to many years using enums causing issues where we need to recompile to add values. In recent years, we have moved away from enums in many situations an using the values in our lookup tables. The biggest value I like about lookup tables is that you add or change values without needing to compile. Even with millions of rows I would stick to the lookup tables and just be intelligent in your database design