How to create a dynamic unique constraint - sql

I have a huge table that is partitioned by a partition id. Each partition can have a different number of fields in its unique constraint. Consider this table:
+----+---------+-------+-----+--+
| id | part_id | name | age | |
+----+---------+-------+-----+--+
| 1 | 1 | James | 12 | |
+----+---------+-------+-----+--+
| 2 | 1 | Mary | 33 | |
+----+---------+-------+-----+--+
| 3 | 2 | James | 1 | |
+----+---------+-------+-----+--+
| 4 | 2 | Mike | 19 | |
+----+---------+-------+-----+--+
| 5 | 3 | James | 12 | |
+----+---------+-------+-----+--+
For part_id: 1 I need a unique constraint on fields name and age. part_id: 2 needs a unique constraint on name. part_id: 3 needs a unique constraint on name. I am open to any database that can accomplish this.

Classic RDBMS is designed to work with stable schema. It means that the structure of your tables, columns, indexes, relations don't change often, each table has a fixed number of columns with fixed types and it is hard/inefficient to make them dynamic.
SQL Server has filtered indexes.
So, you can create a separate unique index for each partition.
CREATE UNIQUE NONCLUSTERED INDEX IX_Part1 ON YourTable
(
name ASC,
age ASC
)
WHERE (part_id = 1)
CREATE UNIQUE NONCLUSTERED INDEX IX_Part2 ON YourTable
(
name ASC
)
WHERE (part_id = 2)
CREATE UNIQUE NONCLUSTERED INDEX IX_Part3 ON YourTable
(
name ASC
)
WHERE (part_id = 3)
These DDL statements are static and the value of part_id is hard coded in them. Optimiser is able to use such indexes in queries that have the same WHERE filter, so they are useful not just for enforcing the constraint.
You can always write a procedure that would generate a text of the CREATE INDEX statement dynamically and run it via EXEC/sp_executesql. There may be some clever use of triggers on YourTable to create it on the fly as the data in your table changes, but in the end it will be some static CREATE INDEX statement.
You can create these indexes in advance for all possible values of part_id, even if there are no such actual values in the table yet.
If you have thousands of part_id and you want to create thousands of such unique constraints, then your current schema may not be quite appropriate.
SQL Server allows max 999 nonclustered indexes per table. See Maximum Capacity Specifications for SQL Server.
Are you trying to build some variation of EAV (entity-attribute-value) model?
Maybe there are non-relational DBMS that allow greater flexibility that would suit better for your task, but I don't have experience with them.

In oracle, the below is possible to create unique index dynamically
CREATE UNIQUE INDEX idx_part_id_dynamic ON partition_table part_id,
(CASE WHEN part_id = 1 THEN name, age
WHEN part_id = 3 THEN age
ELSE height
END );
);

Related

How can I speed up queries with `GROUP BY` in them?

Details:
MariaDB: Server version: 10.2.10-MariaDB MariaDB Server
The DB table, trans_tbl is using Aria DB engine
Table is somewhat large: 126,006,123 rows
Server is not at all large: AWS t3 micro w/attached 30GB EBS
I applied indexes to this DB table as follows:
A primary key: evt_id
Another index on the column I want to group by: transaction_type
3 Related Questions:
Why is the transaction_type index ignored when I perform the following?
SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type
If I look at the output from EXPLAIN, I see:
MariaDB [my_db]> EXPLAIN SELECT COUNT(evt_id), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
| 1 | SIMPLE | trans_tbl | ALL | NULL | NULL | NULL | NULL | 126006123 | Using temporary; Using filesort |
+------+-------------+-----------+------+---------------+------+---------+------+-----------+---------------------------------+
What's confusing me here is that both of the items in the query are indexed. So, shouldn't the index(es) be utilized?
Why is the transaction_type index being used in the following case, where all I've done is switched from COUNT(evt_id) -- the primary key -- to COUNT(1). (The column is transaction_type, the index generated from it is called TransType.)
MariaDB [my_db]> EXPLAIN SELECT COUNT(1), transaction_type FROM trans_tbl GROUP BY transaction_type;
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
| 1 | SIMPLE | trans_tbl | index | NULL | TransType | 35 | NULL | 126006123 | Using index |
+------+-------------+-----------+-------+---------------+-----------+---------+------+-----------+-------------+
The first query (with COUNT(evt_id)) takes 2 minutes & 40 seconds. Since it is not using the indices, that makes sense. But the second query (with COUNT(1)) takes 50 seconds. This makes no sense to me. Shouldn't it take essentially 0 seconds? Can't it just look at the first and last index value of each group, subtract them, and have the count? It seems to me that it is indeed actually counting. What's the point of an index?
I guess my more important question is: How do I set up my indexes to allow for grouping on that index to return results almost instantaneously, as I would expect?
PS I know the machine is ridiculously underpowered for this size of DB table. But, the table data is not worth throwing a lot of money at it to improve performance. I'd rather just learn to implement Aria indexes properly to gain speed.
COUNT(x) checks x for being NOT NULL before counting the row.
COUNT(*) is the usual pattern for counting rows.
So...
SELECT COUNT(evt_id), transaction_t is just `SELECT FIND_IN_SET(17, '8,12,17,90');`ype
FROM trans_tbl GROUP BY transaction_type;
decided to do a table scan, then sort and group.
SELECT COUNT(*), transaction_type
FROM trans_tbl GROUP BY transaction_type;
saw INDEX(transaction_type) and said "goodie; I can just scan that index without having to sort." Note: It still has to scan in order to count. But the INDEX is smaller than the table, so it could be done faster. This is also called a "covering" index since all the columns needed in the SELECT are found in that one INDEX.
COUNT(1) might be treated the same as COUNT(*), I don't know.
INDEX(transaction_type) is essentially identical to INDEX(transaction_type, evt_id). This because the PRIMARY KEY is silently tacked onto any secondary key in InnoDB.
I don't know why INDEX(transaction_type, evt_id) was not used. Bottom line: Use COUNT(*).
Why not 0 seconds? The counts are not saved anywhere. Anyway, there could be other queries modifying the counts as you run you SELECT. The improvement came from scanning 126M 2-column rows instead of 126M multi-column rows.

Using 'character' as primary key and reference it from another table

Consider the following postgres (version 9.4) database:
testbase=# select * from employee;
id | name
----+----------------------------------
1 | johnson, jack
2 | jackson, john
(2 rows)
testbase=# select * from worklog;
id | activity | employee | time
----+----------------------------------+----------+----------------------------
1 | department alpha | 1 | 2018-01-27 20:32:16.512677
2 | department beta | 1 | 2018-01-27 20:32:18.112356
5 | break | 1 | 2018-01-27 20:32:22.255563
3 | department gamma | 2 | 2018-01-27 20:32:20.073173
4 | department gamma | 2 | 2018-01-27 20:32:21.05962
(5 rows)
The column 'name' in table 'employee' is of type character(32) and unique, the column 'employee' in 'worklog' references 'id' from the table 'employee'. The column 'id' is the primary key in either table.
I can see all activities from a certain employee by issuing:
testbase=# select * from worklog where employee=(select id from employee where name='johnson, jack');
id | activity | employee | time
----+----------------------------------+----------+----------------------------
1 | department alpha | 1 | 2018-01-27 20:32:16.512677
2 | department beta | 1 | 2018-01-27 20:32:18.112356
5 | break | 1 | 2018-01-27 20:32:22.255563
(3 rows)
I would rather like to simplify the query to
testbase=# select * from worklog where employee='johnson, jack';
For this I would change 'employee' to type character(32) in 'worklog' and declare 'name' as primary key in table 'employee'. Column 'employee' in 'worklog' would, of course, reference 'name' from table 'employee'.
My question:
Will every new row in 'worklog' require additional 32 bytes for name of the 'employee' or will postgres internally just keep a pointer to the foreign field without duplicating the name for every new row?
I suppose that the answer for my question is somewhere in the documentation but I could not find it. It would be very helpful if someone could provide an according link.
PS: I did find this thread, however, there was no link to some official documentation. The behaviour might also have changed, since the thread is now over seven years old.
Postgres will store the data that you tell it to store. There are some new databases that will do compression under the hood -- and Postgres might have features to enable that (I do not know all Postgres features).
But, you shouldn't do this. Integer primary keys are more efficient than strings for three reasons:
They are fixed length in bytes.
They are shorter.
Collations are not an issue.
Stick with your original query, but write it using a join:
select wl.*
from worklog wl join
employee e
on wl.employee = e.id
where e.name = 'johnson, jack';
I suggest this because this is more consistent with how SQL works and makes it easier to choose multiple employees.
If you want to see the name and not the id, create a view (say v_worklog) and add in the employee name.

ORDER BY [PRIMARY_KEY] has to apply sort-order when it should simply use the index?

according to my Research, ordering by the Primary key (or on any other column with an index) - the query should run without an explicit sort.
I also found a blog where this behavior was shown on different databases, one of them being Oracle.
However - in my Tests it this was not true - what could be the reason? Bad install-options? Broken Index? (although I ruled that out by creating a completely new table)
the query:
select * from auftrag_test order by auftragkey
the execution plan:
Plan Hash Value : 505195503
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost | Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 167910 | 44496150 | 11494 | 00:00:01 |
| 1 | SORT ORDER BY | | 167910 | 44496150 | 11494 | 00:00:01 |
| 2 | TABLE ACCESS FULL | AUFTRAG_TEST | 167910 | 44496150 | 1908 | 00:00:01 |
-----------------------------------------------------------------------------------
create table AUFTRAG_TEST
(
auftragkey VARCHAR2(40) not null,
...
);
alter table AUFTRAG_TEST
add constraint PK_AUFTRAG_TEST primary key (AUFTRAGKEY);
you might ask yourself why the Primary key would be a varchar field. Well, this is something our bosses have decided. (Actually we put in stringified guids)
The blog I found:
http://use-the-index-luke.com/sql/sorting-grouping/indexed-order-by
P.S.: I think that I found out the Problem. This select does NOT "order by":
select *
from auftrag_test
where auftragkey = 'aabbccddeeffaabbccddeeffaabbccdd'
order by auftragkey
So - apparently - it does ONLY work, if you filter against an index, with "equality" which wouldn't be very helpful at all.
P.P.S: MS-SQL seems to do just what I expected. If I order by the Primary key (with a non clustered unique index) - the sort is "free". In execution plan, and also query time wise.
You should be aware that scanning a big table through an index might take hours Vs. full table scan on the same table that will take only few minutes.
In this case travesing through the index is order to save a O(n*log(n)) sort operation, doesn't sound like a good idea.
Heap table will yield sort operation.
IOT (Index orginized Table, also knows as "clustered index") is already sorted.
create table t_heap (i int primary key,j int);
create table t_iot (i int primary key,j int) organization index;
select * from t_heap order by i;
select * from t_iot order by i;

Uniqueness constraint on cross between two rows

I'm creating a (postgres) table that has:
CREATE TABLE workers (id INT PRIMARY KEY, deleted_at DATE, account_id INT)
I'd like to have a uniqueness constraint only across workers that have not been deleted. Is there a good way to achieve this in sql? As an example:
id | date | account_id
1 | NULL | 1
# valid, was deleted
2 | yesterday | 1
# invalid, dup account
# 3 | NULL | 1
You want what Postgres calls a "partial index" (and other databases call a filtered index):
create unique index idx_workers_account_id on workers(account_id)
where deleted_at is null;
Here is the documentation on this feature.

How to optimize this query?

Query:
select id,
title
from posts
where id in (23,24,60,19,21,32,43,49,9,11,17,34,37,39,46,5
2,55)
Explain plan:
mysql> explain select id,title from posts where id in (23,24,60,19,21,32,43,49,9,11,17,34,37,39,46,5
2,55);
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | posts | ALL | PRIMARY | NULL | NULL | NULL | 30 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.05 sec)
id is the primary key of posts table.
Other than adding other indexes, such as
a clustered index id
a covering index which includes id [first] and the other columns from the SELECT clause
there seem to be little to be done...
In fact, even if there were such indexes available, MySQL may decide to do a table scan, as is the case here ("ALL" type). The reason may be the table may has a relative few rows (compared with the estimated number of rows the query would return), and it is therefore more efficient to "read" the table, sequentially, discarding non matching rows as we go, rather than "hoping all over the place", with an index indirection.
I don't see any problem with it. If you need to select against a list, then "IN" is the right way to do it. You're not selecting unnecessary information, and the thing you're selecting against is a key, which is presumably indexed.