Best practice to enforce uniqueness on column but allow some duplicates? - sql

Here is what I am trying to figure out: there should be a table to store authorizations for our new client management system, and every authorization has their unique identifier. This constraint would be pretty easy to translate to SQL, but unfortunately because of the slowness of bureaucracy, sometimes we need to create an entry with a placeholder ID (e.g., "temp") in order for the client to be able to start taking services.
What would be the best practice to enforce this conditional uniqueness constraint?
These are what I could come up with my limited experience:
Use partial indexing mentioned in the PostgreSQL manual (5.3.3. -> Example 11-3.). It also mentions that This is a particularly efficient approach when there are few successful tests and many unsuccessful ones. In our legacy DB that will be migrated, there are 130,000 rows and about 5 temp authorizations a month, but the whole table only grows by about 200 rows per year. Would this be the right approach? (I am also not sure what "efficient" means in this context.)
Create a separate table for the temp authorizations but then it would duplicate the table structure.
Define a unique constraint for a group of columns. An authorization is for a specific service for a certain time period issued to an individual.
EDIT:
I'm sorry I think my description of the authorization ID was a bit obscure: it is provided by a state department with the format of NMED012345678 and it is entered by hand. It is unique, but sometimes only provided at a later time for unknown reasons.

There is a simple, fast and secure way:
Add a boolean column to mark temporary entries which is NULL by default, say:
temp bool DEFAULT NULL CHECK (temp)
The added check constraint disallows FALSE, only NULL or TRUE are possible. Storage cost for the default NULL value is typically ... nothing - unless there are no other NULL values in the row.
How much disk-space is needed to store a NULL value using postgresql DB?
The column default means you don't normally have to take care of the column. It's NULL by default (which is the default default anyway, I'm just being explicit here). You only need to mark the few exceptions explicitly.
Then create a partial unique index like:
CREATE UNIQUE INDEX tbl_unique_id_uni ON tbl (unique_id) WHERE temp IS NULL;
That only includes rows supposed to be unique. Index size is not increased at all.
Be sure to add the predicate WHERE temp IS NULL to queries that are supposed to use the unique index.
Related:
Create unique constraint with null columns

You can have several possibilities:
Make the temp identifiers unique; for instance, if they are automatically created (not entered manually) make them:
CREATE SEQUENCE temp_ids_seq ; -- This done only once for the database
Whenever you need a new temporary id, issue
'temp' || nxtval('temp_ids_seq') AS id
Use a partial index, assuming that the value which is allowed is temp
CREATE UNIQUE INDEX tbl_unique_idx ON tbl (id) WHERE (id IS DISTINCT FROM 'temp')
For the sake of efficiency, you probably would like to have, in those cases, also the complementary index:
CREATE INDEX tbl_temp_idx ON tbl (id) WHERE (id IS NOT DISTINCT FROM 'temp')
This last index will help queries seeking id = 'temp'.

This is a bit long for a comment.
I think I would have an authorization table with a unique authorization. The authorization could then have two types: "approved" and "temporary". You could handle this with two columns.
However, I would probably have the authorization id as a serial column with the "approved" id being a field in the table. That table could have a unique constraint on it. You can use either a full unique constraint or a unique constraint with filtered values (Postgres allows multiple NULL values in a unique constraint, but the second is more explicit).
You can have the same process for the temporary authorizations -- using a different column. Presumably you have some mechanism for authorizing them and storing the approval date, time, and person.
I would not use two tables. Having authorizations spread among multiple tables just seems likely to sow confusion. Anywhere in the code where you want to see who has an authorization is a potential for mis-reading the data.

IMO it is not advisable to use remote keys as (part of) primary keys.
they are not under your control; they can change
you cannot guarantee correctness and/or uniqueness(email-addresses, telefone numbers, licence-numbers, serial numbers)
using them AS PK would cause them to be used AS FK for other tables into this table, with fat indexes and lots cascading on change.
\i tmp.sql
CREATE TABLE the_persons
( seq SERIAL NOT NULL PRIMARY KEY -- surrogate key
, registrationnumber varchar -- "remote" KEY, not necesarily UNIQUE
, is_validated BOOLEAN NOT NULL DEFAULT FALSE
, last_name varchar
, dob DATE
);
CREATE INDEX name_dob_idx ON the_persons(last_name, dob)
;
CREATE UNIQUE INDEX registrationnumber_idx ON the_persons(registrationnumber,seq)
-- WHERE is_validated = False
;
CREATE UNIQUE INDEX registrationnumber_key ON the_persons(registrationnumber)
WHERE is_validated = True
;
INSERT INTO the_persons(is_validated,registrationnumber,last_name, dob)VALUES
( True, 'OKAY001', 'Smith', '1988-02-02')
,( True, 'OKAY002', 'Jones', '1988-02-02')
,( False, 'OKAY001', 'Smith', '1988-02-02')
,( False, 'OMG001', 'Smith', '1988-08-02')
;
-- validated records:
SELECT *
FROM the_persons
WHERE is_validated = True
;
-- some records with nasty cousins
SELECT *
FROM the_persons p
WHERE EXISTS (
SELECT*
FROM the_persons x
WHERE x.registrationnumber = p.registrationnumber
AND x.is_validated = False
)
AND last_name LIKE 'Smith%'
;

Related

update of full table fails on a specific column with "duplicate row error"

This works:
update table_name
set column1=trunc(column1,3);
This doesn't:
update table_name
set column2=trunc(column2,3);
Neither column is a unique or primary key.
Table struct is:
CREATE SET TABLE TABLE_NAME ,
NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
KEYCOL NUMBER,
COLUMN1 FLOAT,
COLUMN2 FLOAT
)
PRIMARY INDEX ( KEYCOL );
By default in Teradata session mode, tables that do not include the SET or MULTISET option will be created in SET mode. This means that Teradata will enforce duplicate row checks on the table in the absence of a UNIQUE constraint, such as a UNIQUE PRIMARY INDEX or UNIQUE SECONDARY INDEX.
In ANSI session mode, the default behavior is the opposite. Teradata will create a MULTISET table which does permit duplicate rows and eliminates the overhead of duplicate row checking when you have a non-unique primary index.
Can you provide the SHOW TABLE output to verify how the table is structured? Your DDL statement that you submitted leaves out some of the default options that are included in the table creation, including SET vs. MULTISET, FALLBACK vs. NO FALLBACK, etc.
edit
After reviewing your updated table definition, the SET option is the reason you are getting the error. This could be a good thing or bad thing depending on your intent and your tolerance for duplicate rows in your table.
To eliminate the error you either have to rebuild the table as a multi set table or reconsider the update your are applying and the consequences it has on the data in your table. As this is an unconstrained update, you may encounter addition records where this error would occur.
You could run a SELECT statement against the table to try to identify how many times your update would produce a duplicate row. You would probably have to group by every column in the table, replacing the column being updated with the function being applied having COUNT(*) > 1.

How to define unique indexes when softDeletes is enabled

Will Laravel 4.1 manage the creation of a unique index(where deleted_at = null) by itself when softDeletes?
Is the approach below correct? Or is it going to mix in already deleted records?
Schema::create('example', function(Blueprint $table) {
$table->increments('id');
$table->integer('example')->unsigned()->unique(); //?????
$table->softDeletes();
});
The database is mysql, but if there's certain solutions for other DB:s, you can provide them as well. However, it should be done within the laravel framework! A uniform solution that works with all dbs that laravel officially supports is appreciated.
Update
It seems like this approach does not work, since it just ignores the softDeletes() option.
So proposed solution
Schema::create('exampe', function(Blueprint $table) {
$table->increments('id');
$table->integer('example')->unsigned();
$table->softDeletes();
$table->unique('example', 'deleted_at');
});
Problem is that there can potentially be two exactly similar timestamps at the deleted_at column.
What I actually need is a where-condition.
$table->unique('example', array('where', 'deleted_at', '=', null));
or
$table->integer('example')->unsigned()->unique()->where('deleted_at', '=', null)
I would recommend making a two-column UNIQUE constraint over the column you want to be unique (example) and a dummy column for instance called is_live. This column is always '1' when the row is not soft-deleted. When you soft-delete a row, set is_live=NULL.
The reason is related to the way "uniqueness" is defined. Unique constraints allow any number of rows that have a NULL value. This is because NULL is not equal to NULL in SQL, so therefore two NULLs count as "not the same".
For multi-column unique keys, if any column is NULL, the whole set of columns in the unique key behaves as if it's not the same as any other row. Therefore you can have any number of rows that have one column of the unique key the same, as long as the other column in the unique key is NULL.
create table example (
id serial primary key,
example int unsigned not null,
is_live tinyint default 1,
unique key (example, is_live)
);
Demo: http://sqlfiddle.com/#!9/8d1e4d/1
PS: The direct answer to your question, about implementing a condition for indexes in MySQL, is that MySQL doesn't support this. Databases that do support partial indexes include:
PostgreSQL (https://www.postgresql.org/docs/current/static/indexes-partial.html)
Microsoft SQL Server (https://msdn.microsoft.com/en-us/library/cc280372.aspx)
Using $table->softDeletes() doesn't change how the Schema sets unique indexes. In your example only the example column will be unique.
If you want to have multiple columns in your unique index just run $table->unique(['column1', 'column2']).
To set unique index you either use it on a chain, like $table->integer('example')->unique() or have it on a new row, like I wrote above.
I have same problem. Using 2 columns as unique, after soft delete I can't create another row with same unique data.
What I want is using Blueprint table object achieve (not RAW query) that SQL:
CREATE UNIQUE INDEX test ON test_table USING btree (test_id, user_id) WHERE deleted_at IS NULL;
But Blueprint and Fluent object dont have any where method.

How to duplicate the amount of data in a PostgreSQL database?

In order to evaluate the load of our platform (django + postgresql) I would like to literally duplicate the amount of data in the system. Its a bit complicated to create mocks that could emulate the different kind of objects (since we have a very complex data model).
Is there a way to create a duplicate of the database, override primary keys and unique fields for unused ones an merge it with the original?
(I) Explaining the principle
In order to illustrate the principle in a clear way, this explanation assumes the following:
every table has a bigserial primary key column called "id"
No unique constraints on tables (except primary keys)
Foreign key constraints reference only primary keys of other tables
Apply following to your database schema:
Make sure there is no circular dependencies between tables in your schema. If there are, choose foreign key constraints that would breake such dependency and drop them (you will later recreate them, after you manually handle affected fields).
Sort tables in topological order and, in that order, for every table execute script from (3)
For every table <table_schema>.<table_name> from (2) execute:
/*
Creating a lookup table which contains ordered pairs (id_old, id_new).
For every existing row in table <table_schema>.<table_name>,
new row with id = new_id will be created and with all the other fields copied. Nextval of sequence <table_schema>.<table_name>_id_seq is fetched to reserve id for a new row.
*/
CREATE TABLE _l_<table_schema>_<table_name> AS
SELECT id as id_old, nextval('<table_schema>.<table_name>_id_seq') as id_new
FROM <table_schema>.<table_name>;
/*
This part is for actual copying of table data with preserving of referential integrity.
Table <table_schema>.<table_name> has the following fields:
id - primary key
column1, ..., columnN - fields in a table excluding the foreign keys; N>=0;
fk1, ..., fkM - foreign keys; M>=0;
_l_<table_schema_fki>_<table_name_fki> (1 <= i <= M) - lookup tables of parent tables. We use LEFT JOIN because foreign key field could be nullable in general case.
*/
INSERT INTO <table_schema>.<table_name> (id, column1, ... , columnN, fk1, ..., fkM)
SELECT tlookup.id_new, t.column1, ... , t.columnN, tablefk1.id_new, ..., tablefkM.id_new
FROM <table_schema>_<table_name> t
INNER JOIN _l_<table_schema>_<table_name> tlookup ON t.id = tlookup.id_old
LEFT JOIN _l_<table_schema_fk1>_<table_name_fk1> tablefk1 ON t.fk1 = tablefk1.id_old
...
LEFT JOIN _l_<table_schema_fkM>_<table_name_fkM> tablefkM ON t.fkM = tablefkM.id_old;
Drop all lookup tables.
(II) Describing my implementation
To check for circular dependencies, I queried the transitive closures (https://beagle.whoi.edu/redmine/projects/ibt/wiki/Transitive_closure_in_PostgreSQL)
I implemented topological sort function (ported from t-sql from some blog). It comes in handy for automations.
I made a code generator (implemented in plpgsql). It's a function which takes <table_schema> and <table_name> as input params and returns text (SQL) shown in (I.2) for that table. By concatenating results of the function for every table in topological order, I produced the copy script.
I made manual changes to the script to satisfy unique constraints and other nuances, which boilerplate script doesn't cover.
Done. Script ready for execution in one transaction.
When I get the chance, I will "anonimize" my code a little bit and put it on github and put a link here.

How can i make certain Oracle Table Rows marked as 'historical' invisible/un-available?

I have a huge existing Order Management Application.
Now, in the main ORDER Table, i am adding a new column: IS_HISTORICAL. If its value is: TRUE, means the Order is Historical now, and should not show up in application.
Now, i have to modify many SQL Queries in my existing application so that they select only those orders whose IS_HISTORICAL is 'FALSE' - i.e add following in WHERE clause:
AND IS_HISTORICAL='FALSE'
Question: *Is there a easier way - so that i do not have to modify so many application queries (to hide away historical orders)?
Essentially all ORDERS marked as IS_HISTORICAL='TRUE' should become invisible/un-available for read/updates!!*
Note: Right now the table sizes are not very huge, but ultimately i intend to partition the table by IS_HISTORICAL true/false.
If you're only going to use the historical data for analysis then I prefer Florin's solution as the amount of data you need to look at for each query remains smaller. It makes the analysis queries more difficult as you need to UNION ALL but everything else will run "quicker" (it may not be noticable).
If some applications/users require access to the historical data the better solution would be to rename your table and create a view on top of it with the query that you need.
The problem with re-writing all your queries is that you're going to forget one or get one incorrect, either now or in the future. A view removes that problem for you as the query is static, every time you query the view the additional conditions you require are automatically added.
Something like:
rename orders to order_history;
create or replace view orders as
select *
from order_history
where is_historical = 'FALSE';
Two further points.
I wouldn't bother with TRUE / FALSE, if the table gets large it's a lot of additional data to scan. Create your column as a VARCHAR2(1) and use T / F or Y / N, they are as immediately obvious but are smaller. Alternatively use a NUMBER(1,0) and 1 / 0.
Don't forget to put a constraint on your table so that the IS_HISTORICAL column can only have the values you've chosen.
If you're only ever going to have the two values then you may want to consider a CHECK CONSTRAINT:
alter table order_history
add constraint chk_order_history_historical
check ( is_historical in ('T','F') );
Otherwise, maybe you should do this anyway, use a FOREIGN KEY CONSTRAINT. Define an extra table, ORDER_HISTORY_TYPES
create table order_history_types (
id varchar2(1)
, description varchar2(4000)
, constraint pk_order_history_types primary key (id)
);
Fill it with your values and then add the foreign key:
alter table order_history
add constraint fk_order_history_historical
foreign key (is_historical)
references order_history_types (id)
You could look into using Virtual Private Database/row-level security. This can be used to automatically add the is_historical = 'FALSE' predicate when certain conditions are met (e.g. you're connected as the application user).
If the user only need nonhistorical records, an option is to create an ORDER_HIST table and move there the historical records. (delete and insert)
If some users/applications need both type of records then the partition aproach is the best.

Linked List in SQL

What's the best way to store a linked list in a MySQL database so that inserts are simple (i.e. you don't have to re-index a bunch of stuff every time) and such that the list can easily be pulled out in order?
Using Adrian's solution, but instead of incrementing by 1, increment by 10 or even 100. Then insertions can be calculated at half of the difference of what you're inserting between without having to update everything below the insertion. Pick a number large enough to handle your average number of insertions - if its too small then you'll have to fall back to updating all rows with a higher position during an insertion.
create a table with two self referencing columns PreviousID and NextID. If the item is the first thing in the list PreviousID will be null, if it is the last, NextID will be null. The SQL will look something like this:
create table tblDummy
{
PKColumn int not null,
PreviousID int null,
DataColumn1 varchar(50) not null,
DataColumn2 varchar(50) not null,
DataColumn3 varchar(50) not null,
DataColumn4 varchar(50) not null,
DataColumn5 varchar(50) not null,
DataColumn6 varchar(50) not null,
DataColumn7 varchar(50) not null,
NextID int null
}
Store an integer column in your table called 'position'. Record a 0 for the first item in your list, a 1 for the second item, etc. Index that column in your database, and when you want to pull your values out, sort by that column.
alter table linked_list add column position integer not null default 0;
alter table linked_list add index position_index (position);
select * from linked_list order by position;
To insert a value at index 3, modify the positions of rows 3 and above, and then insert:
update linked_list set position = position + 1 where position >= 3;
insert into linked_list (my_value, position) values ("new value", 3);
A linked list can be stored using recursive pointers in the table. This is very much the same hierarchies are stored in Sql and this is using the recursive association pattern.
You can learn more about it here (Wayback Machine link).
I hope this helps.
The simplest option would be creating a table with a row per list item, a column for the item position, and columns for other data in the item. Then you can use ORDER BY on the position column to retrieve in the desired order.
create table linked_list
( list_id integer not null
, position integer not null
, data varchar(100) not null
);
alter table linked_list add primary key ( list_id, position );
To manipulate the list just update the position and then insert/delete records as needed. So to insert an item into list 1 at index 3:
begin transaction;
update linked_list set position = position + 1 where position >= 3 and list_id = 1;
insert into linked_list (list_id, position, data)
values (1, 3, "some data");
commit;
Since operations on the list can require multiple commands (eg an insert will require an INSERT and an UPDATE), ensure you always perform the commands within a transaction.
A variation of this simple option is to have position incrementing by some factor for each item, say 100, so that when you perform an INSERT you don't always need to renumber the position of the following elements. However, this requires a little more effort to work out when to increment the following elements, so you lose simplicity but gain performance if you will have many inserts.
Depending on your requirements other options might appeal, such as:
If you want to perform lots of manipulations on the list and not many retrievals you may prefer to have an ID column pointing to the next item in the list, instead of using a position column. Then you need to iterative logic in the retrieval of the list in order to get the items in order. This can be relatively easily implemented in a stored proc.
If you have many lists, a quick way to serialise and deserialise your list to text/binary, and you only ever want to store and retrieve the entire list, then store the entire list as a single value in a single column. Probably not what you're asking for here though.
This is something I've been trying to figure out for a while myself. The best way I've found so far is to create a single table for the linked list using the following format (this is pseudo code):
LinkedList(
key1,
information,
key2
)
key1 is the starting point. Key2 is a foreign key linking to itself in the next column. So your columns will link something link something like this
col1
key1 = 0,
information= 'hello'
key2 = 1
Key1 is primary key of col1. key2 is a foreign key leading to the key1 of col2
col2
key1 = 1,
information= 'wassup'
key2 = null
key2 from col2 is set to null because it doesn't point to anything
When you first enter a column in for the table, you'll need to make sure key2 is set to null or you'll get an error. After you enter the second column, you can go back and set key2 of the first column to the primary key of the second column.
This makes the best method to enter many entries at a time, then go back and set the foreign keys accordingly (or build a GUI that just does that for you)
Here's some actual code I've prepared (all actual code worked on MSSQL. You may want to do some research for the version of SQL you are using!):
createtable.sql
create table linkedlist00 (
key1 int primary key not null identity(1,1),
info varchar(10),
key2 int
)
register_foreign_key.sql
alter table dbo.linkedlist00
add foreign key (key2) references dbo.linkedlist00(key1)
*I put them into two seperate files, because it has to be done in two steps. MSSQL won't let you do it in one step, because the table doesn't exist yet for the foreign key to reference.
Linked List is especially powerful in one-to-many relationships. So if you've ever wanted to make an array of foreign keys? Well this is one way to do it! You can make a primary table that points to the first column in the linked-list table, and then instead of the "information" field, you can use a foreign key to the desired information table.
Example:
Let's say you have a Bureaucracy that keeps forms.
Let's say they have a table called file cabinet
FileCabinet(
Cabinet ID (pk)
Files ID (fk)
)
each column contains a primary key for the cabinet and a foreign key for the files. These files could be tax forms, health insurance papers, field trip permissions slips etc
Files(
Files ID (pk)
File ID (fk)
Next File ID (fk)
)
this serves as a container for the Files
File(
File ID (pk)
Information on the file
)
this is the specific file
There may be better ways to do this and there are, depending on your specific needs. The example just illustrates possible usage.
There are a few approaches I can think of right off, each with differing levels of complexity and flexibility. I'm assuming your goal is to preserve an order in retrieval, rather than requiring storage as an actual linked list.
The simplest method would be to assign an ordinal value to each record in the table (e.g. 1, 2, 3, ...). Then, when you retrieve the records, specify an order-by on the ordinal column to get them back in order.
This approach also allows you to retrieve the records without regard to membership in a list, but allows for membership in only one list, and may require an additional "list id" column to indicate to which list the record belongs.
An slightly more elaborate, but also more flexible approach would be to store information about membership in a list or lists in a separate table. The table would need 3 columns: The list id, the ordinal value, and a foreign key pointer to the data record. Under this approach, the underlying data knows nothing about its membership in lists, and can easily be included in multiple lists.
This post is old but still going to give my .02$. Updating every record in a table or record set sounds crazy to solve ordering. the amount of indexing also crazy, but it sounds like most have accepted it.
Crazy solution i came up with to reduce updates and indexing is to create two tables (and in most use cases you don's sort all records in just one table anyway). Table A to hold the records of the list being sorted and table B to group and hold a record of the order as a string. the order string represents an array that can be used to order the selected records either on the web server or browser layer of a webpage application.
Create Table A{
Id int primary key identity(1,1),
Data varchar(10) not null
B_Id int
}
Create Table B{
Id int primary key Identity(1,1),
GroupName varchat(10) not null,
Order varchar(max) null
}
The format of the order sting should be id, position and some separator to split() your string by. in the case of jQuery UI the .sortable('serialize') function outputs an order string for you that is POST friendly that includes the id and position of each record in the list.
The real magic is the way you choose to reorder the selected list using the saved ordering string. this will depend on the application you are building. here is an example again from jQuery to reorder the list of items: http://ovisdevelopment.com/oramincite/?p=155
https://dba.stackexchange.com/questions/46238/linked-list-in-sql-and-trees suggests a trick of using floating-point position column for fast inserts and ordering.
It also mentions specialized SQL Server 2014 hierarchyid feature.
I think its much simpler adding a created column of Datetime type and a position column of int, so now you can have duplicate positions, at the select statement use the order by position, created desc option and your list will be fetched in order.
Increment the SERIAL 'index' by 100, but manually add intermediate values with an 'index' equal to Prev+Next / 2. If you ever saturate the 100 rows, reorder the index back to 100s.
This should maintain sequence with primary index.
A list can be stored by having a column contain the offset (list index position) -- an insert in the middle is then incrementing all above the new parent and then doing an insert.
You could implement it like a double ended queue (deque) to support fast push/pop/delete(if oridnal is known) and retrieval you would have two data structures. One with the actual data and another with the number of elements added over the history of the key. Tradeoff: This method would be slower for any insert into the middle of the linked list O(n).
create table queue (
primary_key,
queue_key
ordinal,
data
)
You would have an index on queue_key+ordinal
You would also have another table which stores the number of rows EVER added to the queue...
create table queue_addcount (
primary_key,
add_count
)
When pushing a new item to either end of the queue (left or right) you would always increment the add_count.
If you push to the back you could set the ordinal...
ordinal = add_count + 1
If you push to the front you could set the ordinal...
ordinal = -(add_count + 1)
update
add_count = add_count + 1
This way you can delete anywhere in the queue/list and it would still return in order and you could also continue to push new items maintaining the order.
You could optionally rewrite the ordinal to avoid overflow if a lot of deletes have occurred.
You could also have an index on the ordinal to support fast ordered retrieval of the list.
If you want to support inserts into the middle you would need to find the ordinal which it needs to be insert at then insert with that ordinal. Then increment every ordinal by one following that insertion point. Also, increment the add_count as usual. If the ordinal is negative you could decrement all of the earlier ordinals to do fewer updates. This would be O(n)