How to duplicate the amount of data in a PostgreSQL database?

How to duplicate the amount of data in a PostgreSQL database? - sql

In order to evaluate the load of our platform (django + postgresql) I would like to literally duplicate the amount of data in the system. Its a bit complicated to create mocks that could emulate the different kind of objects (since we have a very complex data model).
Is there a way to create a duplicate of the database, override primary keys and unique fields for unused ones an merge it with the original?

(I) Explaining the principle
In order to illustrate the principle in a clear way, this explanation assumes the following:
every table has a bigserial primary key column called "id"
No unique constraints on tables (except primary keys)
Foreign key constraints reference only primary keys of other tables
Apply following to your database schema:
Make sure there is no circular dependencies between tables in your schema. If there are, choose foreign key constraints that would breake such dependency and drop them (you will later recreate them, after you manually handle affected fields).
Sort tables in topological order and, in that order, for every table execute script from (3)
For every table <table_schema>.<table_name> from (2) execute:
/*
Creating a lookup table which contains ordered pairs (id_old, id_new).
For every existing row in table <table_schema>.<table_name>,
new row with id = new_id will be created and with all the other fields copied. Nextval of sequence <table_schema>.<table_name>_id_seq is fetched to reserve id for a new row.
*/
CREATE TABLE _l_<table_schema>_<table_name> AS
SELECT id as id_old, nextval('<table_schema>.<table_name>_id_seq') as id_new
FROM <table_schema>.<table_name>;
/*
This part is for actual copying of table data with preserving of referential integrity.
Table <table_schema>.<table_name> has the following fields:
id - primary key
column1, ..., columnN - fields in a table excluding the foreign keys; N>=0;
fk1, ..., fkM - foreign keys; M>=0;
_l_<table_schema_fki>_<table_name_fki> (1 <= i <= M) - lookup tables of parent tables. We use LEFT JOIN because foreign key field could be nullable in general case.
*/
INSERT INTO <table_schema>.<table_name> (id, column1, ... , columnN, fk1, ..., fkM)
SELECT tlookup.id_new, t.column1, ... , t.columnN, tablefk1.id_new, ..., tablefkM.id_new
FROM <table_schema>_<table_name> t
INNER JOIN _l_<table_schema>_<table_name> tlookup ON t.id = tlookup.id_old
LEFT JOIN _l_<table_schema_fk1>_<table_name_fk1> tablefk1 ON t.fk1 = tablefk1.id_old
...
LEFT JOIN _l_<table_schema_fkM>_<table_name_fkM> tablefkM ON t.fkM = tablefkM.id_old;
Drop all lookup tables.
(II) Describing my implementation
To check for circular dependencies, I queried the transitive closures (https://beagle.whoi.edu/redmine/projects/ibt/wiki/Transitive_closure_in_PostgreSQL)
I implemented topological sort function (ported from t-sql from some blog). It comes in handy for automations.
I made a code generator (implemented in plpgsql). It's a function which takes <table_schema> and <table_name> as input params and returns text (SQL) shown in (I.2) for that table. By concatenating results of the function for every table in topological order, I produced the copy script.
I made manual changes to the script to satisfy unique constraints and other nuances, which boilerplate script doesn't cover.
Done. Script ready for execution in one transaction.
When I get the chance, I will "anonimize" my code a little bit and put it on github and put a link here.

Related

Best practice to enforce uniqueness on column but allow some duplicates?

Here is what I am trying to figure out: there should be a table to store authorizations for our new client management system, and every authorization has their unique identifier. This constraint would be pretty easy to translate to SQL, but unfortunately because of the slowness of bureaucracy, sometimes we need to create an entry with a placeholder ID (e.g., "temp") in order for the client to be able to start taking services.
What would be the best practice to enforce this conditional uniqueness constraint?
These are what I could come up with my limited experience:
Use partial indexing mentioned in the PostgreSQL manual (5.3.3. -> Example 11-3.). It also mentions that This is a particularly efficient approach when there are few successful tests and many unsuccessful ones. In our legacy DB that will be migrated, there are 130,000 rows and about 5 temp authorizations a month, but the whole table only grows by about 200 rows per year. Would this be the right approach? (I am also not sure what "efficient" means in this context.)
Create a separate table for the temp authorizations but then it would duplicate the table structure.
Define a unique constraint for a group of columns. An authorization is for a specific service for a certain time period issued to an individual.
EDIT:
I'm sorry I think my description of the authorization ID was a bit obscure: it is provided by a state department with the format of NMED012345678 and it is entered by hand. It is unique, but sometimes only provided at a later time for unknown reasons.

There is a simple, fast and secure way:
Add a boolean column to mark temporary entries which is NULL by default, say:
temp bool DEFAULT NULL CHECK (temp)
The added check constraint disallows FALSE, only NULL or TRUE are possible. Storage cost for the default NULL value is typically ... nothing - unless there are no other NULL values in the row.
How much disk-space is needed to store a NULL value using postgresql DB?
The column default means you don't normally have to take care of the column. It's NULL by default (which is the default default anyway, I'm just being explicit here). You only need to mark the few exceptions explicitly.
Then create a partial unique index like:
CREATE UNIQUE INDEX tbl_unique_id_uni ON tbl (unique_id) WHERE temp IS NULL;
That only includes rows supposed to be unique. Index size is not increased at all.
Be sure to add the predicate WHERE temp IS NULL to queries that are supposed to use the unique index.
Related:
Create unique constraint with null columns

You can have several possibilities:
Make the temp identifiers unique; for instance, if they are automatically created (not entered manually) make them:
CREATE SEQUENCE temp_ids_seq ; -- This done only once for the database
Whenever you need a new temporary id, issue
'temp' || nxtval('temp_ids_seq') AS id
Use a partial index, assuming that the value which is allowed is temp
CREATE UNIQUE INDEX tbl_unique_idx ON tbl (id) WHERE (id IS DISTINCT FROM 'temp')
For the sake of efficiency, you probably would like to have, in those cases, also the complementary index:
CREATE INDEX tbl_temp_idx ON tbl (id) WHERE (id IS NOT DISTINCT FROM 'temp')
This last index will help queries seeking id = 'temp'.

This is a bit long for a comment.
I think I would have an authorization table with a unique authorization. The authorization could then have two types: "approved" and "temporary". You could handle this with two columns.
However, I would probably have the authorization id as a serial column with the "approved" id being a field in the table. That table could have a unique constraint on it. You can use either a full unique constraint or a unique constraint with filtered values (Postgres allows multiple NULL values in a unique constraint, but the second is more explicit).
You can have the same process for the temporary authorizations -- using a different column. Presumably you have some mechanism for authorizing them and storing the approval date, time, and person.
I would not use two tables. Having authorizations spread among multiple tables just seems likely to sow confusion. Anywhere in the code where you want to see who has an authorization is a potential for mis-reading the data.

IMO it is not advisable to use remote keys as (part of) primary keys.
they are not under your control; they can change
you cannot guarantee correctness and/or uniqueness(email-addresses, telefone numbers, licence-numbers, serial numbers)
using them AS PK would cause them to be used AS FK for other tables into this table, with fat indexes and lots cascading on change.
\i tmp.sql
CREATE TABLE the_persons
( seq SERIAL NOT NULL PRIMARY KEY -- surrogate key
, registrationnumber varchar -- "remote" KEY, not necesarily UNIQUE
, is_validated BOOLEAN NOT NULL DEFAULT FALSE
, last_name varchar
, dob DATE
);
CREATE INDEX name_dob_idx ON the_persons(last_name, dob)
;
CREATE UNIQUE INDEX registrationnumber_idx ON the_persons(registrationnumber,seq)
-- WHERE is_validated = False
;
CREATE UNIQUE INDEX registrationnumber_key ON the_persons(registrationnumber)
WHERE is_validated = True
;
INSERT INTO the_persons(is_validated,registrationnumber,last_name, dob)VALUES
( True, 'OKAY001', 'Smith', '1988-02-02')
,( True, 'OKAY002', 'Jones', '1988-02-02')
,( False, 'OKAY001', 'Smith', '1988-02-02')
,( False, 'OMG001', 'Smith', '1988-08-02')
;
-- validated records:
SELECT *
FROM the_persons
WHERE is_validated = True
;
-- some records with nasty cousins
SELECT *
FROM the_persons p
WHERE EXISTS (
SELECT*
FROM the_persons x
WHERE x.registrationnumber = p.registrationnumber
AND x.is_validated = False
)
AND last_name LIKE 'Smith%'
;

Strategies to store extra information about models without too many column names (alternatives to DB normalization and model subclassing)

Say you had a Model called Forest. Each object represents a forest on your continent. There is a set of data that is common to all these forests, like forest type, area etc., and these can be easily represented by columns on the SQL table, forest.
However, imagine that these forests had additional data about them that might not always be repeatable. For example the 20 coniferous forests have a pine-fir split ratio number, whereas the deciduous forests have a autumn-duration number. One way would be to store all these columns on the main table itself, but there will be too many columns on each row, with many columns remaining un-filled by definition.
The most obvious way around this is to make sub-classes of the Forest model and have separate table for each subclass. I feel that's a heavy handed approach that I would rather not follow. If I need some data about the generic forest I'll have to consult another table.
Is there a pattern to solve this problem? What solution do you usually prefer?
NOTE: I have seen the other questions about this. The solutions proposed were:
Subtyping, same as I proposed above.
Have all the columns on the same table.
Have separate tables for each kind of forest, with duplicated data like area and rainfall... duplicated.
Is there an inventive solution that I don't know of?
UPDATE: I have run into the EAV model, and also a modified version where the unpredictable fields are stored out in a NoSQL/JSON store, and the id for that is held in the RDB. I like both, but welcome suggestions in this direction.

On the database side, the best approach is often to store attributes common to all forests in one table, and to store unique attributes in other tables. Build updatable views for clients to use.
create table forests (
forest_id integer primary key,
-- Assumes forest names are not unique on a continent.
forest_name varchar(45) not null,
forest_type char(1) not null
check (forest_type in ('c', 'd')),
area_sq_km integer not null
check (area_sq_km > 0),
-- Other columns common to all forests go here.
--
-- This constraint lets foreign keys target the pair
-- of columns, guaranteeing that a row in each subtype
-- table references a row here having the same subtype.
unique (forest_id, forest_type)
);
create table coniferous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'c'
check (forest_type = 'c'),
pine_fir_ratio float not null
check (pine_fir_ratio >= 0),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
create table deciduous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'd'
check (forest_type = 'd'),
autumn_duration_days integer not null
check (autumn_duration_days between 20 and 100),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
Clients usually use updatable views, one for each subtype, instead of using the base tables. (You can revoke privileges on the base subtype tables to guarantee this.) You might want to omit the "forest_type" column.
create view coniferous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.pine_fir_ratio
from forests t1
inner join coniferous_forests_subtype t2
on t1.forest_id = t2.forest_id;
create view deciduous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.autumn_duration_days
from forests t1
inner join deciduous_forests_subtype t2
on t1.forest_id = t2.forest_id;
What you have to do to make these views updatable varies a little with the dbms, but expect to write some triggers (not shown). You'll need triggers to handle all the DML actions--insert, update, and delete.
If you need to report only on columns that appear in "forests", then just query the table "forests".

Well, the easiest way is putting all the columns into one table and then having a "type" field to decide which columns to use. This works for smaller tables, but for more complicated cases it can lead to a big messy table and issues with database constraints (such as NULLs).
My preferred method would be something like this:
A generic "Forests" table with: id, type, [generic_columns, ...]
"Coniferous_Forests" table with: id, forest_id (FK to Forests), ...
So, in order to get all the data for a Coniferous Forest with id of 1, you'd have a query like so:
SELECT * FROM Coniferous_Forests INNER JOIN Forests
ON Coniferous_Forests.forest_id = Forests.id
AND Coniferous_Forests.id = 1
As for inventive solutions, there is such a thing as an OODBMS (Object Oriented Database Management Sytem).
The most popular alternative to Relational SQL databases are Document-Oriented NoSQL databases like MongoDB. This is comparable to using JSON objects to store your data, and allows you to be more flexible with your database fields.

SQL Server foreign key to multiple tables

I have the following database schema:
members_company1(id, name, ...);
members_company2(id, name, ...);
profiles(memberid, membertypeid, ...);
membertypes(id, name, ...)
[
{ id : 1, name : 'company1', ... },
{ id : 2, name : 'company2', ... }
];
So each profile belongs to a certain member either from company1 or company2 depending on membertypeid value
members_company1 ————————— members_company2
———————————————— ————————————————
id ——————————> memberid <——————————— id
name membertypeid name
/|\
|
|
profiles |
—————————— |
memberid ————————+
membertypeid
I am wondering if it's possible to create a foreign key in profiles table for referential integrity based on memberid and membertypeid pair to reference either members_company1 or members_company2 table records?

A foreign key can only reference one table, as stated in the documentation (emphasis mine):
A foreign key (FK) is a column or combination of columns that is used
to establish and enforce a link between the data in two tables.
But if you want to start cleaning things up you could create a members table as #KevinCrowell suggested, populate it from the two members_company tables and replace them with views. You can use INSTEAD OF triggers on the views to 'redirect' updates to the new table. This is still some work, but it would be one way to fix your data model without breaking existing applications (if it's feasible in your situation, of course)

Operating under the fact that you can't change the table structure:
Option 1
How important is referential integrity to you? Are you only doing inner joins between these tables? If you don't have to worry too much about it, then don't worry about it.
Option 2
Ok, you probably have to do something about this. Maybe you do have inner joins only, but you have to deal with data in profiles that doesn't relate to anything in the members tables. Could you create a job that runs once per day or week to clean it out?
Option 3
Yeah, that one may not work either. You could create a trigger on the profiles table that checks the reference to the members tables. This is far from ideal, but it does guarantee instantaneous checks.
My Opinion
I would go with option 2. You're obviously dealing with a less-than-ideal schema. Why make this worse than it has to be. Let the bad data sit for a week; clean the table every weekend.

No. A foreign key can reference one and only one primary key and there is no way to spread primary keys across tables. The kind of logic you hope to achieve will require use of a trigger or restructuring your database so that all members are based off a core record in a single table.

Come on you can create a table but you cannot modify members_company1 nor members_company2?
Your idea of a create a members table will require more actions when new records are inserted into members_company tables.
So you can create triggers on members_company1 and members_company2 - that is not modify?
What are the constraints to what you can do?
If you just need compatibility on selects to members_company1 and members_company2 then create a real members table and create views for members_company1 and members_company2.
A basic select does not know it is a view or a table on the other end.
CREATE VIEW dbo.members_company1
AS
SELECT id, name
FROM members
where companyID = 1
You could possible even handle insert, updates, and deletes with instead-of
INSTEAD OF INSERT Triggers

A foreign key cannot reference two tables. Assuming you don't want to correct your design by merging members_company1 and members_company2 tables, the best approach would be to:
Add two columns called member_company1_id and member_company2_id to your profiles table and create two foreign keys to the two tables and allow nulls. Then you could add a constraint to ensure 1 of the columns is null and the other is not, at all times.

How can i make certain Oracle Table Rows marked as 'historical' invisible/un-available?

I have a huge existing Order Management Application.
Now, in the main ORDER Table, i am adding a new column: IS_HISTORICAL. If its value is: TRUE, means the Order is Historical now, and should not show up in application.
Now, i have to modify many SQL Queries in my existing application so that they select only those orders whose IS_HISTORICAL is 'FALSE' - i.e add following in WHERE clause:
AND IS_HISTORICAL='FALSE'
Question: *Is there a easier way - so that i do not have to modify so many application queries (to hide away historical orders)?
Essentially all ORDERS marked as IS_HISTORICAL='TRUE' should become invisible/un-available for read/updates!!*
Note: Right now the table sizes are not very huge, but ultimately i intend to partition the table by IS_HISTORICAL true/false.

If you're only going to use the historical data for analysis then I prefer Florin's solution as the amount of data you need to look at for each query remains smaller. It makes the analysis queries more difficult as you need to UNION ALL but everything else will run "quicker" (it may not be noticable).
If some applications/users require access to the historical data the better solution would be to rename your table and create a view on top of it with the query that you need.
The problem with re-writing all your queries is that you're going to forget one or get one incorrect, either now or in the future. A view removes that problem for you as the query is static, every time you query the view the additional conditions you require are automatically added.
Something like:
rename orders to order_history;
create or replace view orders as
select *
from order_history
where is_historical = 'FALSE';
Two further points.
I wouldn't bother with TRUE / FALSE, if the table gets large it's a lot of additional data to scan. Create your column as a VARCHAR2(1) and use T / F or Y / N, they are as immediately obvious but are smaller. Alternatively use a NUMBER(1,0) and 1 / 0.
Don't forget to put a constraint on your table so that the IS_HISTORICAL column can only have the values you've chosen.
If you're only ever going to have the two values then you may want to consider a CHECK CONSTRAINT:
alter table order_history
add constraint chk_order_history_historical
check ( is_historical in ('T','F') );
Otherwise, maybe you should do this anyway, use a FOREIGN KEY CONSTRAINT. Define an extra table, ORDER_HISTORY_TYPES
create table order_history_types (
id varchar2(1)
, description varchar2(4000)
, constraint pk_order_history_types primary key (id)
);
Fill it with your values and then add the foreign key:
alter table order_history
add constraint fk_order_history_historical
foreign key (is_historical)
references order_history_types (id)

You could look into using Virtual Private Database/row-level security. This can be used to automatically add the is_historical = 'FALSE' predicate when certain conditions are met (e.g. you're connected as the application user).

If the user only need nonhistorical records, an option is to create an ORDER_HIST table and move there the historical records. (delete and insert)
If some users/applications need both type of records then the partition aproach is the best.

Merging databases how to handle duplicate PK's

We have three databases that are physically separated by region, one in LA, SF and NY. All the databases share the same schema but contain data specific to their region. We're looking to merge these databases into one and mirror it. We need to preserve the data for each region but merge them into one db. This presents quite a few issues for us, for example we will certainly have duplicate Primary Keys, and Foreign Keys will be potentially invalid.
I'm hoping to find someone who has had experience with a task like this who could provide some tips, strategies and words of experience on how we can accomplish the merge.
For example, one idea was to create composite keys and then change our code and sprocs to find the data via the composite key (region/original pk). But this requires us to change all of our code and sprocs.
Another idea was to just import the data and let it generate new PK's and then update all the FK references to the new PK. This way we potentially don't have to change any code.
Any experience is welcome!

I have no first-hand experience with this, but it seems to me like you ought to be able to uniquely map PK -> New PK for each server. For instance, generate new PKs such that data from LA server has PK % 3 == 2, SF has PK % 3 == 1, and NY has PK % 3 == 0. And since, as I understood your question anyway, each server only stores FK relationships to its own data, you can update the FKs in identical fashion.
NewLA = OldLA*3-1
NewSF = OldLA*3-2
NewNY = OldLA*3
You can then merge those and have no duplicate PKs. This is essentially, as you already said, just generating new PKs, but structuring it this way allows you to trivially update your FKs (assuming, as I did, that the data on each server is isolated). Good luck.

BEST: add a column for RegionCode, and include it on your PKs, but you don't want to do all the leg work.
HACK: if your IDs are INTs, a quick fix would be to add a fixed value based on region to each key on import. INTs can be as large as: 2,147,483,647
local server data:
LA IDs: 1,2,3,4,5,6
SF IDs: 1,2,3,4,5
NY IDs: 1,2,3,4,5,6,7,9
add 100000000 to LA's IDs
add 200000000 to SF's IDs
add 300000000 to NY's IDs
combined server data:
LA IDs: 100000001,100000002,100000003,100000004,100000005,100000006
SF IDs: 200000001,200000002,200000003,200000004,200000005
NY IDs: 300000001,300000002,300000003,300000004,300000005,300000006,300000007,300000009

I have done this and I say change your keys (pick a method) rather than changing your code. Invariably you will either miss a stored procedure or introduce a bug. With data changes, it is pretty easy to write tests to look for orphaned records or to verify that things were matched up correctly. With code changes, especially code that is working correctly, it is too easy to miss something.

One thing you could do is set up the tables with regional data to use GUID's. That way, the primary keys in each region are unique, and you can mix and match data (import data from one region to another). For the tables which have shared data (like type tables), you can keep the primary keys the way they are (since they should be the same everywhere).
Here is some information about GUID's:
http://www.sqlteam.com/article/uniqueidentifier-vs-identity
Maybe SQL Server Management Studio lets you convert columns to use GUID's easily. I hope so!
Best of luck.

what i have done in a situation like this is this:
create a new db with the same schema
but only tables. no pk fk, checks
etc.
transfer data from DB1 to this
source db
for each table in target database
find the top number for the PK
for each table in the source
database update their pk, fk etc
starting with the (top number + 1)
from the target db
for each table in target database
set identity insert to on
import data from source db to target
db
for each table in target database
set identity insert to off
clear source db
repeat for DB2

As Jon mentioned, I would use GUIDs to solve the merge task. And I see two different solutions that required GUIDs:
1) Permanently change your database schema to use GUIDs instead of INTEGER (IDENTITY) as primary key.
This is a good solution in general, but if you have a lot of non SQL code that is somehow bound to the way your identifiers work, it could require quite some code changes. Probably since you merge databases, you may anyways need to update your application so that it is working with one region data only based on the user logged in etc.
2) Temporarily add GUIDs for migration purposes only, and after the data is migrated, drop them:
This one is kind-of more tricky, but once you write this migration script, you can (re-)run it multiple times to merge databases again in case you screw it the first time. Here is an example:
Table: PERSON (ID INT PRIMARY KEY, Name VARCHAR(100) NOT NULL)
Table: ADDRESS (ID INT PRIMARY KEY, City VARCHAR(100) NOT NULL, PERSON_ID INT)
Your alter scripts are (note that for all PK we automatically generate the GUID):
ALTER TABLE PERSON ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD PERSON_UID UNIQUEIDENTIFIER NULL
Then you update the FKs to be consistent with INTEGER ones:
--// set ADDRESS.PERSON_UID
UPDATE ADDRESS
SET ADDRESS.PERSON_UID = PERSON.UID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_ID = PERSON.ID
You do this for all PKs (automatically generate GUID) and FKs (update as shown above).
Now you create your target database. In this target database you also add the UID columns for all the PKs and FKs. Also disable all FK constraints.
Now you insert from each of your source databases to the target one (note: we do not insert PKs and integer FKs):
INSERT INTO TARGET_DB.dbo.PERSON (UID, NAME)
SELECT UID, NAME FROM SOURCE_DB1.dbo.PERSON
INSERT INTO TARGET_DB.dbo.ADDRESS (UID, CITY, PERSON_UID)
SELECT UID, CITY, PERSON_UID FROM SOURCE_DB1.dbo.ADDRESS
Once you inserted data from all the databases, you run the code opposite to the original to make integer FKs consistent with GUIDs on the target database:
--// set ADDRESS.PERSON_ID
UPDATE ADDRESS
SET ADDRESS.PERSON_ID = PERSON.ID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_UID = PERSON.UID
Now you may drop all the UID columns:
ALTER TABLE PERSON DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN PERSON_UID
So at the end you should get a rather long migration script, that should do the job for you. The point is - IT IS DOABLE
NOTE: all written here is not tested.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas