Postgres/Sql copy row and all dependencies to new tables - sql

I have the situation where I want to copy data from one set of tables to another set in postgres. Reason for this is I have a global collection of data that is 'live' but when I use that data in a project it needs to become 'static' to that project so that no future changes on the global data will change the projects data. The project effectively needs to be frozen in time.
I have been able to accomplish this relatively easily for simple objects with no foreign key dependencies that need copying. Take my Units to Unitslocal example below which copies the designated fields from Units to UnitsLocal, assigns a new Id and returns it for me to use:
INSERT INTO dbo.""UnitsLocal""
(""Type"", ""TenantId"", ""Description"", ""WriteId"", ""ProjectId"")
SELECT
""Type"", ""TenantId"", ""Description"", ""WriteId"" , NULLIF(#ProjectId, 0)
FROM dbo.""Units""
WHERE
#UnitId = ""Id"" AND
NOT EXISTS (
SELECT 1 FROM dbo.""UnitsLocal""
WHERE ""Type"" = #UnitType AND ""TenantId"" = #TenantId AND ""ProjectId"" = #ProjectId)
RETURNING ""Id""
Say I want to copy a more complex object. One that may contain multiple units and other data types, that may contain other units and data types, that may contain other units and data types, and so on. What is the best way to go about this? Wishful thinking but is there something like a copy cascade I don't know about that will copy a row based on an Id I can easily provide and subsequently copy all dependencies mapped by foreign keys to the relevant tables assigning new ids as required?
Essentially the initial insert will be done via dapper. Would the best approach be to do the whole lot in a query, or will I need to use functions, rules, and triggers?
I've just spent 4 hours trying to do this code first and it just turns into inception. It needs to be SQL but I don't even know where to start.
A simplified data structure to explain better may be:
TABLE
dbo.""Reources""
COLUMNS
""Id"" - pkey
""TenantId"" - fkey AspNetUsers (separate schema)
""UnitId" - fkey dbo.""Units"" ""Id""
TABLE
dbo.""ComplexReources""
COLUMNS
""Id"" - pkey
""TenantId"" - fkey AspNetUsers (separate schema)
""LineId" - fkey dbo.""Lines"" ""Id""
** may contain many lines
TABLE
dbo.""Units""
COLUMNS
""Id"" - pkey
""TenantId"" - fkey AspNetUsers (separate schema)
TABLE
dbo.""Lines""
COLUMNS
""Id"" - pkey
""TenantId"" - fkey AspNetUsers (separate schema)
""UnitId" - fkey dbo.""Units"" ""Id""
""ResourceId"" - fkey dbo.""Resources"" ""Id""
""ComplexResourceId"" - fkey dbo.""ComplexResources"" ""Id""
I may need to copy a ComplexResource to the ComplexResourceLocal table same as i did with the Unit table, but will subsequently need to copy each Line's Unit, Resource, and ComplexResource, which may contain another ComplexResource etc....to their on respective local table. The ProjectID is only required in the Local Tables and will be passed as a parameter.
I do protect against circular references.

Related

Extending table with another table ... sort of

I have a DB about renting cars.
I created a CarModels table (ModelID as PK).
I want to create a second table with the same primary key as CarModels have.
This table only contains the number of times this Model was searched on my website.
So lets say you visit my website, you can check a list that contains common cars rented.
"Most popular Cars" table.
It's not about One-to-One relationship, that's for sure.
Is there any SQL code to connect two Primary keys together ?
select m.ModelID, m.Field1, m.Field2,
t.TimesSearched
from CarModels m
left outer join Table2 t on m.ModelID = t.ModelID
but why not simply add the field TimesSearched to table CarModels ?
Then you dont need another table
Easiest is to just use a new primary key on the new table with a foreign key to the CarModels table, like [CarModelID] INT NOT NULL. You can put an index and a unique constraint on the FK.
If you reeeealy want them to be the same, you can jump through a bunch of hoops that will make your life Hell, like creating the table from the CarModels table, then setting that field as the primary key, then whenever you add a new CarModel you'll have to create a trigger that will SET IDENTITY_INSERT ON so you can add the new one, and remember to SET IDENTITY_INSERT OFF when you're done.
Personally, I'd create a CarsSearched table that holds ThisUser selected ThisCarModel on ThisDate: then you can start doing some fun data analysis like [are some cars more popular in certain zip codes or certain times of year?], or [this company rents three cars every year in March, so I'll send them a coupon in January].
You are not extending anything (modifying the actual model of the table). You simply need to make INNER JOIN of the table linking with the primary keys being equal.
It could be outer join as it has been suggested but if it's 1:1 like you said ( the second table with have exact same keys - I assume all of them), inner will be enough as both tables would have the same set of same prim keys.
As a bonus, it will also produce fewer rows if you didn't match all keys as a nice reminder if you fail to match all PKs.
That being said, do you have a strong reason why not to keep the said number in the same table? You are basically modeling 1:1 relationship for 1 extra column (and small one too, by data type)
You could extend (now this is extending tables model) with the additional attribute of integer that keeps that number for you.
Later is preferred for simplicity and lower query times.

Mapping surrogate keys in ssis

How to map a surrogate key(which is a foreign key to other dimension table) in ssis.
I have Dim.Camp table like this :
Dim.Camp(campkey int identity(1,1),Advkey int,campbk int,campname varchar(10))
Dim.Adv(Advkey int identity(1,1),Advbk int)
The above are my dimension tables,
These are my staging tables:
Camp(Advid int,campid int,campname varchar(10))
Adv(Advid int)
I load my Dim.camp through loop up task in ssis using my staging tables :
Then i get:
Dim.Camp(campkey int identity(1,1),Advkey int,campbk int,campname varchar(10)) populated accept
Advkey which gets all Nulls in its column because there is no corresponding mapping in staging tables
Can somebody tell me what is it I'm doing wrong, ...or how to get this done ?
I would like to know what is the relationship between the entities or tables "Camp" and "Adv" in your source system or in the staging table. You always need the business key to perform a lookup in the dimension tables, that is, you need the campbk to lookup records in the Dim.Camp and the Advbk to lookup records in the Dime.Camp dimension
If I understood correctly you should load first the Dim.Adv from staging using for instance a merge command in an execute SQL task or in a data flow task. Then you can load the Dim.Camp from Stage, something like this:
Source query:
select Advid as Advbk,campid campbk,campname from Staging.Camp
Then make a lookup (lookup table: Dim.Adv) and get the Dim.Advid where Dim.Adv.Advbk = Advbk
Finally make another lookup (table: Dim.Camp) and then determine if you have to update or insert the records.
Let me know if you have further questions.
Kind Regards,
Paul

Strategies to store extra information about models without too many column names (alternatives to DB normalization and model subclassing)

Say you had a Model called Forest. Each object represents a forest on your continent. There is a set of data that is common to all these forests, like forest type, area etc., and these can be easily represented by columns on the SQL table, forest.
However, imagine that these forests had additional data about them that might not always be repeatable. For example the 20 coniferous forests have a pine-fir split ratio number, whereas the deciduous forests have a autumn-duration number. One way would be to store all these columns on the main table itself, but there will be too many columns on each row, with many columns remaining un-filled by definition.
The most obvious way around this is to make sub-classes of the Forest model and have separate table for each subclass. I feel that's a heavy handed approach that I would rather not follow. If I need some data about the generic forest I'll have to consult another table.
Is there a pattern to solve this problem? What solution do you usually prefer?
NOTE: I have seen the other questions about this. The solutions proposed were:
Subtyping, same as I proposed above.
Have all the columns on the same table.
Have separate tables for each kind of forest, with duplicated data like area and rainfall... duplicated.
Is there an inventive solution that I don't know of?
UPDATE: I have run into the EAV model, and also a modified version where the unpredictable fields are stored out in a NoSQL/JSON store, and the id for that is held in the RDB. I like both, but welcome suggestions in this direction.
On the database side, the best approach is often to store attributes common to all forests in one table, and to store unique attributes in other tables. Build updatable views for clients to use.
create table forests (
forest_id integer primary key,
-- Assumes forest names are not unique on a continent.
forest_name varchar(45) not null,
forest_type char(1) not null
check (forest_type in ('c', 'd')),
area_sq_km integer not null
check (area_sq_km > 0),
-- Other columns common to all forests go here.
--
-- This constraint lets foreign keys target the pair
-- of columns, guaranteeing that a row in each subtype
-- table references a row here having the same subtype.
unique (forest_id, forest_type)
);
create table coniferous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'c'
check (forest_type = 'c'),
pine_fir_ratio float not null
check (pine_fir_ratio >= 0),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
create table deciduous_forests_subtype (
forest_id integer primary key,
forest_type char(1) not null
default 'd'
check (forest_type = 'd'),
autumn_duration_days integer not null
check (autumn_duration_days between 20 and 100),
foreign key (forest_id, forest_type)
references forests (forest_id, forest_type)
);
Clients usually use updatable views, one for each subtype, instead of using the base tables. (You can revoke privileges on the base subtype tables to guarantee this.) You might want to omit the "forest_type" column.
create view coniferous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.pine_fir_ratio
from forests t1
inner join coniferous_forests_subtype t2
on t1.forest_id = t2.forest_id;
create view deciduous_forests as
select t1.forest_id, t1.forest_type, t1.area_sq_km,
t2.autumn_duration_days
from forests t1
inner join deciduous_forests_subtype t2
on t1.forest_id = t2.forest_id;
What you have to do to make these views updatable varies a little with the dbms, but expect to write some triggers (not shown). You'll need triggers to handle all the DML actions--insert, update, and delete.
If you need to report only on columns that appear in "forests", then just query the table "forests".
Well, the easiest way is putting all the columns into one table and then having a "type" field to decide which columns to use. This works for smaller tables, but for more complicated cases it can lead to a big messy table and issues with database constraints (such as NULLs).
My preferred method would be something like this:
A generic "Forests" table with: id, type, [generic_columns, ...]
"Coniferous_Forests" table with: id, forest_id (FK to Forests), ...
So, in order to get all the data for a Coniferous Forest with id of 1, you'd have a query like so:
SELECT * FROM Coniferous_Forests INNER JOIN Forests
ON Coniferous_Forests.forest_id = Forests.id
AND Coniferous_Forests.id = 1
As for inventive solutions, there is such a thing as an OODBMS (Object Oriented Database Management Sytem).
The most popular alternative to Relational SQL databases are Document-Oriented NoSQL databases like MongoDB. This is comparable to using JSON objects to store your data, and allows you to be more flexible with your database fields.

How to duplicate the amount of data in a PostgreSQL database?

In order to evaluate the load of our platform (django + postgresql) I would like to literally duplicate the amount of data in the system. Its a bit complicated to create mocks that could emulate the different kind of objects (since we have a very complex data model).
Is there a way to create a duplicate of the database, override primary keys and unique fields for unused ones an merge it with the original?
(I) Explaining the principle
In order to illustrate the principle in a clear way, this explanation assumes the following:
every table has a bigserial primary key column called "id"
No unique constraints on tables (except primary keys)
Foreign key constraints reference only primary keys of other tables
Apply following to your database schema:
Make sure there is no circular dependencies between tables in your schema. If there are, choose foreign key constraints that would breake such dependency and drop them (you will later recreate them, after you manually handle affected fields).
Sort tables in topological order and, in that order, for every table execute script from (3)
For every table <table_schema>.<table_name> from (2) execute:
/*
Creating a lookup table which contains ordered pairs (id_old, id_new).
For every existing row in table <table_schema>.<table_name>,
new row with id = new_id will be created and with all the other fields copied. Nextval of sequence <table_schema>.<table_name>_id_seq is fetched to reserve id for a new row.
*/
CREATE TABLE _l_<table_schema>_<table_name> AS
SELECT id as id_old, nextval('<table_schema>.<table_name>_id_seq') as id_new
FROM <table_schema>.<table_name>;
/*
This part is for actual copying of table data with preserving of referential integrity.
Table <table_schema>.<table_name> has the following fields:
id - primary key
column1, ..., columnN - fields in a table excluding the foreign keys; N>=0;
fk1, ..., fkM - foreign keys; M>=0;
_l_<table_schema_fki>_<table_name_fki> (1 <= i <= M) - lookup tables of parent tables. We use LEFT JOIN because foreign key field could be nullable in general case.
*/
INSERT INTO <table_schema>.<table_name> (id, column1, ... , columnN, fk1, ..., fkM)
SELECT tlookup.id_new, t.column1, ... , t.columnN, tablefk1.id_new, ..., tablefkM.id_new
FROM <table_schema>_<table_name> t
INNER JOIN _l_<table_schema>_<table_name> tlookup ON t.id = tlookup.id_old
LEFT JOIN _l_<table_schema_fk1>_<table_name_fk1> tablefk1 ON t.fk1 = tablefk1.id_old
...
LEFT JOIN _l_<table_schema_fkM>_<table_name_fkM> tablefkM ON t.fkM = tablefkM.id_old;
Drop all lookup tables.
(II) Describing my implementation
To check for circular dependencies, I queried the transitive closures (https://beagle.whoi.edu/redmine/projects/ibt/wiki/Transitive_closure_in_PostgreSQL)
I implemented topological sort function (ported from t-sql from some blog). It comes in handy for automations.
I made a code generator (implemented in plpgsql). It's a function which takes <table_schema> and <table_name> as input params and returns text (SQL) shown in (I.2) for that table. By concatenating results of the function for every table in topological order, I produced the copy script.
I made manual changes to the script to satisfy unique constraints and other nuances, which boilerplate script doesn't cover.
Done. Script ready for execution in one transaction.
When I get the chance, I will "anonimize" my code a little bit and put it on github and put a link here.

Merging databases how to handle duplicate PK's

We have three databases that are physically separated by region, one in LA, SF and NY. All the databases share the same schema but contain data specific to their region. We're looking to merge these databases into one and mirror it. We need to preserve the data for each region but merge them into one db. This presents quite a few issues for us, for example we will certainly have duplicate Primary Keys, and Foreign Keys will be potentially invalid.
I'm hoping to find someone who has had experience with a task like this who could provide some tips, strategies and words of experience on how we can accomplish the merge.
For example, one idea was to create composite keys and then change our code and sprocs to find the data via the composite key (region/original pk). But this requires us to change all of our code and sprocs.
Another idea was to just import the data and let it generate new PK's and then update all the FK references to the new PK. This way we potentially don't have to change any code.
Any experience is welcome!
I have no first-hand experience with this, but it seems to me like you ought to be able to uniquely map PK -> New PK for each server. For instance, generate new PKs such that data from LA server has PK % 3 == 2, SF has PK % 3 == 1, and NY has PK % 3 == 0. And since, as I understood your question anyway, each server only stores FK relationships to its own data, you can update the FKs in identical fashion.
NewLA = OldLA*3-1
NewSF = OldLA*3-2
NewNY = OldLA*3
You can then merge those and have no duplicate PKs. This is essentially, as you already said, just generating new PKs, but structuring it this way allows you to trivially update your FKs (assuming, as I did, that the data on each server is isolated). Good luck.
BEST: add a column for RegionCode, and include it on your PKs, but you don't want to do all the leg work.
HACK: if your IDs are INTs, a quick fix would be to add a fixed value based on region to each key on import. INTs can be as large as: 2,147,483,647
local server data:
LA IDs: 1,2,3,4,5,6
SF IDs: 1,2,3,4,5
NY IDs: 1,2,3,4,5,6,7,9
add 100000000 to LA's IDs
add 200000000 to SF's IDs
add 300000000 to NY's IDs
combined server data:
LA IDs: 100000001,100000002,100000003,100000004,100000005,100000006
SF IDs: 200000001,200000002,200000003,200000004,200000005
NY IDs: 300000001,300000002,300000003,300000004,300000005,300000006,300000007,300000009
I have done this and I say change your keys (pick a method) rather than changing your code. Invariably you will either miss a stored procedure or introduce a bug. With data changes, it is pretty easy to write tests to look for orphaned records or to verify that things were matched up correctly. With code changes, especially code that is working correctly, it is too easy to miss something.
One thing you could do is set up the tables with regional data to use GUID's. That way, the primary keys in each region are unique, and you can mix and match data (import data from one region to another). For the tables which have shared data (like type tables), you can keep the primary keys the way they are (since they should be the same everywhere).
Here is some information about GUID's:
http://www.sqlteam.com/article/uniqueidentifier-vs-identity
Maybe SQL Server Management Studio lets you convert columns to use GUID's easily. I hope so!
Best of luck.
what i have done in a situation like this is this:
create a new db with the same schema
but only tables. no pk fk, checks
etc.
transfer data from DB1 to this
source db
for each table in target database
find the top number for the PK
for each table in the source
database update their pk, fk etc
starting with the (top number + 1)
from the target db
for each table in target database
set identity insert to on
import data from source db to target
db
for each table in target database
set identity insert to off
clear source db
repeat for DB2
As Jon mentioned, I would use GUIDs to solve the merge task. And I see two different solutions that required GUIDs:
1) Permanently change your database schema to use GUIDs instead of INTEGER (IDENTITY) as primary key.
This is a good solution in general, but if you have a lot of non SQL code that is somehow bound to the way your identifiers work, it could require quite some code changes. Probably since you merge databases, you may anyways need to update your application so that it is working with one region data only based on the user logged in etc.
2) Temporarily add GUIDs for migration purposes only, and after the data is migrated, drop them:
This one is kind-of more tricky, but once you write this migration script, you can (re-)run it multiple times to merge databases again in case you screw it the first time. Here is an example:
Table: PERSON (ID INT PRIMARY KEY, Name VARCHAR(100) NOT NULL)
Table: ADDRESS (ID INT PRIMARY KEY, City VARCHAR(100) NOT NULL, PERSON_ID INT)
Your alter scripts are (note that for all PK we automatically generate the GUID):
ALTER TABLE PERSON ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD UID UNIQUEIDENTIFIER NOT NULL DEFAULT (NEWID())
ALTER TABLE ADDRESS ADD PERSON_UID UNIQUEIDENTIFIER NULL
Then you update the FKs to be consistent with INTEGER ones:
--// set ADDRESS.PERSON_UID
UPDATE ADDRESS
SET ADDRESS.PERSON_UID = PERSON.UID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_ID = PERSON.ID
You do this for all PKs (automatically generate GUID) and FKs (update as shown above).
Now you create your target database. In this target database you also add the UID columns for all the PKs and FKs. Also disable all FK constraints.
Now you insert from each of your source databases to the target one (note: we do not insert PKs and integer FKs):
INSERT INTO TARGET_DB.dbo.PERSON (UID, NAME)
SELECT UID, NAME FROM SOURCE_DB1.dbo.PERSON
INSERT INTO TARGET_DB.dbo.ADDRESS (UID, CITY, PERSON_UID)
SELECT UID, CITY, PERSON_UID FROM SOURCE_DB1.dbo.ADDRESS
Once you inserted data from all the databases, you run the code opposite to the original to make integer FKs consistent with GUIDs on the target database:
--// set ADDRESS.PERSON_ID
UPDATE ADDRESS
SET ADDRESS.PERSON_ID = PERSON.ID
FROM ADDRESS
INNER JOIN PERSON
ON ADDRESS.PERSON_UID = PERSON.UID
Now you may drop all the UID columns:
ALTER TABLE PERSON DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN UID
ALTER TABLE ADDRESS DROP COLUMN PERSON_UID
So at the end you should get a rather long migration script, that should do the job for you. The point is - IT IS DOABLE
NOTE: all written here is not tested.