How to remove Duplicates and update all relations from PostgreSQL database?

How to remove Duplicates and update all relations from PostgreSQL database? - sql

I'm working on updating a local dataset that has a lot of cases with the following structure:
The elements in table A that share the same refID for the main entity are virtually the same, so I wanna remove all these duplicates and update tables B, C and so on.
The idea is to group all elements in table A that share the same refId, choose one, remove all the others and change the references in other tables to the one I chose. However, I'm having problems with the last part, updating the other tables.
Is there a quick way of doing this? Doing it manually has been a pain

Related

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?

No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

How can I replicate many to many relationship

I have a product database where I am trying to replicate a particular product's data and relationships to a new product, a clone. I am puzzled however on how to replicate several many to many relationships. For example, consider a product with two parts, and for each part, their are several colors available. I have a Product table, a product Areas table, and a Colors table. The product id is a foreign key in the area table, one to many. The Area table has an area id (pk) along with other descriptive fields, and the Colors have color ids (pk) along with palette information. A fourth table serves as the many to many look up table, it's primary key being the part id and the color id combined. This is a pretty straight forward configuration as far as it goes.
I can't think of a way to clone this structure, however, despite many approaches which would be way too much to elaborate upon here. I can easily enough replicate the left hand, product-area relationship, generating new AreaIDs (A,B,C). But in a next step, I then want to replicate the many-to-many relationship using the new area ids. However, now I don't know which original ID (H,L,W) to associate with which new ID.
For example, does the new id A get mapped to the set of colors from the old ID H, L, or W? I have only id's to work with. I can select both parts and part-color pairs from the source in one select statement, but I can't insert into two tables with one statement.
In other words, how do I replicate many to many relationships if I want to supply a new ID for half of it? Do I have to resort to cursors? I can if I need to, but I'm imagining there an elegant way to accomplish this that I just can't figure out. Maybe using a temp table or some sort of table valued function? I've tried to search for answers, but I all I can find is advice on setting up many-to-many relationships.
Thanks for you experts who have the patience to read through this question.

SymmetricDS replicates tables that have many into many relationships using change data capture. The key is to perform an initial load to get the databases in sync initially so that if a child record is updated the change data capture will also work. In the latest versions of SymmetricDS (3.10 and higher) it will also auto resolve foreign key errors if the databases are not in sync. If a child row is being loaded to a target without the parent it will callback to the source to load the missing parent as well so that you do not need to intervene.

Using a Delete query on a single table when referencing other tables

I want to run a delete query to remove certain data from a table in a Sharepoint list using an MS Access query. However I want to be sure only to delete from a single list based on the values of another table.
The table is TMainData: This consists solely of number fields that are references to the keyfields in other tables, such as TProgram which has a program name, or TContact which has the point of contact, or TPositionTitle which has a title like Site Director.
So a TMainData entry looks something like
ProgramID, which links to TPrograms: 4
ContactID, which links to TContacts: 42
PositionTitle, which links to TPositionTitle: 3
This tells me that the Site Director (TPositionTitle 3) of Anesthesiology (ProgramID 4) is John Smith (ContactID 42).
Here's where it gets tricky: I have a reference under TPrograms to TProgramType. I want to delete all records under TMainData where they link to a certain Program Type, because that program type is going away. HOWEVER... I don't want to delete the program itself (yet), just the lines referencing that program in TMainData.
The "manual" way I see to do this is to run queries that identify what the ProgramIDs are of the programs I want to delete the contacts for, and then use those IDs in a delete query that only references the TMainData query. I'm wondering if there's a way to use referential data, because I may have to be running some ridiculous update queries at a later time that would need this same info.
I dug through https://support.office.com/en-us/article/Use-queries-to-delete-one-or-more-records-from-a-database-A323BF1A-C9B4-4C86-8719-BE58BDF1B10C but it doesn't seem to cover deleting from one table based on values referenced in another table.

You already seem to understand what you need to do to achieve the desired result when you state:
...run queries that identify what the ProgramIDs are of the programs I want to delete the contacts for, and then use those IDs in a delete query that only references the TMainData query.
If I've understood your description correctly, I would suggest something along the lines of:
delete from tmaindata
where
tmaindata.programid in
(
select tprograms.programid
from tprograms
where tprograms.tprogramtype = 'YourProgramType'
)
Always take a backup of your data before running delete queries - there is no undo.

Adding record with new foreign key

I have few tables to store company information in my database, but I want to focus on two of them. One table, Company, contains CompanyID, which is autoincremented, and some other columns that are irrelevant for now. The problem is that companies use different versions of names (e.g. IBM vs. International Business Machines) and I want/need to store them all for futher use, so I can't keep names in Company table. Therefore I have another table, CompanyName that uses CompanyID as a foreign key (it's one-to-many relation).
Now, I need to import some new companies, and I have names only. Therefore I want to add them to CompanyName table, but create new records in Company table immediately, so I can put right CompanyID in CompanyName table.
Is it possible with one query? How to approach this problem properly? Do I need to go as far as writing VBA procedure to add records one by one?
I searched Stack and other websites, but I didn't find any solution for my problem, and I can't figure it out myself. I guess it could be done with form and subform, but ultimately I want to put all my queries in macro, so data import would be done automatically.
I'm not database expert, so maybe I just designed it badly, but I didn't figure out another way to cleanly store multiple names of the same entity.

The table structure you setup appears to be a good way to do this. But there's not a way to insert records into both tables at the same time. One option is to write two queries to insert records into Company and then CompanyName. After inserting records into Company you will need to create a query that joins from the source table to the Company table joining it on a field that uniquely defines the record beside the autoincrement key. That will allow you to get the key field from Company for use when you insert into CompanyName.
The other option, is to write some VBA code to loop through the source data inserting records into both. The would be preferable since it should be more reliable.

Entity Framework Inheritance vs Tables

Ok I am very new to creating databases with Entity in mind.
I have a Master table which is going to have:
departmentID
functionID
processID
procedureID
Each of those ID's need to point to a specific list of information. Which is name, description and owner of course they link back to each ID in the master table.
My question is, should I make 4 separate tables or create one table since the information is the same in all the tables except one.
The procedureID will actually need to have an extra field for documentID to point to a specific document.
Is it possible and a good idea to make one table and add some inheritance, or is it better to make 4 separate tables?

Splitting data into a number of related tables brings many advantages over one single table. Also by having data held in separate tables, it is simple to add records that are not yet needed but may be in the future. You can also create your corresponding objects for each table in your code. Also it would be more difficult to split the data into separate tables in the future if somehow you need to do that.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas