merging data from 2 databases

merging data from 2 databases - sql

Currently have a contracts system that pulls in job data from our finance system. Each job has an id and the contracts hang off of that. We now have to bring in job data from another finance system. The jobs from the new system will also contain a job id and contracts will have to hang from this. I expect there will be some id conflicts when the data is merged. Whats the best way to deal with this. Should I create another table that pulls in the job data from both and assigns a new id for the contracts to hang from. Obviously I will need to update the current contracts to match the new id's generated. Does this sound like a good idea or is there a better way.

Given your additional comments, I would suggest that you use a mapping table to map any of the conflicting IDs in the old system to new IDs. Normally when importing data into an existing system you would want to keep the IDs of the current system intact, but since that system is going to be gone in a year (or however long it takes) and is about to be read only I would think that you would want to try to preserve IDs in the new system.
Once you create the mapping table, you would then use that to update any foreign key references, etc. and then import the new data, which should now have no conflicts.

Related

Easiest Way of Returning a Table for manual completion

I have a situation where periodically I am doing the following:
New data comes in about a business
The new data gets joined on several look-up tables (such as mapping employees to teams, firm names to industries)
As the business is changing, new employees or firms can appear each time new data is loaded, at which point my current code adds the new occurences to the existing look-up tables and then i export the table, manually type in the mappings (manually is the easiest way), and then re-import the table
I was wondering if there is an easier way on the fly to write 'directly' into a table as it were rather than having to save and re-import it as i am currently doing. (Also please let's assume that its just easier for me to manually think about and type the mappings in rather than having a CASE statement or similar to do the task)

Accepted methodology when using multiple Sqlite databases

Question
What is the accepted way of using multiple databases that record information about the same object that will ultimately end up living in one central database?
Example
There is one main SQL database about trees.
This database holds information about unique trees from all over the UK.
To collect the information a blank Sqlite database is created (with the same schema) and taken to the tree on a phone.
The collected information is then stored in the Sqlite database until it is brought back to the main database, Where it is then transferred into the main database.
Now this works fine as long as there is only one Sqlite database out for any one tree at a time.
However, if two people wanted to collect different information for the same tree at the same time, when they both came back and attempted to transfer their data in to the main database, there would be collisions on their primary key constraints.
ID Schemes (with example data)
There is a tree table which has unique identifier called treeID
TreeID - TreeName - Location
1001 - Teddington Field - Plymouth
Branch table
BranchID - BranchName - TreeID
1001-10001 - 1st Branch - 1001
1001-10002 - 2nd Branch -1001
Leave table
LeafID - LeafName - BranchId
1001-10001-1 - Bedroom - 1001-10001
1001-10002-2 - Bathroom - 1001-10001
Possible ideas
Assign each database 1000 unique ID's and then one they come back in as the ids have already been assigned the ids on each database won't collide.
Downfall
This isn't very dynamic and could fail if one database overruns on its preassigned ids.
Is there another way to achieve the same flexibility but with out the downfall mentioned above?

So, as an answer:
on the master db, store an extra id field identifying the source/collection database that the dataset was collected on, as well as the tree id.
(src01, 1001), (src02, 1001)
This also allows you to link back easily to the collection source of the information which is likely gonna be a future requirement. Now, you may or may not want to autogenerate another sequence id key value on the master db's table (I wouldn't but that's because I am not that fond of surrogate keys), but I would definitely keep track of the source/treeid it was originally collected with in the field, separately of any master db unique key considerations.

Apparently you are talking about auto-generated IDs for related objects, not the IDs for the trees themselves. Two different people collecting information about the same tree, starting from the same starting set, end up generating the same IDs independently. The two sets of generated IDs cannot coexist in the same DB.
Since you want to keep all the new data. One possible solution is to avoid using the field-generated IDs in the central database at all. When each set of data comes in, take the data that were added in the field, and programmatically add them to the central DB in a way equivalent to how they are added in the field, letting the central DB autogenerate its own IDs.
This requires a mechanism to distinguish newly-collected data from old, but that might be as simple as a timestamp.

What is the best method of logging data changes and user activity in an SQL database?

I'm starting a new application and was wondering what the best method of logging is. Some tables in the database will need to have every change recorded, and the user that made the change. Other tables may just need to have the last modified time recorded.
In previous applications I've used different methods to do this but want to hear what others have done.
I've tried the following:
Add a "modified" date-time field to the table to record the last time it was edited.
Add a secondary table just for recording changes in a primary table. Each row in the secondary table represents a changed field in the primary table. So one record update in the primary could create several records in the secondary table.
Add a table similar to no.2 but it records edits across three or fours tables, reference the table it relates to in an additional field.
what methods do you use and would recommend?
Also what is the best way to record deleted data? I never like the idea that a user can permanently delete a record from the DB, so usually I have a boolean field 'deleted' which is changed to true when its deleted, and then it'll be filtered out of all queries at model level. Any other suggestions on this?
Last one.. What is the best method for recording user activity? At the moment I have a table which records logins/logouts/password changes etc, and depending what the action is, gives it a code either 1,2, 3 etc.
Hope I haven't crammed too much into this question. thanks.

I know it's a very old question, but I'd wanted to add more detailed answer as this is the first link I got googling about db logging.
There are basically two ways to log data changes:
on application server layer
on database layer.
If you can, just use logging on server side. It is much more clear and flexible.
If you need to log on database layer you can use triggers, as #StanislavL said. But triggers can slow down your database performance and limit you to store change log in the same database.
Also, you can look at the transaction log monitoring.
For example, in PostgreSQL you can use mechanism of logical replication to stream changes in json format from your database to anywhere.
In the separate service you can receive, handle and log changes in any form and in any database (for example just put json you got to Mongo)

You can add triggers to any tracked table to olisten insert/update/delete. In the triggers just check NEW and OLD values and write them in a special table with columns
table_name
entity_id
modification_time
previous_value
new_value
user
It's hard to figure out user who makes changes but possible if you add changed_by column in the table you listen.

Having two tables for capturing data at a specific moment

I'm creating an application which will hold curriculum vitaes
the user should be able to:
create different work information for using with different CV's
Name of work, Start date, End Date, ...
CV will have many WorkInformations
Workinformation belongs to many CV's
though when a user changes workinformation outside the scope of the CV I don't want it to change within the current CV's.
Is it correct to have an extra table with the same information?
Its supposed to create a new "workinformation" from a copy of a "workinformation_that_shouldent.."
or any other approach I should look into, open for all suggestions, new to designing relational databases.

No, I don't think you should have a different workinformation table.
Instead, you should have the CV point to a work information record. When the work information record changes outside the CV world, then create a new version of the record. That way, all work information records are in the same table. The ones that CVs refer to remain the same.
You can keep track of different versions of the same record in more than one way. A simple way is to have versions refer back to the base work information record, with another field having the version number.
By the way, I find it unusual that a work information record would be referred to by multiple CVs.

Db design for data update approval

I'm working on a project where we need to have data entered or updated by some users go through a pending status before being added into 'live data'.
Whilst preparing the data the user can save incomplete records. Whilst the data is in the pending status we don't want the data to affect rules imposed on users editing the live data e.g. a user working on the live data should not run up against a unique contraint when entering the same data that is already in the pending status.
I envisage that sets of data updates will be grouped into a 'data submission' and the data will be re-validated and corrected/rejected/approved when someone quality control the submission.
I've thought about two scenarios with regards to storing the data:
1) Keeping the pending status data in the same table as the live data, but adding a flag to indicate its status. I could see issues here with having to remove contraints or make required fields nullable to support the 'incomplete' status data. Then there is the issue with how to handle updating existing data, you would have to add a new row for an update and link it back to existing 'live' row. This seems a bit messy to me.
2) Add new tables that mirror the live tables and store the data in there until it has been approved. This would allow me to keep full control over the existing live tables while the 'pending' tables can be abused with whatever the user feels he wants to put in there. The downside of this is that I will end up with a lot of extra tables/SPs in the db. Another issue I was thinking about was how might a user link between two records, whereby the record linked to might be a record in the live table or one in the pending table, but I suppose in this situation you could always take a copy of the linked record and treat it as an update?
Neither solutions seem perfect, but the second one seems like the better option to me - is there a third solution?

Your option 2 very much sounds like the best idea. If you want to use referential integrity and all the nice things you get with a DBMS you can't have the pending data in the same table. But there is no need for there to be unstructured data- pending data is still structured and presumably you want the db to play its part in enforcing rules even on this data. Even if you didn't, pending data fits well into a standard table structure.
A separate set of tables sounds the right answer. You can bring the primary key of the row being changed into the pending table so you know what item is being edited, or what item is being linked to.
I don't know your situation exactly so this might not be appropriate, but an idea would be to have a separate table for storing the batch of edits that are being made, because then you can quality control a batch, or submit a batch to live. Each pending table could have a batch key so you know what batch it is part of. You'll have to find a way to control multiple pending edits to the same rows (if you want to) but that doesn't seem too tricky a problem to solve.
I'm not sure if this fits but it might be worth looking into 'Master Data Management' tools such as SQL Server's Master Data Services.

'Unit of work' is a good name for 'data submission'.
You could serialize it to a different place, like (non-relational) document-oriented database, and only save to relational DB on approval.
Depends on how many of live data constraints still need to apply to the unapproved data.

I think second option is better. To manage this, you can use View which will contain both tables and you can work with this structure through view.
Another good approach is to use XML column in a separate table to store necessary data(because of unknown quantity/names of columns). You can create just one table with XML column ad column "Type" do determine which table this document is related with.

First scenerio seems to be good.
Add Status column in the table.There is no need to remove Nullable constraint just add one function to check the required fields based on flag like If flag is 1(incomplete) Null is allowed otherwise Not allowed.
regarding second doubt do you want to append the data or update the whole data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas