What is the best way to copy data from related tables to another related tables? - sql

What is the best way to copy data from related tables to another related tables with same schema. Table are connected with one-to-many relationship.
Consider following schema
firm
id | name | city.id (FK)
employee
id | lastname | firm.id (FK)
firm2
id | name | city_id (FK)
employee2
id | lastname |firm2.id (FK)
What I want to do is to copy rows from firm with specific city.id to firm2 and and their employees assosiated with firm to table employee2.
I use posgresql 9.0 so I have to call SELECT nextval('seq_name') to get new id for table.
Right now I perform this query simply iterating over all rows in Java backend server, but on huge amount of data (50 000 employee and 2000 of firms) it takes too much time ( 1-3 minutes).
I'm wondering is there another more tricky way to do it, for example select data into temp table? Or probably use store procedure and iterate over rows with cursror to avoid buffering on my backend server?

This is one problem caused by simply using a sequence or identity value as your sole primary key in a table.
If there is a real-life unique index/primary key, then you can join on that. The other option would be to create a mapping table as you fill in the tables with sequences then you can insert into the children tables' FKs by joining to the mapping tables. It doesn't completely remove the need for looping, but at least some of the inserts get moved out into a set-based approach.

Related

MS SQL Server Optimize repeated column values

I was requested to create a table that will contain many repeated values and I'm not sure if this is the best way to do it.
I must use SQL Server. I would love to use Azure Table Storage and partition keys, but I'm not allowed to.
Imagine that the table Shoes has the columns
id int, customer_name varchar(50), shoe_type varchar(50)
The problem is that the column shoe_type will have millions of repeated values, and I want to have them in their own partition, but SQL Server only allows ranged partitions afaik.
I don't want the repeated values to take more space than needed, meaning that if the column value is repeated 50 times, I don't want it to take 50 times more space, only 1 time.
I thought about using a relationship between the column shoe_type (as an int) and another table which will have its string value, but is that the most I can optimize?
EDIT
Shoes table data
id customer_name shoe_type
-----------------------------
1 a nike
2 b adidas
3 c adidas
4 d nike
5 e adidas
6 f nike
7 g puma
8 h nike
As you can see, the rows contain repeated shoe_type values (nike, adidas, puma).
What I thought about is using the shoe_type column as an int foreign key to another table, but I'm not sure if this is the most efficient way to do it, because in Azure Table Storage you have partitions and partition keys, and in MS SQL Server you have partitions, but they are ranged only.
The sample data you provide suggests that there is a "shoe type" entity in the business domain, and that all shoes have a mandatory relationship to a single shoe type. It would be different if the values were descriptive text - e.g. "Attractive running shoe, suitable for track and leisure wear". Repeated values are often (but of course not always) an indicator that there is another entity you can extract.
You suggest that the table will have millions of records. In very general terms, I recommend designing your schema to reflect the business domain, and only go for exotic optimization options once you know, and can measure, that you have a performance problem.
In your case, I'd suggest factoring out a separate table called "shoe_types", and to include a foreign key relationship from "shoes" to "shoe_types". The primary key for "shoe_types" should be a clustered index, and the "shoe_type_id" in "shoe_types" should be a regular index. All things being equal, with (tens of) millions of rows, that hit the foreign key index should be very fast.
In addition, supporting queries like "find all shoes where shoe type name starts with 'nik%'" should be much faster, because the shoe_types table should have far fewer rows than "shoes".

How to control version of records with referenced records (Foreign keys)

I am developing an application and in which I have multiple tables,
Table Name : User_master
|Id (PK,AI) | Name | Email | Phone | Address
Table Name: User_auth
|Id (PK,AI) | user_id(FK_userMaster) | UserName | password | Active_Status
Table Name: Userbanking_details
|Id (PK,AI) | user_id(FK_userMaster) | Bank Name | account Name | IFSC
Now, what I want is to save all the updates done in records should not be updated directly instead it should control the version that means I want to track the log of all previous updates user has done.
Which means if user updates the address, then also previous address record history should be stored into the table.
I have tried it by adding fields version_name, version_latest, updated_version_of field and insert new record when update like
|Id (PK,AI) | Name | Email | Phone | Address |version_name |version_latest| updated_version_of
1 | ABC |ABC#gm.com|741852|LA |1 |0 |1
2 | ABC |ABC#gm.com|852741|NY |2 |1 |1
Now the problem comes here is the user table is in FK with other two listed tables so when updating the record their relationship will be lost because of new ID.
I want to preserve the old data shown as old and new updated records will be in effect only with new transactions.
How can I achieve this?
Depending upon your use case you can make a json field in your tables for storing the previous states or a new identical history table for each table.
Dump the entire hash into the history column everytime the user updates anything.
Or insert a new row in the history table for each update in the original.
Storing historical records and current records in the same table, is not a good practice, in a transactional system.
The reasons are:
There will be more I/O due to scanning more number of pages to identify a record
Additional maintenance effort on the table
Transactions getting bigger, longer and cause time out issues
Additional effort of cascading referential integrity changes to child tables
I would suggest to keep historical records in a separate table. You can have OUTPUT clause to have the historical records to be captured and inserted into separate table. In that way, your referential integrity will remain the same. In the historical table, you don't need to have PK defined.
A below sample for using OUTPUT clause with UPDATE. You can read more about OUTPUT clause here
DECLARE #Updated table( [ID] int,
[Name_old] varchar(50),
[Email_old] varchar(50),
[Phone_old] varchar(50),
[Address_old] varchar(50),
[ModifiedDate_old] datetime);
Update User_Master
Set Email= 'NewEmail#Email.com', Name = 'newName', Phone='NewPhone', Address='NewAddress'
ModifiedDate=Getdate()
OUTPUT deleted.Id as Id, deleted.Name as Name_old, deleted.Email as email_old ,
deleted.ModifiedDate as ModifiedDate_old, deleted.Phone as phone_old, deleted.Address AS Address_old, deleted.ModifiedDate as modifiedDate_old
INTO #updated
Where [Id]=1;
INSERT INTO User_Master_History
SELECT * FROM #updated;
When I have faced this situation in the past I have solved it in the following ways:
First Method
Recommended method.
Have a second table which acts as a change history. Because you are not adding rows to the main table your foreign keys maintain integrity.
There are now mechanisms in SQL Server to do this automatically.
SQL Server 2016 Temporal Tables
SQL Server 2017 Change Data Capture
Second Method
I don't recommend this as a good design, but it does work.
Treat one record as the primary record, and this record maintains a foreign key relationship with records in other related tables which are subject to change tracking.
Always update this primary record with any changes thereby maintaining integrity of the foreign keys.
Add a self-referencing key to this table e.g. Parent and a date-of-change column.
Each time the primary record is updated, store the old values into a new record in the same table, and set the Parent value to the id of the primary record. Again the primary record never changes, and therefore your foreign key relationships maintain integrity.
Using the date-of-change column in conjunction with the change history allows you to reconstruct the exact values at any point in time.

Is it good practice to have two SQL tables with bijective row correspondence?

I have a table of tasks,
id | name
----+-------------
1 | brush teeth
2 | do laundry
and a table of states.
taskid | state
--------+-------------
1 | completed
2 | uncompleted
There is a bijective correspondence between the tables, i.e.
each row in the task table corresponds to exactly one row in the state table.
Another way of implementing this would be to place a state row in the task table.
id | name | state
----+-------------+-------------
1 | brush teeth | completed
2 | do laundry | uncompleted
The main reason why I have selected to use two tables instead of this one, is because updating the state will then cause a change in the task id.
I have other tables referencing the task(id) column, and do not want to have to update all those other tables too when altering a task's state.
I have two questions about this.
Is it good practice to have two tables in bijective row-row correspondence?
Is there a way I can ensure a constraint that there is exactly one row in the state table corresponding to each row in the task table?
The system I am using is postgresql.
You can ensure the 1-1 correspondence by making the id in each table a primary key and a foreign key that references the id in the other table. This is allowed and it guarantees 1-1'ness.
Sometimes, you want such tables, but one table has fewer rows than the other. This occurs when there is a subsetting relationship, and you don't want the additional columns on all rows.
Another purpose is to store separate columns in different places. When I learned about databases, this approach was called vertical partitioning. Nowadays, columnar databases are relatively common; these take the notion to the extreme -- a separate "store" for each column (although the "store" is not exactly a "table").
Why would you do this? Here are some reasons:
You have infrequently used columns that you do not want to load for every query on the more frequent columns.
You have frequently updated columns and you do not want to lock the rest of the columns.
You have too many columns to store in one row.
You have different security requirements on different columns.
Postgres does offer other mechanisms that you might find relevant. In particular, table inheritance might be useful in your situation.
All that said, you would not normally design a database like this. There are good reasons for doing so, but it is more typical to put all columns related to an entity in the same table.

PHP Junction Table Relations (Many to Many), grasping concept

So I've tried searching and have yet to find out how to grasp this entirely.
I'm reorganising my database because I was storing user id's as comma separated values in a column withing that row to control permissions. To me, this seems like a better and faster(hardware) way, but I'm moving towards this "proper" way now.
I understand that you need 3 tables. This is what I have.
Table 1. members -> ID | user_name
Table 2. teams -> ID | team_name
Table 3. team_members -> ID | team_fk | member_fk
I understand how to store data in another column and use sql data to display it. What I'm confused about is why I have to link(relation) the columns to the ID's of the other tables. I could get the data without using the relation. I'm confused by what it even does.
Furthermore, I would like to have multiple values that determine permissions for each team. Would I do:
Table 3. team_members -> ID | team_fk | member_fk | leader_fk | captain_fk
^setting 0 or 1(true or false) for the leader and captain.
Or would I create a table(like team_leaders, team_captains) for each permission?
Thanks for the help!
Ryan
It seems that "leader", "captain and "regular member" are roles in your team. So you can create table team_roles, or just assign roles as strings to your relation table, i.e.
team_members -> ID | team_fk | member_fk | role
The key thing about this is to keep your database [normalised]https://en.wikipedia.org/wiki/Database_normalization. It is really easier to work with normalised database in most cases.
What I'm confused about is why I have to link(relation) the columns to the ID's of the other tables. I could get the data without using the relation.
You don't have to declare columns as foreign keys. It's just a good idea. It serves the following purposes:
It tells readers of the schema how the tables are related to each other. If you name the columns well, this is redundant -- team_fk is pretty obviously a reference to the teams table.
It enables automatic integrity checks by the database. If you try to create a team_members row that contains a team_fk or member_fk that isn't in the corresponding table, it will report an error. Note that in MySQL, this checking is only done by the InnoDB engine, not MyISAM.
Indexes are automatically created for the foreign key columns, which helps to optimize queries between the tables.
Table 3. team_members -> ID | team_fk | member_fk | leader_fk | captain_fk
If leader and captain are just true/false values, they aren't foreign keys. A foreign key column contains a reference to a key in another table. So I would call these is_leader and is_captain.
But you should only put these values in the team_members table if a team can have multiple captains and leaders. If there's just one of each, they should be in the teams table:
teams -> ID | team_name | leader_fk | captain_fk
where leader_fk and captain_fk are IDs from the members table. This will ensure that you can't inadvertently assign is_captain = 1 to multiple members from the same team.

How to merge two identical database data to one?

Two customers are going to merge. They are both using my application, with their own database. About a few weeks they are merging (they become one organisation). So they want to have all the data in 1 database.
So the two database structures are identical. The problem is with the data. For example, I have Table Locations and persons (these are just two tables of 50):
Database 1:
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
1 Location A
2 Location B
Persons:
Id LocationId Name etc...
1 1 Mark
2 2 Ashley
3 1 Ben
We see that person is related to location (column locationId). Note that I have more tables that is referring to the location table and persons table.
The databases contains their own locations and persons, but the Id's can be the same. In case, when I want to import everything to DB2 then the locations of DB1 should be inserted to DB2 with the ids 3 and 4. The the persons from DB1 should have new Id 4,5,6 and the locations in the person table also has to be changed to the ids 4,5,6.
My solution for this problem is to write a query which handle everything, but I don't know where to begin.
What is the best way (in a query) to renumber the Id fields also having a cascade to the childs? The databases does not containing referential integrity and foreign keys (foreign keys are NOT defined in the database). Creating FKeys and Cascading is not an option.
I'm using sql server 2005.
You say that both customers are using your application, so I assume that it's some kind of "shrink-wrap" software that is used by more customers than just these two, correct?
If yes, adding special columns to the tables or anything like this probably will cause pain in the future, because you either would have to maintain a special version for these two customers that can deal with the additional columns. Or you would have to introduce these columns to your main codebase, which means that all your other customers would get them as well.
I can think of an easier way to do this without changing any of your tables or adding any columns.
In order for this to work, you need to find out the largest ID that exists in both databases together (no matter in which table or in which database it is).
This may require some copy & paste to get a lot of queries that look like this:
select max(id) as maxlocationid from locations
select max(id) as maxpersonid from persons
-- and so on... (one query for each table)
When you find the largest ID after running the query in both databases, take a number that's larger than that ID, and add it to all IDs in all tables in the second database.
It's very important that the number needs to be larger than the largest ID that already exists in both databases!
It's a bit difficult to explain, so here's an example:
Let's say that the largest ID in any table in both databases is 8000.
Then you run some SQL that adds 10000 to every ID in every table in the second database:
update Locations set Id = Id + 10000
update Persons set Id = Id + 10000, LocationId = LocationId + 10000
-- and so on, for each table
The queries are relatively simple, but this is the most work because you have to build a query like this manually for each table in the database, with the correct names of all the ID columns.
After running the query on the second database, the example data from your question will look like this:
Database 1: (exactly like before)
Locations:
Id Name Adress etc....
1 Location 1
2 Location 2
Persons:
Id LocationId Name etc...
1 1 Alex
2 1 Peter
3 2 Lisa
Database 2:
Locations:
Id Name Adress etc....
10001 Location A
10002 Location B
Persons:
Id LocationId Name etc...
10001 10001 Mark
10002 10002 Ashley
10003 10001 Ben
And that's it! Now you can import the data from one database into the other, without getting any primary key violations at all.
If this were my problem, I would probably add some columns to the tables in the database I was going to keep. These would be used to store the pk values from the other db. Then I would insert records from the other tables. For the ones with foreign keys, I would use a known value. Then I would update as required and drop the columns I added.