Insert into third table where 2 tables have same Id - sql

I have a huge database and want to process it by smaller chunks so I'm trying to write scripts and copy rows onto a temporary table, process it and then copy them back.
Now I've copied around 1000 rows into PersonMeta from old database and now want to insert corresponding rows for People table.
So basically I want to insert data from olddb.People into newdb.People where newdb.PersonMeta and newdb.People have the same code.
I've created this script but for some reason it doesn't copy all the rows. For example it copies 960 rows when it should copy 1000.
INSERT INTO [newdb].[dbo].[People] ([Id]
,[Name]
,[PersonId])
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId
edit:
I originally wrote 100 rows where it was 1000 rows. So the query is selecting 960 (40 less)
edit 2
The People table has some duplicate values for PersonId column. I removed them and now after I run the query it copies 956 rows (4 less then before).
edit 3:
I created this fiddle and it seems to be working just fine.
However, I did some queries on the database. Turns out when I query with a RIGHT JOIN the value for those records which are not copied are all NULL. So when I run the following query:
Select fp.*, fp.personid, pm.personid
From [olddb].[dbo].[People] fp
right join [newdb].[dbo].[PersonMeta] pm on
fp.personid = pm.personid
It returns this:
Is there another approach I could try to copy the data?

There may be a NULL value in the PersonID field on either table. If so, remove/update the NULL record and try again.

First you need to check your record separately what is generating from your query.
Check the output of this query. It should create 100 rows as you are expecting.
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId
But still if it creates more then your expected rows you may try some filters to test the result as isnull(pm.PersonId,0)<>0 and isnull(fp.PersonId,0)<>0.
So it filter out the record having personId is null, which may duplicate your record.
So final query for test was
SELECT fp.[Id]
,fp.[Name]
,fp.[PersonId]
FROM [olddb].[dbo].[People] fp
INNER JOIN [newdb].[dbo].[PersonMeta] pm on
pm.PersonId = fp.PersonId and isnull(pm.PersonId,0)<>0 and isnull(fp.PersonId,0)<>0
If still you can't figure out the issue then please share your table structure of tables which might help to understand the issue.

OK Now I feel silly but the problem was that simply not all of the rows had a corresponding value in the old People table for PersonMeta table. I thought they had it because I used the Id in the query rather than PersonId.
In short the posted query was in fact correct.

Considering you want to keep distinct and unique records in new table.
Below query will create same schema as of old table and copies all the data present in old table to new table.
select * into [newdb].[dbo].[People] from [olddb].[dbo].[People]
Now if you want to keep the data present in new table in sync with the unique records present in [newdb].[dbo].[PersonMeta]. you can simply do
delete from [newdb].[dbo].[People] where personid not in (select personid from [newdb].[dbo].[PersonMeta] )

Related

How to get the differences between two - kind of - duplicated tables (sql)

Prolog:
I have two tables in two different databases, one is an updated version of the other. For example we could imagine that one year ago I duplicated table 1 in the new db (say, table 2), and from then I started working on table 2 never updating table 1.
I would like to compare the two tables, to get the differences that have grown in this period of time (the tables has preserved the structure, so that comparison has meaning)
My way of proceeding was to create a third table, in which I would like to copy both table 1 and table 2, and then count the number of repetitions of every entry.
In my opinion, this, added to a new attribute that specifies for every entry the table where he cames from would do the job.
Problem:
Copying the two tables into the third table I get the (obvious) error to have two duplicate key values in a unique or primary key costraint.
How could I bypass the error or how could do the same job better? Any idea is appreciated
Something like this should do what you want if A and B have the same structure, otherwise just select and rename the columns you want to confront....
SELECT
*
FROM
B
WHERE NOT EXISTS (SELECT * FROM A)
if NOT EXISTS doesn't work in your DBMS you could also use a left outer join comparing the rows columns values.
SELECT
A.*
from
A left outer join B
on A.col = B.col and ....

Copy records missing from one table to a new table

I managed to delete 4,000 rows from a table in my 129,000-row production database (Postgres 9.4 on Heroku), but only identified the problem a few days later.
I have a backup from before the loss, but only want to selectively restore the missing rows back to the table, preserving their id's. (A complete restore is not an option as new data has since been added to the table.)
Into a local testing database I have imported the backed-up table as articles_backup, alongside the actual articles table. I want to find all the rows in articles_backups that are missing from articles and then copy these to a new table articles_restores that I will then restore to the production database, back into the articles table (preserving record id's).
This query successfully returns all the id's of the deleted records:
select articles_backups.id
from articles_backups
left outer join articles on (articles_backups.id = articles.id)
where articles.id is null
But I have not been able to copy the result to a new table. I have unsuccessfully tried:
select *
into articles_restores
from articles_backups
left outer join articles on (articles_backups.id = articles.id)
where articles.id is null;
Which gives:
ERROR: column "id" specified more than once
Basically your query with LEFT JOIN / IS NULL does what you are after:
Select rows which are not present in other table
You get the error because you select all columns from both tables, and there is an id column in both. It's not possible to create a new table with duplicate column names, and it's not what you want to begin with. Only select columns from articles_backups:
CREATE TABLE articles_restores AS
SELECT ab.*
FROM articles_backups ab
LEFT JOIN articles a USING (id)
WHERE a.id IS NULL;
While being at it I simplified your query syntax with table aliases. The USING clause is just for the convenience of shorter code. It folds the two id columns into one, but all other columns are still in there twice if you SELECT *.
Use CREATE TABLE AS. SELECT INTO is also defined by the SQL standard and implemented in Postgres, but its use is discouraged. It's used in PL/pgSQL functions for a different purpose. Details:
Creating temporary tables in SQL
You could use an except to retrieve all the rows from articles_backup that are different from articles:
(assuming both tables have the same columns in the same order)
you could also create a temp table with this info to make it easy on your repairing statements:
create table temp_articles as
select * from articles_backup
except
select * from articles
step 1 - update rows from 'articles_backup' present in articles.
This step needs attention... you will have to establish a rule to choose between the data present in articles and the one present in temp_articles.
UPDATE articles a
SET a.col1=b.col1,
a.col2=b.col2,
(... other columns ...)
FROM (SELECT * FROM temp_articles) AS b
WHERE a.id = b.id and /* your rule for data to be (or not) updated goes here */
step 2 - insert rows from 'articles_backup' not present in articles (your deleted records):
insert into articles
select * from temp_articles where id not in (select id from articles)
Let us know if you need more help.

How to check if a set of rows already exist in the database and skip migrate them?

I need to create a package to migrate a large amount of data from a database table into a different database table. The source table will continuously have new data in like 4,5 days so I will run my package again and again.
I need to migrate all data from this table to another table but I don't want to migrate those data that I already migrated. What kind of transformation I need to use or what SQL command I need to write to do this?
The usual way this is done is by having "audit" timestamps on the source table and migating only records updated or inserted after the last migration.
for example:
Table Sales
sale_id
sale_date
sale_amount
...............
dw_create_date
dw_update_date
Your source extraction could be something along the lines of..
select sales.sale_id,
sales.sale_date,
....
from sales
where dw_updated_date > {last_migration_date}
last_migration_date is usually read from a config file or table.
Other approaches
There are a few other approaches that you could use, but all of these have bigger performance problems as your data size grows.
1) Do a (target-source) data, to get changed rows in the souurce.
select *
from source
minus
select * from target
You could do the same using a join between source and target.
select source.*
from src
left join tgt on (src.id=tgt.id)
where (src.column1 <> tgt.column1 or
src.column2 <> tgt.column2
............
)
Note that either one of these approaches does not take care of deletes in the source. If you want the tables to be in sync, the only way to do that would be do a (source-target) to get insert/update changes and (target-source) to get deleted rows and do the same in the target.
2. Insert and ignore the primary constraint error:
This has serious issues if the data can change in the source and you want the updates propagated to the target. You'd also be querying the entire source each time. It is usually better to use Merge/Upsert along with filtered source data, instead.
I would assume both tables have some unique identifier, no?
Table A has:
1
2
3
4
You're moving that to Table B, but keeping the data in Table A at the same time, yes?
So you've run your job once. Now Table B has:
1
2
3
4
Table A gets updated. It now has:
1
2
3
4
5
6
7
You run your job again, but you only want to send over 5,6,7.
SELECT *
FROM TableA
LEFT OUTER JOIN TableB ON TableA.ID = TableB.ID
WHERE TableB.ID = NULL.
If you have some sample data it would help. Does this give you a good idea?
See joins: http://i.stack.imgur.com/1UKp7.png

delete old values of a table and update the table with results of same query

My question is to simple, but I can't find out a way to delete old values of a table and update same table with results of same query.
UPDATE
The query is an SELECT on Table A, and the results be Table B. And nothing on Table B different of the result of last query on Table A.
I have a very big table, and I need to process the records and create a new table regularly. The old values of this table are not important, only the new ones.
I will appreciate any help.
What about a view? If you only need table B to query on. You said you have a select on table A. Lets say your select is SELECT * FROM TableA WHERE X = Y. Then your statement would be
CREATE VIEW vwTableB AS
SELECT * FROM TableA WHERE X = Y
And then instead of querying tableB you would query vwTableB. Any changes to the data in table A would be reflected in the view so you don't have to keep running a script yourself.
This was the data in vwTableB would be kept updated and you wouldn't have to keep deleting and inserting into the second table.
you can use a temporary table to store results you are working with, if you only need it for one session. it will automatically be dropped when you sign out.
you didn't say what db you are using, but try this
create temp tableB AS select * from tableA

How can I block bad row from a delete query

I have a query that moves year-old rows from one table to an identical "archive" table.
Sometimes, invalid dates get entered in to a dateprocessed column (used to evaluate if the row is more than a year old), and the query errors out. I want to essentially "screen" the bad rows -- i.e. where not isdate(dateprocessed) does not equal 1 -- so that the query does not try to archive them.
I have a few ideas about how to do this, but want to do this in the absolute simplest way possible. If I select the good data into a temp table in my stored procedure, then inner join it with the live table, then run the delete from live output to archive -- will it delete from the underlying live table or the new joined table?
Is there a better way to do this? Thanks for the help. I am a .NET programmer playing DBA, but really want to do this properly.
Here is the query that errors when some of the dateprocessed column values are invalid:
delete from live
output deleted.* into archive
where isdate(dateprocessed) = 1
and cast (dateprocessed as datetime) < dateadd(year, -1, getdate())
and not exists (select * from archive where live.id = archive.id)
The simplest thing to do is:
Select the correct records into a temp table
One of the fields you need to copy into the temp table should be a
unique identifier like an "ID" column
Do any additional processing in the temp table
Archive from the temp table to archive table
Delete from live table with a join with temp table using the "ID" Column. This will ensure no mistakes are made.
If you are a .NET guy you could bring every data down and do a DateTime.TryParse. Better yet just do it once to populate a real DateTime column. The the dates that don't parse you could assign a fixed date or null. And there are some dates strings that .NET will parse that SQL will not (.e.g. November 2010).