How to combine identical tables into one table? - sql

I have 100s of millions of unique rows spread across 12 tables in the same database. They all have the same schema/columns. Is there a relatively easy way to combine all of the separate tables into 1 table?
I've tried importing the tables into a single table, but given this is a HUGE size of files/rows, SQL Server is making me wait a long time as if I was importing from a flat file. There has to be an easier/faster way, no?

You haven't given much info about your table structure, but you can probably just do a plain old insert from a select, like below. The example would take all records that don't already exist Table2 and Table3, and insert them into Table1. You could do this to merge everything from all your 12 tables into a single table.
INSERT INTO Table1
SELECT * FROM Table2
WHERE SomeUniqueKey
NOT IN (SELECT SomeUniqueKey FROM Table1)
UNION
SELECT * FROM Table3
WHERE SomeUniqueKey
NOT IN (SELECT SomeUniqueKey FROM Table1)
--...

Do what Jim says, but first:
1) Drop (or disable) all indices in the destination table.
2) Insert rows from each table, one table at a time.
3) Commit the transaction after each table is appended, otherwise much disk space will be taken up in case of a possible rollback.
4) Renable or recreate the indices after you are done.
If there is a possibility of duplicate keys, you may need to retain an index on the key field and have a NOT EXISTS clause to hold back the duplicate records from being added.

Related

Transfer only rows that have different values as existing table data

I am using postgres and have 2 tables Transaction and Backup.
I would like to transfer rows of data from Transaction to Backup.
There will be new rows of data in Transaction table.
How do I transfer only rows of data that have different values as the existing data in Transaction table as I do not want to have duplicate rows of data?
As long as data in 1 of the column is different, I will transfer the row from Transaction to Backup.
e.g
Day 1: Transaction (20 rows) , Backup (20 rows) [All transaction file being backup to Backup at night]
Day 2: Transaction (40 rows), Backup(20 rows) [The additional 20 rows in Transaction may contain duplicate rows as the previous 20 rows in Transaction. I only want to transfer non-duplicate rows to Backup]
Reading between the lines I think this is a harder question than you know.
The real issue here is that you don't really know what has changed. If information is append-only and we can assume they are all visible when the last backup was made then we just select rows inserted after a point in time. If these are not good assumptions, then you are going to have a long-term issue. a_horse_with_no_name has an ok solution above assuming your data is append-only (all inserts, no updates), but it is not going to perform very well as these tables bet bigger.
A few options you might consider instead:
A table audit trigger would allow you to specify columns, values, etc as well as when they are changed and it could do this real-time. That would be my first solution.
Even if it is insert only you may want to store information in the backup table regarding max ids or the like and go back only one backup in checking. Then you could use a_horse_with_no_name's solution as a template.
Suppose you want to transfer rows from table Source to table Result. From your question, I understand that they have the same columns.
As you mentioned, you need values from Source different from the ones, that are already in Result.
SELECT * FROM [Source]
WHERE column NOT IN (SELECT column FROM Result)
It will return ,,new" records. Now you need insert it:
INSERT INTO Result
SELECT * FROM [Source]
WHERE column NOT IN (SELECT column FROM Result)
Try this
insert into Backup (fields1, fields2, ......)
select fields1, fields2 from Transaction t where your condition by date here and not exists (select * from Backup b where t.fields1 = b.fields1
t.fields2 = b.fields2
.....................
)
This will insert if any changes happens in transaction table. if you change an existing row from transaction table -NULL also included- will be inserted into backup table. but you shouldnt have primary key in your backup table because you wont be able to insert that row:
duplicate key value violates unique constraint "tb_backup_pkey"
will comes out.
this should work for entire row:
insert into backup (col1, col2, col3, col4)
select t.* from transactions t
EXCEPT
select * from backup b
If I understand you correctly you want to copy data from the transactions table to the backup table.
Something like this should do it. As you haven't shown us the actual table definitions I had to use dummy names for the columns. pk_col is the primary key column of the table.
insert into backup (pk_col, col1, col2, col3, col4)
select t.*
from transactions t
full outer join backup b on t.pk_col = b.pk_col
where t is distinct from b;
This assumes that the target table does not have a unique key. If it does you need to use an ON CONFLICT clause

How to delete 3 billion rows from 2 related tables

I have a table with 5 billion rows (table1) another table with 3 billion rows in table 2. These 2 tables are related. I have to delete 3 billion rows from table 1 and its related rows from table 2. Table1 is child of table 2. I tried using the for all method from plsql it didn't help much. Then I thought of using oracle partition strategy. Since I am not a DBA I would like to know if partioning of a existing table is possible on primary key column for a selected number of id's? My primary key is 64 bit auto generated number.
It is hard to partition the objects online(it can be done using dbms_redefinition). And not necessary(with the details you gave).
Best ideea would be to recreate the objects without the undesired rows.
For example some simple code would be like:
create table undesired_data as (select undesired rows from table1);
Create table1_new as (select * from table1 where key not in (select key from undesired_data));
Create table2_new as (select * from table2 where key not in (select key from undesired_data));
rename table1 to table1_old;
rename table2 to table2_old;
rename table1_new to table1;
rename table2_new to table2;
recreate constraints;
check if everything is ok;
drop table1_old and table2_old;
This can be done offlining consumers, but would be very small downtime for them if scripts are ok(you should test them in a test environment).
Sounds very dubious.
If it is real use-case then you don't delete you create another table, well defined, including partitioned and you fill it using insert /*+ append */ into MyNewTable select ....
The most common practice is to define partitions on dates (record create date, event date etc.).
Again, if this is a real use-case I strongly recommend that you will reach for real help, not seek for advice on the internet and not doing it yourself.

Remove duplicate SQL rows by looking at all columns

I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.

How to delete all data then insert new data

I have a process that runs every 60 minutes. On one table I need to remove all data then insert records from a different table. The problem is it takes a long time to delete and reinsert the data. When the table has no data I am afraid the users will see this. Is there a way to refresh the data without users seeing this?
If you want to remove all data from the table then use the TRUNCATE
TABLE instead of delete - It'll do it faster.
As for the insert it is a bit hard to say because you did not give any details but what you can try is:
Option 1 - Using temp table
create table table_temp as select * from original_table where rownum < 1;
//insert into table_temp
drop table original_table;
Exec sp_rename 'table_temp' , 'original_table'
Option 2 - Use 2 tables "Active-Passive" -
Have 2 tables for the data and a view to select over them. The view will join with a third table that will specify from which of the tables to select. kind of an "active-passive" concept.
To demonstrate concept:
with active_table as ( select 'table1_active' active_table )
select 1 data
where 'table1_active' in (select * from active_table)
union all
select 2
where 'table2_active' in (select * from active_table)
//This returns only one record with the "1"
Are you truncating instead of deleting? A truncate (while logged) is much, much, faster then a delete.
If you cannot truncate try deleting 1000-10000 rows at a time (smaller log buildup and on deleting large amounts of rows great increase in speed.)
If you really want fast performance you can create a second table, fill it with data, and then drop the first table and rename the second table as the first table. You will lose all the permissions on the table when you do this so be sure to reapply the permissions to the renamed table.
If you are deleting all rows in a table, you can consider using a TRUNCATE statement against the table instead of a DELETE. It will speed up part of your process. Keep in mind that this will reset any identity seeds you may have on the table.
As suggested, you can wrap this process in a transaction and depending on how you set your transaction isolation level, you can control what your users will see if they query the data during the transaction.
Make it sequence based, your copied in records all have have a series number (all the same for all copied in records) and another file holds which sequence is active, and you always select on a join to this table - when you copy in new records they have a new sequence that is not yet active, when they are all copied in, then the sequence table is updated to the new sequence - the redundant sequence records are deleted at your leisure.
Example
Let's suppose your table has field SeriesNo added and table ActiveSeries has field SeriesNo.
All queries of your table:
SELECT *
FROM YourTable Y
JOIN ActiveSeries A
ON A.SeriesNo = Y.SeriesNo
then updating SeriesNo in ActiveSeries makes new series of records available instantly.
I would follow below approach. While I troubleshoot why the delete and reinsert is taking time.
Create a new table ( t1 ) which has same data as oldtable ( maintable )
Now do your stuff on t1.
When your stuff is done, rename t1 to maintable.

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.