Insert all data from an already sorted table into another - sql

I have two tables in an Amazon Redshift cluster that both use a timestamp as sort key. The first table is sorted and contains only data from timepoint 1 to timepoint 2. The second table is only temporary but also sorted and contains data from timepoint 3 to timepoint 4. Is there any to insert all the data from the first table into the second without having to run VACUUM on the table as. A normal INSERT from one table to another always needs a VACUUM afterwards as far as I know.
I know it would be possible if I used COPY on a pre-sorted flat file. But is there also a solution for two pre-sorted tables that does not need a VACUUM?

Option 1:
create new table say final table same schema as table two as you wish to copy content of table 1 to table 2.
Please check
select "column", type, encoding
from pg_table_def where tablename='table2'
this will give encoding used for each column for table 2. Create new final table with same encoding for each column.
Use query to load data in final table in sorted order
insert into final table ( select * from table1 order by timepoint asc)
then fire
insert into final table ( select * from table2 order by timepoint asc )
Option 2:
create final table and load data for timpoint1 , then load for timepoint2.. Continue till time points loaded in sorted manner.
Option 3:
You can check for Deep Copy Redshift option as well
here is the link http://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html
While doing deep copy, copy data for table 1 first then load table 2

I have tried this query in SQL Server :
SELECT * INTO table_name FROM old_table_name

Related

How to stop partition column appearing last in a SELECT * output, in Hive?

In Apache Hive, I'm trying to copy specific rows from one table to a second table that's identical apart from an additional string column (which I'm calling "report-type") at the end of the second table. Both tables are partitioned by a string field called 'dt' which has a date e.g. "2022-08-04". When I try and copy a row from table 1 to table 2, the data is inserted into table 2 with report-type and dt swapped, because the partition column seems to be forcibly listed last.
E.g. INSERT INTO table2 SELECT *, 'some_report_type' FROM table1 WHERE <some criteria>;
gives all the data in table2 in the correct columns apart from report-type is e.g. "2022-08-04" and dt is e.g. "2022-08-04"
Is there any way around this?
Two solutions I can see are recreate the table without the partitioning (ideally want to avoid) and just have dt as a regular non-partition column, or alternatively specify each of the columns in a column list in the query, but not sure if this would force "dt" to not be the last column, and the main issue with that is I have 830 columns to specify individually.
Thanks

Deleting completely identical duplicates from db

We have a table in our db with copied data that has completely duplicated many rows. Because the id is also duplicated there is nothing we can use to select just the duplicates. I tried using a limit to only delete 1 but redshift gave a syntax error when trying to use limit.
Any ideas how we can delete just one of two rows that have completely identical information?
Use select distinct to create a new table. Then either truncate & copy the data, or drop the original table and rename the new table to the original name:
create table t2 as select distinct * from t;
truncate t;
insert into t from select * from t2;
drop table t2;
Add column a column with unique values. identity(seed, step) looks interesting.

Append data from one table to another table

I have a Destination table with 3 columns, ID, Name, Source.
I have 10+ Source tables, each with multiple columns, but I only require the ID, Name, and the table name itself to be appended into Destination table.
Do note the naming of the column names are different in each table, but the required ID and Name are of the same data type and are of sufficient length for the ID and Name fields.
I already have the query to add the data (see below), and I have no issues doing the first run to add the required data into the destination table, as I just need one query for each table, Here is one my code below
INSERT INTO dest_table
SELECT ID, Name, 'source_table' as Source
FROM source_table
The issue now is that I need to schedule this to run on a daily basis.
I would like the source tables to append their new data into the destination table, and not add all records from each source table into the destination table.
Another condition to consider is that I will still need the data from the destination table to be intact. This means what ever records that were removed from the source tables , will not be removed from the destination table.
Thanks people!
You can exclude the old data by using a WHERE clause as shown below. I am assuming that the ID is unique in this table among all tables else you need to add another column in the destination table to identify where this ID is coming from
INSERT INTO dest_table
SELECT ID, Name, 'source_table' as Source FROM source_table
WHERE NOT EXISTS (SELECT 1 FROM dest_table dt WHERE dt.id = source_table.id)
Another non-optimal approach would be to create a trigger on insert in the source tables and push the data to the destination table.

How to delete all data then insert new data

I have a process that runs every 60 minutes. On one table I need to remove all data then insert records from a different table. The problem is it takes a long time to delete and reinsert the data. When the table has no data I am afraid the users will see this. Is there a way to refresh the data without users seeing this?
If you want to remove all data from the table then use the TRUNCATE
TABLE instead of delete - It'll do it faster.
As for the insert it is a bit hard to say because you did not give any details but what you can try is:
Option 1 - Using temp table
create table table_temp as select * from original_table where rownum < 1;
//insert into table_temp
drop table original_table;
Exec sp_rename 'table_temp' , 'original_table'
Option 2 - Use 2 tables "Active-Passive" -
Have 2 tables for the data and a view to select over them. The view will join with a third table that will specify from which of the tables to select. kind of an "active-passive" concept.
To demonstrate concept:
with active_table as ( select 'table1_active' active_table )
select 1 data
where 'table1_active' in (select * from active_table)
union all
select 2
where 'table2_active' in (select * from active_table)
//This returns only one record with the "1"
Are you truncating instead of deleting? A truncate (while logged) is much, much, faster then a delete.
If you cannot truncate try deleting 1000-10000 rows at a time (smaller log buildup and on deleting large amounts of rows great increase in speed.)
If you really want fast performance you can create a second table, fill it with data, and then drop the first table and rename the second table as the first table. You will lose all the permissions on the table when you do this so be sure to reapply the permissions to the renamed table.
If you are deleting all rows in a table, you can consider using a TRUNCATE statement against the table instead of a DELETE. It will speed up part of your process. Keep in mind that this will reset any identity seeds you may have on the table.
As suggested, you can wrap this process in a transaction and depending on how you set your transaction isolation level, you can control what your users will see if they query the data during the transaction.
Make it sequence based, your copied in records all have have a series number (all the same for all copied in records) and another file holds which sequence is active, and you always select on a join to this table - when you copy in new records they have a new sequence that is not yet active, when they are all copied in, then the sequence table is updated to the new sequence - the redundant sequence records are deleted at your leisure.
Example
Let's suppose your table has field SeriesNo added and table ActiveSeries has field SeriesNo.
All queries of your table:
SELECT *
FROM YourTable Y
JOIN ActiveSeries A
ON A.SeriesNo = Y.SeriesNo
then updating SeriesNo in ActiveSeries makes new series of records available instantly.
I would follow below approach. While I troubleshoot why the delete and reinsert is taking time.
Create a new table ( t1 ) which has same data as oldtable ( maintable )
Now do your stuff on t1.
When your stuff is done, rename t1 to maintable.

Copy data between tables in different databases without PK's ( like synchronizing )

I have a table ( A ) in a database that doesn't have PK's it has about 300 k records.
I have a subset copy ( B ) of that table in other database, this has only 50k and contains a backup for a given time range ( july data ).
I want to copy from the table B the missing records into table A without duplicating existing records of course. ( I can create a database link to make things easier )
What strategy can I follow to succesfully insert into A the missing rows from B.
These are the table columns:
IDLETIME NUMBER
ACTIVITY NUMBER
ROLE NUMBER
DURATION NUMBER
FINISHDATE DATE
USERID NUMBER
.. 40 extra varchar columns here ...
My biggest concern is the lack of PK. Can I create something like a hash or a PK using all the columns?
What could be a possible way to proceed in this case?
I'm using Oracle 9i in table A and Oracle XE ( 10 ) in B
The approximate number of elements to copy is 20,000
Thanks in advance.
If the data volumes are small enough, I'd go with the following
CREATE DATABASE LINK A CONNECT TO ... IDENTIFIED BY ... USING ....;
INSERT INTO COPY
SELECT * FROM table#A
MINUS
SELECT * FROM COPY;
You say there are about 20,000 to copy, but not how many in the entire dataset.
The other option is to delete the current contents of the copy and insert the entire contents of the original table.
If the full datasets are large, you could go with a hash, but I suspect that it would still try to drag the entire dataset across the DB link to apply the hash in the local database.
As long as no duplicate rows should exist in the table, you could apply a Unique or Primary key to all columns. If the overhead of a key/index would be to much to maintain, you could also query the database in your application to see whether it exists, and only perform the insert if it is absent