i need to archive data from db to db lying a totally different server (DB2). I can do that with the following steps, but the performance is the issue. i have very large amount of data to archive. anyway to do this with optimized archiving performance?
/* TEST WITH 1 TABLE */
--1. RETRIEVE IDs AND SAVE IN LIST - [USE LOOP TO PUSH RECORDS BASED ON IDs IN AN ARRAY]
SELECT ID FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
ORDER BY A.ID ASC
--2. DROP INDEXES [TOO SLOW!!!]
ALTER TABLE TABLE_1_ARC DROP PRIMARY KEY
--3. INSERT RECORDS INTO ARC TABLE [STORED PROCEDURE TO INSERT IN ALL TABLES???]
INSERT INTO TABLE_1_ARC
SELECT * FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
ORDER BY ID ASC
--LOOPING THROUGH ARRAY FROM STEP 1 WILL BE USED HERE INSTEAD OF WHERE
--4. DELETE ARCHIVED RECORDS FROM OPERATIONAL TABLE [STORED PROCEDURE TO DELETE EVERY FEW RECORDS???]
DELETE FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
--LOOPING THROUGH ARRAY FROM STEP 1 WILL BE USED HERE INSTEAD OF WHERE
--5. PUT INDEXES BACK [TOO SLOW!!!]
ALTER TABLE TABLE_1_ARC ADD PRIMARY KEY (ID)
Partition both the source and archive tables by CREATED_TIME. You then will be able to simply detach a partition from the source table and attach it to the archive table, which is almost instantaneous.
Related
I need help optimizing my stored procedure. This is for our fact table, and currently the stored procedure truncates the table, and then loads the data back in. I want to get rid of truncating and instead append new rows or delete rows by a last_update column which currently does not exist. There also is a last_update table with one column, which changes at every stored procedure run, but I'd rather the last_update be a column in the table itself, rather than a separate column.
I've created a trigger that should update the last_updated column with the current date when the stored procedure runs, but I would also like to get rid of truncating and instead append/delete rows as well. The way the stored procedure is currently structured is making it difficult for me to figure out how best to do it.
The stored procedure begins by adding data into 2 temp tables, then adds the data from the two temp tables into a 3rd temp table, then truncates the current FACT TABLE and then the 3rd temp table finally inserts into the FACT table.
--CLEAR LAST UPDATE TABLE
TRUNCATE TABLE ADM.LastUpdate;
--SET NEW LAST UPDATE TIME
INSERT INTO ADM.LastUpdate(TABLE_NAME, UPDATE_TIME)
VALUES('FactBP', CONVERT(VARCHAR, GETDATE(), 100)+' (CST)');
--CHECK TO SEE IF TEMP TABLES EXISTS THEN DROP
IF OBJECT_ID('tempdb.dbo.#TEMP_CARTON', 'U') IS NOT NULL
DROP TABLE #TEMP_CARTON;
IF OBJECT_ID('tempdb.dbo.#TEMP_ORDER', 'U') IS NOT NULL
DROP TABLE #TEMP_ORDER;
--CREATE TEMP TABLES
SELECT *
INTO #TEMP_CARTON
FROM [dbo].[FACT_CARTON_V];
SELECT *
INTO #TEMP_ORDER
FROM [dbo].[FACT_ORDER_V];
--CHECK TO SEE IF DATA EXISTS IN #TEMP_CARTON AND #TEMP_ORDER
IF EXISTS(SELECT * FROM #TEMP_CARTON)
AND EXISTS(SELECT * FROM #TEMP_ORDER)
--CODE HERE joins the data from #TEMP_CARTON and #TEMP ORDER and puts it into a 3rd temp table #TEMP_FACT.
--CLEAR ALL DATA FROM FACTBP
TRUNCATE TABLE dbo.FactBP;
--INSERT DATA FROM TEMP TABLE TO FACTBP
INSERT INTO dbo.FactBP
SELECT
[SOURCE]
,[DC_ORDER_NUMBER]
,[CUSTOMER_PURCHASE_ORDER_ID]
,[BILL_TO]
,[CUSTOMER_MASTER_RECORD_TYPE]
,[SHIP_TO]
,[CUSTOMER_NAME]
,[SALES_ORDER]
,[ORDER_CARRIER]
,[CARRIER_SERVICE_ID]
,[CREATE_DATE]
,[CREATE_TIME]
,[ALLOCATION_DATE]
,[REQUESTED_SHIP_DATE]
,[ADJ_REQ_SHIP]
,[CANCEL_DATE]
,[DISPATCH_DATE]
,[RELEASED_DATE]
,[RELEASED_TIME]
,[PRIORITY_ORDER]
,[SHIPPING_LOAD_NUMBER]
,[ORDER_HDR_STATUS]
,[ORDER_STATUS]
,[DELIVERY_NUMBER]
,[DCMS_ORDER_TYPE]
,[ORDER_TYPE]
,[MATERIAL]
,[QUALITY]
,[MERCHANDISE_SIZE_1]
,[SPECIAL_PROCESS_CODE_1]
,[SPECIAL_PROCESS_CODE_2]
,[SPECIAL_PROCESS_CODE_3]
,[DIVISION]
,[DIVISION_DESC]
,[ORDER_QTY]
,[ORDER_SELECTED_QTY]
,[CARTON_PARCEL_ID]
,[CARTON_ID]
,[SHIP_DATE]
,[SHIP_TIME]
,[PACKED_DATE]
,[PACKED_TIME]
,[ADJ_PACKED_DATE]
,[FULL_CASE_PULL_STATUS]
,[CARRIER_ID]
,[TRAILER_ID]
,[WAVE_NUMBER]
,[DISPATCH_RELEASE_PRIORITY]
,[CARTON_TOTE_COUNT]
,[PICK_PACK_METHOD]
,[RELEASED_QTY]
,[SHIP_QTY]
,[MERCHANDISE_STYLE]
,[PICK_WAREHOUSE]
,[PICK_AREA]
,[PICK_ZONE]
,[PICK_AISLE]
,EST_DEL_DATE
FROM #TEMP_FACT;
Currently, since I've added the last_updated column into my FACT TABLE and created a trigger, I don't actually pass any value via the stored procedure for it, so I get an error
An object or column name is missing or empty.
I am not sure as to where I'm supposed to pass any value for the LAST_UPDATED column.
Here is the trigger I've created for updating the last_updated column:
CREATE TRIGGER last_updated
ON dbo.factbp
AFTER UPDATE
AS
UPDATE dbo.factbp
SET last_updated = GETDATE()
FROM Inserted i
WHERE dbo.factbp.id = i.id
The first thing I would try is to create primary keys on the two temp tables #TEMP_CARTON and #TEMP_ORDER and use the intersect command to get the rows that are common to both tables:
select * from #TEMP_CARTON
intersect
SELECT * FROM #TEMP_ORDER
Figured out the answer. I just had to put "null" for the last_updated value during Insert, and then the Trigger took care of adding the timestamp on its own.
I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?
You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast
I have a common pattern in the current database that I would like to rip out. I have 3 objects where a single will suffice: current_table, history_table, combined_view.
current_table and history_table have exactly the same columns and contain data split on a timestamp, that is history_table contains data up to 2010-01-01 and current_table includes data since, including 2010-01-01 etc.
The combined view is (poor man's partitioning)
select * from history_table
UNION ALL
select * from current_table
I would like to have a single table with the same name as the view and go away with the history_table and the view. My algorithm is:
Drop constraints on cutoff time.
Move data from history_table into current_table
Rename history_table to history_table_DEPR, rename view to combined_view_DEPR, rename current_table to combined_view
I currently achieve (2) above via the following SQL:
INSERT INTO current_table
SELECT * FROM history_table
I imagine (2) is where the bulk of the time is spent. I am worried that the insert above will attempt to write a log for each row inserted and will be slower than it could be. What is the best way to move the data in this case? I do not care about logging these moves.
This will batch
select 1
while (##rowcount > 0)
begin
INSERT INTO current_table ct
SELECT top (100000) * FROM history_table ht
where not exists ( select 1 from current_table ctt
where ctt.PK = ht.PK
)
end
I wouldn't move the data at all, especially if you're going to have repeat this exercise. Use some partitioning tricks to shuffle metadata around.
1) Create an intermediate staging table with two partitions based on your separation date.
2) Create your eventual target table, named after your view, without partitions.
3) Switch the data from the existing tables into the partitioned table.
4) Collapse the two partitions into one partition.
5) Switch the remaining partition into your new target table.
6) Drop all the working objects.
7) Repeat as needed.
-- Step 0.
-- Standard issue pre-cleaning.
IF OBJECT_ID('dbo.OldData','U') IS NOT NULL
DROP TABLE dbo.OldData;
IF OBJECT_ID('dbo.NewData','U') IS NOT NULL
DROP TABLE dbo.NewData;
IF OBJECT_ID('dbo.CleanUp','U') IS NOT NULL
DROP TABLE dbo.CleanUp;
IF OBJECT_ID('dbo.AllData','U') IS NOT NULL
DROP TABLE dbo.AllData;
IF EXISTS (SELECT * FROM sys.partition_schemes
WHERE name = 'psCleanUp')
DROP PARTITION SCHEME psCleanUp;
IF EXISTS (SELECT * FROM sys.partition_functions
WHERE name = 'pfCleanUp')
DROP PARTITION FUNCTION pfCleanUp;
-- Mock up your existing situation. Two data tables.
CREATE TABLE dbo.OldData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
CREATE TABLE dbo.NewData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
INSERT INTO dbo.OldData
(
Dates
,OtherStuff
)
VALUES
(
'20090101' -- Dates - date
,'' -- OtherStuff - varchar(1)
);
INSERT INTO dbo.NewData
(
Dates
,OtherStuff
)
VALUES
(
'20110101' -- Dates - date
,'' -- OtherStuff - varchar(1)
)
-- Step .5
-- Here's where the solution starts.
-- Add check contraints to your existing tables.
-- The partition switch will require this to be sure
-- the incoming data works with the partition scheme.
ALTER TABLE dbo.OldData
ADD CONSTRAINT ckOld CHECK (Dates < '2010-01-01');
ALTER TABLE dbo.NewData
ADD CONSTRAINT ckNew CHECK (Dates >= '2010-01-01');
-- Step 1.
-- Create your partitioning artifacts and
-- intermediate table.
CREATE PARTITION FUNCTION pfCleanUp (DATE)
AS RANGE RIGHT FOR VALUES ('2010-01-01');
CREATE PARTITION SCHEME psCleanUp
AS PARTITION pfCleanUp
ALL TO ([PRIMARY]);
CREATE TABLE dbo.CleanUp
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
) ON psCleanUp(Dates);
-- Step 2.
-- Create your new target table.
CREATE TABLE dbo.AllData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
-- Step 3.
-- Start flopping metadata around.
ALTER TABLE dbo.OldData
SWITCH TO dbo.CleanUp PARTITION 1;
ALTER TABLE dbo.NewData
SWITCH TO dbo.CleanUp PARTITION 2;
-- Step 4.
-- Your old tables should be empty now.
-- Put all of the data into one partition.
ALTER PARTITION FUNCTION pfCleanUp()
MERGE RANGE ('2010-01-01');
-- Step 5.
-- Switch that partition out to your
-- spanky new table.
ALTER TABLE dbo.CleanUp
SWITCH PARTITION 1 TO dbo.AllData;
-- Verify the data's where it belongs.
SELECT *
FROM dbo.AllData;
-- Verify the data's not where it shouldn't be.
SELECT * FROM dbo.OldData;
SELECT * FROM dbo.NewData;
SELECT * FROM dbo.CleanUp ;
-- Step 6.
-- Clean up after yourself.
DROP TABLE dbo.OldData;
DROP TABLE dbo.NewData;
DROP TABLE dbo.CleanUp;
DROP PARTITION SCHEME psCleanUp;
DROP PARTITION FUNCTION pfCleanUp;
-- This one's just here for me.
DROP TABLE dbo.AllData;
what is the best way to archive a table with huge amount of data say within 1 year to another table and deleting these records from the existing table?
Currently, i did this:
/*insert into archive table */
insert into table_a_archive (select *
from table_a
where last_updated < sysdate - interval '1' year);
/* delete archived data from existing table */
delete from variable_value where last_updated < sysdate - interval '1' month;
Is there a better approach?
1) Create 2 new tables with the same structure as table_a: table_a_archive and table_a_new. (don't forget to grant same privileges, create indexes and etc as on original table_a)
2) Rename table_a to table_a_old
3) Rename table_a_new to table_a
4) Lock table table_a_old (to prevent any changes during migration)
5) Using conditional multi table insert move all data from table_a_old into table_a_archive (when last_updated < sysdate - interval '1' year) and into table_a (else clause) using append hint and parallel
6) commit
7) drop and purge table_a_old
This method requires additional free space and stopping of your application (when rename table package can become invalid, new table during migration will be empty). If you need online solution you can use dbms_redefinition to capture changes occurred during migration.
I can also recommend to consider recreation table as partitioned by RANGE on last_updated instead of having 2 tables.
I have this table which doesn't have a primary key.
I'm going to insert some records in a new table to analyze them and I'm thinking in creating a new primary key with the values from all the available columns.
If this were a programming language like Java I would:
int hash = column1 * 31 + column2 * 31 + column3*31
Or something like that. But this is SQL.
How can I create a primary key from the values of the available columns? It won't work for me to simply mark all the columns as PK, for what I need to do is to compare them with data from other DB table.
My table has 3 numbers and a date.
EDIT What my problem is
I think a bit more of background is needed. I'm sorry for not providing it before.
I have a database ( dm ) that is being updated everyday from another db ( original source ) . It has records form the past two years.
Last month ( july ) the update process got broken and for a month there was no data being updated into the dm.
I manually create a table with the same structure in my Oracle XE, and I copy the records from the original source into my db ( myxe ) I copied only records from July to create a report needed by the end of the month.
Finally on aug 8 the update process got fixed and the records which have been waiting to be migrated by this automatic process got copied into the database ( from originalsource to dm ).
This process does clean up from the original source the data once it is copied ( into dm ).
Everything look fine, but we have just realize that an amount of the records got lost ( about 25% of july )
So, what I want to do is to use my backup ( myxe ) and insert into the database ( dm ) all those records missing.
The problem here are:
They don't have a well defined PK.
They are in separate databases.
So I thought that If I could create a unique pk from both tables which gave the same number I could tell which were missing and insert them.
EDIT 2
So I did the following in my local environment:
select a.* from the_table#PRODUCTION a , the_table b where
a.idle = b.idle and
a.activity = b.activity and
a.finishdate = b.finishdate
Which returns all the rows that are present in both databases ( the .. union? ) I've got 2,000 records.
What I'm going to do next, is delete them all from the target db and then just insert them all s from my db into the target table
I hope I don't get in something worst : - S : -S
The danger of creating a hash value by combining the 3 numbers and the date is that it might not be unique and hence cannot be used safely as a primary key.
Instead I'd recommend using an autoincrementing ID for your primary key.
Just create a surrogate key:
ALTER TABLE mytable ADD pk_col INT
UPDATE mytable
SET pk_col = rownum
ALTER TABLE mytable MODIFY pk_col INT NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
or this:
ALTER TABLE mytable ADD pk_col RAW(16)
UPDATE mytable
SET pk_col = SYS_GUID()
ALTER TABLE mytable MODIFY pk_col RAW(16) NOT NULL
ALTER TABLE mytable ADD CONSTRAINT pk_mytable_pk_col PRIMARY KEY (pk_col)
The latter uses GUID's which are unique across databases, but consume more spaces and are much slower to generate (your INSERT's will be slow)
Update:
If you need to create same PRIMARY KEYs on two tables with identical data, use this:
MERGE
INTO mytable v
USING (
SELECT rowid AS rid, rownum AS rn
FROM mytable
ORDER BY
co1l, col2, col3
)
ON (v.rowid = rid)
WHEN MATCHED THEN
UPDATE
SET pk_col = rn
Note that tables should be identical up to a single row (i. e. have same number of rows with same data in them).
Update 2:
For your very problem, you don't need a PK at all.
If you just want to select the records missing in dm, use this one (on dm side)
SELECT *
FROM mytable#myxe
MINUS
SELECT *
FROM mytable
This will return all records that exist in mytable#myxe but not in mytable#dm
Note that it will shrink all duplicates if any.
Assuming that you have ensured uniqueness...you can do almost the same thing in SQL. The only problem will be the conversion of the date to a numeric value so that you can hash it.
Select Table2.SomeFields
FROM Table1 LEFT OUTER JOIN Table2 ON
(Table1.col1 * 31) + (Table1.col2 * 31) + (Table1.col3 * 31) +
((DatePart(year,Table1.date) + DatePart(month,Table1.date) + DatePart(day,Table1.date) )* 31) = Table2.hashedPk
The above query would work for SQL Server, the only difference for Oracle would be in terms of how you handle the date conversion. Moreover, there are other functions for converting dates in SQL Server as well, so this is by no means the only solution.
And, you can combine this with Quassnoi's SET statement to populate the new field as well. Just use the left side of the Join condition logic for the value.
If you're loading your new table with values from the old table, and you then need to join the two tables, you can only "properly" do this if you can uniquely identify each row in the original table. Quassnoi's solution will allow you to do this, IF you can first alter the old table by adding a new column.
If you cannot alter the original table, generating some form of hash code based on the columns of the old table would work -- but, again, only if the hash codes uniquely identify each row. (Oracle has checksum functions, right? If so, use them.)
If hash code uniqueness cannot be guaranteed, you may have to settle for a primary key composed of as many columns are required to ensure uniqueness (e.g. the natural key). If there is no natural key, well, I heard once that Oracle provides a rownum for each row of data, could you use that?