Merging 2 tables that were partitioned together via "UNION ALL"

Merging 2 tables that were partitioned together via "UNION ALL" - sql

I have a common pattern in the current database that I would like to rip out. I have 3 objects where a single will suffice: current_table, history_table, combined_view.
current_table and history_table have exactly the same columns and contain data split on a timestamp, that is history_table contains data up to 2010-01-01 and current_table includes data since, including 2010-01-01 etc.
The combined view is (poor man's partitioning)
select * from history_table
UNION ALL
select * from current_table
I would like to have a single table with the same name as the view and go away with the history_table and the view. My algorithm is:
Drop constraints on cutoff time.
Move data from history_table into current_table
Rename history_table to history_table_DEPR, rename view to combined_view_DEPR, rename current_table to combined_view
I currently achieve (2) above via the following SQL:
INSERT INTO current_table
SELECT * FROM history_table
I imagine (2) is where the bulk of the time is spent. I am worried that the insert above will attempt to write a log for each row inserted and will be slower than it could be. What is the best way to move the data in this case? I do not care about logging these moves.

This will batch
select 1
while (##rowcount > 0)
begin
INSERT INTO current_table ct
SELECT top (100000) * FROM history_table ht
where not exists ( select 1 from current_table ctt
where ctt.PK = ht.PK
)
end

I wouldn't move the data at all, especially if you're going to have repeat this exercise. Use some partitioning tricks to shuffle metadata around.
1) Create an intermediate staging table with two partitions based on your separation date.
2) Create your eventual target table, named after your view, without partitions.
3) Switch the data from the existing tables into the partitioned table.
4) Collapse the two partitions into one partition.
5) Switch the remaining partition into your new target table.
6) Drop all the working objects.
7) Repeat as needed.
-- Step 0.
-- Standard issue pre-cleaning.
IF OBJECT_ID('dbo.OldData','U') IS NOT NULL
DROP TABLE dbo.OldData;
IF OBJECT_ID('dbo.NewData','U') IS NOT NULL
DROP TABLE dbo.NewData;
IF OBJECT_ID('dbo.CleanUp','U') IS NOT NULL
DROP TABLE dbo.CleanUp;
IF OBJECT_ID('dbo.AllData','U') IS NOT NULL
DROP TABLE dbo.AllData;
IF EXISTS (SELECT * FROM sys.partition_schemes
WHERE name = 'psCleanUp')
DROP PARTITION SCHEME psCleanUp;
IF EXISTS (SELECT * FROM sys.partition_functions
WHERE name = 'pfCleanUp')
DROP PARTITION FUNCTION pfCleanUp;
-- Mock up your existing situation. Two data tables.
CREATE TABLE dbo.OldData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
CREATE TABLE dbo.NewData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
INSERT INTO dbo.OldData
(
Dates
,OtherStuff
)
VALUES
(
'20090101' -- Dates - date
,'' -- OtherStuff - varchar(1)
);
INSERT INTO dbo.NewData
(
Dates
,OtherStuff
)
VALUES
(
'20110101' -- Dates - date
,'' -- OtherStuff - varchar(1)
)
-- Step .5
-- Here's where the solution starts.
-- Add check contraints to your existing tables.
-- The partition switch will require this to be sure
-- the incoming data works with the partition scheme.
ALTER TABLE dbo.OldData
ADD CONSTRAINT ckOld CHECK (Dates < '2010-01-01');
ALTER TABLE dbo.NewData
ADD CONSTRAINT ckNew CHECK (Dates >= '2010-01-01');
-- Step 1.
-- Create your partitioning artifacts and
-- intermediate table.
CREATE PARTITION FUNCTION pfCleanUp (DATE)
AS RANGE RIGHT FOR VALUES ('2010-01-01');
CREATE PARTITION SCHEME psCleanUp
AS PARTITION pfCleanUp
ALL TO ([PRIMARY]);
CREATE TABLE dbo.CleanUp
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
) ON psCleanUp(Dates);
-- Step 2.
-- Create your new target table.
CREATE TABLE dbo.AllData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
-- Step 3.
-- Start flopping metadata around.
ALTER TABLE dbo.OldData
SWITCH TO dbo.CleanUp PARTITION 1;
ALTER TABLE dbo.NewData
SWITCH TO dbo.CleanUp PARTITION 2;
-- Step 4.
-- Your old tables should be empty now.
-- Put all of the data into one partition.
ALTER PARTITION FUNCTION pfCleanUp()
MERGE RANGE ('2010-01-01');
-- Step 5.
-- Switch that partition out to your
-- spanky new table.
ALTER TABLE dbo.CleanUp
SWITCH PARTITION 1 TO dbo.AllData;
-- Verify the data's where it belongs.
SELECT *
FROM dbo.AllData;
-- Verify the data's not where it shouldn't be.
SELECT * FROM dbo.OldData;
SELECT * FROM dbo.NewData;
SELECT * FROM dbo.CleanUp ;
-- Step 6.
-- Clean up after yourself.
DROP TABLE dbo.OldData;
DROP TABLE dbo.NewData;
DROP TABLE dbo.CleanUp;
DROP PARTITION SCHEME psCleanUp;
DROP PARTITION FUNCTION pfCleanUp;
-- This one's just here for me.
DROP TABLE dbo.AllData;

Related

SQLite table with AS SELECT

I have a table called flights which has all information related to flights and I have a table called users.
I want to create a new table called orders, in this table I wanted to add user name from the user table and certain flight information from the flight table.
The thing is I also want to have a column in my order table called orderID.
My question is how do I add a column in a as select query

SELECT *, an_expression AS another_column FROM the_table_or_subquery .... ;
an_expression is that, an expression that results in a single value
the_table_or_subquery and another_column are descriptive rather than actual, change accordingly.
the new column could be first e.g. SELECT an_expression AS another_column,* FROM the_table_or_subquery;
Could you please give an example like to better understand
Considering that you have provided scant details then here are examples of creating a new table and inserting some data from wherever and other data from another table (flights) :-
DROP TABLE IF EXISTS flights;
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS `order`;
DROP TABLE IF EXISTS other_order_table;
CREATE TABLE IF NOT EXISTS flights (
id INTEGER PRIMARY KEY,
flight_info TEXT
)
;
CREATE TABLE IF NOT EXISTS users (
userid INTEGER PRIMARY KEY,
user_name TEXT UNIQUE,
user_email TEXT UNIQUE
)
;
INSERT OR IGNORE INTO flights (flight_info)
VALUES
('Flight1'),
('Flight2'),
('Flight3')
;
INSERT OR IGNORE INTO users (user_name,user_email)
VALUES
('Fred','fred#email'),
('Mary','mary#email'),
('Jane','jane#email')
;
DROP TABLE IF EXISTS `order`;
/* >>>>>>>>>> NOT A GOOD IDEAD <<<<<<<<<< due to
A table created using CREATE TABLE AS has no PRIMARY KEY and no constraints of any kind.
The default value of each column is NULL.
The default collation sequence for each column of the new table is BINARY.
*/
CREATE TABLE IF NOT EXISTS `order` /* NOTE ORDER is a keyword so has to be enclosed - better to not call it order */
AS SELECT *,null AS orderid /* The new column BUT see above, value will be null*/
FROM flights;
SELECT * FROM `order`;
/* BETTER as can specify column attributes
however must insert elsewhere
*/
CREATE TABLE IF NOT EXISTS other_order_table (
orderid INTEGER PRIMARY KEY,
order_added TEXT DEFAULT CURRENT_TIMESTAMP,
flight_id,
flight_info
)
;
/*
EXAMPLE 1
uses defaults for columns
in the case of orderid as it's an alias of the rowid then autogenerated id
in the case of order_added the current date and time in YYYY-MM-DD hh:mm:ss format
*/
INSERT INTO other_order_table (flight_id,flight_info) SELECT * FROM flights;
SELECT * FROM other_order_table;
DELETE FROM other_order_table;
/* EXAMPLE 2 */
INSERT INTO other_order_table
SELECT
/* a random id will bne inserted into the first column (orderid) */
abs(random()),
/* a random date up to 999 days in the past */
datetime('now','-'||CAST(abs(random()) % 1000 AS INTEGER)||' days'),
/* all the columnd from the flights tables */
*
FROM flights
;
SELECT * FROM other_order_table;
/* Cleanup Ennvironment*/
DROP TABLE IF EXISTS flights;
DROP TABLE IF EXISTS users;
DROP TABLE IF EXISTS `order`;
DROP TABLE IF EXISTS other_order_table;
The 3 results are :-
2 columns from the flights table + new orderID column set to null
WARNING see commentary above re column attributes being stripped
the orderId is generated
the order_added is generated due to default being CURRENT_TIMESTAMP
both the new columns orderid and order_added use expressions that return a random suitable value.

Need help to optimize my stored procedure

I need help optimizing my stored procedure. This is for our fact table, and currently the stored procedure truncates the table, and then loads the data back in. I want to get rid of truncating and instead append new rows or delete rows by a last_update column which currently does not exist. There also is a last_update table with one column, which changes at every stored procedure run, but I'd rather the last_update be a column in the table itself, rather than a separate column.
I've created a trigger that should update the last_updated column with the current date when the stored procedure runs, but I would also like to get rid of truncating and instead append/delete rows as well. The way the stored procedure is currently structured is making it difficult for me to figure out how best to do it.
The stored procedure begins by adding data into 2 temp tables, then adds the data from the two temp tables into a 3rd temp table, then truncates the current FACT TABLE and then the 3rd temp table finally inserts into the FACT table.
--CLEAR LAST UPDATE TABLE
TRUNCATE TABLE ADM.LastUpdate;
--SET NEW LAST UPDATE TIME
INSERT INTO ADM.LastUpdate(TABLE_NAME, UPDATE_TIME)
VALUES('FactBP', CONVERT(VARCHAR, GETDATE(), 100)+' (CST)');
--CHECK TO SEE IF TEMP TABLES EXISTS THEN DROP
IF OBJECT_ID('tempdb.dbo.#TEMP_CARTON', 'U') IS NOT NULL
DROP TABLE #TEMP_CARTON;
IF OBJECT_ID('tempdb.dbo.#TEMP_ORDER', 'U') IS NOT NULL
DROP TABLE #TEMP_ORDER;
--CREATE TEMP TABLES
SELECT *
INTO #TEMP_CARTON
FROM [dbo].[FACT_CARTON_V];
SELECT *
INTO #TEMP_ORDER
FROM [dbo].[FACT_ORDER_V];
--CHECK TO SEE IF DATA EXISTS IN #TEMP_CARTON AND #TEMP_ORDER
IF EXISTS(SELECT * FROM #TEMP_CARTON)
AND EXISTS(SELECT * FROM #TEMP_ORDER)
--CODE HERE joins the data from #TEMP_CARTON and #TEMP ORDER and puts it into a 3rd temp table #TEMP_FACT.
--CLEAR ALL DATA FROM FACTBP
TRUNCATE TABLE dbo.FactBP;
--INSERT DATA FROM TEMP TABLE TO FACTBP
INSERT INTO dbo.FactBP
SELECT
[SOURCE]
,[DC_ORDER_NUMBER]
,[CUSTOMER_PURCHASE_ORDER_ID]
,[BILL_TO]
,[CUSTOMER_MASTER_RECORD_TYPE]
,[SHIP_TO]
,[CUSTOMER_NAME]
,[SALES_ORDER]
,[ORDER_CARRIER]
,[CARRIER_SERVICE_ID]
,[CREATE_DATE]
,[CREATE_TIME]
,[ALLOCATION_DATE]
,[REQUESTED_SHIP_DATE]
,[ADJ_REQ_SHIP]
,[CANCEL_DATE]
,[DISPATCH_DATE]
,[RELEASED_DATE]
,[RELEASED_TIME]
,[PRIORITY_ORDER]
,[SHIPPING_LOAD_NUMBER]
,[ORDER_HDR_STATUS]
,[ORDER_STATUS]
,[DELIVERY_NUMBER]
,[DCMS_ORDER_TYPE]
,[ORDER_TYPE]
,[MATERIAL]
,[QUALITY]
,[MERCHANDISE_SIZE_1]
,[SPECIAL_PROCESS_CODE_1]
,[SPECIAL_PROCESS_CODE_2]
,[SPECIAL_PROCESS_CODE_3]
,[DIVISION]
,[DIVISION_DESC]
,[ORDER_QTY]
,[ORDER_SELECTED_QTY]
,[CARTON_PARCEL_ID]
,[CARTON_ID]
,[SHIP_DATE]
,[SHIP_TIME]
,[PACKED_DATE]
,[PACKED_TIME]
,[ADJ_PACKED_DATE]
,[FULL_CASE_PULL_STATUS]
,[CARRIER_ID]
,[TRAILER_ID]
,[WAVE_NUMBER]
,[DISPATCH_RELEASE_PRIORITY]
,[CARTON_TOTE_COUNT]
,[PICK_PACK_METHOD]
,[RELEASED_QTY]
,[SHIP_QTY]
,[MERCHANDISE_STYLE]
,[PICK_WAREHOUSE]
,[PICK_AREA]
,[PICK_ZONE]
,[PICK_AISLE]
,EST_DEL_DATE
FROM #TEMP_FACT;
Currently, since I've added the last_updated column into my FACT TABLE and created a trigger, I don't actually pass any value via the stored procedure for it, so I get an error
An object or column name is missing or empty.
I am not sure as to where I'm supposed to pass any value for the LAST_UPDATED column.
Here is the trigger I've created for updating the last_updated column:
CREATE TRIGGER last_updated
ON dbo.factbp
AFTER UPDATE
AS
UPDATE dbo.factbp
SET last_updated = GETDATE()
FROM Inserted i
WHERE dbo.factbp.id = i.id

The first thing I would try is to create primary keys on the two temp tables #TEMP_CARTON and #TEMP_ORDER and use the intersect command to get the rows that are common to both tables:
select * from #TEMP_CARTON
intersect
SELECT * FROM #TEMP_ORDER

Figured out the answer. I just had to put "null" for the last_updated value during Insert, and then the Trigger took care of adding the timestamp on its own.

Split Hive table on subtables by field value

I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?

You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast

Using Polybase to load data into an existing table in parallel

Using CTAS we can leverage the parallelism that Polybase provides to load data into a new table in a highly scalable and performant way.
Is there a way to use a similar approach to load data into an existing table? The table might even be empty.
Creating an external table and using INSERT INTO ... SELECT * FROM ... - I would assume that this goes through the head node and is therefore not in parallel?
I know that I could also drop the table and use CTAS to recreate it but then I have to deal with all the metadata again (column names, data types, distributions, ...).

You could use partition switching to do this, although remember not to use too many partitions with Azure SQL Data Warehouse. See 'Partition Sizing Guidance' here.
Bear in mind check constraints are not supported so the source table has to use the same partition scheme as the target table.
Full example with partitioning and switch syntax:
-- Assume we have a file with the values 1 to 100 in it.
-- Create an external table over it; will have all records in
IF NOT EXISTS ( SELECT * FROM sys.schemas WHERE name = 'ext' )
EXEC ( 'CREATE SCHEMA ext' )
GO
-- DROP EXTERNAL TABLE ext.numbers
IF NOT EXISTS ( SELECT * FROM sys.external_tables WHERE object_id = OBJECT_ID('ext.numbers') )
CREATE EXTERNAL TABLE ext.numbers (
number INT NOT NULL
)
WITH (
LOCATION = 'numbers.csv',
DATA_SOURCE = eds_yourDataSource,
FILE_FORMAT = ff_csv
);
GO
-- Create a partitioned, internal table with the records 1 to 50
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
CREATE TABLE dbo.numbers
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX ( number ),
PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
)
AS
SELECT *
FROM ext.numbers
WHERE number Between 1 And 50;
GO
-- DBCC PDW_SHOWPARTITIONSTATS ('dbo.numbers')
-- CTAS the second half of the external table, records 51-100 into an internal one.
-- As check contraints are not available in SQL Data Warehouse, ensure the switch table
-- uses the same scheme as the original table.
IF OBJECT_ID('dbo.numbers_part2') IS NOT NULL DROP TABLE dbo.numbers_part2
CREATE TABLE dbo.numbers_part2
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX ( number ),
PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
)
AS
SELECT *
FROM ext.numbers
WHERE number > 50
GO
-- Partition switch it into the original table
ALTER TABLE dbo.numbers_part2 SWITCH PARTITION 2 TO dbo.numbers PARTITION 2;
SELECT *
FROM dbo.numbers
ORDER BY 1;

DB2 ARCHIVING OLD DATA TO DB2 ON ANOTHER SERVER

i need to archive data from db to db lying a totally different server (DB2). I can do that with the following steps, but the performance is the issue. i have very large amount of data to archive. anyway to do this with optimized archiving performance?
/* TEST WITH 1 TABLE */
--1. RETRIEVE IDs AND SAVE IN LIST - [USE LOOP TO PUSH RECORDS BASED ON IDs IN AN ARRAY]
SELECT ID FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
ORDER BY A.ID ASC
--2. DROP INDEXES [TOO SLOW!!!]
ALTER TABLE TABLE_1_ARC DROP PRIMARY KEY
--3. INSERT RECORDS INTO ARC TABLE [STORED PROCEDURE TO INSERT IN ALL TABLES???]
INSERT INTO TABLE_1_ARC
SELECT * FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
ORDER BY ID ASC
--LOOPING THROUGH ARRAY FROM STEP 1 WILL BE USED HERE INSTEAD OF WHERE
--4. DELETE ARCHIVED RECORDS FROM OPERATIONAL TABLE [STORED PROCEDURE TO DELETE EVERY FEW RECORDS???]
DELETE FROM TABLE_1
WHERE CREATED_TIME >= '2013-08-07 10:06:22' AND CREATED_TIME <= '2013-08-07 11:09:43'
--LOOPING THROUGH ARRAY FROM STEP 1 WILL BE USED HERE INSTEAD OF WHERE
--5. PUT INDEXES BACK [TOO SLOW!!!]
ALTER TABLE TABLE_1_ARC ADD PRIMARY KEY (ID)

Partition both the source and archive tables by CREATED_TIME. You then will be able to simply detach a partition from the source table and attach it to the archive table, which is almost instantaneous.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merging 2 tables that were partitioned together via "UNION ALL" - sql

This will batch select 1 while (##rowcount > 0) begin INSERT INTO current_table ct SELECT top (100000) * FROM history_table ht where not exists ( select 1 from current_table ctt where ctt.PK = ht.PK ) end

Related

SQLite table with AS SELECT

Need help to optimize my stored procedure

Split Hive table on subtables by field value

Using Polybase to load data into an existing table in parallel

DB2 ARCHIVING OLD DATA TO DB2 ON ANOTHER SERVER

Categories

Resources