Using CTAS we can leverage the parallelism that Polybase provides to load data into a new table in a highly scalable and performant way.
Is there a way to use a similar approach to load data into an existing table? The table might even be empty.
Creating an external table and using INSERT INTO ... SELECT * FROM ... - I would assume that this goes through the head node and is therefore not in parallel?
I know that I could also drop the table and use CTAS to recreate it but then I have to deal with all the metadata again (column names, data types, distributions, ...).
You could use partition switching to do this, although remember not to use too many partitions with Azure SQL Data Warehouse. See 'Partition Sizing Guidance' here.
Bear in mind check constraints are not supported so the source table has to use the same partition scheme as the target table.
Full example with partitioning and switch syntax:
-- Assume we have a file with the values 1 to 100 in it.
-- Create an external table over it; will have all records in
IF NOT EXISTS ( SELECT * FROM sys.schemas WHERE name = 'ext' )
EXEC ( 'CREATE SCHEMA ext' )
GO
-- DROP EXTERNAL TABLE ext.numbers
IF NOT EXISTS ( SELECT * FROM sys.external_tables WHERE object_id = OBJECT_ID('ext.numbers') )
CREATE EXTERNAL TABLE ext.numbers (
number INT NOT NULL
)
WITH (
LOCATION = 'numbers.csv',
DATA_SOURCE = eds_yourDataSource,
FILE_FORMAT = ff_csv
);
GO
-- Create a partitioned, internal table with the records 1 to 50
IF OBJECT_ID('dbo.numbers') IS NOT NULL DROP TABLE dbo.numbers
CREATE TABLE dbo.numbers
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX ( number ),
PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
)
AS
SELECT *
FROM ext.numbers
WHERE number Between 1 And 50;
GO
-- DBCC PDW_SHOWPARTITIONSTATS ('dbo.numbers')
-- CTAS the second half of the external table, records 51-100 into an internal one.
-- As check contraints are not available in SQL Data Warehouse, ensure the switch table
-- uses the same scheme as the original table.
IF OBJECT_ID('dbo.numbers_part2') IS NOT NULL DROP TABLE dbo.numbers_part2
CREATE TABLE dbo.numbers_part2
WITH (
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED INDEX ( number ),
PARTITION ( number RANGE LEFT FOR VALUES ( 50, 100, 150, 200 ) )
)
AS
SELECT *
FROM ext.numbers
WHERE number > 50
GO
-- Partition switch it into the original table
ALTER TABLE dbo.numbers_part2 SWITCH PARTITION 2 TO dbo.numbers PARTITION 2;
SELECT *
FROM dbo.numbers
ORDER BY 1;
Related
So i got and ETL that stores 3 years '17 (corrupt), '18(corrupt), '19:
STG_tables: import Data from 3 differents DB and Export it to
DWH_tables: This is the Relational fase where all the historical information is stored. Here only the normalization and parameterization of the tables and the fields are carried out to adapt them to the developed logical model, but no business rules are applied.
DIM_tables: Finally, in the Dimensional Fase, the business rules are applied and the tables and indexes are optimized for the queries, since this is where the analytical tools will attack.
I got 2 types of Reloads:
Daily Reload: This Job is responsible for executing the SSIS packages necessary to perform the incremental daily load of the Data Warehouse. it only loads the last partition of the large tables (corresponding to the current year) in the dimensional Fase.
Full Reload: Loads full 3 years (this one is not working)
This wasn't done by me and i have 0 technical documentation, so im just trying to figure out how this works, my thinking is that once i get to do this full reload, the data will restore.
Im getting error on STG_fase:
DROP TABLE DWH_PROD.DWH_XX;
DROP TABLE ... ':' The partition function 'pfPetitions' is being used in one or more partition schemes.'. Possible reasons for the error: problems with the query, the property 'ResultSet' was not set correctly, parameters not set correctly or connection poorly established.
i dont know how to drop this partition so i can create it again
and cant find 'ResultSet' property, please help
USE DB;
GO
DROP TABLE DWH_PROD.DWH_ALBARANES_TARIFA;
DROP TABLE DWH_PROD.DWH_PETICIONES;
DROP TABLE DWH_PROD.DWH_SOLICITUDES;
DROP TABLE DWH_PROD.DWH_RESULTADOS;
DROP TABLE DWH_PROD.DWH_INCIDENCIAS;
-------i delete code so the text is not so big------
Here there are all the creation of the drop tables above
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_ALBARANES_TARIFA')
CREATE TABLE DWH_PROD.DWH_ALBARANES_TARIFA (
);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_INCIDENCIAS')
CREATE TABLE DWH_PROD.DWH_INCIDENCIAS (
);
IF EXISTS (SELECT * FROM sys.partition_functions WHERE name = N'pfPeticiones')
DROP PARTITION FUNCTION pfPeticiones;
CREATE PARTITION FUNCTION pfPeticiones (DATE)
AS RANGE RIGHT FOR VALUES
('2017-01-01', '2018-01-01', '2019-01-01');
IF EXISTS (SELECT * FROM sys.partition_schemes WHERE name = N'psPeticiones')
DROP PARTITION SCHEME psPeticiones;
CREATE PARTITION SCHEME psPeticiones
AS PARTITION pfPeticiones
ALL TO ([Primary]);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_PETICIONES')
CREATE TABLE DWH_PROD.DWH_PETICIONES (
) ON psPeticiones(FEC_PETICION);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_SOLICITUDES')
CREATE TABLE DWH_PROD.DWH_SOLICITUDES (
) ON psPeticiones(FEC_PETICION);
IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = N'DWH_RESULTADOS')
CREATE TABLE DWH_PROD.DWH_RESULTADOS (
) ON psPeticiones(FEC_PETICION);
You need to perform a few actions in order to do delete a partitioning function:
Delete or move (i.e. if you have a heap, create a clustered index on PRIMARY) all tables that use the partitioning schema.
Delete the partitioning schema.
Delete the partitioning function.
I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?
You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast
I have a common pattern in the current database that I would like to rip out. I have 3 objects where a single will suffice: current_table, history_table, combined_view.
current_table and history_table have exactly the same columns and contain data split on a timestamp, that is history_table contains data up to 2010-01-01 and current_table includes data since, including 2010-01-01 etc.
The combined view is (poor man's partitioning)
select * from history_table
UNION ALL
select * from current_table
I would like to have a single table with the same name as the view and go away with the history_table and the view. My algorithm is:
Drop constraints on cutoff time.
Move data from history_table into current_table
Rename history_table to history_table_DEPR, rename view to combined_view_DEPR, rename current_table to combined_view
I currently achieve (2) above via the following SQL:
INSERT INTO current_table
SELECT * FROM history_table
I imagine (2) is where the bulk of the time is spent. I am worried that the insert above will attempt to write a log for each row inserted and will be slower than it could be. What is the best way to move the data in this case? I do not care about logging these moves.
This will batch
select 1
while (##rowcount > 0)
begin
INSERT INTO current_table ct
SELECT top (100000) * FROM history_table ht
where not exists ( select 1 from current_table ctt
where ctt.PK = ht.PK
)
end
I wouldn't move the data at all, especially if you're going to have repeat this exercise. Use some partitioning tricks to shuffle metadata around.
1) Create an intermediate staging table with two partitions based on your separation date.
2) Create your eventual target table, named after your view, without partitions.
3) Switch the data from the existing tables into the partitioned table.
4) Collapse the two partitions into one partition.
5) Switch the remaining partition into your new target table.
6) Drop all the working objects.
7) Repeat as needed.
-- Step 0.
-- Standard issue pre-cleaning.
IF OBJECT_ID('dbo.OldData','U') IS NOT NULL
DROP TABLE dbo.OldData;
IF OBJECT_ID('dbo.NewData','U') IS NOT NULL
DROP TABLE dbo.NewData;
IF OBJECT_ID('dbo.CleanUp','U') IS NOT NULL
DROP TABLE dbo.CleanUp;
IF OBJECT_ID('dbo.AllData','U') IS NOT NULL
DROP TABLE dbo.AllData;
IF EXISTS (SELECT * FROM sys.partition_schemes
WHERE name = 'psCleanUp')
DROP PARTITION SCHEME psCleanUp;
IF EXISTS (SELECT * FROM sys.partition_functions
WHERE name = 'pfCleanUp')
DROP PARTITION FUNCTION pfCleanUp;
-- Mock up your existing situation. Two data tables.
CREATE TABLE dbo.OldData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
CREATE TABLE dbo.NewData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
INSERT INTO dbo.OldData
(
Dates
,OtherStuff
)
VALUES
(
'20090101' -- Dates - date
,'' -- OtherStuff - varchar(1)
);
INSERT INTO dbo.NewData
(
Dates
,OtherStuff
)
VALUES
(
'20110101' -- Dates - date
,'' -- OtherStuff - varchar(1)
)
-- Step .5
-- Here's where the solution starts.
-- Add check contraints to your existing tables.
-- The partition switch will require this to be sure
-- the incoming data works with the partition scheme.
ALTER TABLE dbo.OldData
ADD CONSTRAINT ckOld CHECK (Dates < '2010-01-01');
ALTER TABLE dbo.NewData
ADD CONSTRAINT ckNew CHECK (Dates >= '2010-01-01');
-- Step 1.
-- Create your partitioning artifacts and
-- intermediate table.
CREATE PARTITION FUNCTION pfCleanUp (DATE)
AS RANGE RIGHT FOR VALUES ('2010-01-01');
CREATE PARTITION SCHEME psCleanUp
AS PARTITION pfCleanUp
ALL TO ([PRIMARY]);
CREATE TABLE dbo.CleanUp
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
) ON psCleanUp(Dates);
-- Step 2.
-- Create your new target table.
CREATE TABLE dbo.AllData
(
[Dates] DATE NOT NULL
,[OtherStuff] VARCHAR(1) NULL
);
-- Step 3.
-- Start flopping metadata around.
ALTER TABLE dbo.OldData
SWITCH TO dbo.CleanUp PARTITION 1;
ALTER TABLE dbo.NewData
SWITCH TO dbo.CleanUp PARTITION 2;
-- Step 4.
-- Your old tables should be empty now.
-- Put all of the data into one partition.
ALTER PARTITION FUNCTION pfCleanUp()
MERGE RANGE ('2010-01-01');
-- Step 5.
-- Switch that partition out to your
-- spanky new table.
ALTER TABLE dbo.CleanUp
SWITCH PARTITION 1 TO dbo.AllData;
-- Verify the data's where it belongs.
SELECT *
FROM dbo.AllData;
-- Verify the data's not where it shouldn't be.
SELECT * FROM dbo.OldData;
SELECT * FROM dbo.NewData;
SELECT * FROM dbo.CleanUp ;
-- Step 6.
-- Clean up after yourself.
DROP TABLE dbo.OldData;
DROP TABLE dbo.NewData;
DROP TABLE dbo.CleanUp;
DROP PARTITION SCHEME psCleanUp;
DROP PARTITION FUNCTION pfCleanUp;
-- This one's just here for me.
DROP TABLE dbo.AllData;
Say I have a predefined Hive table with partitions loaded to it.
CREATE EXTERNAL TABLE t1
(
c1 STRING
)
PARTITIONED BY ( dt STRING )
LOCATION...
ALTER TABLE t1 ADD PARTITION ( dt = '2017-01-01' )
Now I got a new text representing the schema:
CREATE EXTERNAL TABLE t1
(
user_id STRING
)
PARTITIONED BY ( dt STRING )
LOCATION...
If I drop and then recreate the table, I'll lose partitions info.
I am looking for a way to redefine the columns schema part without manual add/remove/rename columns ( not a one time thing, trying to automate a schema update process ).
I found a way to do 'almost' what I needed:
Hive supports
REPLACE COLUMNS
Which means I can replace all old columns with new ones.
I have an excel sheet that contains more than 8k IDs. I have a table in SQL server that contains those IDs and related entries. What would be the best way to get those rows? The way I am doing right now is to use export data function from for the specific table using query:
select * from table_name where uID in (ALL 8K IDs)
Since this has to be done multiple times I suggest using bulk insert from the csv file to a temporary sql table and then use inner join with this table.
Assuming your csv file contains the ids in a single row, (i.e 1,34,345,....), something like this should do the trick:
-- create the temporary table
CREATE TABLE #CSVData
(
IdValue int
)
-- create a clustered index for this table (Note: this doesn't need to be unique)
CREATE CLUSTERED INDEX IX_CSVData on #CSVData (IdValue )
-- insert the csv data to the table
BULK INSERT #CSVData
FROM 'c:\csvData.txt'
WITH
(
ROWTERMINATOR = ','
)
-- select the data
SELECT T.*
FROM table_name T
INNER JOIN #CSVData ON(T.uId = IdValue)
-- cleanup (the index will be dropped with the table)
DROP TABLE #CSVData
One more link to look at is This article by Pinal dave on sqlauthority.