COPY INTO versus INSERT INTO on snowflake transformations, with existing tables - sql

So as far as I can tell, it's generally considered more efficient to use COPY INTO versus INSERT INTO in Snowflake. Is this true for existing tables being transformed? And is it even possible for tables already existing in snowflake? For example,
INSERT INTO TEST_TABLE
SELECT *
FROM SOURCE_TABLE_1
UNION ALL
SELECT *
FROM SOURCE_TABLE_2
Doing something like the above query in an insert is very straightforward, but would it technically be more efficient to use a COPY INTO here? And how would the syntax for that work?
COPY INTO TEST_TABLE
FROM (SELECT *
FROM SOURCE_TABLE_1
UNION ALL
SELECT *
FROM SOURCE_TABLE_2)
Doesn't appear to work, is there a way to get it to do so?
Thanks,just trying to learn :)

COPY INTO has two flavours:
data ingestion: COPY INTO table
data unloading COPY INTO location
Both uses named internal/external stage or storage location as one side of operation.
Thus COPY INTO is not inteded to perform data movement between tables already present in Snowflake.
COPY INTO TEST_TABLE
FROM (SELECT *
FROM SOURCE_TABLE_1
UNION ALL
SELECT *
FROM SOURCE_TABLE_2)
Assuming that source_table1 and source_table_2 are stages and not pernament table, it will not work either. Snowflake supports subset of operations during data load: Transforming Data During a Load
The COPY command supports:
Column reordering, column omission, and casts using a SELECT statement. There is no requirement for your data files to have the same number and ordering of columns as your target table.
The ENFORCE_LENGTH | TRUNCATECOLUMNS option, which can truncate text strings that exceed the target column length.

There is a fundamental difference between COPY and INSERT.
COPY is used to load data from a staged file to a Snowflake table. This means you are loading the file from either an internal or an external stage into your Snowflake target table: https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
INSERT is used to load a table in Snowflake with - like in your example - data from another table. But note: The source data is in a Snowflake table already and not a file in one of your stages: https://docs.snowflake.com/en/sql-reference/sql/insert.html
This means: Your first SQL query would work, your second one - in case SOURCE_TABLE_1 and _2 are permanent tables - would fail.

Related

How to delete customer information from hdfs

Suppose, I have several customers today so I am storing their information like customer_id, customer_name, customer_emailid etc. If my customer is leaving and he wants that his personal information should be removed from my hdfs.
So I have below two approaches to achieve the same.
Approach 1:
1.Create Internal Table on top of HDFS
2.Create external table from first table using filter logic
3.While Creating 2nd Table apply udfs on specific columns for more column filtering
Approach 2:
Spark=> Read, filter, write
Is there any other solution?
Approach 2 is possible in Hive - select, filter, write
Create a table on top of directory in hdfs (external or managed, does not matter in this context, better external if you are going to drop table later and keep the data as is). Insert overwrite table or partition from select with filter.
insert overwrite mytable
select *
from mytable --the same table
where customer_id not in (...) --filter rows

BigQuery loop to select values from dynamic table_names registered in another table

I'm looking for a solution to extract data from multiple tables and insert it into another table automatically running a single script. I need to query many tables, so I want to make a loop to select from those table's names dynamically.
I wonder if I could have a table with table names, and execute a loop like:
foreach(i in table_names)
insert into aggregated_table select * from table_names[i]
end
Below is for BigQuery Standard SQL
#standardSQL
SELECT * FROM `project.dataset1.*`
WHERE _TABLE_SUFFIX IN (SELECT table_name FROM `project.dataset2.list`)
This approach will work if below conditions are met
all to be processed table from list have exact same schema
one of those tables is the most recent table - this table will define schema that will be used for all the rest tables in the list
to meet above bullet - ideally list should be hosted in another dataset
Obviously, you can add INSERT INTO ... to insert result into whatever destination is to be
Please note: Filters on _TABLE_SUFFIX that include subqueries cannot be used to limit the number of tables scanned for a wildcard table, so make sure your are using longest possible prefix - for example
#standardSQL
SELECT * FROM `project.dataset1.source_table_*`
WHERE _TABLE_SUFFIX IN (SELECT table_name FROM `project.dataset2.list`)
So, again - even though you will select data from specific tables (set in project.dataset2.list) the cost will be for scanning all tables that match project.dataset1.source_table_* woldcard
While above is purely in BigQuery SQL - you can use any client of your choice to script exacly the logic you need - read table names from list table and then select and insert in loop - this option is simplest and most optimal I think

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

Merge Statement VS Lookup Transformation

I am stuck with a problem with different views.
Present Scenario:
I am using SSIS packages to get data from Server A to Server B every 15 minutes.Created 10 packages for 10 different tables and also created 10 staging table for the same. In the DataFlow Task it is selecting data from server A with ID greater last imported ID and dumping them onto a Staging table.(Each table has its own stagin table).After the DataFlow task I am using a MERGE statement to merge records from Staging table to Destination table where ID is NO Matched.
Problem:
This will take care all new records inserted but if once a record is picked by SSIS job and is update at the source I am not able to pick it up again and not able to grab the updated data.
Questions:
How will I be able to achieve the Update with impacting the source database server too much.
Do I use MERGE statement and select 10,000 records every single run?(every 15 minutes)
Do I use LookUp transformation to do the updates
Some tables have more than 2 million records and growing, so what is the best approach for them.
NOTE:
I can truncate tables in destination and reinsert complete data for the first run.
Edit:
The Source has a column 'LAST_UPDATE_DATE' which I can Use in my query.
If I'm understanding your statements correctly it sounds like you're pretty close to your solution. If you currently have a merge statement that includes the insert (where source does not match destination) you should be able to easily include the update statement for the (where source matches destination).
example:
MERGE target_table as destination_table_alias
USING (
SELECT <column_name(s)>
FROM source_table
) AS source_alias
ON
[source_table].[table_identifier] = [destination_table_alias].[table_identifier]
WHEN MATCHED THEN UPDATE
SET [destination_table_alias.column_name1] = mySource.column_name1,
[destination_table_alias.column_name2] = mySource.column_name2
WHEN NOT MATCHED THEN
INSERT
([column_name1],[column_name2])
VALUES([source_alias].[column_name1],mySource.[column_name2])
So, to your points:
Update can be achieved via the 'WHEN MATCHED' logic within the merge statement
If you have the last ID of the table that you're loading, you can include this as a filter on your select statement so that the dataset is incremental.
No lookup is needed with the 'WHEN MATCHED' is utilized.
utilizing a select filter in the select portion of the merge statement.
Hope this helps

Copying contents of table A to table B (one more column than table A)

In our application, we have two sets of tables: One set of working tables (with the data that is currently analyzed) and another set of archive tables (with all data that has even been analyzed, same table-name but with a a_prefix). The structure of the tables is the same, except that the archive tables have an extra column run_id to distinguish between different sets of data.
Currently, we have a SQL script that copies the contents over with statements similar to this:
insert into a_deals (run_id, deal_id, <more columns>)
select maxrun, deal_id, <more columns>
from deals,
(select max(run_id) maxrun from batch_runs);
This works fine, but whenever we add a new column to the table, we also have to mpdify the script. Is there a better way to do this that is stable when we have new columns? (Of course the structures have to match, but we'd like to be able not to have to change the script as well.)
FWIW, we're using Oracle as our RDBMS.
Following up on the first answer, you could build a pl/sql procedure which will read all_tab_columns to build the insert statement, then execute immediate. Not too hard, but be careful about what input parameters you allow (table_name and the like) and who can run it since it could provide a great opportunity for SQL Injection.
If the 2 tables have the SAME columns in the same order (column_id from all_tab_columns) except for this run_id in front, then you can do something like:
insert into a_deals
select (select max(run_id) from maxrun), d.*
from deals
where ...;
This is a lazy approach imo, and you'll want to ensure that the columns are in the same position for both tables as part of this script (inspect all_tab_columns). 2 varchar2 fields that are switched will lead to data inserted into incorrect fields.