I have an existing bq table with date, stock id and stock price columns. Using bq load, I can either overwrite or append data from csv file. Using csv file,
want to overwrite rows in bq table if date and stock id already exist (updating price),
Else, want to append as new rows in bq table if if date and stock id does not exist in bq table
In that scenario you would want to to a two step process.
load the data to a staging table.
Issue a MERGE statement, and define the when matched and not matched criteria as you needed.
Documentation on the MERGE statement can be found here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_statement
Related
I have multiple parquet files in S3 that are partitioned by date, like so:
s3://mybucket/myfolder/date=2022-01-01/file.parquet
s3://mybucket/myfolder/date=2022-01-02/file.parquet
and so on.
All of the files follow the same schema, except some which is why I am using the FILLRECORD (to fill the files with NULL values in case a column is not present). Now I want to load the content of all these files into an SQL temp table in redshift, like so:
DROP TABLE IF EXISTS table;
CREATE TEMP TABLE table
(
var1 bigint,
var2 bigint,
date timestamp
);
COPY table
FROM 's3://mybucket/myfolder/'
access_key_id 'id'secret_access_key 'key'
PARQUET FILLRECORD;
The problem is that the date column is not a column in the parquet files which is why the date column in the resulting table is NULL. I am trying to find a way to use the date to be inserted into the temp table.
Is there any way to do this?
I believe there are only 2 approaches to this:
Perform N COPY commands, one per S3 partition value, and populate the date column with the same information as the partition key value as a literal. A simple script can issue the SQL to Redshift. The issue with this is that you are issuing many COPY commands and if each partition in S3 has only 1 parquet file (or a few files) this will not take advantage of Redshift's parallelism.
Define the region of S3 with the partitioned parquet files as a Redshift partitioned external table and then INSERT INTO (SELECT * from );. The external table knows about the partition key and can insert this information into the local table. The down side is that you need to define the external schema and table and if this is a one time process, you will want to then tear these down after.
There are some other ways to attack this but none that are worth the effort or will be very slow.
Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
I have two existing tables and many records already added.
formula(formulaId,formulaName,formulaType)
and formula_detail(detailId,formulaId,fieldType,value)
Now there is change in formula table and new column is being added , branchId as
formula(formulaId,formulaName,formulaType,branchId),
and branch table is branch(branchId,branchName)
I want to copy and paste every existing record in formula table for every branch.
e.g if there are 3 existing records in formula table (with ID 1,2,3) and 2 branches. Then copy paste operation should produce total new (3*2)=6 records in formula table and also replicate records in formula_detail table for every newly created formula as follows
for formulaId 1 , If there were 5 records in formula_detail table, then copy paste in formual_detail table will have (2*5) new records added in formual_detail table.
I tried some solutions but number of are records huge and script is taking time. Please help. If need any test code I can add.
First of all replicating same column in detail table is against the Normalized Form used in your Data model.
Still If you want to add the column anyway,
Add the column using ALTER statement in formula_detail table
Try using this Merge statement
.
MERGE INTO formula_detail fd
USING (SELECT formulaId, branchId from Formula) temp
ON (fd.formulaId=temp.formulaId)
WHEN MATCHED THEN
UPDATE SET fd.branchId=temp.branchId;
I have a table Customers. I'm trying to design a way which will extract data from Customers table daily and create a CSV of this data. I want to pick only those records which haven't been extracted yet. How can I keep track of whether it has been extracted or not? I cannot alter the Customers table to add a flag.
So far I'm planning to use a Stage table which will have this flag. So I'm writing a stored procedure to get the data from the Customers table and have the flag set to 0 for each of these records. And use SSIS to create the CSV after pulling this data from stage table and once the records have been extracted into CSV update the staging table with flag=1 for those records.
What is a good design for this problem?
Customer table:
CustomerID | Name | RecordCreated | RecordUpdated
Create another table tblExportedEmpID with a column CustomerID. Add the customer id of each customer extracted from the Customer table into that new table. And to extract the customer from the Customer table which are not extracted yet, you can use this query :
select * from customer where customerid not in(select customerid from tblExportedEmpID)
You have RecordCreated and RecordUpdated. Why even bother with a separate record-per table if you have that information?
You'll need to create a table or equivalent "saved until next run" data area. The first thing you have your script do is grab the current time, and whatever was stored in that data area. Then, have your statement query everything:
SELECT <list of columns and transformation>
FROM Customers
WHERE recordCreated >= :lastRunTime AND recordCreated < :currentRunTime
(or recordUpdated, if you need to re-extract if the customer's name changes)
Note that you want the exclusive upper-bound (<) to cover the case where your stored timestamp has less resolution than the mechanism getting the timestamp.
For the last step, store off your run start - whatever the script grabbed for "current time" - into the "saved until next run" data area.
I am stuck with a problem with different views.
Present Scenario:
I am using SSIS packages to get data from Server A to Server B every 15 minutes.Created 10 packages for 10 different tables and also created 10 staging table for the same. In the DataFlow Task it is selecting data from server A with ID greater last imported ID and dumping them onto a Staging table.(Each table has its own stagin table).After the DataFlow task I am using a MERGE statement to merge records from Staging table to Destination table where ID is NO Matched.
Problem:
This will take care all new records inserted but if once a record is picked by SSIS job and is update at the source I am not able to pick it up again and not able to grab the updated data.
Questions:
How will I be able to achieve the Update with impacting the source database server too much.
Do I use MERGE statement and select 10,000 records every single run?(every 15 minutes)
Do I use LookUp transformation to do the updates
Some tables have more than 2 million records and growing, so what is the best approach for them.
NOTE:
I can truncate tables in destination and reinsert complete data for the first run.
Edit:
The Source has a column 'LAST_UPDATE_DATE' which I can Use in my query.
If I'm understanding your statements correctly it sounds like you're pretty close to your solution. If you currently have a merge statement that includes the insert (where source does not match destination) you should be able to easily include the update statement for the (where source matches destination).
example:
MERGE target_table as destination_table_alias
USING (
SELECT <column_name(s)>
FROM source_table
) AS source_alias
ON
[source_table].[table_identifier] = [destination_table_alias].[table_identifier]
WHEN MATCHED THEN UPDATE
SET [destination_table_alias.column_name1] = mySource.column_name1,
[destination_table_alias.column_name2] = mySource.column_name2
WHEN NOT MATCHED THEN
INSERT
([column_name1],[column_name2])
VALUES([source_alias].[column_name1],mySource.[column_name2])
So, to your points:
Update can be achieved via the 'WHEN MATCHED' logic within the merge statement
If you have the last ID of the table that you're loading, you can include this as a filter on your select statement so that the dataset is incremental.
No lookup is needed with the 'WHEN MATCHED' is utilized.
utilizing a select filter in the select portion of the merge statement.
Hope this helps