ADF Azure Data Factory - dataflow delete row

ADF Azure Data Factory - dataflow delete row - azure-sql-database

A table in Azure SQL holds information About Order Lines. For each order there can be multiple lines, with no primary key for the table
Img here:
Upsert and insert work fine by using a sha1 generated column based on order_id + product_sku + product_qantity. Hence each time the quantity changes for a given line, the quantity is changed on the sink.
for the exists step i use a match between source and sink, source being the order_lines table in azure sql. The following conditions
flatten2#products_sku == source2#products_sku && checksum#sha1 == source2#sha1
The Alter row looks like this
Upsert and insert work fine
However, i cannot make the delete work, for scenarios when a product is being removed from an order line.
Any suggestion on how this can be achieved?
thanks

I reproduced this and not able to delete the records in the target using alter row.
As an alternative, you can try the below approach to delete the records in database target.
This is my sample source table:
Sample targe table:
To delete the records, you can use script in sink.
For sample, here I want to delete the record with name 'Rakesh'in sink and insert the source as it is.
I have used the below script in sink Post SQL script.
delete from source1 where name='Rakesh'
You can use pre script or post script as per your requirement and change the query according to your condition.
Target table with source data and without 'Rakesh' row after Execution:

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.

When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.

Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?

Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.

Merge data using Integration Service

Please Consider this scenario:
I have a table in my database. I want move this data in my OLAP database using SSIS.I can move all record from my table to OLAP database.The problem is I don't know how I can apply changes in OLAP environment.For example if just 100 record of my table were changed how I can apply this changes NOT copy all records from scratch.
How I can Merge this two tables?
thanks

There are two main approaches to this:
Lookup Transformation --> OLE DB Command / OLE DB Destination
Load all data to a staging table and perform the MERGE using SQL.
My Preference is for the latter because the update is SET Based, but I do use the former where I know it will be predominantly inserts.
With the former you will end up with a data flow task something like:
This is a OLE DB Source from the OLTP database, which then looks up against your OLAP Database to retrieve the surrogate key. Where there is no match it simple inserts a new record to the OLE DB Destination, when there is a match it does a conditional split, if any fields have changed it will use the OLE DB Command to update the OLAP table.
It can obviously get much more complicated than this, but this covers the simplest example.
You can also use the Slowly Changing Dimension Transformation to open up a wizard to create your data flow for you, which again gets a bit more complex:
As mentioned though, my Preference is for a staging table and a set based update, because the OLE DB Command executes on a row by row basis, so if you are updating millions of records this will take a long time. You can simply create a staging table on your OLAP database and move the data in with a simple OLE DB Source and Destination, then use MERGE to update the OLAP Table:
MERGE OLAP o
USING Staging s
ON o.BusinessKey = s.BusinessKey
AND o.Type2SCD = s.Type2SCD
AND o.Active = 1
WHEN MATCHED AND o.Type1SCD != s.Type1SCD THEN
UPDATE
SET Type1SCD = s.Type1SCD
WHEN NOT MATCHED BY TARGET THEN
INSERT (BusinessKey, Type1SCD, Type2SCD, Active, EffectiveDate)
VALUES (s.BusinessKey, s.Type1SCD, s.Type2SCD, 1, GETDATE())
WHEN NOT MATCHED BY SOURCE AND o.Active = 1 THEN
UPDATE
SET Active = 0;
The above assumes you have one active record per business Key, and both type 1 and type 2 slowly changing dimentions, it will insert a new record where there is no match on BusinessKey and Type2SCD, in addition it will set any unmatched records in the source table to inactive. When there is a match but the type 1 SCD is different this will be updated.
It is worth noting that MERGE has it's downsides, and you may want to write your set based upserts as separate INSERT and UPDATE statements. One major issue I have come across is that on all my Dimension tables I have a unique filtered index on my BusinessKey field WHERE Active = 1 to ensure there is only one active record, which the MERGE I have written should work fine for, but doesn't as detailed in this connect item. Although it was not the end of the world having to add OPTION (QUERYTRACEON 8790); to the end of all the MERGE statements in my ETL it was not ideal.

Sounds like you're wanting to use incremental loads.
The first five tutorials on this page should point you in the right direction - I found them really useful in the past.

Copy database schema to an existing database

I'm using Microsoft Sql Server Management Studio.
I currently have an existing database with data in it, which I will call DatabaseProd
And I have a second database with data used for testing, so the data isn't exactly correct nor up to date. I will call this database DatabaseDev.
However DatabaseDev now contains newly added tables and newly added columns,etc etc.
I would like to copy this new schema from DatabaseDev to DatabaseProd while keeping the DatabaseProd's Data.
Ex.
DatabaseProd contains 2 tables
TableA with column ID and Name
TableB with column ID and jobName
and these tables contains data that I would like to keep
DatabaseDev contains 3 tables
TableA with column ID ,Name and phoneNum
TableB with column ID and jobName
TableC with column ID and document
and these tables contains Data that I dont need
Copy DatabaseDev Schema to DatabaseProd but keep the data from DatabaseProd
So DatabaseProd after the copy would look like this
TableA with column ID ,Name and phoneNum
TableB with column ID and jobName
TableC with column ID and document
But the tables would contain it's original Data.
Is that possible?
Thank you

You can use Red-Gate SQL Compare, this will allow you to compare both DB's and generate a script to run on the source DB. You have to pay for a license, but you will get a 14-day trial period.
This tool, along with Data Compare and two tools I always insist on with new roles as they speed up development time, and minimise human error.
Also, a good tip when using SQL compare - if you need to generate a rollback script, then you can edit the project (after creating your rollout script), switch the source and destination around and this will create a script which will return the schema back to it's original state if the rollout script fails. However, be very careful when doing this, and don't select synchronize with sql compare, rather generate a script, see image. I can't upload an image, but I have linked to one here - you can see the two options to select Generate Script / Sync using SQL compare.

Yes, you can just generate a database script which is just for schema only no data will added to that script.
Also you need to just select the third table while creating or generating the database script and run that script to your production server database it will create a new table (table 3 in your case) without any data.
For more information about how to create a database script please follow the below link:
http://blog.sqlauthority.com/2011/05/07/sql-server-2008-2008-r2-create-script-to-copy-database-schema-and-all-the-objects-data-schema-stored-procedure-functions-triggers-tables-views-constraints-and-all-other-database-objects/

You need an ALTER TABLE statement
ALTER TABLE tableA ADD PhoneNum Varchar(10) --Insert variable of choice here
Looked like no changes to TableB
Add TableC
CREATE TABLE TableC (ColumnID int, Document Varvhar(50))
DO you need to copy constraints, Indexes or triggers over?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

ADF Azure Data Factory - dataflow delete row - azure-sql-database

Related

Azure Data Factory Incremental Load data by using Copy Activity

How to drop columns from a partitioned table in BigQuery

How can I create a Primary Key in Oracle when composite key and a "generated always as identity" option won't work?

Merge data using Integration Service

Copy database schema to an existing database

Categories

Resources