Why isn't there an option to upsert data in Azure Data Factory inline sink - azure-sql-database

The problem I'm trying to tackle is inserting and/or updating dynamic tables in a sink within an Azure Data Factory data flow. I've managed to get the source data, transform it how I want it and then send it to a sink. The pipeline ran successfully and it said it copied 37 rows (as expected) but investigation showed that no data was actually deposited in the target table. This was because the Table Action on the sink was set to 'None'. So in trying to fix this last part, it seems I don't have the 'Create' option but do have the 'Recreate' option (see screenshot of the sink below) which is not what I want as the datasource will eventually only have changed data. I need the process to create the table if it doesn't exist and then Upsert data. (Recreate drops the table and then creates it).
If I change the sink type from Inline to Dataset, then I can select Insert and Upsert, etc options but this is then not dynamic as I need to select a specific dataset.
So has anyone come across the same issue and have you managed to have dynamic sinks in your data flow where the table is created if it doesn't exist, then upsert data.
I guess I can add a Pre SQL script which takes care of the 'create the table if it doesn't exist' but I still can't select the Upsert option with inline tables.

For the CREATE TABLE IF NOT EXISTS issue, I would recommend a Stored Procedure that is executed in the pipeline prior to the Data Flow.
For Inline vs Dataset, you can make the Dataset very flexible:
So still based on your runtime table name and no schema, so no need to target a specific table.
For the UPSERT issue, make sure you have an AlterRow activity before the Sink:

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.
When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

How to pass dynamic table names for sink database in Azure Data Factory

I am trying to copy tables from one schema to another with the same Azure SQL db. So far, I have created a lookup pipeline and passed the parameters for the for each loop and copy activity. But my sink dataset is not taking the parameter value I have given under "table option" field rather it is taking the dummy table I chose when creating the sink dataset. Can someone tell how can I pass dynamic table name to a sink dataset?
I have given concat('dest_schema.STG_',#{item().table_name})} in the table option field.
To make the schema and table names dynamic, add Parameters to the Dataset:
Most important - do NOT import a schema. If you already have one defined in the Dataset, clear it. For this Dataset to be dynamic, you don't want improper schemas interfering with the process.
In the Copy activity, provide the values at runtime. These can be hardcoded, variables, parameters, or expressions, so very flexible.
If it's the same database, you can even use the same Dataset for both, just provide different values for the Source and Sink.
WARNING: If you use the "Auto-create table" option, the schema for the new table will define any character field as varchar(8000), which can cause serious performance problems.
MY OPINION:
While you can do this, one of my personal rules is to not cross the database boundary. If the Source and Sink are on the same SQL database, I would try to solve this problem with a Stored Procedure rather than a data factory.

Azure Data Factory: trivial SQL query in Data Flow returns nothing

I am experimenting with Data Flows in Azure Data Factory.
I have:
Set up a LinkedService to a SQL Server db. This db only has 2 tables.
The two tables are called "dummy_data_table1" and "dummy_data_table1" and are registered as Datasets
The ADF is copying data from these 2 tables, and in the Data Flow they are called "source1" and "source2"
However, when I select a source, go to Source options, and change Input from Table to Query and enter a simple query, it returns 0 columns (there are 11 columns in dummy_data_table1). I suspect my syntax is wrong, but how should I change it?
Hopefully this screenshot will help.
The problem was not the syntax. The problem was that the data flow could not recognize "dummy_data_table1" because it didn't refer to anything known. To make it work, I had to:
Enable Data Flow Debug (at the top of the page, not visible in my screenshot)
Once that's enabled, I had to click on "import projection" to import the schema of my table
Once this is done, the table name and fields are all automatically recognized and can be referenced to in the query just like one would do in SQL Server.
Source:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-source#import-schema

U-SQL job to query multiple tables with dynamic names

Our challenge is the following one :
in an Azure SQL database, we have multiple tables with the following table names : table_num where num is just an integer. These tables are created dynamically so the number of tables can vary. (from table_1, table_2 to table_N) All tables have the same columns.
As part of a U-SQL script file, we would like to execute the same query on all of these tables and generate an output csv file with the combined results of all these queries.
We tried several things :
U-SQL does not allow looping so we were thinking creating a View in our Azure SQL database that would combine all the tables using a cursor of some sort. Then, the U-SQL file would query this View (using external source). However, a View in Azure SQL database can only be created via a function and a function cannot execute dynamic SQL or even call a stored procedure...
We did not find a way to call a stored procedure of the external data source directly from U-SQL
we dont want to update our U-SQL job each time a new table is added...
Is there a way to do that in U-SQL through a custom extractor for instance? Any other ideas?
One solution I can think of is to use Azure Data Factory (v2) to assist in this.
You could create a pipeline with the following activities:
Lookup activity configured to execute the stored procedure
For Each activity that uses the output of the lookup activity as a source
As a child item use a U-Sql Activity that executes your U-Sql script which writes the output of a single table (the item of the For Each activity) to blob or datalake
Add a Copy Activity that merges the blobs from step 2.1 to one final blob.
If you have little or no experience working with ADF v2 do mind that it takes some time to get to know it but once you do, you won't regret it. Having a GUI to create the pipeline is a nice bonus.
Edit: as #wBob mentions another (far easier) solution is to somehow create a single table with all rows since all dynamically generated table have the same schema. You can create a stored procedure for populating this table for example.