Azure Data Factory Copy Activity - Does not fail, despite being unsuccessful - azure-sql-database

Overview
I have a Data Factory copy activity in a pipeline that moves tabular data (.csv) from a file in Azure Data Lake into a SQL database-landing schema.
The table definitions will be refreshed each load, in principle, and so there should not be errors. However, I want to log errors in the case that the table exists and the file cannot be mapped to the destination.
I anticipate the activity failing when it cannot match the column to the destination table definition. Strangely my process succeeds despite not loading the file into the database.
Testing
I added a column in my source file.
I did not update the destination database table definition.
I truncated the destination table.
I unset fault tolerance.
Deleted all copy activity logs (nascent development environment, so this is OK).
Run the pipeline.
Check Data Factory pipeline Output, 100% success rating!?
Query destination table (empty).
Open each log file and look for relevant information (nothing just headers per copy activity call).
Copy activity settings

Related

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.
When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

Can I copy data table folders in QuestDb to another instance?

I am running QuestDb on production server which constantly writes data to a table, 24x7. The table is daily partitioned.
I want to copy data to another instance and update it there incrementally since the old days data never changes. Sometimes the copy works but sometimes the data gets corrupted and reading from the second instance fails and I have to retry coping all the table data which is huge and takes a lot of time.
Is there a way to backup / restore QuestDb while not interrupting continuous data ingestion?
QuestDB appends data in following sequence
Append to column files inside partition directory
Append to symbol files inside root table directory
Mark transaction as committed in _txn file
There is no order between 1 and 2 but 3 always happens last. To incrementally copy data to another box you should copy in opposite manner:
Copy _txn file first
Copy root symbol files
Copy partition directory
Do it while your slave QuestDB sever is down and then on start the table should have data up to the point when you started copying _txn file.

Data Flow output to Azure SQL Database contains only NULL data on Azure Data Factory

I'm testing the data flow on my Azure Data Factory. I created Data Flow with the following details:
Source dataset linked service - from CSV files dataset from Blob storage
Sink linked service - Azure SQL database with pre-created table
My CSV files are quite simple as they contain only 2 columns (PARENT, CHILD). So, my table in SQL DB also have only 2 columns.
For the sink setting of my data flow, I have allowed insert data and leaving other options as default.
I have also mapped the 2 columns for input and output columns as per screenshot.
The pipeline with data flow ran successfully when I checked the result, I could see thqat 5732 rows were processed. Is this the correct way to check? As this is the first time I try this functionality in Azure Data Factory.
But, when I click on Data preview tab, they are all NULL value.
And; when I checked my Azure SQL DB in the table where I tried to insert the data from CSV files from Blob storage with selecting top 1000 rows from this table, I don't see any data.
Could you please let me know what I configured incorrectly on my Data Flow? Thank you very much in advance.
Here is the screenshot of ADF data flow source data, it does see the data on the right side as they are not NULL, but on the left side are all NULLs. I imagine that the right side are the data from the CSV from the source on the blob right? And the left side is the sink destination as the table is empty for now?
And here is the screenshot for the sink inspect input, I think this is correct as it reads the 2 columns correctly (Parent, Child), is it?
After adding Map drifted, to map "Parent" => "parent" and "Child" => "child"
I get this error message after running the pipeline.
When checking on sink data preview, I get this error message. It seems like there is incorrect mapping?
I rename the MapDrifted1 expression to "toString(byName('Parent1))" and Child1 as suggested.
The data flow executed successfully, however I still get NULL data in the sink SQL table.
Can you copy/paste the script behind your data flow design graph? Go to the ADF UI, open the data flow, then click the Script button on top right.
In your Source transformation, click on Data Preview to see the data. Make sure you are seeing your data, not NULLs. Also, look at the Inspect on the INPUT for your Sink, to see if ADF is reading additional columns.

SSIS Multicast - executing in a specific order

I have an SSIS MULTICAST object that splits my flow into 2 paths.
1st path: i need to update a row;
2nd path: I need to insert a row.
Basically, I am implementing SCD TYPE2 data in SSIS without using SCD wizard. So after I have identified the record that has been changed in the source data, i need the '1st path' to expire that record while the '2nd path' to insert the changed record in the destination table.
I need a way to make 2nd path wait until the 1st path has finished. (otherwise, the 1st path will also update the newly inserted row by the 2nd path).
Any help is appreciated.
Multicast is parallel operation, so you cannot make one path to wait for another.
So, What you need to do is, temporarily store the data in memory and process them later to insert into the destination (for SCD Type 2).
For temporarily storing the data, you have some options:
Temporary tables (Use RetainSameConnection = true), so that session context is maintained. From temporary tables, you can load to final table. Temporary tables Retain Same Connection
Recordset destination (It is in-memory object). From the record set destination, you can load to final table in a Dataflow task or in a script task. RecordSet destination in Dataflow task
Raw file destination (It is storing the data in the native format and easily can be reused) Raw file destination

Failed Activity not running

I have an Azure Data Factory v2 Pipeline with a copy data activity. If the activity fails a Lookup activity should be run. Unfortunately the Lookup never runs. Why doesn't it run on failure of the copy data activity? How do I get this to work?
I'm expecting the "Set load of file to failed" activity to run because the Load Zipped File to Import Destination" activity failed. In fact in the output you can see the Status is "Failed" but no other activity is run.
Later I updated the Copy Activity to skip incompatible rows which caused the Copy data activity to succeed. The expected number of rows loaded now doesn't match the total number of rows loaded, so the If Condition activity goes to the failure route. Why would the Lookup run from the If Condition only triggering the failure Activity vs the Copy Data activity?
Activity dependencies are a logical AND. The lookup activity Set load of file to failed will only execute if both the Copy data activity and the If condition fail. It's not one or the other - it's both. I blogged about this here.
It's common to redesign this as:
A. Use multiple failure activities. Instead of having the one set load of file to failed at the end, copy that activity and have the copy data activity link to the new one on failure.
B. Create a parent pipeline and use an execute pipeline activity. Then add a single failure dependency from the execute pipeline activity to Set load of file to failed activity emphasized text.