Using ADF Data Flow Derived Column transform against nested Delta structures

Using ADF Data Flow Derived Column transform against nested Delta structures - azure-data-factory-2

I'm trying to use a derived column transform within an ADF (Gen 2) Data Flow where I've ingested a Delta table with nested structures. I'm struggling with the syntax needed to flatten out these structures and no column info is displayed despite me being able to preview the data.
Such a structure would be:
{
"ContactId":"1002657",
"Name":{
"FirstName":"Donna",
"FullName":"Donna Brittain",
"LastName":"Brittain"
}
}
Data Preview working OK:
Data Preview
The structure of my Delta table:
Delta Table Struct
The error I'm getting trying to reference a nested column:
Derived Column Task
How can I reference a nested column such as Name.FirstName to flatten it out to FirstName and why is it not showing up in any of the mappings?

There is a easy way to flatten the nested structures. We can use Copy activity in ADF firstly, it will automate flatten the nested column.
Copy the data into Azure Storage such as data lake(here I used Azure Data Lake Storage Gen 2), then we can use it as data source in the Data Flow.
We can create a txt or csv file with headers in data lake.
Then we can define a Copy activity in ADF and set the mapping.
After run debug, we can see the result. We can use it as data source in data flow.
Update:
In the sink,we can set the value of the Max rows per file option like follows:
ADF will divide the file into several files

Related

Azure Data Factory Copy Activity for JSON to Table in Azure SQL DB

I have a copy activity that takes a bunch of JSON files and merges them into a singe JSON.
I would now like to copy the merged single JSON to Azure SQL DB. Is that possible?
Ok, it appears to be working however the output in SQL is just countryCode and CompanyId
However, I need to retrieve all the financial information in the JSON as well

Azure Data Factory Copy Activity for JSON to Table in Azure SQL DB
I repro'd the same and below are the steps.
Two json files are taken as source.
Those files are merged into single file using copy activity.
Then Merged Json data is taken as source dataset in another copy activity.
In sink, dataset for Azure SQL db is created and Auto create table option is selected.
In sink dataset, edit checkbox is selected and sink table name is given.
Once the pipeline is run, data is copied to table.

Azure Data Factory Incremental Load data by using Copy Activity

I would like to load incremental data from data lake into on premise SQL, so that i created data flow do the necessary data transformation and cleaning the data.
after that i copied all the final data sink to staging data lake to stored CSV format.
I am facing two kind of issues here.
when ever i am trigger / debug to loading my dataset(data flow full activity ), the first time data loaded in CSV, if I load second time similar pipeline, the target data lake, the CSV file loaded empty data, which means, the column header loaded but i could not see the any value inside file.
coming to copy activity, which is connected to on premise SQL server, i am trying to load the data but if we trigger this pipeline again and again, the duplicate data loaded, i want to load only incremental or if updated data comes from data lake CSV file. how do we handle this.
Kindly suggest.

When we want to incrementally load our data to a database table, we need to use the Upsert option in copy data tool.
Upsert helps you to incrementally load the source data based on a key column (or columns). If the key column is already present in target table, it will update the rest of the column values, else it will insert the new key column with other values.
Look at following demonstration to understand how upsert works. I used azure SQL database as an example.
My initial table data:
create table player(id int, gname varchar(20), team varchar(10))
My source csv data (data I want to incrementally load):
I have taken an id which already exists in target table (id=1) and another which is new (id=4).
My copy data sink configuration:
Create/select dataset for the target table. Check the Upsert option as your write behavior and select a key column based on which upsert should happen.
Table after upsert using Copy data:
Now, after the upsert using copy data, the id=1 row should be updated and id=4 row should be inserted. The following is the final output achieved which is inline with expected output.
You can use the primary key in your target table (which is also present in your source csv) as the key column in Copy data sink configuration. Any other configuration (like source filter by last modified configuration) should not effect the process.

Table name is getting appending with column names in resultent file in azure datafactory

I was trying to get data from On-prem hive Source to Azure data lake gen 2 using azure data factory.
As I need to get data for multiple tables I have created and file(ex: tnames.txt) with all my table names and stored in data lake gen 2.
In Azure Data Factory created a lookup activity and passed tnames.txt file to it.
Then added a foreach activity to that lookup actvity and in foreach activity added a copy activity.
In copy activity in source, I was giving query to extract data.
Sink is datalake gen 2.
Example code:
select * from tableName
Here table is dynamically passed from tnames.txt.
But after data is copied into data lak,e I am getting headers in copied data are like:
"tablename.columnname".
For example: Table name is Employee and few columns are ID, Name, Gender,....
My resultent file columns are like Employee.ID,Employee.Name,Employee.Gender, but my requirement is just column name.
Basically tabe name is append to column name.
How to solve this issue/Is there any other way to get data for multiple tables in single pipeline/copy activity?

Check the mapping tab of your copy activity . If the mapping is enabled, clear it and use auto-create table . It will auto-generate the schema according to the source schema. No need to explicitly create the table with defined schema. Let it be auto create table. It will generate required mapping automatically.

How to get an array from JSON in the Azure Data Factory?

My actual (not properly working) setup has two pipelines:
Get API data to lake: for each row in metadata table in SQL calling the REST API and copy the reply (json-files) to the Blob datalake.
Copy data from the lake to SQL: For Each file auto create table in SQL.
The result is the correct number of tables in SQL. Only the content of the tables is not what I hoped for. They all contain 1 column named odata.metadata and 1 entry, the link to the metadata.
If I manually remove the metadata from the JSON in the datalake and then run the second pipeline, the SQL table is what I want to have.
Have:
{ "odata.metadata":"https://test.com",
"value":[
{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]}
Want:
[{
"Key":"12345",
"Title":"Name",
"Status":"Test"
}]
I tried to add $.['value'] in the API call. The result then was no odata.metadata line, but the array started with {value: which resulted in an error copying to SQL
I also tried to use mapping (in sink) to SQL. That gives the wanted result for the dataset I manually specified the mapping for, but only goes well for the dataset with the same number of column in the array. I don't want to manually do the mapping for 170 calls...
Does anyone know how handle this in ADF? For now I feel like the only solution is to add a Python step in the pipeline, but I hope for a somewhat standard ADF way to do this!

You can add another pipeline with dataflow to remove the content from JSON file before copying data to SQL, using flatten formatters.
Before flattening the JSON file:
This is what I see when JSON data copied to SQL database without flattening:
After flattening the JSON file:
Added a pipeline with dataflow to flatten the JSON file to remove 'odata.metadata' content from the array.
Source preview:
Flatten formatter:
Select the required object from the Input array
After selecting value object from input array, you can see only the values under value in Flatten formatter preview.
Sink preview:
File generated after flattening.
Copy the generated file as Input to SQL.
Note: If your Input file schema is not constant, you can enable Allow schema drift to allow schema changes
Reference: Schema drift in mapping data flow

Write Azure Data Factory Output Parameter to dataset

Is it possible to write an output parameter to a dataset?
I have a meta data activity that stores the file name of an azure blob dataset and I would like write that value into another azure blob dataset as an additional column via a copy activity.
Thanks

If you are looking to get the output of the previous operation as an input to the next operation, you could probably go ahead in the following manner,
I am hoping that the attribute you are getting is the child Items, the values for this can be obtained in the next step using the following expression.
#activity('Name_of_activity').output.childItems.
This would return an Array of your subfolders.
The following link should help you with the expression in ADF

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using ADF Data Flow Derived Column transform against nested Delta structures - azure-data-factory-2

Related

Azure Data Factory Copy Activity for JSON to Table in Azure SQL DB

Azure Data Factory Incremental Load data by using Copy Activity

Table name is getting appending with column names in resultent file in azure datafactory

How to get an array from JSON in the Azure Data Factory?

Write Azure Data Factory Output Parameter to dataset

Categories

Resources