How can I leverage a dynamic Data Flow in Azure Data Factory to map lookup tables based on a config file? - dynamic

I am attempting to create a pipeline that accepts values from a config file (JSON) in an attempt to build a source query, lookup logic, and destination sink based on the values from the file.
An example of an object from the config file would look something like this:
{
/*Destination Table fields */
"destTableName": "DimTable1",
"destTableNaturalKey": "ClientKey, ClientNaturalKey",
"destTableSchema": "dbo",
/*Source Table fields */
"sourcePullFields": "ClientKey, ClientNaturalKey",
"sourcePullFilters": "WHERE ISNULL(ClientNaturalKey,'') <> ''",
"sourceTableName": "ClientDataStaged",
"sourceTableSchema": "stg"
}
The pipeline would identify the number of items within the config (for each) that need to be checked for new data, in a basic pipeline like this:
Pipeline pic
I would then pass these values into the data flow, from the ADF Pipeline:
ADF Parameters
And build the source pull and lookup values within the Data flow expressions with something like this:
concat('SELECT DISTINCT ', $sourcePullFields, ' FROM ', $sourceTableSchema, '.', $sourceTableName, ' ', $sourcePullFilters)
When I am within the data flow, and pass the same config values within the debug settings, I can correctly view projections and step through the data flow correctly. It is when I execute the data flow from the pipeline that I get errors.
As a second attempt, I simply passed through the source query within the config:
{
"destQuery": "SELECT Hashbytes('MD5', (cast(ClientKey as varchar(5)) + ClientNaturalKey)) AS DestHashVal FROM dbo.DimTable1",
"sourceQuery": "SELECT DISTINCT ClientKey, ClientNaturalKey, Hashbytes('MD5', ( Cast(SchoolKey AS VARCHAR(5)) + ClientNaturalKey )) AS SourceHashVal FROM stg.ClientNaturalKey WHERE Isnull(ClientNaturalKey, '') <> ''"
}
I had intended to use the md5 function within the data flow expressions, but at this point I simply want to:
Define a source query, whether it be via a SQL statement or built from variables
Define a lookup query, whether it be via a SQL statement or built from variables
Have the ability to compare a Hashed value(s) from source to the lookup (destination table)
If the lookup returns no match on the hash, load the values
ADF Data Flow Pic
Ideally I am not defining the SQL statement directly.. it just feels less intelligent. Regardless, this is to prevent migrating ~50 DFTs from SSIS to a few pipelines and a single data flow that can handle the dynamacy. Since the process has been working within the confines of the data flow, I have been messing with passing in the parameters in different ways, removing quotes, unsure of what the string interpolation is doing.. etc.

Related

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

How to build dynamic url for http connector in Clover ETL

How to build http dynamic url in http connector to map values from sql source in CloverDX? For now I have dbread component with selected one column (list of TAXIDs) in table which should be my dynamic attribute/parameter for url...
I need to build url with Path Parameter - TAXID and Query Parameter — getdate(today)
Something like:
GET baseurl/api/search/taxid/{taxid}?date={getdate(today)}
Use 'Input Mapping' property of HTTPConnector where you can build URL manually (in $out.0.URL = 'baseurl/api/search/taxid/' + $in.0. taxid + '?date='+ date2str(today(),'yyyy_Mm....')
OR
use 'Add input fields as parameters', provide records with field 'taxid' and 'date' (with prefilled value of date) and query will be build automatically, on the fly with provided values in mentioned fields.

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

How to Map Input and Output Columns dynamically in SSIS?

I Have to Upload Data in SQL Server from .dbf Files through SSIS.
My Output Column is fixed but the input column is not fixed because the files come from the client and the client may have updated data by his own style. there may be some unused columns too or the input column name can be different from the output column.
One idea I had in my mind was to map files input column with output column in SQL Database table and use only those column which is present in the row for file id.
But I am not getting how to do that. Any idea?
Table Example
FileID
InputColumn
OutputColumn
Active
1
CustCd
CustCode
1
1
CName
CustName
1
1
Address
CustAdd
1
2
Cust_Code
CustCode
1
2
Customer Name
CustName
1
2
Location
CustAdd
1
If you create a similar table, you can use it in 2 approaches to map columns dynamically inside SSIS package, or you must build the whole package programmatically. In this answer i will try to give you some insights on how to do that.
(1) Building Source SQL command with aliases
Note: This approach will only work if all .dbf files has the same columns count but the names are differents
In this approach you will generate the SQL command that will be used as source based on the FileID and the Mapping table you created. You must know is the FileID and the .dbf File Path stored inside a Variable. as example:
Assuming that the Table name is inputoutputMapping
Add an Execute SQL Task with the following command:
DECLARE #strQuery as VARCHAR(4000)
SET #strQuery = 'SELECT '
SELECT #strQuery = #strQuery + '[' + InputColumn + '] as [' + OutputColumn + '],'
FROM inputoutputMapping
WHERE FileID = ?
SET #strQuery = SUBSTRING(#strQuery,1,LEN(#strQuery) - 1) + ' FROM ' + CAST(? as Varchar(500))
SELECT #strQuery
And in the Parameter Mapping Tab select the variable that contains the FileID to be Mapped to the parameter 0 and the variable that contains the .dbf file name (alternative to table name) to the parameter 1
Set the ResultSet type to Single Row and store the ResultSet 0 inside a variable of type string as example #[User::SourceQuery]
The ResultSet value will be as following:
SELECT [CustCd] as [CustCode],[CNAME] as [CustName],[Address] as [CustAdd] FROM database1
In the OLEDB Source select the Table Access Mode to SQL Command from Variable and use #[User::SourceQuery] variable as source.
(2) Using a Script Component as Source
In this approach you have to use a Script Component as Source inside the Data Flow Task:
First of all, you need to pass the .dbf file path and SQL Server connection to the script component via variables if you don't want to hard code them.
Inside the script editor, you must add an output column for each column found in the destination table and map them to the destination.
Inside the Script, you must read the .dbf file into a datatable:
C# Read from .DBF files into a datatable
Load a DBF into a DataTable
After loading the data into a datatable, also fill another datatable with the data found in the MappingTable you created in SQL Server.
After that loop over the datatable columns and change the .ColumnName to the relevant output column, as example:
foreach (DataColumn col in myTable.Columns)
{
col.ColumnName = MappingTable.AsEnumerable().Where(x => x.FileID = 1 && x.InputColumn = col.ColumnName).Select(y => y.OutputColumn).First();
}
After loop over each row in the datatable and create a script output row.
In addition, note that in while assigning output rows, you must check if the column exists, you can first add all columns names to list of string, then use it to check, as example:
var columnNames = myTable.Columns.Cast<DataColumn>()
.Select(x => x.ColumnName)
.ToList();
foreach (DataColumn row in myTable.Rows){
if(columnNames.contains("CustCode"){
OutputBuffer0.CustCode = row("CustCode");
}else{
OutputBuffer0.CustCode_IsNull = True
}
//continue checking all other columns
}
If you need more details about using a Script Component as a source, then check one of the following links:
SSIS Script Component as Source
Creating a Source with the Script Component
Script Component as Source – SSIS
SSIS – USING A SCRIPT COMPONENT AS A SOURCE
(3) Building the package dynamically
I don't think there are other methods that you can use to achieve this goal except you has the choice to build the package dynamically, then you should go with:
BIML
Integration Services managed object model
EzApi library
(4) SchemaMapper: C# schema mapping class library
Recently i started a new project on Git-Hub, which is a class library developed using C#. You can use it to import tabular data from excel, word , powerpoint, text, csv, html, json and xml into SQL server table with a different schema definition using schema mapping approach. check it out at:
SchemaMapper: C# Schema mapping class library
You can follow this Wiki page for a step-by-step guide:
Import data from multiple files into one SQL table step by step guide

Export SQL query results to XML format using powershell

I need to make an XML file based on the SQL query that I run using powershell. I already know the schema for the XML that I need to create. The query results need to be looped through and I want to add each data value to specific XML node as per the schema.
I am able to run the query and get the results I need but I am having issues placing the data as per prescribed format.
Here's an example of how I am trying to accomplish this:
**Parsing the XMl Template
$XmlTemplate= [xml](get-content $xml) ($xml is the schema I have from the client)
***Parsing through XML Template and jumping to tag
$PlanIDXML= $XmlTemplate.NpiLink.PlanProvider.PlanID (to get to the node I need to enter data into)
**Parsing through XML Template and jumping to tag
$PlannameXML= $XmlTemplate.NpiLink.PlanProvider.PlanName (to get to the node I need to enter data into)
sample qry;
select PlanID,PlanName from plan
**Assuming I ran my query and saved the results as $qryresults***
foreach($result in $qryresults)
{
$PlanID=$result.PlanID
$PlanName=$result.PlanName
**Make Clone
$NewPlanIDXML=$PlanIDXML.Clone()
**Make Changes to the data
$NewPlanIDXML=$PlanID
***Append
$PlanIDXML.AppendChild($NewPlanIDXML)
* Do the same thing for Plan Name **
$PlanNameXML=$result.PlanName
}
$XmlTemplate.Save('filepath')
My concern is that I need to do this for each plan or planid that I get in my query results and I need to keep generating tags and tags even and append them to orginal nodes and save the schema.
So, if my query results have 10 Plan IDs it should continue to generate new Plan ID tags and Plan Name tags.
Its not letting me append (because system.string can not be converted to system.xml). I am really stuck and if you have a better approach on how to handle this, I am all ears.
Thanks much in advance!!!
You might be overengineering this a bit. If you have a template for the XML node, just treat it as a string, popping your values in at the appropriate place. Generate some array of these nodes as strings, then join them together and save to disk.
Let's say your template looks like this (type in some tokens yourself where generated values should go):
--Template.xml---
<Node attr="##VALUE1##">
<Param>##VALUE2##</Param>
</Node>
And you want to run some query to generate a bunch of these nodes, filling in VALUE1 and VALUE2. Then something like this could work:
$template = (gc .\Template.xml) -join "`r`n"
$val1Token = '##VALUE1##'
$val2Token = '##VALUE2##'
$nodes = foreach( $item in Run-Query)
{
# simple string replace
$result = $template
$result = $result.Replace($val1Token, $item.Value1)
$result = $result.Replace($val2Token, $item.Value2)
$result
}
# you have all nodes in a string array, just join them together along with parent node
$newXml = (#("<Nodes>") + $nodes + "</Nodes>") -join "`r`n"
$newXml | out-file .\Results.xml