How to Parameterize USQL input files without using ADF - azure-data-lake

I have a input folder in ADLS in the format year/month/date eg: 2017/07/11. I want to pass this input folder as a parameter to my usql script. I am not using ADF. I dont want to generate current date from within Usql script as i am not sure if the input folder is of the current date. How to do it effectively?
One way I thought of was uploading a "done" file after all my input folder is uploaded to ADLS account and that "done" file will contain the date. But i am not able to use that date to form my input data path. Please help.

Let's assume you have several csv files in your folder structure (structured as yyyy/MM/dd) and you want to extract all the files in a folder of a specific date. You can do it in two ways (depending in whether you need to have exact datetime semantics or if you are fine with path concat).
First the path concat example:
DECLARE EXTERNAL #folder = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now.
DECLARE #path = "/constantpath/"+#folder+"/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
FROM #path
USING Extractors.Csv();
...
And here is the example with a file set virtual column:
DECLARE EXTERNAL #date = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now and string serialization (I am not sure if the ADF parameter model supports DateTime values).
DECLARE #path = "/constantpath/{date:yyyy}/{date:MM}/{date:dd}/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
, date DateTime // virtual column for the date pattern
FROM #path
USING Extractors.Csv();
// Now apply the requested filter to reduce the files to the requested set
#data = SELECT * FROM #data WHERE date == DateTime.Parse(#date);
...
In both cases, you pass the parameter via the ADF parameterization model and you can decide to wrap the code into a U-SQL stored procedure or TVF as suggested by Bob.

Related

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

How can I fetch a dynamic file from FTP server-generated every day?

There are transactional input CSV files coming on a daily basis on an FTP location. I need to read these input files and process them on daily batch execution. The name of the files remains the same every day, but the date gets appended at the end of the filenames every day,
Ex:
Day1
General_Ledger1_2020-07-01,
General_Ledger2_2020-07-01,
General_Ledger3_2020-07-01,
General_Ledger4_2020-07-01,
General_Ledger5_2020-07-01
Day2
General_Ledger1_2020-07-02,
General_Ledger2_2020-07-02,
General_Ledger3_2020-07-02,
General_Ledger4_2020-07-02,
General_Ledger5_2020-07-02
How can I append this Date information to the input file name every time the job runs?
I have faced similar problem earlier and this can be solved using calculated parameter in the file path. Here, you can create expressions that will retrieve the file dynamically.
Example,
CONCAT( UPPER(lit('$(Prefix)')), ADD_DAYS( TODATE(lit('$(currentTime)'), 'yyyy-mm-dd'), 'yyyy-mm-dd' ,-1),'.csv')
Breaking of the expression :
$(currentTime) : this system parameter will get the current date (this will also include timestamp).
(TODATE(lit('$(currentTime)'), 'yyyy-mm-dd') : TODATE will get only date from the whole timestamp with format as ‘yyyy-mm-dd’.
ADD_DAYS(TODATE(lit('$(currentTime)'), 'yyyy-mm-dd'), 'yyyy-mm-dd' ,-1) : ADD_DAYS here will add -1 to the date retrieved from. TODATE(). Hence (2020-04-24) + (-1) would give us 2020-04-23
$(Prefix) : $(Prefix) will be an user defined input parameter of type String which user will be providing at runtime – Since the
prefix will be always dynamic.
CONCAT() : Finally to combine all the results and form the exact file path CONCAT() can be used. Also in between some static
string is added as it will always be fixed for every file to be read.

How to Map Input and Output Columns dynamically in SSIS?

I Have to Upload Data in SQL Server from .dbf Files through SSIS.
My Output Column is fixed but the input column is not fixed because the files come from the client and the client may have updated data by his own style. there may be some unused columns too or the input column name can be different from the output column.
One idea I had in my mind was to map files input column with output column in SQL Database table and use only those column which is present in the row for file id.
But I am not getting how to do that. Any idea?
Table Example
FileID
InputColumn
OutputColumn
Active
1
CustCd
CustCode
1
1
CName
CustName
1
1
Address
CustAdd
1
2
Cust_Code
CustCode
1
2
Customer Name
CustName
1
2
Location
CustAdd
1
If you create a similar table, you can use it in 2 approaches to map columns dynamically inside SSIS package, or you must build the whole package programmatically. In this answer i will try to give you some insights on how to do that.
(1) Building Source SQL command with aliases
Note: This approach will only work if all .dbf files has the same columns count but the names are differents
In this approach you will generate the SQL command that will be used as source based on the FileID and the Mapping table you created. You must know is the FileID and the .dbf File Path stored inside a Variable. as example:
Assuming that the Table name is inputoutputMapping
Add an Execute SQL Task with the following command:
DECLARE #strQuery as VARCHAR(4000)
SET #strQuery = 'SELECT '
SELECT #strQuery = #strQuery + '[' + InputColumn + '] as [' + OutputColumn + '],'
FROM inputoutputMapping
WHERE FileID = ?
SET #strQuery = SUBSTRING(#strQuery,1,LEN(#strQuery) - 1) + ' FROM ' + CAST(? as Varchar(500))
SELECT #strQuery
And in the Parameter Mapping Tab select the variable that contains the FileID to be Mapped to the parameter 0 and the variable that contains the .dbf file name (alternative to table name) to the parameter 1
Set the ResultSet type to Single Row and store the ResultSet 0 inside a variable of type string as example #[User::SourceQuery]
The ResultSet value will be as following:
SELECT [CustCd] as [CustCode],[CNAME] as [CustName],[Address] as [CustAdd] FROM database1
In the OLEDB Source select the Table Access Mode to SQL Command from Variable and use #[User::SourceQuery] variable as source.
(2) Using a Script Component as Source
In this approach you have to use a Script Component as Source inside the Data Flow Task:
First of all, you need to pass the .dbf file path and SQL Server connection to the script component via variables if you don't want to hard code them.
Inside the script editor, you must add an output column for each column found in the destination table and map them to the destination.
Inside the Script, you must read the .dbf file into a datatable:
C# Read from .DBF files into a datatable
Load a DBF into a DataTable
After loading the data into a datatable, also fill another datatable with the data found in the MappingTable you created in SQL Server.
After that loop over the datatable columns and change the .ColumnName to the relevant output column, as example:
foreach (DataColumn col in myTable.Columns)
{
col.ColumnName = MappingTable.AsEnumerable().Where(x => x.FileID = 1 && x.InputColumn = col.ColumnName).Select(y => y.OutputColumn).First();
}
After loop over each row in the datatable and create a script output row.
In addition, note that in while assigning output rows, you must check if the column exists, you can first add all columns names to list of string, then use it to check, as example:
var columnNames = myTable.Columns.Cast<DataColumn>()
.Select(x => x.ColumnName)
.ToList();
foreach (DataColumn row in myTable.Rows){
if(columnNames.contains("CustCode"){
OutputBuffer0.CustCode = row("CustCode");
}else{
OutputBuffer0.CustCode_IsNull = True
}
//continue checking all other columns
}
If you need more details about using a Script Component as a source, then check one of the following links:
SSIS Script Component as Source
Creating a Source with the Script Component
Script Component as Source – SSIS
SSIS – USING A SCRIPT COMPONENT AS A SOURCE
(3) Building the package dynamically
I don't think there are other methods that you can use to achieve this goal except you has the choice to build the package dynamically, then you should go with:
BIML
Integration Services managed object model
EzApi library
(4) SchemaMapper: C# schema mapping class library
Recently i started a new project on Git-Hub, which is a class library developed using C#. You can use it to import tabular data from excel, word , powerpoint, text, csv, html, json and xml into SQL server table with a different schema definition using schema mapping approach. check it out at:
SchemaMapper: C# Schema mapping class library
You can follow this Wiki page for a step-by-step guide:
Import data from multiple files into one SQL table step by step guide

Azure Data Lake Loop

Does Azure Data Lake Analytics and U-SQL support use While or For to Loop and create multiple outputs?
I want output to multiple files using one USQL execution.
This is what i want:
Foreach #day in #days
#dataToSave =
SELECT day AS day,
company AS Company,
FROM #data
WHERE #day = #day
#out = #day + ".txt"
OUTPUT #dataToSave
TO #out
USING Outputters.Text();
Next
I know i can use a powershell, but i think that will cost performance prepairing the execution.
U-SQL does not support While or For loops. You can use WHERE statements to filter extracted data, and virtual columns to filter based on file paths/names (example).
To output to multiple files, you can write a unique rowset and WHERE clause for each output if its a reasonable number of files.
As you said, you could also script this with Powershell or U-SQL (example).
Dynamic output to multiple files is currently in a limited private preview. Please send an email to usql at microsoft dot com with your scenario if you're interested in this feature, as it could work for your scenario based on your description.
Hope this helps, and let me know if you have more questions about implementing any of these solutions.
You can try create a custom outputter and ignore the output file and write on your own file!
public override void Output (IRow row, IUnstructuredWriter output)
Try this, using outputter too:
public override void Output(IRow input, IUnstructuredWriter output)
{
using (System.IO.StreamWriter streamWriter = new StreamWriter(address + _file, true))
//Save on file!
}

SSIS filename - file count

I'm currently creating a flat file export for one of our clients, i've managed to get the file in the format they want, i'm trying to get the easiest way of creating a dynamic file name. I've got the date in as a variable and the path ect but they want a count in the file name. For example
File name 1 : TDY_11-02-2013_{1}_T1.txt. The {} being the count. So next weeks file would be TDY_17-02-2013_{2}_T1.txt
I cant see an easy way of doing this!! any idea's??
EDIT:
on my first answer, I thought you meant count of values returned on a query. My bad!
two ways to achieve this, you could loop into the destination folder, select the last file by date, get its value and increase 1, which sound like a lot of trouble. Why not a simple log table on the DB with last execution date and ID and then you compose your file name base on the last row of this table?
where exactly is your problem?
you can make a dynamic file name using expressions:
the count, you can use a "row count" component inside your data flow to assign the result to a variable and use the variable on your expression:
Use Script task and get the number inside the curly braces of the file name and store it in a variable.
Create a variable(FileNo of type int) which stores the number for the file
Pseudo code
string name = string.Empty;
string loction = #"D:\";
/* Get the path from the connection manager like the code below
instead of hard coding like D: above
string flatFileConn =
(string(Dts.Connections["Yourfile"].AcquireConnection(null) as String);
*/
string pattern = string.Empty;
int number = 0;
string pattern = #"{([0-9])}"; // Not sure about the correct regular expression to retrieve the number inside braces
foreach (string s in Directory.GetFiles(loction,"*.txt"))
{
name = Path.GetFileNameWithoutExtension(s);
Match match = Regex.Match(name, pattern );
if (match.Success)
{
dts.Variables["User::FileNo"].Value = int.Parse(match.Value)+1;
}
}
Now once you get the value use it in your file expression in the connection manager
#[User::FilePath] +#[User::FileName]
+"_{"+ (DT_STR,10,1252) #[User::FileNo] + "}T1.txt"