I Have to Upload Data in SQL Server from .dbf Files through SSIS.
My Output Column is fixed but the input column is not fixed because the files come from the client and the client may have updated data by his own style. there may be some unused columns too or the input column name can be different from the output column.
One idea I had in my mind was to map files input column with output column in SQL Database table and use only those column which is present in the row for file id.
But I am not getting how to do that. Any idea?
Table Example
FileID
InputColumn
OutputColumn
Active
1
CustCd
CustCode
1
1
CName
CustName
1
1
Address
CustAdd
1
2
Cust_Code
CustCode
1
2
Customer Name
CustName
1
2
Location
CustAdd
1
If you create a similar table, you can use it in 2 approaches to map columns dynamically inside SSIS package, or you must build the whole package programmatically. In this answer i will try to give you some insights on how to do that.
(1) Building Source SQL command with aliases
Note: This approach will only work if all .dbf files has the same columns count but the names are differents
In this approach you will generate the SQL command that will be used as source based on the FileID and the Mapping table you created. You must know is the FileID and the .dbf File Path stored inside a Variable. as example:
Assuming that the Table name is inputoutputMapping
Add an Execute SQL Task with the following command:
DECLARE #strQuery as VARCHAR(4000)
SET #strQuery = 'SELECT '
SELECT #strQuery = #strQuery + '[' + InputColumn + '] as [' + OutputColumn + '],'
FROM inputoutputMapping
WHERE FileID = ?
SET #strQuery = SUBSTRING(#strQuery,1,LEN(#strQuery) - 1) + ' FROM ' + CAST(? as Varchar(500))
SELECT #strQuery
And in the Parameter Mapping Tab select the variable that contains the FileID to be Mapped to the parameter 0 and the variable that contains the .dbf file name (alternative to table name) to the parameter 1
Set the ResultSet type to Single Row and store the ResultSet 0 inside a variable of type string as example #[User::SourceQuery]
The ResultSet value will be as following:
SELECT [CustCd] as [CustCode],[CNAME] as [CustName],[Address] as [CustAdd] FROM database1
In the OLEDB Source select the Table Access Mode to SQL Command from Variable and use #[User::SourceQuery] variable as source.
(2) Using a Script Component as Source
In this approach you have to use a Script Component as Source inside the Data Flow Task:
First of all, you need to pass the .dbf file path and SQL Server connection to the script component via variables if you don't want to hard code them.
Inside the script editor, you must add an output column for each column found in the destination table and map them to the destination.
Inside the Script, you must read the .dbf file into a datatable:
C# Read from .DBF files into a datatable
Load a DBF into a DataTable
After loading the data into a datatable, also fill another datatable with the data found in the MappingTable you created in SQL Server.
After that loop over the datatable columns and change the .ColumnName to the relevant output column, as example:
foreach (DataColumn col in myTable.Columns)
{
col.ColumnName = MappingTable.AsEnumerable().Where(x => x.FileID = 1 && x.InputColumn = col.ColumnName).Select(y => y.OutputColumn).First();
}
After loop over each row in the datatable and create a script output row.
In addition, note that in while assigning output rows, you must check if the column exists, you can first add all columns names to list of string, then use it to check, as example:
var columnNames = myTable.Columns.Cast<DataColumn>()
.Select(x => x.ColumnName)
.ToList();
foreach (DataColumn row in myTable.Rows){
if(columnNames.contains("CustCode"){
OutputBuffer0.CustCode = row("CustCode");
}else{
OutputBuffer0.CustCode_IsNull = True
}
//continue checking all other columns
}
If you need more details about using a Script Component as a source, then check one of the following links:
SSIS Script Component as Source
Creating a Source with the Script Component
Script Component as Source – SSIS
SSIS – USING A SCRIPT COMPONENT AS A SOURCE
(3) Building the package dynamically
I don't think there are other methods that you can use to achieve this goal except you has the choice to build the package dynamically, then you should go with:
BIML
Integration Services managed object model
EzApi library
(4) SchemaMapper: C# schema mapping class library
Recently i started a new project on Git-Hub, which is a class library developed using C#. You can use it to import tabular data from excel, word , powerpoint, text, csv, html, json and xml into SQL server table with a different schema definition using schema mapping approach. check it out at:
SchemaMapper: C# Schema mapping class library
You can follow this Wiki page for a step-by-step guide:
Import data from multiple files into one SQL table step by step guide
Related
In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option
My data is structured with
FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv
I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.
DROP view IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)
My desired outcome is something like this
col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
Is there a clever option, or should I load my data another way?
The function input_file_name() could be used to retrieve the file name while reading.
SELECT *, input_file_name() as path FROM clevertablenameA
Note that this does not add a column to the view and merely returns the name of the file being read.
Refer to below link for more information.
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/functions/input_file_name
Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name()) function and then create the view on top of it.
I am attempting to create a pipeline that accepts values from a config file (JSON) in an attempt to build a source query, lookup logic, and destination sink based on the values from the file.
An example of an object from the config file would look something like this:
{
/*Destination Table fields */
"destTableName": "DimTable1",
"destTableNaturalKey": "ClientKey, ClientNaturalKey",
"destTableSchema": "dbo",
/*Source Table fields */
"sourcePullFields": "ClientKey, ClientNaturalKey",
"sourcePullFilters": "WHERE ISNULL(ClientNaturalKey,'') <> ''",
"sourceTableName": "ClientDataStaged",
"sourceTableSchema": "stg"
}
The pipeline would identify the number of items within the config (for each) that need to be checked for new data, in a basic pipeline like this:
Pipeline pic
I would then pass these values into the data flow, from the ADF Pipeline:
ADF Parameters
And build the source pull and lookup values within the Data flow expressions with something like this:
concat('SELECT DISTINCT ', $sourcePullFields, ' FROM ', $sourceTableSchema, '.', $sourceTableName, ' ', $sourcePullFilters)
When I am within the data flow, and pass the same config values within the debug settings, I can correctly view projections and step through the data flow correctly. It is when I execute the data flow from the pipeline that I get errors.
As a second attempt, I simply passed through the source query within the config:
{
"destQuery": "SELECT Hashbytes('MD5', (cast(ClientKey as varchar(5)) + ClientNaturalKey)) AS DestHashVal FROM dbo.DimTable1",
"sourceQuery": "SELECT DISTINCT ClientKey, ClientNaturalKey, Hashbytes('MD5', ( Cast(SchoolKey AS VARCHAR(5)) + ClientNaturalKey )) AS SourceHashVal FROM stg.ClientNaturalKey WHERE Isnull(ClientNaturalKey, '') <> ''"
}
I had intended to use the md5 function within the data flow expressions, but at this point I simply want to:
Define a source query, whether it be via a SQL statement or built from variables
Define a lookup query, whether it be via a SQL statement or built from variables
Have the ability to compare a Hashed value(s) from source to the lookup (destination table)
If the lookup returns no match on the hash, load the values
ADF Data Flow Pic
Ideally I am not defining the SQL statement directly.. it just feels less intelligent. Regardless, this is to prevent migrating ~50 DFTs from SSIS to a few pipelines and a single data flow that can handle the dynamacy. Since the process has been working within the confines of the data flow, I have been messing with passing in the parameters in different ways, removing quotes, unsure of what the string interpolation is doing.. etc.
I'm a newbie so don't laugh :#
I'm working with 2002-2003 Microsoft access database.
Now, I want to add an array of DataRow into an existing table that I've in my database. Is there a way to do that? because right now I'm just adding the rows with a foreach loop
thank you
I think that the foreach-loop actually is the best way to do it.
foreach(DataRow row in yourRowArray)
{
dataTable.Add(row);
}
If you are using .Net Framework 3.5+ you can also use the DataRows CopyToDataTable() Method.
But you have to watch out because the Data in the DataTable is overwritten in this case.
DataTable table = yourDataTable;
DataRow[] yourRowArray = ...;
if(yourRowArray.Length > 0)
{
table = yourRowArray.CopyToDataTable();
}
I would recommend using the foreach-loop.
What you describe as array must be a saved file type i.e. excel or csv. Be sure it is a clean grid of data without extraneous non aligned rows.
Then you can link to that file with Access as a table. This is a manual step using the Access interface - in the ribbon it is the External area. This link remains good - allowing you to replace the excel/csv with a new one as long as the location path and structure of the file do not change.
Then you create an Append query to write all the records from this table into the table in your Access database.
www.CahabaData.com
I have a input folder in ADLS in the format year/month/date eg: 2017/07/11. I want to pass this input folder as a parameter to my usql script. I am not using ADF. I dont want to generate current date from within Usql script as i am not sure if the input folder is of the current date. How to do it effectively?
One way I thought of was uploading a "done" file after all my input folder is uploaded to ADLS account and that "done" file will contain the date. But i am not able to use that date to form my input data path. Please help.
Let's assume you have several csv files in your folder structure (structured as yyyy/MM/dd) and you want to extract all the files in a folder of a specific date. You can do it in two ways (depending in whether you need to have exact datetime semantics or if you are fine with path concat).
First the path concat example:
DECLARE EXTERNAL #folder = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now.
DECLARE #path = "/constantpath/"+#folder+"/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
FROM #path
USING Extractors.Csv();
...
And here is the example with a file set virtual column:
DECLARE EXTERNAL #date = "2017/07/11"; // Script parameter with default value.
// You can specify the value also with constant-foldable expression on Datetime.Now and string serialization (I am not sure if the ADF parameter model supports DateTime values).
DECLARE #path = "/constantpath/{date:yyyy}/{date:MM}/{date:dd}/{*.csv}";
#data = EXTRACT I int, s string // or whatever your schema is...
, date DateTime // virtual column for the date pattern
FROM #path
USING Extractors.Csv();
// Now apply the requested filter to reduce the files to the requested set
#data = SELECT * FROM #data WHERE date == DateTime.Parse(#date);
...
In both cases, you pass the parameter via the ADF parameterization model and you can decide to wrap the code into a U-SQL stored procedure or TVF as suggested by Bob.
Ok, I have a simple process...
Read a table and get the rows that
have a "StatusID" of 1. Simple.
Select ProductID from PreorderStatus where StatusID = 1
Foreach row returned from that
query, perform an action. For
simplicity sake, let's just modify
the original table to set the
"StatusID" to 2.
Update PreorderStatus set StatusID = 2 where ProductID = #ProductID
In order to do this in SSIS, I have created a simple "Execute SQL Task" with the first statement. In the editor I have set the Result Set to return a Full result set and the Result Name of 0 is set to fill an object variable named ReadySet.
The output is then routed to a For Each Loop container. The Enumerator is set to Foreach ADO Enumerator and the object source variable set to the ReadySet variable from above. I have also mapped the variable v_ProductID to index 0.
Setting a breakpoint at the begining of the Foreach loop shows the variable being set correctly. GREAT!! Now on to step two....
Now I have placed a new SQL task in the foreach container. Now I have a head scratcher. How do I actually use the variable in the SQL statement. Simply using "v___ProductID" or "User::v_ProductID" doesn't seem to work. Mapping a parameter seemed like a good idea (got a #ProductID and everything!) but that didn't seem to work either.
I get the feeling that I am missing something pretty simple but can't tell what. Thanks for any help!!
I think there is a better approach. Here are the approximate steps:
Drag a DataFlow task onto the design surface.
Open it up and add a OLE DB source and OLEDB Command components to the design surface.
Modify the source to use the query you have described.
Connect the source to the Command component.
Modify command component to use "Update PreorderStatus set StatusID = 2 where ProductID = ?" query and on param mapping page map the ? variable to the input coming from the datasource.
HTH
When I want to use an execute sql task and vary something based on a variable, I use a stored proc and make the variable the input parameter for the proc.
Then you set the parmeter in the execute SQL task and set the SQL statement to something like:
exec myproc ?