How do I retain NULL values when using SSIS to import from flat file in SQL Server 2005 - sql

I've exported records to a flat file delimited by "|" and it seems that when I import those records into a new database , SQL Server treats the NULL values as empty fields. IMy queries worked properly when the records/fields were NULL and so I want to either find a way to retain the NULL values in the data or convert the blank fields to NULL values. I'm assuming the former would be easier, but I don't know how to do that. Any help would be appreciated.

I just had the same problem. I resolved it by Changing the RetainNulls property in the properties of the Flat File Source in the Data Flow Task.

In your destination connection in the dataflow, there is a property that you can chceck that says Keep nulls, JUst check that. Why that isn't the default I'll never know.
Hmmm something stange going on there. I can suggest that you then clean the data and change it to null, you can either do this as part of the dataflow or do two dataflows, one which inserts the data into a staging table, then run an exectue SQl task to do the clean up and then create a dataflow to run fromthe staging table to the real table.

in case anyone is looking how to do this when building the package programatically you need to set the variable in your CManagedComponentWrapper object
CManagedComponentWrapper instanceSource = ComponentSource
...
instanceSource.SetComponentProperty("RetainNulls", true);

Related

SSIS: Excel data source - if column not exists use other column

I am using select statement in excel source to select just specific columns data from excel for import.
But I am wondering, is it possible to select data such way when I select for example column with name: Column_1, but if this column is not exists in excel then it will try to select column with name Column_2? Currently if Column_1 is missing, then data flow task fails.
Use a Script task and write .net code to read the excel file and then perform the check for the Column_1 availability in the file. If the column does not present then use Column_2 as input. Script Task in SSIS can act as a source.
SSIS is metadata based and will not support dynamic metadata, however you can use Script Component as #nitin-raj suggested to handle all known source columns. There is a good post below on how it can be done.
Dynamic File Connections
If you have many such files that can have varying columns then it is better to create a custom component.However, you cannot have dynamic metadata even with custom component, the set of columns should be known upfront to SSIS.
If the list of columns keep changing and you cannot know in advance what are expected columns then you are better off handling the entire thing in C#/VB.Net using Script Task of control flow
As a best practice, because SSIS meta data is static, any data quality and formatting issues in source files should be corrected before ssis data flow task runs.
I have seen this situation before and there is a very simple fix. In the beginning of your ssis package, using a file task to create copy of the source excel file and then run a c# script or execute a powershell to rename the columns so that if column 1 does not exist, it is either added at the appropriate spot in excel file or in case the column name is wrong is it corrected.
As a result of this, you will not need to refresh your ssis meta data every time it fails. This is a standard data standardization practice.
The easiest way is to add two data flow tasks, one data flow for each Excel source select statement and use precedence constraints to execute the second data flow when the first one fails.
The disadvantage of this approach is that if the first data flow task fails for another reason, it will also try to execute the second one. You will need some advanced error handling to check if the error is thrown due to missing columns or not.
But if have a similar situation, I will use a Script Task to check if the column exists and build the SQL command dynamically. Note that this SQL command must always return the same metadata (you must use aliases).
Helpful links
Overview of SSIS Precedence Constraints
Working with Precedence Constraints in SQL Server Integration Services
Precedence Constraints

Trying to create a table and load data into same table using Databricks and SQL

I Googled for a solution to create a table, using Databticks and Azure SQL Server, and load data into this same table. I found some sample code online, which seems pretty straightforward, but apparently there is an issue somewhere. Here is my code.
CREATE TABLE MyTable
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://server_name_here.database.windows.net:1433;database = db_name_here",
user "u_name",
password "p_wd",
dbtable "MyTable"
);
Now, here is my error.
Error in SQL statement: SQLServerException: Invalid object name 'MyTable'.
My password, unfortunately, has spaces in it. That could be the problem, perhaps, but I don't think so.
Basically, I would like to get this to recursively loop through files in a folder and sub-folders, and load data from files with a string pattern, like 'ABC*', and load recursively all these files into a table. The blocker, here, is that I need the file name loaded into a field as well. So, I want to load data from MANY files, into 4 fields of actual data, and 1 field that captures the file name. The only way I can distinguish the different data sets is with the file name. Is this possible? Or, is this an exercise in futility?
my suggestion is to use the Azure SQL Spark library, as also mentioned in documentation:
https://docs.databricks.com/spark/latest/data-sources/sql-databases-azure.html#connect-to-spark-using-this-library
The 'Bulk Copy' is what you want to use to have good performances. Just load your file into a DataFrame and bulk copy it to Azure SQL
https://docs.databricks.com/data/data-sources/sql-databases-azure.html#bulk-copy-to-azure-sql-database-or-sql-server
To read files from subfolders, answer is here:
How to import multiple csv files in a single load?
I finally, finally, finally got this working.
val myDFCsv = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.count()
Thanks for a point in the right direction mauridb!!

Using SSIS Package, How to validate the source records for duplicate before inserting?

SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.

Do while loop with GPDB using talend

I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.

Import SQL to SQL DB: How can I populate columns that exist in destination, not source?

I'm using SSIS to import data from one DB to another existing DB. Some columns in the destination tables do not exist in the source tables. Seems the Import & Export Wizard only allows me to select unmapped columns from the source and match them with these new columns in the destination. I'd like to be able to just provide one piece of data to import into all rows of these new columns.
Would like to use the GUI if possible because I'm not skilled at writing scripts. Thanks!
In SSIS, you can add a "derived column" component that will add columns to the buffer rows with the value you want (either a string or an expression).
I don't believe this is possible in the GUI. However, it would be a simple script after the data is loaded with SSIS:
UPDATE table SET newcolumn = new value
If you need to filter the rows, just add
WHERE column = value ...
You could change your source to a select query and list out the columns along with the static value you want to map.
SELECT SOURCECOLUMN_1,SOURCECOLUMN_2,....,SOURCECOLUMN_N,'VALUE' AS DESTINATIONCOLUMN FROM Source_Table
My original thought was that you could use the query right in the Import & Export wizard. you can obviously do alot more if you go in and edit the package, but it sounded like you didn't have much expereince with that. Here is how you would do this in the wizard.
After you have selected your source and destination databases you can Specify Table Copy or Query. Select the Write a query to specify the data to transfer option
On the next screen enter the query listing out all of the columns and add in your static columns.
On the Next screen You will need to select the Destination table or it will default to creating a new table named Query. You should be able to choose from the drop down. As long as you aliased your extra columns with the same names it should map correctly. You can go in and edit mappings here if needed.
You can then save off the SSIS package and it will source form the query.
Alternatively if you already have the SSIS pacakge created without the extra columns you can go in to the Data Flow and change the Data access mode in the OLE DB Source to be a SQL Command instead of a table or view. Add your query here.
You can then go into the properties of the OLE DB Desitination in the Dataflow and map the new column. You could also add in a derived column as #DominicGoulet by adding in a Dervied Column task and putting your static information here and then mapping. If you want to see that solution too let me know.