I have developed a mapping in informatica.Source is file .I need to write a post sql that will delete the already existing data if the file with same name comes again.File comes once in every month and naming is like jass_naming_yyyymm.csv .I have written like delete from tab where load_date = sysdate but its not working.laod date is a column in target table taht stores yyyymm from the file.So query shoud be like if file with existing yyyymm comes again the existing data should get deleted and new file will be loaded.
Please give soluntion.
Post SQL will not help here. You need two pipelines.
Pipeline 1 - Src->exp->tgt.
Use indirect file read method, get file name to fetch yyyy_mm part from file name.
You need to use 'update override' option in the target to delete the data. Use this logic -
DELETE FROM target_table WHERE target_yyyy_mm= :TU.source_yyyy_mm
Pipeline 2 - your mapping.
HTH
Related
I'm working on an SSIS project that pulls data form Excel and loads to Oracle Database every month. I plan to pull data from Excel file and load to Oracle stage table. I will be using a merge statement because the data that gets loaded each month is a rolling 12 month list and the data can change, so need to be able to INSERT when records don't match or UPDATE when they do. My control flow looks like this: Truncate Stage Table (to clear out table from last package run)---> DATA FLOW from Excel to Stage Table---> Merge to Target Table in Oracle.
My problem is that the data in the source Excel file doesn't have any unique columns to select a primary key or a composite key, as it is a possibility (although very unlikely) that a new record could have the exact same information. I am unable to utilize the "generated always as identity" because my SSIS package needs to truncate at the beginning of each job to clear out the Stage Table. This would generate the same ID numbers in the new load and create problems in the Target Table.
Any suggestions as to how I can get around this problem?
Welcome to SO and ETL. Instead of using a staging table, in SSIS use two sources: Excel file and existing production table. Sort both inputs and then perform a merge join on the unique identifier. From there, use a derived column transformation to add a new column called 'Action' which will mark a row as either an INSERT/UPDATE/DELETE based on whether the join key is NULL. So:
NULL from file means DELETE (not in file, in database)
NULL from database means INSERT (in file, not in database)
Not NULL for both means UPDATE (in file, in database)
From there, use a conditional split to split rows to either a OLE DB Destination (INSERT), or SQL Command (UPDATE or DELETE). You can now remove the stage environment and MERGE command from your process. This has the added benefit of removing the ETL load from the SQL Server, assuming SSIS is running on a separate server.
Note: The sort transformation has the option to remove duplicates.
SQL Server 2012: using a SSIS package, how to validate the source records for duplicate before inserting?
Our source file is a .csv. We are facing duplicate records loaded in the staging table.
At present , we are following manual process of loading data.
How to validate the source file data against the destination table before loading and load only the valid records? Possibility of loading duplicate records not only because of the source file having duplicate records in it but also reloading the same file to the staging table.
We are not Truncate the staging table. We are keeping records as is.
Second question : How to pick the name of the source file and pass it in the loading ? Possibly having a derived column as "FileName" which will get loaded along with raw data to the staging table.
The typical load pattern I use in this case is:
Prepare a staging table that matches the source file
In SSIS run a SQL Task with TRUNCATE StagingTable; (which clears it out)
Then, run a data flow task that loads the entire data file into the staging table
Lastly, merge the staging table into the final table.
I prefer to do this last step in a SQL Task also:
INSERT INTO FinalTable
(PrimaryKey,Column1,Column2,Column3)
SELECT
PrimaryKey,Column1,Column2,Column3
FROM StagingTable SRC
WHERE NOT EXISTS (
SELECT * FROM FinalTable TGT WHERE TGT.PrimaryKey=SRC.PrimaryKey
);
If you prefer a graphical UI, and you don't mind the extra network traffic, and slower processing time, you can do the same type of merge operation using lookups. You can even use the SCD component but I strongly discourage it's use.
Whether you do it in T-SQL or the UI, you need a key that can be used to uniquely identify the records (referred to as PrimaryKey in my example). If you don't have this key, there is no way to 'deduplicate'
Note in this example you have a 'real' staging table whose only purpose is to get the data file into the database. Then you have a final table that contains the final consistent result
Also note that this pattern only adds new rows - it will not update existing rows if they change in the data file.
Given your exact scenario (of loading the same file again), I would first check if the data is even loaded to the staging table. If you do that, you don't have to worry about checking the duplicates at record level.
How are you setting the connection to the file? Most of the data loads I have dealt with, I designed for-each-loop-container where the file name/path would be populated in a user variable. As you said, you could just use a derived column transform to add a new column which gets the value from a variable. If you don't have the file name in a user variable, you could use expression task in the control flow to populate it.
To cover your exact requirement, I would use the above step to populate the file name in the table. You could even normalize to a different table instead of storing long file name for every data record. Once you have all the file names in the database, you could just have an "Execute SQL" at the beginning to see if that file name is already in the database.
Two years back I have faced the same problem with importing TSV files.
I tried many other solutions but best I could design is C# code script for such validation at its best.
What I did as a solution
Create one C# DataTable object in memory with Primary Key constraints,
like:-
DataColumn[] keyColumn = new DataColumn[30];
keyColumn[intJ] = dtFilterdPK.Columns["Column name"];
Then try to add one by one row from your CSV to this DataTables.
Whenever your data will get Duplication based on Primary Key will have an error
Handle this error code in (TRY)..CATCH block and make this duplication error as per your logging requirement.
Avoid those error records importing in DataTable object.
Atlast import your CSV file into your table as BulkImport
Like:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(myConnection))
{
bulkCopy.DestinationTableName = "Your DB Table Name"; //Assign table name
bulkCopy.WriteToServer(dtToBeImport); //Write into Actual table.
}
Hope this will help you.
I am trying to import data into a table in Oracle from a CSV file using SQL Loader. However, I want to add two additional attributes namely date of upload and the file path from which the data is being imported. I can add the date using SYSDATE, Is there a similar method of obtaining the file path?
The trouble with using SYSDATE is that it will not be the same for all rows. This can make it difficult if you do more than one load in a day and need to back out a particular load. Consider using a batch_id also using the method in this post: Insert rows with batch id using sqlldr
I suspect it could be adapted to use the SYSDATE as well so it would be the same for all rows. Give it a try and let us know. At any rate, using a batch_id from a sequence would make working with problems much easier should you need to delete based on a batch_id.
I have 4 different text files each file with different name and different column in it place in one folder. I want these 4 four files to be inserted or updated into 4 different existing tables. So How to read read these 4 files dynamically and insert them into their respective table dynamically in SSIS.
Well, you need to use Data Flow Task to move data from a Flat File Source to a Table Destination (OLEDB Destination perhaps). Are the columns in your file delimited in any way? For example, with any of these: (;),(|) or something like that? if it is, you can create a FlatFileConnectionManager and set that to split the columns. If not, you might need to use the FixedWidth option to separate your columns. To use the OLEDB Destination, you will need to create a OLEDB connectionManager to point to the table in your database. I could help you more if I had more information about the files you want to read the data from.
EDIT
Well you said at the start you were working with 4 files and 4 tables, so you can create 4 Flat Destination sourcers with 4 OLEDB destinations aswell (1 of each for each flat file). If I understood you correctly, these 4 files can or cannot exist yet. So if you know the names that the files will get, change the Package Property DelayValidation to true, and then create a connection with a sample text file. You do this so the File path gets saved. The tables, in my opinion DO need to exist. Now, when you said:
i want to load all the text files into each different existing table whenever there is files inside the folder.
The only way I know you can do something similar, is to schedule the execution of your package at a certain time with SQL Server Agent Job. Please let me know if this was what you were looking for.
When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.