Bulk insert from CSV file - skip duplicates - sql

UPDATE: Ended up using this method created by Johnny Bubriski and then modified it a bit to skip duplicates. Works like a charm and is apparently quite fast.
Link: http://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/
I have been searching for an answer to this but cannot seem to find it. I am doing a T-SQL bulk insert to load data into a table in a local database from a csv file. My statement looks like this:
BULK INSERT Orders
FROM 'csvfile.csv'
WITH(FIELDTERMINATOR = ';', ROWTERMINATOR = '0x0a', FORMATFILE = 'formatfile.fmt', ERRORFILE = 'C:\\ProgramData\\Tools_TextileMagazine\\AdditionalFiles\\BulkInsertErrors.txt')
GO
SELECT *
FROM Orders
GO
I get an exception when I try to insert duplicate rows (for example taking the same csv file twice) which causes the entire insert to stop and rollback. Understandable enough since I am violating the primary key constraint. Right now I just show a messagebox to let users know that duplicates are present in the csv file, but this is of course not a proper solution, actually not a solution at all. My question is, is there any way to ignore these duplicate rows and just skip them and only add the rows that are not duplicate? Perhaps in a try catch somehow?
If it is not possible, what would be the "correct" (for lack of a better word) way to import data from the csv file? The exception is causing me a bit of trouble. I did read somewhere that you can set up a temporary table, load the data into it and select distinct between the two tables before inserting. But is there really no easier way to do it with bulk insert?

You could set the MAXERRORS property to quite a high which will allow the valid records to be inserted and the duplicates to be ignored. Unfortunately, this will mean that any other errors in the dataset won't cause the load to fail.
Alternatively, you could set the BATCHSIZE property which will load the data in multiple transactions therefore if there are duplicates it will only roll back the batch.
A safer, but less efficient, way would be to load the CSV file in to a separate, empty, table and then merge them into your orders table as you mentioned. Personally, this is the way I'd do it.
None of these solutions are ideal but I can't think of a way of ignoring duplicates in the bulk insert syntax.

First of all there is no direct solution like BULK INSERT WHERE NOT EXISTS. You can use following solutions.
When using BULK INSERT there are two scenarios
You are BULK INSERTing in a empty table
You are BULK INSERTing in a already filled table
Solution for Case 1:
set the MAXERRORS = 0
set the BATCHSIZE = total number of rows in csv file
Using above statement with BULK INSERT would cause whole BULK INSERT operation to roll back even if there is a single error, this would prevent from rows being imported even when there are errors in few rows. You would need to solve all import errors to complete the import operation. This method would prevent situations when you import 50 rows, 30 gets imported and remaining not. Then you have to search CSV file for the failed ones and reimport them or delete all imported rows from SQL table and do BULK INSERT again.
Solution for Case 2:
1> You can run select query on the existing table, right click and export in CSV. If you have any spreadsheet program then paste the data below import data and use conditional formatting on primary key column to highlight duplicate rows and delete them. Then use BULK INSERT operation.
2> Set MAXERRORS = number of rows, and import csv file using BULK INSERT. This is not safe and recommended way, as there might be other kinds of errors apart from duplicate key errors
3> Set BATCHSIZE = 1 and MAXERRORS = high number, and import csv file using BULK INSERT. This would import all the rows without errors and any row with errors would be skipped. This is useful if the data set is small and you can visually identify the rows which are not imported by observing table columns like id number column which shows the missing numbers.
4> Right click on existing table, choose table as > Crete to > new query window. Simply rename the table name and change to staging name like table_staging. BULK Insert in the staging table and then run a second query to copy data from staging table to main table and using WHERE clause to check if row/pk exists. This is a much safer way but forces you to create a staging table.

Related

insert into statements,how to be fastly insert records

If i have a 1 million insert into statement in text file then how to be inserted in a table with faster time.
INSERT INTO TABLE(a,b,c) VALUES (1,'shubham','engg');
INSERT INTO TABLE(a,b,c) VALUES (2,'swapnil','chemical');
INSERT INTO TABLE(a,b,c) VALUES (n,'n','n');
like in above we have 1 million rows. how to be fastly insert records in a table If any other options is there else simply run above all statement in sequency.
Avoid row by row inserts for dumping such huge quantities of data. They are pretty slow and there's no reason you should rely on the plain inserts, even if you're utilising SQL* Plus command line to run it as a file. Put the values to be inserted as comma(or any other delimiter) separated entries in a flat file and then use options such as
SQL Loader
External table
It is a common practice to extract data into flat files from tools like SQL Developer. Choose the "CSV" option instead of "Insert" that will generate the values in a flat file.
If it means that your text file contains literally those INSERT INTO statements, then run the file.
If you use GUI, load the file and run it as a script (so that it executes them all).
If you use SQL*Plus, call the file using #, e.g.
SQL> #insert_them_all.sql
You may use a batch insert on single query statement:
INSERT INTO TABLE(a,b,c) VALUES
(1,'shubham','engg'),
(2,'swapnil','chemical'),
(n,'n','n');

How to clean database table from garbage data after it is populated with a data from excel file?

I have a package that dumps the data from .xls file into some kind of staging table.
Then I need to insert this data into a main table.
I'm looking for the way to write a sql code that would get rid off garbage data from the staging table.
This is an example of xls file
When executing my package, my staging table looks like this:
After that, I run the following code to delete the garbage data from the statging table:
delete from StagingTable where Data IS NULL and DATA = 'Date'
That takes care of garbage removal for that particular case.
But what if the data comes in so, the xls columns names are different, then my delete statement simply will not work
Is there a work around this problem?
I found an answer. That will work if the first column of a staging table has a date value:
select * from StatgingTable where ISDATE(Date) = 1
This will return:

Import Oracle data dump and overwrite existing data

I have an oracle dmp file and I need to import data into a table.
The data in the dump contains new rows and few updated rows.
I am using import command and IGNORE=Y, so it imports all the new rows well. But it doesn't import/overwrite the existing rows (it shows a warning of unique key constraint violated).
Is there some option to make the import UPDATE the existing rows with new data?
No. If you were using data pump then you could use the TABLE_EXISTS_ACTION=TRUNCATE option to remove all existing rows and import everything from the dump file, but as you want to update existing rows and leave any rows not in the new file alone - i.e. not delete them (I think, since you only mention updating, though that isn't clear) - that might not be appropriate. And as your dump file is from the old exp tool rather than expdp that's moot anyway, unless you can re-export the data.
If you do want to delete existing rows that are not in the dump then you could truncate all the affected tables before importing. But that would be a separate step that you'd have to perform yourself, its not something imp will do for you; and the tables would be empty for a while, so you'd have to have downtime to do it.
Alternatively you could import into new staging tables - in a different schema sinceimp doesn't support renaming either - and then use those to merge the new data into the real tables. That may be the least disruptive approach. You'd still have to design and write all the merge statements though. There's no built-in way to do this automatically.
You can import into temp table and then do record recon by joining with it.
Use impdp option REMAP_TABLE to load existing file into temp table.
impdp .... REMAP_TABLE=TMP_TABLE_NAME
when load is done run MERGE statement on existing table from temp table.

Bulk Insert with Limited Disk Space

I have a bit of a strange situation, and I'm wondering if anyone would have any ideas how to proceed.
I'm trying to bulk load a 48 gig pipe-delimited file into a table in SQL Server 2008, using a pretty simple bulk insert statement.
BULK INSERT ItemMovement
FROM 'E:\SQLexp\itemmove.csv'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
Originally, I was trying to load directly into the ItemMovement table. But unfortunately, there's a primary key violation somewhere in this giant file. I created a temporary table to load this file to instead, and I'm planning on selecting distinct rows from the temporary table and merging them into the permanent table.
However, I keep running into space issues. The drive I'm working with is a total of 200 gigs, and 89 gigs are already devoted to both my CSV file and other database information. Every time I try to do my insertion, even with my recovery model set to "Simple", I get the following error (after 9.5 hours of course):
Msg 9002, Level 17, State 4, Line 1
The transaction log for database 'MyData' is full due to 'ACTIVE_TRANSACTION'.
Basically, my question boils down to two things.
Is there any way to load this file into a table that won't fill up the drive with logging? Simple Recovery doesn't seem to be enough by itself.
If we do manage to load up the table, is there a way to do a distinct merge that removes the items from the source table while it's doing the query (for space reasons)?
Appreciate your help.
Even with simple recovery the insert is still a single operation.
You are getting the error on the PK column
I assume the PK is only a fraction of the total size
I would break it up to only insert the PK
Pretty sure you can limit the columns with FORMATFILE
If you have to edit a bunch of duplicate PKs you may need use a program to parse and then load row by row
Sounds like a lot of work that is solved with a $100 drive.
For real would install a drive and use it for the transaction log.
#tommy_o was right about using TABLOCK in order to get my information loaded. Not only did it run in about an hour and a half instead of nine hours, but it barely increased my log size.
For the second part, I realized I could free up quite a bit of space by deleting my CSV after the load, which gave me enough space to get the tables merged.
Thanks everyone!

Is it possible to overwrite with a SSIS Insert or similar?

I have a .csv file that gets pivoted into 6 million rows during a SSIS package. I have a table in SQLServer 2005 of 25 million + rows. The .csv file has data that duplicates data in the table, is it possible for rows to get updated if it already exists or what would be the best method to achieve this efficiently?
Comparing 6m rows against 25m rows is not going to be too efficient with a lookup or a SQL command data flow component being called for each row to do an upsert. In these cases, sometimes it is most efficient to load them quickly into a staging table and use a single set-based SQL command to do the upsert.
Even if you do decide to do the lookup - split the flow into two streams, one which inserts and the other which inserts into a staging table for an update operation.
If you don't mind losing the old data (ie. the latest file is all that matters, not what's in the table) you could erase all the records in the table and insert them again.
You could also load into a temporary table and determine what needs to be updated and what needs to be inserted from there.
You can use the Lookup task to identify any matching rows in the CSV and the table, then pass the output of this to another table or data flow and use a SQL task to perform the required Update.