Bulk Insert with Limited Disk Space - sql

I have a bit of a strange situation, and I'm wondering if anyone would have any ideas how to proceed.
I'm trying to bulk load a 48 gig pipe-delimited file into a table in SQL Server 2008, using a pretty simple bulk insert statement.
BULK INSERT ItemMovement
FROM 'E:\SQLexp\itemmove.csv'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
Originally, I was trying to load directly into the ItemMovement table. But unfortunately, there's a primary key violation somewhere in this giant file. I created a temporary table to load this file to instead, and I'm planning on selecting distinct rows from the temporary table and merging them into the permanent table.
However, I keep running into space issues. The drive I'm working with is a total of 200 gigs, and 89 gigs are already devoted to both my CSV file and other database information. Every time I try to do my insertion, even with my recovery model set to "Simple", I get the following error (after 9.5 hours of course):
Msg 9002, Level 17, State 4, Line 1
The transaction log for database 'MyData' is full due to 'ACTIVE_TRANSACTION'.
Basically, my question boils down to two things.
Is there any way to load this file into a table that won't fill up the drive with logging? Simple Recovery doesn't seem to be enough by itself.
If we do manage to load up the table, is there a way to do a distinct merge that removes the items from the source table while it's doing the query (for space reasons)?
Appreciate your help.

Even with simple recovery the insert is still a single operation.
You are getting the error on the PK column
I assume the PK is only a fraction of the total size
I would break it up to only insert the PK
Pretty sure you can limit the columns with FORMATFILE
If you have to edit a bunch of duplicate PKs you may need use a program to parse and then load row by row
Sounds like a lot of work that is solved with a $100 drive.
For real would install a drive and use it for the transaction log.

#tommy_o was right about using TABLOCK in order to get my information loaded. Not only did it run in about an hour and a half instead of nine hours, but it barely increased my log size.
For the second part, I realized I could free up quite a bit of space by deleting my CSV after the load, which gave me enough space to get the tables merged.
Thanks everyone!

Related

Oracle Bulk Insert Using SQL Developer

I have recently taken data dumps from an Oracle database.
Many of them are large in size(~5GB). I am trying to insert the dumped data into another Oracle database by executing the following SQL in SQL Developer
#C:\path\to\table_dump1.sql;
#C:\path\to\table_dump2.sql;
#C:\path\to\table_dump3.sql;
:
but it is taking a long time like more than a day to complete even a single SQL file.
Is there any better way to get this done faster?
SQL*Loader is my favorite way to bulk load large data volumes into Oracle. Use the direct path insert option for max speed but understand impacts of direct-path loads (for example, all data is inserted past the high water mark, which is fine if you truncate your table). It even has a tolerance for bad rows, so if your data has "some" mistakes it can still work.
SQL*Loader can suspend indexes and build them all at the end, which makes bulk inserting very fast.
Example of a SQL*Loader call:
$SQLDIR/sqlldr /#MyDatabase direct=false silent=feedback \
control=mydata.ctl log=/apps/logs/mydata.log bad=/apps/logs/mydata.bad \
rows=200000
And the mydata.ctl would look something like this:
LOAD DATA
INFILE '/apps/load_files/mytable.dat'
INTO TABLE my_schema.my_able
FIELDS TERMINATED BY "|"
(ORDER_ID,
ORDER_DATE,
PART_NUMBER,
QUANTITY)
Alternatively... if you are just copying the entire contents of one table to another, across databases, you can do this if your DBA sets up a DBlink (a 30 second process), presupposing your DB is set up with the redo space to accomplish this.
truncate table my_schema.my_table;
insert into my_schema.my_table
select * from my_schema.my_table#my_remote_db;
The use of the /* +append */ hint can still make use of direct path insert.

Bulk insert from CSV file - skip duplicates

UPDATE: Ended up using this method created by Johnny Bubriski and then modified it a bit to skip duplicates. Works like a charm and is apparently quite fast.
Link: http://johnnycode.com/2013/08/19/using-c-sharp-sqlbulkcopy-to-import-csv-data-sql-server/
I have been searching for an answer to this but cannot seem to find it. I am doing a T-SQL bulk insert to load data into a table in a local database from a csv file. My statement looks like this:
BULK INSERT Orders
FROM 'csvfile.csv'
WITH(FIELDTERMINATOR = ';', ROWTERMINATOR = '0x0a', FORMATFILE = 'formatfile.fmt', ERRORFILE = 'C:\\ProgramData\\Tools_TextileMagazine\\AdditionalFiles\\BulkInsertErrors.txt')
GO
SELECT *
FROM Orders
GO
I get an exception when I try to insert duplicate rows (for example taking the same csv file twice) which causes the entire insert to stop and rollback. Understandable enough since I am violating the primary key constraint. Right now I just show a messagebox to let users know that duplicates are present in the csv file, but this is of course not a proper solution, actually not a solution at all. My question is, is there any way to ignore these duplicate rows and just skip them and only add the rows that are not duplicate? Perhaps in a try catch somehow?
If it is not possible, what would be the "correct" (for lack of a better word) way to import data from the csv file? The exception is causing me a bit of trouble. I did read somewhere that you can set up a temporary table, load the data into it and select distinct between the two tables before inserting. But is there really no easier way to do it with bulk insert?
You could set the MAXERRORS property to quite a high which will allow the valid records to be inserted and the duplicates to be ignored. Unfortunately, this will mean that any other errors in the dataset won't cause the load to fail.
Alternatively, you could set the BATCHSIZE property which will load the data in multiple transactions therefore if there are duplicates it will only roll back the batch.
A safer, but less efficient, way would be to load the CSV file in to a separate, empty, table and then merge them into your orders table as you mentioned. Personally, this is the way I'd do it.
None of these solutions are ideal but I can't think of a way of ignoring duplicates in the bulk insert syntax.
First of all there is no direct solution like BULK INSERT WHERE NOT EXISTS. You can use following solutions.
When using BULK INSERT there are two scenarios
You are BULK INSERTing in a empty table
You are BULK INSERTing in a already filled table
Solution for Case 1:
set the MAXERRORS = 0
set the BATCHSIZE = total number of rows in csv file
Using above statement with BULK INSERT would cause whole BULK INSERT operation to roll back even if there is a single error, this would prevent from rows being imported even when there are errors in few rows. You would need to solve all import errors to complete the import operation. This method would prevent situations when you import 50 rows, 30 gets imported and remaining not. Then you have to search CSV file for the failed ones and reimport them or delete all imported rows from SQL table and do BULK INSERT again.
Solution for Case 2:
1> You can run select query on the existing table, right click and export in CSV. If you have any spreadsheet program then paste the data below import data and use conditional formatting on primary key column to highlight duplicate rows and delete them. Then use BULK INSERT operation.
2> Set MAXERRORS = number of rows, and import csv file using BULK INSERT. This is not safe and recommended way, as there might be other kinds of errors apart from duplicate key errors
3> Set BATCHSIZE = 1 and MAXERRORS = high number, and import csv file using BULK INSERT. This would import all the rows without errors and any row with errors would be skipped. This is useful if the data set is small and you can visually identify the rows which are not imported by observing table columns like id number column which shows the missing numbers.
4> Right click on existing table, choose table as > Crete to > new query window. Simply rename the table name and change to staging name like table_staging. BULK Insert in the staging table and then run a second query to copy data from staging table to main table and using WHERE clause to check if row/pk exists. This is a much safer way but forces you to create a staging table.

LDF file continues to grow very large during transaction phase - SQL Server 2005

We have a 6 step where we copy tables from one database to another. Each step is executing a stored procedure.
Remove tables from destination database
Create tables in destination database
Shrink database log before copy
Copy tables from source to destination
Shrink the database log
Back up desstination database
during the step 4, our transaction log (ldf file) grows very large to where we now have to consistently increase the max size on the sql server and soon enough (in the far furture) we believe it may eat up all the resources on our server. It was suggested that in our script, we commit each transaction instead of waiting til the end to commit the transactions.
Any suggestions?
I'll make the assumption that you are moving large amounts of data. The typical solution to this problem is to break the copy up in to smaller number of rows. This keeps the hit on transaction log smaller. I think this will be the preferred answer.
The other answer that I have seen is using Bulk Copy, which writes the data out to a text file and imports it into your target db using Bulk Copy. I've seen a lot of posts that recommend this. I haven't tried it.
If the schema of the target tables isn't changing could you not just truncate the data in the target tables instead of dropping and recreating?
Can you change the database recovery model to Bulk Logged for this process?
Then, instead of creating empty tables at the destination, do a SELECT INTO to create them. Once they are built, alter the tables to add indices and constraints. Doing bulk copies like this will greatly reduce your logging requirements.

ORA delete / truncate

I'm using SQL loader to load my data into database.
Before I insert the data I need to remove existing data in the table:
options(skip=1,load=250000,errors=0,ROWS=30000,BINDSIZE=10485760)
load data
infile 'G:1.csv' "str '^_^'"
replace
into table IMPORT_ABC
fields terminated by "," OPTIONALLY ENCLOSED BY '"'
trailing nullcols(
.
.
.
.)
But I got error like:
SQL*LOADER-926: OCI error while executing delete/truncate for table IMPORT_ABC
ORA-30036: unable to extend segment by 8 in undo tablespace 'undo1'
How can I delete data for example by 10000 rows?
I know that I have some limit on my DB.
Deleting records in batches can be done in a PL/SQL loop, but is generally considered bad practice as the entire delete should normally be considered as a single transaction; and that can't be done from within the SQL*Loader control file. Your DBA should size the UNDO space to accommodate the work you need to do.
If you're deleting the entire table you'll almost certainly be better off truncating anyway, either in the control file:
options(skip=1,load=250000,errors=0,ROWS=30000,BINDSIZE=10485760)
load data
infile 'G:1.csv' "str '^_^'"
truncate
into table IMPORT_ABC
...
Or as a separate truncate statement in SQL*Plus/SQL Developer/some other client before you start the load:
truncate table import_abc;
The disadvantage is that your table will appear empty to other users while the new rows are being loaded, but if it's a dedicated import area (guessing from the name) that may not matter anyway.
If your UNDO is really that small then you may have to run multiple loads, in which case - probably obviously - you need to make sure you only have the truncate in the control file for the first one (or use the separate truncate statement), and have append instead in subsequent control files as you noted in comments.
You might also want to consider external tables if you're using this data as a base to populate something else, as there is no UNDO overhead on replacing the external data source. You'll probably need to talk to your DBA about setting that up and giving you the necessary directory permissions.
Your undo tablespace is to small to hold all the undo information and it seems it cannot be extended.
You can split the import into smaller batched and issue a commit after each batch or get your DBA to increase the tablespace for undo1
And use truncate in stead of replace before you start the immports

Issue with huge table archive

I am assigned to move data from huge tables (around 20 hundred thousand record each) to identical history table. But when my query is running the log file grows too large and messes up everything. I tried the following:
For each table being archived, handle as separate transaction
Anyway, for history table I didn’t specify the primary key (could this be a problem?)
All the transactions were written in a single stored procedure
Can anyone tell me if there is any issue with my work or this is not the right way?
You can minimise logging if you use table locks with a bulk import
Lots of great info is found here:
http://msdn.microsoft.com/en-us/library/ms190422.aspx
Some pointers from the article:
change db mode to bulk logged
apply indexes after import
import in batches
do a log backup after each batch.