I'm attempting to import a large amount of data contained in a CSV file into a SQL database. The CSV is 4g in size. The CSV has 329 columns and 300,000+ rows of data. So far I've successfully created the database and table that will hold the data once imported. The data contains string (VARCHAR(x), numeric (INT), and dates (DATE).
The data contained within the CSV file is separated by a deliminator "," but all of the data fields are encased in double quotes, with some fields not containing data values. Below is a mock example of the data.
"123244234","09/12/2012","First Name","Last Name","Address 1","","","555-555-5555","","CountryCode"
In research I've determined the easiest way to import the data will be to use BCP to create a format file and then uses that with BULK INSERT. The only probably is in formatting the format file to remove the double quotes. When attempting to import without a format file it fails on row one because the first column first row is numeric and has "" around it.
I've reviewed the following link that talks about removing the double quotes "http://support.microsoft.com/default.aspx?scid=kb;EN-US;132463" with the use of a dummy entry to remove the quotes. In this case that is a lot of manual editing. Does anyone know of a better way to edit the format file?? Here is a sample of the format file:
10.0
329
1 SQLCHAR 0 12 "," 1 NPI ""
2 SQLCHAR 0 12 "," 2 Entity Type Code ""
3 SQLCHAR 0 12 "," 3 Replacement NPI ""
4 SQLCHAR 0 9 "," 4 Employer Identification Number (EIN) SQL_Latin1_General_CP1_CI_AS
5 SQLCHAR 0 70 "," 5 Provider Organization Name (Legal Business Name) SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 0 35 "," 6 Provider Last Name (Legal Name) SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 0 20 "," 7 Provider First Name SQL_Latin1_General_CP1_CI_AS
8 SQLCHAR 0 20 "," 8 Provider Middle Name SQL_Latin1_General_CP1_CI_AS
9 SQLCHAR 0 5 "," 9 Provider Name Prefix Text SQL_Latin1_General_CP1_CI_AS
10 SQLCHAR 0 5 "," 10 Provider Name Suffix Text
Related
Attempting to do a bulk insert. The sample data and the format file are given below. It was brought to my attention that we need to use a Universal naming convention (UNC) hence why the '\FR-6RSGJH2.xyz.st\C$ item in the code. However, the same error occurs if you simply it to '\C\Users\myname\Desktop\testimport.csv'. Any ideas as to what is missing in the syntax or any settings changes that could be done?
BULK INSERT testimport
FROM '\\FR-6RSGJH2.xyz.st\C$\Users\myname\Desktop\testimport.csv'
WITH (FORMATFILE = '\\FR-
6RSGJH2.xyz.st\C$\Users\myname\Desktop\format.txt')
GO
Msg 4861, Level 16, State 1, Line 1
Cannot bulk load because the file
"\C\Users\myname\Desktop\testimport.csv" could not be opened. Operating
system error code 3(The system cannot find the path specified.).
Sample data
32003012017010316
32001022017040218
32003032017030213
32002042017020111
32002052017020110
format file
13.0
5
1 SQLCHAR 0 02 "" 1 st ""
2 SQLCHAR 0 03 "" 2 cnty ""
3 SQLCHAR 0 02 "" 3 v1 ""
4 SQLCHAR 0 08 "" 4 date ""
5 SQLCHAR 0 02 "\r\n" 5 v2 ""
Not sure how it worked but when I made the testimport into a .txt versus a .csv, it worked. Anyways, that is the answer.
I'm trying to use a non-xml format file to bulk import a null delimited file into sql. I've added a column to the staging table in question, and updated the format file to reflect this. Everything seems to be inserting fine, except this last column. The column I added is
Comments (nvarchar(256), null)
The format file looks like this:
11.0
8
1 SQLNCHAR 0 4 "\0\0" 1 ClaimCheckSetId ""
2 SQLNCHAR 0 4 "\0\0" 2 BatchValidationId ""
3 SQLNCHAR 0 4 "\0\0" 3 SourceCommunicationId ""
4 SQLNCHAR 0 4 "\0\0" 4 TargetCommunicationId ""
5 SQLNCHAR 0 1800 "\0\0" 5 TargetExternalCommunicationId ""
6 SQLNCHAR 0 8 "\0\0" 6 TargetSentDateTime ""
7 SQLNCHAR 0 2000 "\0\0" 7 TargetSubject ""
8 SQLNCHAR 0 256 "\r\0\n\0" 8 Comments ""
The SQL looks like this:
DECLARE #filepath NVARCHAR(MAX) = 'C:\{file to import}_512fc21d-dbc9-4975-8169-2ca383ac2bdf.txt';
DECLARE #formatpath NVARCHAR(MAX) = 'C:\{format file}.txt';
DECLARE #bulkinsert NVARCHAR(MAX);
SET
#bulkinsert =
N'BULK INSERT
[The Table]
FROM ''' +
#filepath + N'''
WITH
(
FORMATFILE = ''' +
#formatpath + N''',
DATAFILETYPE = ''WIDECHAR'',
FIRSTROW = 1
)';
SET ANSI_WARNINGS OFF;
EXEC sp_executesql #Bulkinsert;
SET ANSI_WARNINGS ON;
I'm getting no errors, and it is returning a number of rows affected. Unfortunately, I don't know enough about SQL to diagnose this problem. A few hours of googling have not helped either. I hope one of you kind guys or gals can set me back on the straight and narrow.
Update: I edited the \r\0\n\0 to \r\n and am now getting an error!
OLE DB provider 'BULK' for linked server '(null)' returned invalid data for column '[BULK].InsertedDateTime'.
You should check the input file in an editor that shows special symbols. Personally I use Notepad++ (free) for that (View > Show Symbol > Show All Characters), but any decent editor will do.
That way the row terminator (ie the last field terminator) should be clearly visible. In Notepad++ the \0 will be visible as NUL, \r AS CR and \n AS LF.
So with your settings as you currently have, you should be seeing CR NUL LF NUL. If you don't then change the last field terminator to what you see in the editor you are using.
With the limited information I have, can you please change the following
8 SQLNCHAR 0 256 "\r\0\n\0" 8 Comments ""
to
8 SQLNCHAR 0 256 "\r\n" 8 Comments ""
or
8 SQLNCHAR 0 256 "\0\0" 8 Comments ""
It seems the last one should wrap to new line.
I am attempting to Use Bulk Insert to upload a very large data file (5M rows). All columns are just varchars no conversion. So the Format file is simple...
11.0
29
1 SQLCHAR 0 8 "" 1 AccountId ""
2 SQLCHAR 0 10 "" 2 TranDate ""
3 SQLCHAR 0 4 "" 3 TransCode ""
4 SQLCHAR 0 2 "" 4 AdditionalCode ""
5 SQLCHAR 0 11 "" 5 CurrentPrincipal ""
6 SQLCHAR 0 11 "" 6 CurrentInterest ""
7 SQLCHAR 0 11 "" 7 LateInterest ""
...
27 SQLCHAR 0 8 "" 27 Operator ""
28 SQLCHAR 0 10 "" 28 UpdateDate ""
29 SQLCHAR 0 12 "" 29 TimeUpdated ""
but each time, at some point, I get the same error:
Msg 4832, Level 16, State 1, Line 1 Bulk load: An unexpected end of
file was encountered in the data file. Msg 7399, Level 16, State 1,
Line 1 The OLE DB provider "BULK" for linked server "(null)" reported
an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1 Cannot fetch a row from OLE DB
provider "BULK" for linked server "(null)".
I have tried the following:
Bulk Insert
[TableName] From 'dataFilePPathSpecification'
With (FORMATFILE = 'formatFilePPathSpecification')
but I get the error after about 5-6 minutes, and no data has been inserted.
When I added BatchSize parameter, I get the error after a much longer time, near the end of the file, after all except a very few of the rows have been inserted successfully.
Bulk Insert
[TableName] From 'dataFilePPathSpecification'
With (BATCHSIZE = 200,
FORMATFILE = 'formatFilePPathSpecification')
When I set the BatchSize to 2000 it runs much faster, (Fewer, larger transacxtions I assume), but it still fails.
Does this have something to do with how the Bulk Insert recognizes the end of the file? If so, what do I need to do to the format file to fix it ?
Explicitly state your row terminator:
BULK INSERT TableName FROM 'Path'
WITH (
DATAFILETYPE = 'char',
ROWTERMINATOR = '\r\n'
With (FORMATFILE = 'formatFilePPathSpecification')
);
If this still fails, check your file to see if you have unexpected terminators embedded in text fields.
Trying using the errorFile specifier in the WITH portion to find the offending data:
ERRORFILE = 'C:\offendingdata.log'
If you still have problem even after enabling the errorfile output, you can do a binary search for the problem by setting the FirstRow and LastRow options and running bulk insert repeatedly to isolate the problem.
To be honest your input format looks so simple it might be a good idea to write a small C#, Python, or whatever floats your boat app to quality check you data before attempt import. You could simply discard invalid rows (or possibly fix them) or write them to an exceptions file for hand processing, or simply stop the job -- I.e., file must be perfect or it is considered corrupted. Validating 5M rows this way will be quite fast -- essentially as fast as you can read the file (and possible write) the file.
Thanks for the suggestions to all, I applied both ideas... I wrote a small .Net (c#) file processor utility and it told me there were additional nulls (binary zeroes (\0) at the end of every line, and I was able to strip them off using a simple c# program.
The error file indicated the issue was at the very end, (That's what the error msg said!)
The actual issue was that the Bulk Insert could not recognize the EOF.. I had to modify the format file like this to fix it.. Then it worked.
11.0
29
1 SQLCHAR 0 8 "" 1 AccountId ""
2 SQLCHAR 0 10 "" 2 TranDate ""
3 SQLCHAR 0 4 "" 3 TransCode ""
4 SQLCHAR 0 2 "" 4 AdditionalCode ""
5 SQLCHAR 0 11 "" 5 CurrentPrincipa ""
6 SQLCHAR 0 11 "" 6 CurrentInterest ""
7 SQLCHAR 0 11 "" 7 LateInterest ""
...
27 SQLCHAR 0 8 "" 27 Operator ""
28 SQLCHAR 0 10 "" 28 UpdateDate ""
29 SQLCHAR 0 12 "\r\n" 29 TimeUpdated ""
I have a csv format file, which I want to import to sql server 2008 using bulk insert. I have 80 columns in csv file which has comma for example, column state has NY,NJ,AZ,TX,AR,VA,MA like this for few millions of rows.
So I enclosed the state column in double quotes using custom format in excel, so that this column will be treated as single column and does not split at comma in between the column. But still the import is not successful; still it is splitting at comma. Can anyone please suggest successful import of the columns containing comma using bulk insert
I am using this code
bulk insert test from 'C:\test.csv'
with (
fieldterminator=',', rowterminator='\n'
)
go
I saw similar question previously asked here, but I don't know visual basic to apply the code. Is there any other option to modify file in excel?
Is there any other option to modify file in excel?
It turns out there is, at least in Windows.
Go to Start Menu > Control Panel > Regional and Language Options.
In the Regional Options tab, click the Customize Button.
In the List Separator field, replace the , with a |. Click OK.
Saving a file as a .CSV through Excel will now create a pipe-separated value file. Be sure to undo this change to the Regional Options setting, as Excel uses the list separator for other things like functions.
Then you can do as datagod suggests and bulk upload the file using | as the column delimiter.
You should create a format file: http://msdn.microsoft.com/en-us/library/ms191516.aspx
If your data contains commas, I would choose a different delimiter. You can specify "|" as the delimiteter in the format file.
Example:
10.0
4
1 SQLCHAR 0 100 "|" 1 Col1 SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 100 "|" 2 Col2 SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 100 "|" 3 Col3 SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 0 7000 "\r\n" 4 Col11 SQL_Latin1_General_CP1_CI_AS
error list
Msg 4866, Level 16, State 7, Line 2
The bulk load failed. The column is too long in the data file for row 1, column 1.
Verify that the field terminator and row terminator are specified correctly.
Msg 7399, Level 16, State 1, Line 2
The OLE DB provider "BULK" for linked server "(null)" reported an error.
The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 2
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
fmt file
9.0
10
1 SQLCHAR 2 50 "," 2 EmployeeSSN SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 2 50 "," 3 DOB SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 2 50 "," 4 Gender SQL_Latin1_General_CP1_CI_AS
4 SQLCHAR 2 50 "," 5 Relcode SQL_Latin1_General_CP1_CI_AS
5 SQLCHAR 2 50 "," 6 EmployeeID SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 2 50 "," 7 AssessmentType SQL_Latin1_General_CP1_CI_AS
7 SQLCHAR 2 50 "," 8 MeasurementDate SQL_Latin1_General_CP1_CI_AS
8 SQLCHAR 2 50 "," 9 RecordCreationDate SQL_Latin1_General_CP1_CI_AS
9 SQLCHAR 2 50 "," 10 AttributeID SQL_Latin1_General_CP1_CI_AS
10 SQLCHAR 2 50 "/r/n" 11 AttributeValue SQL_Latin1_General_CP1_CI_AS
Bulk insert code
BULK insert *******_raw_data
from 'E:\*****_csv\BWC_To_*****_2.csv'
with (formatfile = 'c:\*******_raw_data-n.fmt');
first line from csv
NULL,07/14/1983,F,S,105***,HRA,09/28/2011,09/28/2011,19,1
I am trying to figure out where I am going wrong here.... I have gotten other files to work but have been unsuccessful with this one. The files' names are correct in my code they are starred out because they are company names
First error:
Msg 4866, Level 16, State 7, Line 2
The bulk load failed. The column is too long in the data file for row 1, column 1. Verify that the field terminator and row terminator are specified correctly.
This is either a problem with NULL or the row terminator.
The last terminator for the row may not be "/r/n", it could be "/n". It is best to confirm that with a Hex Editor.
Second and Third Error:
These all look like a NULL problem.
The correct way to handle nulls in BULK INSERT is to specify the KEEPNULLS option.
with (formatfile = 'c:\*******_raw_data-n.fmt',KEEPNULLS);
Create the csv files with an empty field for NULL values.
,07/14/1983,F,S,105***,HRA,09/28/2011,09/28/2011,19,1