Bulk insert, SQL Server 2000, unix linebreaks - sql

I am trying to insert a .csv file into a database with unix linebreaks. The command I am running is:
BULK INSERT table_name
FROM 'C:\file.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
If I convert the file into Windows format the load works, but I don't want to do this extra step if it can be avoided. Any ideas?

I felt compelled to contribute as I was having the same issue, and I need to read 2 UNIX files from SAP at least a couple of times a day. Therefore, instead of using unix2dos, I needed something with less manual intervention and more automatic via programming.
As noted, the Char(10) works within the sql string. I didn't want to use an sql string, and so I used ''''+Char(10)+'''', but for some reason, this didn't compile.
What did work very slick was: with (ROWTERMINATOR = '0x0a')
Problem solved with Hex!

Thanks to all who have answered but I found my preferred solution.
When you tell SQL Server ROWTERMINATOR='\n' it interprets this as meaning the default row terminator under Windows which is actually "\r\n" (using C/C++ notation). If your row terminator is really just "\n" you will have to use the dynamic SQL shown below.
DECLARE #bulk_cmd varchar(1000)
SET #bulk_cmd = 'BULK INSERT table_name
FROM ''C:\file.csv''
WITH (FIELDTERMINATOR = '','', ROWTERMINATOR = '''+CHAR(10)+''')'
EXEC (#bulk_cmd)
Why you can't say BULK INSERT ...(ROWTERMINATOR = CHAR(10)) is beyond me. It doesn't look like you can evaluate any expressions in the WITH section of the command.
What the above does is create a string of the command and execute that. Neatly sidestepping the need to create an additional file or go through extra steps.

I confirm that the syntax
ROWTERMINATOR = '''+CHAR(10)+'''
works when used with an EXEC command.
If you have multiple ROWTERMINATOR characters (e.g. a pipe and a unix linefeed) then the syntax for this is:
ROWTERMINATOR = '''+CHAR(124)+''+CHAR(10)+'''

It's a bit more complicated than that! When you tell SQL Server ROWTERMINATOR='\n' it interprets this as meaning the default row terminator under Windows which is actually "\r\n" (using C/C++ notation). If your row terminator is really just "\n" you will have to use the dynamic SQL shown above. I have just spent the best part of an hour figuring out why \n doesn't really mean \n when used with BULK INSERT!

One option would be to use bcp, and set up a control file with '\n' as the line break character.
Although you've indicated that you would prefer not to, another option would be to use unix2dos to pre-process the file into one with '\r\n' line breaks.
Finally, you can use the FORMATFILE option on BULK INSERT. This will use a bcp control file to specify the import format.

Looks to me there are two general avenues that can be taken: some alternate way to read the CSV in the SQL script or convert the CSV beforehand with any of the numerous ways you can do that (bcp, unix2dos, if it is a one-time king of a thing, you can probably even use your code editor to fix the file for you).
But you will have to have an extra step!
If this SQL is launched from a program, you might want to convert the line endings in that program. In that case and you decide to code the conversion yourself, here is what you need to watch out for:
1. The line ending might be \n
2. or \r\n
3. or even \r (Mac!)
4. good grief, it could be that some lines have \r\n and others \n, any combination is possible unless you control where the CSV came from
OK, OK. Possibility 4 is farfetched. It happens in email, but that is another story.

I would think "ROWTERMINATOR = '\n'" would work. I would suggest opening the file in a tool that shows "hidden characters" to make sure the line is being terminated like you think. I use notepad++ for things like this.

It comes down to this. Unix uses LF (ctrl-J), MS-DOS/Windows uses CR/LF (ctrl-M/Ctrl-J).
When you use '\n' on Unix, it gets translated to a LF character. On MS-DOS/Windows it gets translated to CR/LF. When the your import runs on the Unix formatted file, it sees only a LF. Hence, its often easier to run the file through unix2dos first. But as you said in you original question, you don't want to do this (I'll assume there is a good reason why you can't).
Why can't you do:
(ROWTERMINATOR = CHAR(10))
Probably because when the SQL code is being parsed, it is not replacing the char(10) with the LF character (because it's already encased in single-quotes). Or perhaps its being interpreted as:
(ROWTERMINATOR =
)
What happens when you echo out the contents of #bulk_cmd?

Related

Bulk insert not retaining special chars of txt file SQL Server 2008 [duplicate]

I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.
My first suspect was the last line of the format file:
10.0
3
1 SQLCHAR 0 0 "{|}" 1 INSTANCEID ""
2 SQLCHAR 0 0 "{|}" 2 PROPERTYID ""
3 SQLCHAR 0 0 "[|]" 3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"
But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.
What is the solution to this problem or at least a workaround?
I came here before looking for a solution for bulk inserting special characters.
Didn't like the workaround with UTF-16 (that would double the size of csv file).
I found out that you definitely CAN and it's very easy, you don't need a format file.
This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is:
just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8).
Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.
Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.
e.g.:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = '65001',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:
BULK INSERT #myTempTable
FROM 'D:\somefolder\myCSV.txt'+
WITH
(
CODEPAGE = 'ACP',
FIELDTERMINATOR = '|',
ROWTERMINATOR ='\n'
);
You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.
In excel save file as CSV(Comma delimited)
Open saved CSV file in notepad++
Encoding -> Convert tO UCS-2 Big Endian
Save
BULK INSERT #tmpData
FROM 'C:\Book2.csv'
WITH
(
FIRSTROW = 2,
FIELDTERMINATOR = ';', --CSV field delimiter
ROWTERMINATOR = '\n', --Use to shift the control to next row
TABLOCK
)
Done.
Microsoft just added UTF-8 support to SQL Server 2014 SP2:
https://support.microsoft.com/en-us/kb/3136780
You can re-encode the data file with UTF-16. That's what I did anyway.
Use these options -
DATAFILETYPE='char' and CODEPAGE = '1252'
Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT (as was part of the original question), and OPENROWSET.
Shouldn't you be using SQLNCHAR instead of SQLCHAR for the unicode data?
Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.
bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.
Using Java we wrote the file using this code:
PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");
This gave us the correct data to insert.
We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:
UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:
bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w
When this is all done you get some sweet looking data!
Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ )
I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.
Format File codepages are having priority over the -C attribute
Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem
13.0
4
1 SQLCHAR 0 7 "," 1 PersonID ""
2 SQLCHAR 0 25 "," 2 FirstName SQL_Latin1_General_CP1_CI_AS
3 SQLCHAR 0 30 "," 3 LastName ""
4 SQLCHAR 0 11 "\r\n" 4 BirthDate ""
I managed to do this using SSIS and a ADO NET destination instead of OLEDB.
My exported data are in TSV format from DB which has Latin-1 encoding.
This easy to check:
SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;
Extract file is in UTF-8 format.
BULK INSERT isn't working with UTF-8, so I convert UTF-8 to ISO-8859-1 (aka Latin-1) with simple Clojure script:
(spit ".\\dump\\file1.txt"
(slurp ".\\dump\\file1_utf8.txt" :encoding "UTF-8")
:encoding "ISO-8859-1")
To execute - correct paths and
java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj
I have tested the bulk insertion with UTF -8 Format. It works fine in Sql Server 2012.
string bulkInsertQuery = #"DECLARE #BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
WITH ( FIELDTERMINATOR = '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
EXEC SP_EXECUTESQL #BulkInsertQuery";
I was using *.DAT file with FS as column separator.

Informix 11.5 SQL Select Carriage Return and Line Feed

Informix 11.5
I am trying to search for carriage returns and line feeds that may exist in a VARCHAR field. First, I need a SELECT statement to show that they exist. Second, I need to REPLACE them with a space or other character.
I've tried all kinds of variations:
CHR(10) + CHR(13)
CHR(10) || CHR(13)
CHAR(13) + CHAR(10)
CHAR(13) || CHAR(10)
SELECT CHR(10) from systables;
Everything gives an error: Routine (chr) can not be resolved.
I've been searching all over and just can't find anything that works, and I'm sure this is crazy stupid easy.
Get the ASCII package from the IIUG
The CHR() function was added to IDS 11.70; it isn't in IDS 11.50.
The good news is you can add the function because IDS is an extensible server. The better news for you is that you can obtain the relevant code from the IIUG web site in the Software Archive under the Miscellaneous section as ascii.
That should allow you to do what you need. (Note: I wrote the code way back when — before there was support built into any of the servers.)
Windows makes things more complicated
I was uploading the ascii.unl file and I get an error that the number of columns do not match on line 13. Have you seen this before? I'm on Windows 2008. The errors are:
846: Number of values in load file is not equal to number of columns.
847: Error in load file line 13.
I hadn't seen it before, but I've not tried the file on Windows and … well, let's say life gets trickier on Windows than it is on Unix (and this bit isn't all that simple on Unix).
First of all, the data file needs to have CRLF line endings instead of the NL-only line endings that are standard on Unix. (Note that NL, newline, is another name for LF, line feed — aka '\n'.) For most lines in the unload file, that isn't a problem.
The two entries for which it might be (is) a problem are for CR and LF — entries 13 and 10 respectively. In theory, if the entry for line 10 contains (in C string notation) "10|\\\n\r\n" (that is, 10, pipe, backslash, newline, CRLF), all should be OK; the absence of an error message for line 10 suggests that it is OK.
Similarly, the entry for line 13 is "13|\r\r\n", which apparently causes grief. The simplest trial fix is to add a backslash here too: "13|\\\r\r\nn". The backslash says "the next character doesn't have a special meaning". If that doesn't work, we'll probably have to try hex-escape notation: "13|\\0d\r\n" — and use dbaccess -X to enable the hex escape notation.
With luck, one of those two (or both) will work. If neither works, come back and we'll try to think of something else.
As per my above comment:
I was uploading the ascii.unl file and I get an error that the number of columns do not match on line 13. Have you seen this before? I'm on Windows 2008. 846: Number of values in load file is not equal to number of columns. 847: Error in load file line 13.
Here is what I see in the ascii.unl file.
If I put this into MS Word and turn on Show Formatting/Paragraph marks, it shows this:

Access VBA, importing csv file via TransferText with commata as decimal separator and semicolon as delimiter

I'm having some problems importing double numbers from csv files. The files have a semicolon delimiter and comma as decimal separator.
I can't set up import specs since the order of the fields in the csv often changes and it would be a desaster if the data goes into the wrong field.
Also the csv files will have to written to a temporary table first. Don't hate me for it, but since I have to process data and set some information fields for later data processing this is by far the easiest, fastest and safest way to achieve it.
Here is the problem itself:
When using TransferText it will import, but of course interpret the comma as delimiter. Not good ...
When replacing comma by full stop and semicolon by comma it works. But it will ignore full stops, so 1.2 becomes 12, 1.333 becomes 1333. The field will be of type double.
I've tests numerous things. Besides TransferText I've tried:
DoCmd.RunSQL ("INSERT INTO Tabelle1 SELECT cdbl(a1) as aa FROM[TEXT;FMT=Delimited;HDR=YES;CharacterSet=437;DATABASE=C:\SPOT].[test.csv]")
But nothing seems to work, even when I create a new table with field type DOUBLE before using TransferText ... decimals are still ignored.
So, I would be happy if you could tell me either how to use TransferText with or without replacing semicolon and comma in a first step or how to use the INSERT INTO stuff.
Thank you very much!
Ok, I think I got it!
The problem where the regional settings and that my Access uses comma as decimal separator. I was also not able to create a Import Spec via manual import, since it needs to have defined which fields will have to be imported.
What I did now was this:
Open the table MSysIMEXSpecsthat contains the import specs via query:
select * from MSysIMEXSpecs
Then add a new row and set SpecName = "Whatever", DecimalPoint= "," and 'FieldSeparator` = ";" and whatever other settings have to be made.
Since there is this workaround, isn't there a way to do this easier?

SQL Server to CSV character encoding

I have a SQL Server database extract I'm doing.
At the beginning of my program, I have:
ini_set('mssql.charset','cp1250');
My database calls do not do anything special.
I'm only call the following methods:
mssql_connect, mssql_select_db, mssql_query, mssql_fetch_object,
mssql_next_result and mssql_close.
When I print the output of my export on screen, all the characters look fine. When I export fputcsv() into a csv file, I get a ton of <92> and <93> characters (this is the way that they look when I use a terminal to read them). When I open the file using Excel, they look like ì, í and î
This is causing major problems. Do you have any ideas?
try to convert encoding into utf8
iconv('cp1250', 'utf-8', $text);
also print this:
var_dump(iconv_get_encoding('all'));
Thanks but it turns out that the problem isn't with the encoding so much as it is with the fact that my fputcsv() call actually was not specifying a delimiter. I chose "\t" for the delim and everything worked perfectly.

Replace character in SQL results

This is from a Oracle SQL query. It has these weird skinny rectangle shapes in the database in places where apostrophes should be. (I wish we would could paste screen shots in here)
It looks like this when I copy and paste the results.
spouse�s
is there a way to write a SQL SELECT statement that searches for this character in the field and replaces it with an apostrophe in the results?
Edit: I need to change only the results in a SELECT statement for reporting purposes, I can't change the Database.
I ran this
select dump('�') from dual;
which returned
Typ=96 Len=3: 239,191,189
This seems to work so far
select translate('What is your spouse�s first name?', '�', '''') from dual;
but this doesn't work
select translate(Fieldname, '�', '''') from TableName
Select FN from TN
What is your spouse�s first name?
SELECT DUMP(FN, 1016) from TN
Typ=1 Len=33 CharacterSet=US7ASCII: 57,68,61,74,20,69,73,20,79,6f,75,72,20,73,70,6f,75,73,65,92,73,20,66,69,72,73,74,20,6e,61,6d,65,3f
EDIT:
So I have established that is the backquote character. I can't get the DB updated so I'm trying this code
SELECT REGEX_REPLACE(FN,"\0092","\0027") FROM TN
and I"m getting ORA-00904:"Regex_Replace":invalid identifier
This seems a problem with your charset configuracion. Check your NLS_LANG and others NLS_xxx enviroment/regedit values. You have to check the oracle server, your client and the client of the inserter of that data.
Try to DUMP the value. you can do it with a select as simple as:
SELECT DUMP(the_column)
FROM xxx
WHERE xxx
UPDATE: I think that before try to replace, look for the root of the problem. If this happens because a charset trouble you can get big problems with bad data.
UPDATE 2: Answering the comments. The problem may be is not on the database server side, may be is in the client side. The problem (if this is the problem) can be a translation on server to/from client comunication. It's for a server-client bad configuracion-coordination. For instance if the server has defined UTF8 charset and your client uses US7ASCII, then all acutes will appear as ?.
Another approach can be that if the server has defined UTF8 charset and your client also UTF8 but the application is not able to show UTF8 chars, then the problem is in the application side.
UPDATE 3: On your examples:
select translate('What. It works because the � is exactly the same char: You have pasted on both sides.
select translate(Fieldname. It does not work because the � is not stored on database, it's the char that the client receives may be because some translation occurs from the data table until it's showed to you.
Next step: Look in DUMP syntax and try to extract the codes for the mysterious char (from the table not pasting �!).
I would say there's a good chance the character is a single-tick "smart quote" (I hate the name). The smart quotes are characters 91-94 (using a Windows encoding), or Unicode U+2018, U+2019, U+201C, and U+201D.
I'm going to propose a front-end application-based, client-side approach to the problem:
I suspect that this problem has more to do with a mismatch between the font you are trying to display the word spouse�s with, and the character �. That icon appears when you are trying to display a character in a Unicode font that doesn't have the glyph for the character's code.
The Oracle database will dutifully return whatever characters were INSERTed into its' column. It's more up to you, and your application, to interpret what it will look like given the font you are trying to display your data with in your application, so I suggest investigating as to what this mysterious � character is that is replacing your apostrophes. Start by using FerranB's recommended DUMP().
Try running the following query to get the character code:
SELECT DUMP(<column with weird character>, 1016)
FROM <your table>
WHERE <column with weird character> like '%spouse%';
If that doesn't grab your actual text from the database, you'll need to modify the WHERE clause to actually grab the offending column.
Once you've found the code for the character, you could just replace the character by using the regex_replace() built-in function by determining the raw hex code of the character and then supplying the ASCII / C0 Controls and Basic Latin character 0x0027 ('), using code similar to this:
UPDATE <table>
set <column with offending character>
= REGEX_REPLACE(<column with offending character>,
"<character code of �>",
"'")
WHERE regex_like(<column with offending character>,"<character code of �>");
If you aren't familiar with Unicode and different ways of character encoding, I recommend reading Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I wasn't until I read that article.
EDIT: If your'e seeing 0x92, there's likely a charset mismatch here:
0x92 in CP-1252 (default Windows code page) is a backquote character, which looks kinda like an apostrophe. This code isn't a valid ASCII character, and it isn't valid in IS0-8859-1 either. So probably either the database is in CP-1252 encoding (don't find that likely), or a database connection which spoke CP-1252 inserted it, or somehow the apostrophe got converted to 0x92. The database is returning values that are valid in CP-1252 (or some other charset where 0x92 is valid), but your db client connection isn't expecting CP-1252. Hence, the wierd question mark.
And FerranB is likely right. I would talk with your DBA or some other admin about this to get the issue straightened out. If you can't, I would try either doing the update above (seems like you can't), or doing this:
INSERT (<normal table columns>,...,<column with offending character>) INTO <table>
SELECT <all normal columns>, REGEX_REPLACE(<column with offending character>,
"\0092",
"\0027") -- for ASCII/ISO-8859-1 apostrophe
FROM <table>
WHERE regex_like(<column with offending character>,"\0092");
DELETE FROM <table> WHERE regex_like(<column with offending character>,"\0092");
Before you do this you need to understand what actually happened. It looks to me that someone inserted non-ascii strings in the database. For example Unicode or UTF-8. Before you fix this, be very sure that this is actually a bug. The apostrophe comes in many forms, not just the "'".
TRANSLATE() is a useful function for replacing or eliminating known single character codes.