TSQL Bulk Insert with Special Character as Delimiter - sql

I need the ability to bulk insert into an instance of SQL Server 2016 (13.0.4224.16 - FORMAT and FIELDQUOTE properties are not available) on special characters as the field delimiter and to also include any Unicode characters that may be in the data sets. I am trying to use ¿ (Hex 0xBF) or any character that I know isn't in my data set as a field delimiter.
I have a UTF-8 encoded test.txt file containing some test data (headers excluded in dataset):
foo¿bar¿foobar¿
\n < last line
and a TSQL statement to insert:
BULK INSERT [dbo].[testTable]
FROM 'C:\Datasource\test.txt'
WITH (KEEPNULLS,
MAXERRORS=0,
FIELDTERMINATOR='0xBF');
into this table:
create table [dbo].[testTable](
col1 nvarchar(50),
col2 nvarchar(50),
col3 nvarchar(50),
col4 nvarchar(50)
)
when I run a select on my testTable it returns:
col1 col2 col3 col4
foo┬ bar┬ foobar┬ NULL
Why are these ┬ characters showing up? I'm guessing it's my delimiters incorrectly encoded and included in the data? If I change my delimiters to | I can get the data in without issue but it exists in my data sets and would break the inserts further down the line. I tried adding CODEPAGE=65001 which imports Unicode characters without issue using a pipe delimiter, but using the special character delimiter results in this error:
Bulk load: An unexpected end of file was encountered in the data file.
Edit:
I've changed the import txt file to UTF-16 encoding but still encounter the same issues.

Related

SQL Bulk insert ignores first data row

I am trying to import a pipeline delimited file into a temporary table using bulk insert (UTF-8 with unix style row terminator), but it keeps ignoring the first data row (the one after the header) and i don't know why.
Adding | to the header row will not help either...
File contents:
SummaryFile_20191017140001.dat|XXXXXXXXXX|FIL-COUNTRY|128
File1_20191011164611.dat|2|4432|2|Imported||
File2_20191011164611.dat|3|4433|1|Imported||
File3_20191011164611.dat|4|4433|2|Imported||
File4_20191011164611.dat|5|4434|1|Imported|INV_ERROR|
File5_20191011164611.dat|6|4434|2|Imported||
File6_20191011164611.dat|7|4434|3|Imported||
The bulk insert throws no error, but it keeps ignoring the first data line (File1_...)
SQL below:
IF OBJECT_ID('tempdb..#mycsv') IS NOT NULL
DROP TABLE #mycsv
create table #mycsv
(
tlr_file_name varchar(150) null,
tlr_record_id int null,
tlr_pre_invoice_number varchar(50) null,
tlr_pre_invoice_line_number varchar(50) null,
tlr_status varchar (30) null,
tlr_error_code varchar(30) null,
tlr_error_message varchar (500) null)
bulk insert #mycsv
from 'D:\TestData\Test.dat'
with (
rowterminator = '0x0A',
fieldTerminator = '|',
firstrow = 2,
ERRORFILE = 'D:\TestData\Import.log')
select * from #mycsv
It's really bugging me, since i don't really know what am i missing.
If i specify FirstRow = 1 th script will throw:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 2 (tlr_record_id).
Thanks in advance!
"UTF-8 with unix style row terminator" I assume you're using a version of SQL Server that doesn't support UTF-8. From BULK INSERT (Transact-SQL)
** Important ** Versions prior to SQL Server 2016 (13.x) do not support code page 65001 (UTF-8 encoding).
If you are using 2016+, then specify the code page for UTF-8:
BULK INSERT #mycsv
FROM 'D:\TestData\Test.dat'
WITH (ROWTERMINATOR = '0x0A',
FIELDTERMINATOR = '|',
FIRSTROW = 1,
CODEPAGE = '65001',
ERRORFILE = 'D:\TestData\Import.log');
If you aren't using SQL Server 2016+, then you cannot use BULK INSERT to import a UTF-8 file; you will have to use a different code page or use a different tool.
Note, also, that the above document states the below:
The FIRSTROW attribute is not intended to skip column headers. Skipping headers is not supported by the BULK INSERT statement. When skipping rows, the SQL Server Database Engine looks only at the field terminators, and does not validate the data in the fields of skipped rows.
if you are skipping rows, you still need to ensure the row is valid, but it's not for skipping headers. This means you should be using FIRSTROW = 1 and fixing your header row as #sarlacii points out.
Of course, that does not fix the code page problem if you are using an older version of SQL Server; and my point stands that you'll have to use a different technology on 2014 and prior.
To import rows effectively into a SQL database, it is important to make the header formatting match the data rows. Add the missing delimiters, like so, to the header and try the import again:
SummaryFile_20191017140001.dat|XXXXXXXXXX|FIL-COUNTRY|128|||
The number of fields in the header, versus the data fields, must match, else the row is ignored, and the first satisfactory "data" row will be treated as the header.

External table how to delete newline char from the end of each row

i have problem with loading rows from a file, the point is that when im using External table like this
create table table_name
(
id VARCHAR2(60)
)
organization external
(
type ORACLE_LOADER
default directory DIRECTORY
access parameters
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET EE8MSWIN1250 nobadfile nodiscardfile
FIELDS TERMINATED BY ";" OPTIONALLY ENCLOSED BY '\"' LDRTRIM
REJECT ROWS WITH ALL NULL FIELDS
(
ID VARCHAR2(60)
)
)
location ('tmp.txt')
)
reject limit 0;
my all rows have the newLine byte at the end of row, only thing that works is after loading data from file is update all rows using this
update table_name
set id = translate (id, 'x'||CHR(10)||CHR(13), 'x');
How can i make it automatically?
Check exactly what newline charcters are in your file and than define the record delimiter explicitely.
Example
records delimited by '\r\n'
The probable cause of your problem is that the newline character is not compatible with your operating system - which topic you can address as well.
while may have line delimiter as either \n or \r\n..
you can check that by opening file in notepad++ or any other supporting editor and by clicking show all characters
based no how the data is in the life you may create the external table as
RECORDS DELIMITED BY '\r\n' or
RECORDS DELIMITED BY '\n' etx

Hive - Load delimited data with special character cause off position

Let's say I want to create a simple table with 4 columns in Hive and load some pipe-delimited data.
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
;
Raw Data:
123|456|Dasani Bottled \| Water|789
What I expect for Col3 value is "Dasani Bottled \| Water", it has some special character "\|" in the middle thus cause Hive table column off position starting at COL3 because I create the table using "|" as the delimiter. The special character \| does have a pipe | character within it.
Is there any way to resolve the issue so Hive can load data correctly?
Thanks for any help.
you can add the ESCAPED BY clause to your table creation like this to allow character escaping
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|' ESCAPED BY '\'
;
From the Hive documentation
Enable escaping for the delimiter characters by using the 'ESCAPED BY'
clause (such as ESCAPED BY '\') Escaping is needed if you want to
work with data that can contain these delimiter characters.
A custom NULL format can also be specified using the 'NULL DEFINED AS'
clause (default is '\N').

Update oracle table and change one character with another

I have this table where one of the Fields contains values like this ¤1¤. It is in an Unicode database and nvarchar2 Fields.
I then want to switch the ¤ With an ? and Write this line:
update table1 set col1 = REPLACE(col1,'¤','?');
commit;
The col1 is not updated.
What am I doing wrong?
select ascii('¤') from dual;
Even though this states 164 as the result on a non-unicode database, it doesn't so on a unicode database. Hence the replace as asked for, will not work.
select chr(164 USING NCHAR_CS) from dual;
Using this states '¤' as the result.
Hence the following replace should work.
select replace(col1, chr(164 USING NCHAR_CS), chr(63)) from table1;
Sometime simple cut and paste may not get the character properly in your query so .. instead of writing the character in your query.. get the ascii value (you can use ascii or dump function in oracle to get the ascii value of the character) of the character and then use that in your replace as below
Your character seems to me ASCII value 164 ...
-- try as below
update table1 set col1 = replace(col1,chr(164),'?');
commit;

Postgres Copy - Importing an integer with a comma

I am importing 50 CSV data files into postgres. I have an integer field where sometimes the value is a regular number (comma-delimited) and sometimes it is in quotations and uses a comma for the thousands.
For instance, I need to import both 4 and "4,000".
I'm trying:
COPY race_blocks FROM '/census/race-data/al.csv' DELIMITER ',' CSV HEADER;
And get the error:
ERROR: invalid input syntax for integer: "1,133"
How can I do this?
Let's assume you have only one column in your data.
First create temporary table with varchar column:
CREATE TEMP TABLE race_blocks_temp (your_integer_field VARCHAR);
Copy your data from file
COPY race_blocks_tmp FROM '/census/race-data/al.csv' DELIMITER ',' CSV HEADER;
Remove ',' from varchar field, convert data to numeric and insert into your table.
INSERT INTO race_blocks regexp_replace(your_integer_field, ',', '') :: numeric AS some_colun FROM race_blocks_tmp;