invalid byte sequence for encoding “UTF8” - sql

I am trying to load a 3GB (24 Million rows) csv file to greenplum database using gpload functionality but I keep getting the below error
Error -
invalid byte sequence for encoding "UTF8": 0x8d
I have tried solution provided by Mike but for me, my client_encoding and file encoding are already the same. Both are UNICODE.
Database -
show client_encoding;
"UNICODE"
File -
file my_file_name.csv
my_file_name.csv: UTF-8 Unicode (with BOM) text
I have browsed through Greenplum's documentation as well, which says the encoding of external file and database should match. It is matching in my case yet somehow it is failing.
I have uploaded similar smaller files as well (same UTF-8 Unicode (with BOM) text)
Any help is appreciated !

Posted in another thread - use the iconv command to strip these characters out of your file. Greenplum is instantiated using a character set, UTF-8 by default, and requires that all characters be of the designated character set. You can also choose to log these errors with the LOG ERRORS clause of the EXTERNAL TABLE. This will trap the bad data and allow you to continue up to set LIMIT that you specify during create.
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence

Related

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

removing unconventional field separators (^#^#^#) in a text file [duplicate]

I have a text file containing unwanted null characters (ASCII NUL, \0). When I try to view it in vi I see ^# symbols, interleaved in normal text. How can I:
Identify which lines in the file contain null characters? I have tried grepping for \0 and \x0, but this did not work.
Remove the null characters? Running strings on the file cleaned it up, but I'm just wondering if this is the best way?
I’d use tr:
tr < file-with-nulls -d '\000' > file-without-nulls
If you are wondering if input redirection in the middle of the command arguments works, it does. Most shells will recognize and deal with I/O redirection (<, >, …) anywhere in the command line, actually.
Use the following sed command for removing the null characters in a file.
sed -i 's/\x0//g' null.txt
this solution edits the file in place, important if the file is still being used. passing -i'ext' creates a backup of the original file with 'ext' suffix added.
A large number of unwanted NUL characters, say one every other byte, indicates that the file is encoded in UTF-16 and that you should use iconv to convert it to UTF-8.
I discovered the following, which prints out which lines, if any, have null characters:
perl -ne '/\000/ and print;' file-with-nulls
Also, an octal dump can tell you if there are nulls:
od file-with-nulls | grep ' 000'
If the lines in the file end with \r\n\000 then what works is to delete the \n\000 then replace the \r with \n.
tr -d '\n\000' <infile | tr '\r' '\n' >outfile
Here is example how to remove NULL characters using ex (in-place):
ex -s +"%s/\%x00//g" -cwq nulls.txt
and for multiple files:
ex -s +'bufdo!%s/\%x00//g' -cxa *.txt
For recursivity, you may use globbing option **/*.txt (if it is supported by your shell).
Useful for scripting since sed and its -i parameter is a non-standard BSD extension.
See also: How to check if the file is a binary file and read all the files which are not?
I used:
recode UTF-16..UTF-8 <filename>
to get rid of zeroes in file.
I faced the same error with:
import codecs as cd
f=cd.open(filePath,'r','ISO-8859-1')
I solved the problem by changing the encoding to utf-16
f=cd.open(filePath,'r','utf-16')
Remove trailing null character at the end of a PDF file using PHP, . This is independent of OS
This script uses PHP to remove a trailing NULL value at the end of a binary file, solving a crashing issue that was triggered by the NULL value. You can edit this script to remove all NULL characters, but seeing it done once will help you understand how this works.
Backstory
We were receiving PDF's from a 3rd party that we needed to upload to our system using a PDF library. In the files being sent to us, there was a null value that was sometimes being appended to the PDF file. When our system processed these files, files that had the trailing NULL value caused the system to crash.
Originally we were using sed but sed behaves differently on Macs and Linux machines. We needed a platform independent method to extract the trailing null value. Php was the best option. Also, it was a PHP application so it made sense :)
This script performs the following operation:
Take the binary file, convert it to HEX (binary files don't like exploding by new lines or carriage returns), explode the string using carriage return as the delimiter, pop the last member of the array if the value is null, implode the array using carriage return, process the file.
//In this case we are getting the file as a string from another application.
// We use this line to get a sample bad file.
$fd = file_get_contents($filename);
//We trim leading and tailing whitespace and convert the string into hex
$bin2hex = trim(bin2hex($fd));
//We create an array using carriage return as the delminiter
$bin2hex_ex = explode('0d0a', $bin2hex);
//look at the last element. if the last element is equal to 00 we pop it off
$end = end($bin2hex_ex);
if($end === '00') {
array_pop($bin2hex_ex);
}
//we implode the array using carriage return as the glue
$bin2hex = implode('0d0a', $bin2hex_ex);
//the new string no longer has the null character at the EOF
$fd = hex2bin($bin2hex);

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

SQL Server adding extra special characters in query result

I am trying to extract some records in a file using BCP command in SQL Server. However when the file is generated, there are extract spaces in between the result for each column.
To try I just wrote basic SQL Query as simple as this
select 'ABC', 40, 'TEST','NOTWORKING'
When we copy the output of above query and paste it in Notepad, the output comes as
ABC 40 TEST NOTWORKING
Notice the space between each value? The file that system is generating using BCP command also has same space coming in the output file which is incorrect. What I want to see in the output file is
ABC40TESTNOTWORKING
What must be causing this issue? I am simply amazed to see such weird issue and hoping that it can be fixed by some changes or setting. Please help.
Sample BCP command
EXEC xp_cmdshell 'bcp "select ''ABC'', 40, ''TEST'',''NOTWORKING''" queryout "E:\Testfile.txt" -c -T -S""'
Output in the File - Testfile.txt
ABC 40 TEST NOTWORKING
There are probably tabs between the values. If you want a single value, use concat():
select CONCAT('ABC', 40, 'TEST', 'NOTWORKING')
There's no issue. The command line has no field terminator argument, so the default is used, a tab. That's described in the docs :
-t field_term
Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
If you specify the field terminator in hexadecimal notation in a bcp.exe command, the value will be truncated at 0x00. For example, if you specify 0x410041, 0x41 will be used.
If field_term begins with a hyphen (-) or a forward slash (/), do not include a space between -t and the field_term value.
The link points to an entire article that explains how to use terminators, for each of the bulk operations.
As for the Copy/Paste operation, it has nothing to do with SQL Server. SQL Server has no UI, it's a service. I suspect what was pasted in Notepad was copied from an SSMS grid.
SSMS is a client tool just like any other. When you copy data from it into the clipboard, it decides what to put there and what format to use. That format can be plain text, using spaces and tabs for layout, RTF, HTML etc.
Plain text with tabs as field separators is probably the best choice for any tool, as it preserves the visual layout up to a point and uses only a single character as a separator. A fixed-length layout using spaces could also be used but that would add characters that may well be part of a field.
Encodings and codepages
-c exports the data using the user's default codepage. This means that text stored in varchar fields using a different codepage (collation) may get mangled. Non-visible Unicode characters will also get mangled and appear as something else, or as ?.
-c
Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
It's better to use export the file as UTF16 using -w.
-w
Performs the bulk copy operation using Unicode characters. This option does not prompt for each field; it uses nchar as the storage type, no prefixes, \t (tab character) as the field separator, and \n (newline character) as the row terminator. -w is not compatible with -c.
The codepage can be specified using the -C parameter. -C 1251 for example will export the data using Windows' Latin1 codepage. 1253 will export it using the Greek codepage.
-C { ACP | OEM | RAW | code_page }
Specifies the code page of the data in the data file. code_page is relevant only if the data contains char, varchar, or text columns with character values greater than 127 or less than 32.
SQL Server 2016 and later can also export text as UTF8 with -C 65001. Earlier versions don't support UTF8.
Versions prior to version 13 (SQL Server 2016 (13.x)) do not support code page 65001 (UTF-8 encoding). Versions beginning with 13 can import UTF-8 encoding to earlier versions of SQL Server.
All this is described in bcp's online documentation.
This subject is so important for any database that it has an entire section in the docs, that describes data format and considerations, using format files to specify different settings per column, and guidelines to ensure compatibility with other applications

How to import from a mixed-encoding file to a PostgreSQL table

I have a 30 GB text file. the encoding of the file is UTF8 but it also contains some Windows-1252 characters. So, when I try to import, it gives the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x9b
How can I fix this?
the file already has UTF8 format, when i run the 'file' command for this file it says the encoding is UTF8. but it also contains some not UTF8 byte sequences. for example when I run the \copy command after a while it gives the above mentioned error for this row:
0B012234 Basic study of <img src="/fulltext-image.asp?format=htmlnonpaginated&src=323K744431152658_html\233_2 basic study of img src fulltext image asp format htmlnonpaginated src 323k744431152658_html 233_2 1975 Semigroup Forum semigroup forum 04861B53 19555
The issue is caused by the backslash (\).
Use CSV format which does not treat backslash as a special character, e.g. -
\copy t from myfile.txt with csv quote E'\x1' delimiter E'\x2'