\\x96 is not a valid UTF-8 string in BigQuery - google-bigquery

We're seeing BigQuery produce invalid utf-8 errors when the " - " (dash) character is used in pipe delimited csv files. The weird thing is, these characters are in files that are over a year old, have not changed, and BigQuery has been reading the files for many months just fine until a few days ago. Here's an example of one of the errors.
Christus Trinity Clinic \\x96 Rheumatology is not a valid UTF-8 string
The way the string looks in the original file is like this:
Christus Trinity Clinic – Rheumatology
Does anyone know the fix for this or if BigQuery has changed it's functionality in a way that might cause this issue? I know that I can just upload a corrected file, but in this scenario the files are not supposed to change for auditing purposes.

I had the same issue from aug 14.
I am using gsutil to load the csv into bigquery.
I had used the encoding option while loading the csv and it is working for me.
Encoding:
--encoding ISO-8859-1
Command line:
bq --location=US load --skip_leading_rows=1 --encoding ISO-8859-1 --replace --source_format=CSV gcs.dim_employee

We saw the same thing suddenly happen since yesterday.
For me, the solution was to add a encoding type to the loadconfig.
(I'm using the PHP client, but your client probably also has this option)
$loadConfig->encoding('ISO-8859-1');

Related

invalid byte sequence for encoding “UTF8”

I am trying to load a 3GB (24 Million rows) csv file to greenplum database using gpload functionality but I keep getting the below error
Error -
invalid byte sequence for encoding "UTF8": 0x8d
I have tried solution provided by Mike but for me, my client_encoding and file encoding are already the same. Both are UNICODE.
Database -
show client_encoding;
"UNICODE"
File -
file my_file_name.csv
my_file_name.csv: UTF-8 Unicode (with BOM) text
I have browsed through Greenplum's documentation as well, which says the encoding of external file and database should match. It is matching in my case yet somehow it is failing.
I have uploaded similar smaller files as well (same UTF-8 Unicode (with BOM) text)
Any help is appreciated !
Posted in another thread - use the iconv command to strip these characters out of your file. Greenplum is instantiated using a character set, UTF-8 by default, and requires that all characters be of the designated character set. You can also choose to log these errors with the LOG ERRORS clause of the EXTERNAL TABLE. This will trap the bad data and allow you to continue up to set LIMIT that you specify during create.
iconv -f utf-8 -t utf-8 -c file.txt
will clean up your UTF-8 file, skipping all the invalid characters.
-f is the source format
-t the target format
-c skips any invalid sequence

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

Big Query do not accept EMOJI

I have emojis in this format - \U0001f924 why BigQuery(Google Data studio) does not display them, even if I saw examples that this format working for other people?
SAMPLE: - Second Emoji in this format \u2614
Ref: Emoji crashed when uploading to Big Query
Based on this article it should work: Google \Uhhhhhhhh Format
UPDATE 1.0:
If I use "" then emojis in this format \U2714 displays emoji, this one \U0001f680 still the same as text U0001f680
If I use '' then emojis in this format \U2714 as well as \U0001f680 display only value U2714 and U0001f680
The emoji on the question works for me with SELECT "\U0001f680":
I stored the results in a table so you can find it:
https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.one_emoji?tab=preview
If you ask BigQuery to export this table to a GCS file, and bring this file into your computer, it will continue to work:
You can download this json file and load it back into BigQuery:
https://drive.google.com/file/d/1hXASPN0J4bP0PVk20x7x6HfkAFDAD4vq/view?usp=sharing
Let's load it into BigQuery:
Everything works fine:
So the problem is in the files you are loading to BigQuery - which are not encoding emoji's appropriately.
What I don't know is how you are generating these files, nor how to fix that process. But here I have proven that for files that correctly encode emojis - you can load them into BigQuery and emojis will be preserved.
🙃

Apache Pig filtering out carriage returns

I'm fairly new with apache pig and trying to work with some fixed width text. In pig, I'm reading every line in as a chararray (I know I can use fixedwidthloader, but am not in this instance). One of the fields I'm working with is an email field and one entry has a carriage return that generates extra lines of output in the finished data dump (I show 12 rows instead of the 9 I'm expecting). I know which entry has the error but I'm unable to filter it out using pig.
Thus far I've tried to use pig's REPLACE to replace on \r or \uFFFD and even tried a python UDF which works on the command line but not when I run it as a UDF through PIG. Anyone have any suggestions? Please let me know if more details are required.
My original edit with a solution turned out to only work part of the time. This time I had to clean the data before I ran it through pig. On the raw data file I did a perl -i -pe 's/\r//g' filename to remove the rogue carriage return.

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?