Big Query do not accept EMOJI - google-bigquery

I have emojis in this format - \U0001f924 why BigQuery(Google Data studio) does not display them, even if I saw examples that this format working for other people?
SAMPLE: - Second Emoji in this format \u2614
Ref: Emoji crashed when uploading to Big Query
Based on this article it should work: Google \Uhhhhhhhh Format
UPDATE 1.0:
If I use "" then emojis in this format \U2714 displays emoji, this one \U0001f680 still the same as text U0001f680
If I use '' then emojis in this format \U2714 as well as \U0001f680 display only value U2714 and U0001f680

The emoji on the question works for me with SELECT "\U0001f680":
I stored the results in a table so you can find it:
https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.one_emoji?tab=preview
If you ask BigQuery to export this table to a GCS file, and bring this file into your computer, it will continue to work:
You can download this json file and load it back into BigQuery:
https://drive.google.com/file/d/1hXASPN0J4bP0PVk20x7x6HfkAFDAD4vq/view?usp=sharing
Let's load it into BigQuery:
Everything works fine:
So the problem is in the files you are loading to BigQuery - which are not encoding emoji's appropriately.
What I don't know is how you are generating these files, nor how to fix that process. But here I have proven that for files that correctly encode emojis - you can load them into BigQuery and emojis will be preserved.
๐Ÿ™ƒ

Related

Camelot in python does not behave as expected

I have two pdf documents, both in same layout with different information. The problem is:
I can read one perfectly but the other one the data is unrecognizable.
This is an example which I can read perfectly, download here:
from_pdf = camelot.read_pdf('2019_05_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe as expected:
This is an example which after I read, the information is unrecognizable, download here:
from_pdf = camelot.read_pdf('2020_04_2.pdf', flavor='stream', strict=False)
df_pdf = from_pdf[0].df
camelot.plot(from_pdf[0], kind='text').show()
print(from_pdf[0].parsing_report)
This is the dataframe with unrecognizable information:
I don't understand what I have done wrong and why the same code doesn't work for both files. I need some help, thanks.
The problem: malformed PDF
Simply, the problem is that your second PDF is malformed / corrupted. It doesn't contain correct font information, so it is impossible to extract text from your PDF as is. It is a known and difficult problem (see this question).
You can check this by trying to open the PDF with Google Docs.
Google Docs tries to extract the text and this is the result:.
Possible solutions
If you want to extract the text, you can print the document to an image-based PDF and perform an OCR text extraction.
However, Camelot does not currently support image-based PDFs, so it is not possible to extract the table.
If you have no way to recover a well-formed PDF, you could try this strategy:
print PDF to an image-based PDF
add a good text layer to your image-based PDF (using OCRmyPDF)
try using Camelot to extract tables

Custom Speech: "normalized text is empty"

I uploaded a wav and text file to the custom speech portal. I got the following error: โ€œError: normalized text is empty.โ€
The text file is UTF-8 BOM, and is similar in format to a file that did work.
How I can trouble-shoot this?
There can be several reasons for a normalized text to be empty, e.g. if there are words of Latin and non-Latin characters in a sentence (depending on the locale). Also, words that are repeated multiple times in a row may cause this. Can you share which locale you're using to import the data? If you could share the text we can find the reason. Otherwise you could try to reduce the input text (no need to cut the audio for this) to find out what causes the normalization to discard the sentence.

\\x96 is not a valid UTF-8 string in BigQuery

We're seeing BigQuery produce invalid utf-8 errors when the " - " (dash) character is used in pipe delimited csv files. The weird thing is, these characters are in files that are over a year old, have not changed, and BigQuery has been reading the files for many months just fine until a few days ago. Here's an example of one of the errors.
Christus Trinity Clinic \\x96 Rheumatology is not a valid UTF-8 string
The way the string looks in the original file is like this:
Christus Trinity Clinic โ€“ Rheumatology
Does anyone know the fix for this or if BigQuery has changed it's functionality in a way that might cause this issue? I know that I can just upload a corrected file, but in this scenario the files are not supposed to change for auditing purposes.
I had the same issue from aug 14.
I am using gsutil to load the csv into bigquery.
I had used the encoding option while loading the csv and it is working for me.
Encoding:
--encoding ISO-8859-1
Command line:
bq --location=US load --skip_leading_rows=1 --encoding ISO-8859-1 --replace --source_format=CSV gcs.dim_employee
We saw the same thing suddenly happen since yesterday.
For me, the solution was to add a encoding type to the loadconfig.
(I'm using the PHP client, but your client probably also has this option)
$loadConfig->encoding('ISO-8859-1');

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?

How to force scheme.ini to be used for MS Text Driver?

I am creating this huge csv import, that uses the ms text driver, to read the csv file.
And I am using ColdFusion to create the scheme.ini in each folder's location, where the file has been uploaded.
Here is a sample one I am using:
[some_filename.csv]
Format=CSVDelimited
ColNameHeader=True
MaxScanRows=0
Col1=user_id Text width 80
Col2=first_name Text width 20
Col3=last_name Text width 30
Col4=rights Text width 10
Col5=assign_training Text width 1
CharacterSet=ANSI
Then in my ColdFusion code, I am doing 2 cfdump's:
<cfdump var="#GetMetaData( csvfile )#" />
<cfdump var="#csvfile#">
The meta data shows that the query has not grabbed the correct data types for reading the csv file.
And the dump of the query to read file, shows that it is missing values, because of Excel we can not force them to use double quotes. And when fields have mixed data types, then it causes our process to not work..
How can I either change the data type inside the query, aka make it use scheme.ini, or update metadata to the correct data type.
I am using a view on information_schema in sql server 2005 to get the correct data types, column names, and max lengths...
Unless I have some kind of syntax error, I can't see why it's not grabbing the data as the correct data type.
Any suggestions?
Funnily, I had the filename spelled wrong, instead of using schema.ini i was having it as scheme.ini.
I hate when you make lil mistakes like this...
Thank You