Cannot upload CSV that starts with an integer - google-bigquery

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David

I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?

Related

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

\\x96 is not a valid UTF-8 string in BigQuery

We're seeing BigQuery produce invalid utf-8 errors when the " - " (dash) character is used in pipe delimited csv files. The weird thing is, these characters are in files that are over a year old, have not changed, and BigQuery has been reading the files for many months just fine until a few days ago. Here's an example of one of the errors.
Christus Trinity Clinic \\x96 Rheumatology is not a valid UTF-8 string
The way the string looks in the original file is like this:
Christus Trinity Clinic – Rheumatology
Does anyone know the fix for this or if BigQuery has changed it's functionality in a way that might cause this issue? I know that I can just upload a corrected file, but in this scenario the files are not supposed to change for auditing purposes.
I had the same issue from aug 14.
I am using gsutil to load the csv into bigquery.
I had used the encoding option while loading the csv and it is working for me.
Encoding:
--encoding ISO-8859-1
Command line:
bq --location=US load --skip_leading_rows=1 --encoding ISO-8859-1 --replace --source_format=CSV gcs.dim_employee
We saw the same thing suddenly happen since yesterday.
For me, the solution was to add a encoding type to the loadconfig.
(I'm using the PHP client, but your client probably also has this option)
$loadConfig->encoding('ISO-8859-1');

How can I resolve a Unicode error from read_csv?

This is my first time working on a python project outside of school, so bear with me.
When I run the code below, I get the error
"(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated\uXXXXXXXX escape"
and the IDLE editor highlights the '(' before the argument of pd.read_csv.
I googled the error but got a lot of stuff that went way over my head.
The csv file in question is an excel file i saved as csv. should i save it some other way?
import pandas as pd
field = pd.read_csv("C:\Users\Glen\Documents\Feild.csv")
I just want to convert my excel data into a data frame and I don't understand why it was so easy in class, and it's now so difficult on my home pc.
The problem is with the path. There are two ways to mention the path while reading a csv file,
1- Use double backslashes,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv")
2- Use single forwardslash,
pd.read_csv("C:/Users/Glen/Documents/Feild.csv")
If these do not work, try this one,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv", encoding='utf-8')
OR
pd.read_csv("C:/Users/Glen/Documents/Feild.csv", encoding='utf-8')

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

how to import flat file source to database using sql

im currently want to inport my data from flat file to the database.
the flat file is in a txt file. in that txt file, i save a list of URLs. example:
http://www.mimi.com/Hotels-g303188-Rurrenabaque-Hotels.html
im using the SQL Server Import and Export wizard to do it. but when the time of execution, it has error saying
Error 0xc02020a1:
Data Flow Task 1: Data conversion failed. The data conversion for column
"Column 0" returned status value 4 and status text "Text was truncated or one
or more characters had no match in the target code page.".
can anyone help?..
You get this error because the text is too long for the column youve chosen to put it in.
Text was truncated or
You might want to check the size of the database column vis-a-vis your input data. Does the longest URL less than the column width?
one or more characters had no match in the target code page.".
Check if your input file has any special characters. An easy way to check this would be to save your file in ANSI (Notepad > Save As > Encoding = ANSI). Note - you'd still have to select the right code page so that the import interprets your input text correctly.
Here's a very nice link that has some background on what code pages are - http://www.joelonsoftware.com/articles/Unicode.html
Note you can also change the target column data type (to text stream for example) in the Datasource->Advanced section