Bigquery - loading CSV with "#N/A" in one column - google-bigquery

How can i load a csv successfully using bq load (gsutil) where a FLOAT column has few values as #N/A.
I get the below error when i use the below bq load
bq --location=australia-southeast1 load --max_bad_records=20 --allow_jagged_rows --skip_leading_rows=1 --source_format=CSV DATASET1.T1.FILE1 gs://load_files/Test/FILE!.csv
ERROR - Could not parse #N/A as double for field blah blah
Modifying csv file is not an option

You can try the --null-marker flag (cf here), specifying "#N/A" as a special null character.

Related

BigQuery load with Quoted new line

I know we can use Quoted new lines option in BQ. But my data is not loading even with this option. I don't have any idea why its breaking the load.
CSV line:
"49602"|"AFFF7240654213"|"FROM MASTER: TESTing part. Goodluck: example.com/wp.png (tarifado)
"|"MO"|||0.000|0.000|50.7000|"NET"
Its giving me this error.
Error while reading data, error message: Error detected while parsing row starting at position: 436388389. Error: Missing close double quote (") character.
But with with allow-quoted-newline it should work right?
Also here is the exact column from Postgres(the data source)
FROM MASTER: TESTing part. Goodluck: example.com/wp.png (tarifado) \r
\r
I stored your sample CSV line in a test.csv file and when running
bq load --autodetect --source_format=CSV dataset.test_table "gs://my-bucket/test.csv"
indeed I see the same error
Error detected while parsing row starting at position: 0. Error: Missing close
double quote (") character.
However when adding the flag for allowing quoted newlines it worked fine
bq load --autodetect --allow_quoted_newlines --source_format=CSV dataset.test_table "gs://my-bucket/test.csv"
And the BQ table, test_table looks like this
Row
int64_field_0
string_field_1
string_field_2
string_field_3
string_field_4
string_field_5
double_field_6
double_field_7
double_field_8
string_field_9
1
49602
AFFF7240654213
FROM MASTER: TESTing part. Goodluck: example.com/wp.png (tarifado)
MO
null
null
0.0
0.0
50.7
NET

Bigquery load from Avro gives can not convert from long to int

I am trying to load the avro file from google storage to Big query tables but faced these issue.
Steps i have followed are as below.
Create a dataframe in spark.
Stored these data by writing it into avro.
dataframe.write.avro("path")
Loaded these data into google storage.
Tried to load the data into google bigquery by using following command
bq --nosync load --autodetect --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro
This command leads to give this error.
Error while reading data, error message: The Apache Avro library failed to read data with the follwing error: Cannot resolve: "long" with "int"
So i even tried to use this command by specifying the schema.
bq --nosync load --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro C1:STRING, C2:STRING, C3:STRING, C4:STRING, C5:STRING, C6:INTEGER, C7:INTEGER, C8:INTEGER, C9:STRING, C10:STRING, C11:STRING
Here i have only C6,C7 and C8 are having integer values.
Even this also giving the same previous error.
Is there any reason why i am getting error for long to int instead of long to INTEGER
Please let me know is there any way to load these data by casting it.

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

Load CSV file in PIG

In PIG, When we load a CSV file using LOAD statement without mentioning schema & with default PIGSTORAGE (\t), what happens? Will the Load work fine and can we dump the data? Else will it throw error since the file has ',' and the pigstorage is '/t'? Please advice
When you load a csv file without defining a schema using PigStorage('\t'), since there are no tabs in each line of the input file, the whole line will be treated as one tuple. You will not be able to access the individual words in the line.
Example:
Input file:
john,smith,nyu,NY
jim,young,osu,OH
robert,cernera,mu,NJ
a = LOAD 'input' USING PigStorage('\t');
dump a;
OUTPUT:
(john,smith,nyu,NY)
(jim,young,osu,OH)
(robert,cernera,mu,NJ)
b = foreach a generate $0, $1, $2;
dump b;
(john,smith,nyu,NY,,)
(jim,young,osu,OH,,)
(robert,cernera,mu,NJ,,)
Ideally, b should have been:
(john,smith,nyu)
(jim,young,osu)
(robert,cernera,mu)
if the delimiter was a comma. But since the delimiter was a tab and a tab does not exist in the input records, the whole line was treated as one field. Pig doe snot complain if a field is null- It just outputs nothing when there is a null. Hence you see only the commas when you dump b.
Hope that was useful.

Hi , Google big query - bq fail load display file number how to get the file name

I'm running the following bq command
bq load --source_format=CSV --skip_leading_rows=1 --max_bad_records=1000 --replace raw_data.order_20150131 gs://raw-data/order/order/2050131/* order.json
and
getting the following message when loading data into bq .
*************************************
Waiting on bqjob_r4ca10491_0000014ce70963aa_1 ... (412s) Current status: DONE
BigQuery error in load operation: Error processing job
'orders:bqjob_r4ca10491_0000014ce70963aa_1': Too few columns: expected
11 column(s) but got 1 column(s). For additional help: http://goo.gl/RWuPQ
Failure details:
- File: 844 / Line:1: Too few columns: expected 11 column(s) but got
1 column(s). For additional help: http://goo.gl/RWuPQ
**********************************
The message display only the file number .
checked the files content most of them are good .
gsutil ls and the cloud console on the other hand display file names .
how can I know which file is it according to the file number?
There seems to be some weird spacing introduced in the question, but if the desired path to ingest is "/order.json" - that won't work: You can only use "" at the end of the path when ingesting data to BigQuery.