Convert Empty string ("") to Double data type while importing data from JSON file using command line BQ command - google-bigquery

What steps will reproduce the problem?
1.I am running command:
./bq load --source_format=NEWLINE_DELIMITED_JSON --schema=lifeSchema.json dataset_test1.table_test_3 lifeData.json
2.I have attached data source file and scema files.
3. It throws an error - JSON parsing error in row starting at position 0 at file:
file-00000000. Could not convert value to double. Field:
computed_results_A; Value:
What is the expected output? What do you see instead?
I want empty string converted as NULL or 0
What version of the product are you using? On what operating system?
I am using MAC OSX YOSEMITE
Source JSON lifeData.json
{"schema":{"vendor":"com.bd.snowplow","name":"in_life","format":"jsonschema","version":"1-0-2"},"data":{"step":0,"info_userId":"53493764","info_campaignCity":"","info_self_currentAge":45,"info_self_gender":"male","info_self_retirementAge":60,"info_self_married":false,"info_self_lifeExpectancy":0,"info_dependantChildren":0,"info_dependantAdults":0,"info_spouse_working":true,"info_spouse_currentAge":33,"info_spouse_retirementAge":60,"info_spouse_monthlyIncome":0,"info_spouse_incomeInflation":5,"info_spouse_lifeExpectancy":0,"info_finances_sumInsured":0,"info_finances_expectedReturns":6,"info_finances_loanAmount":0,"info_finances_liquidateSavings":true,"info_finances_savingsAmount":0,"info_finances_monthlyExpense":0,"info_finances_expenseInflation":6,"info_finances_expenseReduction":10,"info_finances_monthlyIncome":0,"info_finances_incomeInflation":5,"computed_results_A":"","computed_results_B":null,"computed_results_C":null,"computed_results_D":null,"uid_epoch":"53493764_1466504541604","state":"init","campaign_id":"","campaign_link":"","tool_version":"20150701-lfi-v1"},"hierarchy":{"rootId":"94583157-af34-4ecb-8024-b9af7c9e54fa","rootTstamp":"2016-06-21 10:22:24.000","refRoot":"events","refTree":["events","in_life"],"refParent":"events"}}
Schema JSON lifeSchema.json
{
"name": "computed_results_A",
"type": "float",
"mode": "nullable"
}

Try loading the JSON file as a one column CSV file.
bq load --field_delimiter='|' proj:set.table file.json json:string
Once the file is loaded into BigQuery, you can use JSON_EXTRACT_SCALAR or a JavaScript UDF to parse the JSON with total freedom.

Related

problem on changing a columns' data type in BigQuery

I try to change a columns' data type from string to DATETIME (for example '04/12/2016 02:47:30') with the format 'YY/MM/DD HH24:MI:SS' but it shoes an error like :
Failed to parse input timestamp string at 8 with format element ' '
The initial file was a csv which i uploaded from my drive. I tried to convert the column's data type from google sheets and then re'upload it, but the column type still remains as string.
I think when you load your CSV file to the BigQuery table, you use autodetect mode.
Unfortunately with this mode, BigQuery will consider your date as String even if you changed it from Google Sheet.
Instead of using autodetect, I propose you using a Json schema for your BigQuery table.
In your schema you will indicate that the column type for your date field is timestamp.
The format you indicated 04/12/2016 02:47:30 is compatible with a timestamp and BigQuery will convert it for you.
For the loading file to BigQuery, you can directly use the console or gcloud cli with bq command :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
For the BigQuery Json schema, the timestamp type is :
{
{
"name": "yourDate",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"description": "Your date"
}
}

Bigquery load from Avro gives can not convert from long to int

I am trying to load the avro file from google storage to Big query tables but faced these issue.
Steps i have followed are as below.
Create a dataframe in spark.
Stored these data by writing it into avro.
dataframe.write.avro("path")
Loaded these data into google storage.
Tried to load the data into google bigquery by using following command
bq --nosync load --autodetect --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro
This command leads to give this error.
Error while reading data, error message: The Apache Avro library failed to read data with the follwing error: Cannot resolve: "long" with "int"
So i even tried to use this command by specifying the schema.
bq --nosync load --source_format AVRO datasettest.testtable gs://test/avrodebug/*.avro C1:STRING, C2:STRING, C3:STRING, C4:STRING, C5:STRING, C6:INTEGER, C7:INTEGER, C8:INTEGER, C9:STRING, C10:STRING, C11:STRING
Here i have only C6,C7 and C8 are having integer values.
Even this also giving the same previous error.
Is there any reason why i am getting error for long to int instead of long to INTEGER
Please let me know is there any way to load these data by casting it.

BigQuery Could not parse 'null' as int for field

Tried to load csv files into bigquery table. There are columns where the types are INTEGER, but some missing values are NULL. So when I use the command bq load to load, got the following error:
Could not parse 'null' as int for field
So I am wondering what are the best solutions to deal with this, have to reprocess the data first for bq to load?
You'll need to transform the data in order to end up with the expected schema and data. Instead of INTEGER, specify the column as having type STRING. Load the CSV file into a table that you don't plan to use long-term, e.g. YourTempTable. In the BigQuery UI, click "Show Options", then select a destination table with the table name that you want. Now run the query:
#standardSQL
SELECT * REPLACE(SAFE_CAST(x AS INT64) AS x)
FROM YourTempTable;
This will convert the string values to integers where 'null' is treated as null.
Please try with job config setting.
job_config.null_marker = 'NULL'
configuration.load.nullMarker
string
[Optional] Specifies a string that represents a null value in a CSV file. For example, if you specify "\N", BigQuery interprets "\N" as a null value when loading a CSV file. The default value is the empty string. If you set this property to a custom value, BigQuery throws an error if an empty string is present for all data types except for STRING and BYTE. For STRING and BYTE columns, BigQuery interprets the empty string as an empty value.
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
BigQuery Console has it's limitations and doesn't allow you to specify a null marker while loading data from a CSV. However, it can easily be done by using the BigQuery command-line tool's bq load command. We can use the --null_marker flag to specify the marker which is simply null in this case.
bq load --source_format=CSV \
--null_marker=null \
--skip_leading_rows=1 \
dataset.table_name \
./data.csv \
./schema.json
Setting the null_marker as null does the trick here. You can omit the schema.json part if the table is already present with a valid schema. --skip_leading_rows=1 is used because my first row was a header.
You can learn more about the bg load command in the BigQuery Documentation.
The load command however lets you create and load a table in a single go. The schema needs to be specified in a JSON file in the below format:
[
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
},
{
"description": "[DESCRIPTION]",
"name": "[NAME]",
"type": "[TYPE]",
"mode": "[MODE]"
}
]

JSON file not loading into redshift

I have issues using the copy command in redshift to load in JSON objects, I am receiving a file in the below JSON format which fails when attempting to use the copy command, however when I adjust the json file to the bottom it works. This is not an ideal solution as I am not permiited to modify the JSON file
this works fine :
{
"id": 1,
"name": "Major League Baseball"
}
{
"id": 2,
"name": "National Hockey League"
}
This does not work (notice the extra square brackets)
[
{"id":1,"name":"Major League Baseball"},
{"id":2,"name":"National Hockey League"}
]
this is my json path
{
"jsonpaths": [
"$['id']",
"$['name']"
]
}
The problem with the COPY command is it does not really accept a valid JSON file. Instead, it expects a JSON-per-line which is shown in the documentation, but not obviously mentioned.
Hence, every line is supposed to be a valid JSON but the full file is not. That's why when you modify your file, it works.

Unexpected error while loading data

I am getting an "Unexpected" error. I tried a few times, and I still could not load the data. Is there any other way to load data?
gs://log_data/r_mini_raw_20120510.txt.gzto567402616005:myv.may10c
Errors:
Unexpected. Please try again.
Job ID: job_4bde60f1c13743ddabd3be2de9d6b511
Start Time: 1:48pm, 12 May 2012
End Time: 1:51pm, 12 May 2012
Destination Table: 567402616005:myvserv.may10c
Source URI: gs://log_data/r_mini_raw_20120510.txt.gz
Delimiter: ^
Max Bad Records: 30000
Schema:
zoneid: STRING
creativeid: STRING
ip: STRING
update:
I am using the file that can be found here:
http://saraswaticlasses.net/bad.csv.zip
bq load -F '^' --max_bad_record=30000 mycompany.abc bad.csv id:STRING,ceid:STRING,ip:STRING,cb:STRING,country:STRING,telco_name:STRING,date_time:STRING,secondary:STRING,mn:STRING,sf:STRING,uuid:STRING,ua:STRING,brand:STRING,model:STRING,os:STRING,osversion:STRING,sh:STRING,sw:STRING,proxy:STRING,ah:STRING,callback:STRING
I am getting an error "BigQuery error in load operation: Unexpected. Please try again."
The same file works from Ubuntu while it does not work from CentOS 5.4 (Final)
Does the OS encoding need to be checked?
The file you uploaded has an unterminated quote. Can you delete that line and try again? I've filed an internal bigquery bug to be able to handle this case more gracefully.
$grep '"' bad.csv
3000^0^1.202.218.8^2f1f1491^CN^others^2012-05-02 20:35:00^^^^^"Mozilla/5.0^generic web browser^^^^^^^^
When I run a load from my workstation (Ubuntu), I get a warning about the line in question. Note that if you were using a larger file, you would not see this warning, instead you'd just get a failure.
$bq show --format=prettyjson -j job_e1d8636e225a4d5f81becf84019e7484
...
"status": {
"errors": [
{
"location": "Line:29057 / Field:12",
"message": "Missing close double quote (\") character: field starts with: <Mozilla/>",
"reason": "invalid"
}
]
My suspicion is that you have rows or fields in your input data that exceed the 64 KB limit. Perhaps re-check the formatting of your data, check that it is gzipped properly, and if all else fails, try importing uncompressed data. (One possibility is that the entire compressed file is being interpreted as a single row/field that exceeds the aforementioned limit.)
To answer your original question, there are a few other ways to import data: you could upload directly from your local machine using the command-line tool or the web UI, or you could use the raw API. However, all of these mechanisms (including the Google Storage import that you used) funnel through the same CSV parser, so it's possible that they'll all fail in the same way.