bigquery loadJob of a json - forcing a field to be String in the schema auto-detect - google-bigquery

if in the beginning the json contains
"label": "foo"
and later it is
"label": "123"
bigquery returns
Invalid schema update. Field label has changed type from STRING to INTEGER
although it is "123" and not 123.
file is being loaded with
autodetect: true
is there a way to force bigquery to make any field as string when it applies its auto-detect, or the only way is using csv instead ?

The auto-detection is based on the best effort to recognize the data type by scanning up to 100 rows of data to use as a representative sample. There is no way to give insight about which kind of type it is. You may consider to specify manually the schema for your use case.
UDATE:
I have tested to load a file with only {"label" : "123"} and it is recognized as INTEGER. Therefore, the auto detection recognizes "123" as INGETER no matter if there are quotes or not. For your case, you may consider to export the schema from the existent table as explained in the documentation:
Note: You can view the schema of an existing table in JSON format by
entering the following command: bq show --format=prettyjson
[DATASET].[TABLE].
and use it for further dynamic loads

Related

How can I query at struct<$oid:string> in aws athena

I want to query data that is stored in MongoDB and exported out into a number of JSON files, stored in S3
I am using AWS Glue to read the files into Athena however the data type for the id on each table is imported as struct<$oid:string>
I have tried every variation of adding quotations around the fields with no luck. everything I try results in the error name expected at the position 7 of 'struct<$oid:string>' but '$' is found.
Is there any way I can read these tables in their current form or do I need to declare their type in Glue?
Glue Crawlers create schemas that match what they find, without considering if they will work with, for example Athena. In Athena you can't have a struct property with an initial $, but Glue doesn't take that into account – partly because maybe you will be using the table with something else where that is not a problem, and partly because what else can it do, that is the name of the property.
There are two ways around it, but neither will work if you continue to use a crawler. You will need to modify the table schema, but if you continue to run the crawler it will just revert it back again.
The first, and probably simplest option, is to change the type of the column to STRING and then use a JSON function at query time to extract the value using JSONPath ($ is a special character in JSONPath, but you should be able to escape it).
The second option is to use the "mappings" feature of the Hive JSON serde. I'm not 100% sure if it will work for this case, but it could be worth a try. The docs are not very extensive on how to configure it, unfortunately.

how do i select certain key/value pair from json field inside a SQL table in SNOWFLAKE

I am currently working on building a dataware house in snowflake for the business that i work for and i have encounter some problems. I used to apply the function Json_value in TSQL for extracting certain key/value pair from json format field inside my original MSSQL DB.
All the other field are in the regular SQL format but there is this one field that i really need that is formated in JSON and i can't seems to exact the key/value pair that i need.
I'm new to SnowSQL and i can't seems to find a way to extract this within a regular query. Does anyone knows a way around my problem ?
* ID /// TYPE /// Name (JSON_FORMAT)/// Amount *
1 5 {En: "lunch, fr: "diner"} 10.00
I would like to extract this line (for exemple) and be able to only retrieve the EN: "lunch" part from my JSON format field.
Thank you !
Almost any time you use JSON in Snowflake, it's advisable to use the VARIANT data type. You can use the parse_json function to convert a string into a variant with JSON.
select
parse_json('{En: "lunch", fr: "diner"}') as VARIANT_COLUMN,
VARIANT_COLUMN:En::string as ENGLISH_WORD;
In this sample, the first column converts your JSON into a variant named VARIANT_COLUMN. The second column uses the variant, extracting the "En" property and casting it to a string data type.
You can define columns as variant and store JSON natively. That's going to improve performance and allow parsing using dot notation in SQL.
For anyone else who also stumbles upon this question:
You can also use JSON_EXTRACT_PATH_TEXT. Here is an example, if you wanted to create a new column called meal.
select json_extract_path_text(Name,'En') as meal from ...

BigQuery load job failing with "Could not parse 'Text' as bool"

Creating a table from a CSV file in Big Query with Auto Detect Schema.
Load job fails with the error:
Error while reading data, error message: Could not parse 'good' as bool for field order_Flag (position 26) starting at location 1689438
Even though the column has some rows with text/string, why is BigQuery parsing it as a bool?
Even though the column has some rows with text/string, why is BigQuery parsing it as a bool?
When auto-detection is enabled, BigQuery starts the inference process by scanning up to 100 rows of data in your file to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
So, looks like those "some rows with text/string" are beyond 100 rows used for auto-detection and first 100 rows "define" that field as a Boolean
You can read more about Schema auto-detection
To avoid this - you can define your own schema for the load - see details for Loading CSV data into a table
for this particular issue, uncheck Auto Schema, and in line 1 write "name:string, gender:srting, count:number"
name:STRING,
gender:STRING,
count:NUMERIC

Purpose of Json schema file while loading data into Big query from a csv file

Can someone please help me by stating the purpose of providing the json schema file while loading a file to BQtable using bq command. what are the advantages?
Dose this file help to maintain data integrity by avoiding any column swap ?
Regards,
Sreekanth
Specifying a JSON schema--instead of relying on auto-detect--means that you are ensured to get the expected types for each column being loaded. If you have data that looks like this, for example:
1,'foo',true
2,'bar',false
3,'baz',true
Schema auto-detection would infer that the type of the first column is an INTEGER (a.k.a. INT64). Maybe you plan to load more data in the future, though, that looks like this:
3.14,'foo',true
1.59,'bar',false
-2.001,'baz',true
In that case, you probably want the first column to have type FLOAT (a.k.a. FLOAT64) instead. If you provide a schema when you load the first file, you can specify a type of FLOAT for that column explicitly.

What happens if I send integers to a BigQuery field "string"?

One of the columns I send (in my code) to BigQuery is integers. I added the columns to BigQuery and I was too fast and added them as type string.
Will they be automatically converted? Or will the data be totally corrupted (= I cannot trust at all the resulting string)?
Data shouldn't be automatically converted as this would destroy the purpose of having a table schema.
What I've seen people doing is saving a whole json line as string and then processing this string inside of BigQuery. Other than that, if you try to save values not correspondent to the field schema definition, you should see an error being thrown, like so:
If you need to change a table schema's definition, you can check this tutorial on updating a table schema.
Actually BigQuery converted automatically the integers that I have sent it to string, so my table populates ok