BigQuery converts automatically a string into int - google-bigquery

I have a table with a field called field_string of type string. I insert data in json format and some rows have only numbers for that field. For example "field_string": "123456". The problem is that BigQuery transforms the value of that row into int and cannot insert it because the types do not match. I cannot transform the field's type because some rows do have letters or symbols in that field.
I have a workaround that adds a symbol to the string so that BigQuery does not convert it, but I would like to know whether I can find a way to not do that.
The job configuration is the following:
job_config = bigquery.LoadJobConfig(
autodetect=False,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
max_bad_records=10,
ignore_unknown_values=False,
)
Thanks!

As mentioned by #rtenha in the comment, you should pass the schema inside your LoadJobConfig. Below is the same for a sample schema.
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", "STRING"),
bigquery.SchemaField("field_string", "STRING"),
],
autodetect=False,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
)

Related

Extracting JSON returns null (Presto Athena)

I'm working with SQL Presto in Athena and in a table I have a column named "data.input.additional_risk_data.basket" that has a json like this:
[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]
I need to extract some of the data there, for example data.input.additional_risk_data.basket.val.item_reference. I'm not used to working with jsons but I tried a few things:
json_extract("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference')
json_extract_scalar("data.input.additional_risk_data.basket", '$.data.input.additional_risk_data.basket.val.item_reference)
They all returned null. I'm wondering what is the correct way to get the values from that json
Thank you!
There are multiple "problems" with your data and json path selector. Keys are not conventional (and I have not found a way to tell athena to escape them) and your json is actually an array of json objects. What you can do - cast data to an array and process it. For example:
-- sample data
WITH dataset (json_val) AS (
VALUES (json '[
{
"data.input.additional_risk_data.basket.val.brand":null,
"data.input.additional_risk_data.basket.val.category":null,
"data.input.additional_risk_data.basket.val.item_reference":"26484651",
"data.input.additional_risk_data.basket.val.name":"Nike Force 1",
"data.input.additional_risk_data.basket.val.product_name":null,
"data.input.additional_risk_data.basket.val.published_date":null,
"data.input.additional_risk_data.basket.val.quantity":"1",
"data.input.additional_risk_data.basket.val.size":null,
"data.input.additional_risk_data.basket.val.subCategory":null,
"data.input.additional_risk_data.basket.val.unit_price":769.0,
"data.input.additional_risk_data.basket.val.upc":null,
"data.input.additional_risk_data.basket.val.url":null
}
]')
)
--query
select arr[1]['data.input.additional_risk_data.basket.val.item_reference'] item_reference -- or use unnest if there are actually more than 1 element in array expected
from(
select cast(json_val as array(map(varchar, json))) arr
from dataset
)
Output:
item_reference
"26484651"

Pyspark Column Values are getting shifted automatically while creating DataFrame

I am trying to create a pyspark dataframe manually using the below nested schema -
schema = StructType([
StructField('fields', ArrayType(StructType([
StructField('source', StringType()),
StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name',StringType()),
StructField('last_name',StringType()),
StructField('kare_id',StringType()),
StructField('match_key',ArrayType(StringType()))
])
I am using the below code to create a dataframe using this schema -
row = [Row(fields=[Row(
source='BCONNECTED',
sourceids=[10,202,30]),
Row(
source='KP',
sourceids=[20,30,40])],first_name='Christopher', last_name='Nolan', kare_id='kare1', match_key=['abc','abcd']),
Row(fields=[
Row(
source='BCONNECTED',
sourceids=[20,304,5,6]),
Row(
source='KP',
sourceids=[40,50,60])],first_name='Michael', last_name='Caine', kare_id='kare2', match_key=['ncnc','cncnc'])]
content = spark.createDataFrame(sc.parallelize(row), schema=schema)
content.printSchema()
Schema is getting printed correctly, but when I am doing content.show() I can see the values of kare_id and last_name column has swapped.
+--------------------+-----------+---------+-------+-------------+
| fields| first_name|last_name|kare_id| match_key|
+--------------------+-----------+---------+-------+-------------+
|[[BCONNECTED, [10...|Christopher| kare1| Nolan| [abc, abcd]|
|[[BCONNECTED, [20...| Michael| kare2| Caine|[ncnc, cncnc]|
+--------------------+-----------+---------+-------+-------------+
PySpark sorts the Row object on column names using lexicographic ordering. Thus, the ordering of the columns in your data will be fields, first_name, kare_id, last_name, match_key.
Spark then associates each one of the column names with the data resulting in the mismatch. The fix is to swap the schema entry for last_name and kare_id as shown below:
schema = StructType([
StructField('fields', ArrayType(StructType([
StructField('source', StringType()),
StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name', StringType()),
StructField('kare_id', StringType()),
StructField('last_name', StringType()),
StructField('match_key', ArrayType(StringType()))
])
From PySpark Docs on Row: "Row can be used to create a row object by using named arguments, the fields will be sorted by names."
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Row
First you are actually defining schema twice once when you are creating data at that time you are already using row object in RDD thus you do not need to use createDataFrame function instead you can do following:
sc.parallelize(row).toDF().show()
But if still you want to mention schema explicitly then you need to keep schema and data in same order and your mentioned Schema is incorrect as per the data you are passing. The correct schema would be:
schema = StructType([
StructField('fields', ArrayType(StructType([StructField('source', StringType()),StructField('sourceids', ArrayType(IntegerType()))]))),
StructField('first_name',StringType()),
StructField('kare_id',StringType()),
StructField('last_name',StringType()),
StructField('match_key',ArrayType(StringType()))
])
kare_id should come before last_name because this is the order in which you are passing data

Writing dataframe via sql query (pyodbc): pyodbc.Error: ('HY004', '[HY004])

I'd like to parse a dataframe to two pre-define columns in an sql table. The schema in sql is:
abc(varchar(255))
def(varchar(255))
With a dataframe like so:
df = pd.DataFrame(
[
[False, False],
[True, True],
],
columns=["ABC", "DEF"],
)
And the sql query is like so:
with conn.cursor() as cursor:
string = "INSERT INTO {0}.{1}(abc, def) VALUES (?,?)".format(db, table)
cursor.execute(string, (df["ABC"]), (df["DEF"]))
cursor.commit()
So that the query (string) looks like so:
'INSERT INTO my_table(abc, def) VALUES (?,?)'
This creates the following error message:
pyodbc.Error: ('HY004', '[HY004] [Cloudera][ODBC] (11320) SQL type not supported. (11320) (SQLBindParameter)')
So I try using a direct query (not via Python) in the Impala editor, on the following:
'INSERT INTO my_table(abc, def) VALUES ('Hey','Hi');'
And produces this error message:
AnalysisException: Possible loss of precision for target table 'my_table'. Expression ''hey'' (type: `STRING) would need to be cast to VARCHAR(255) for column 'abc'`
How come I cannot even insert into my table simple strings, like "Hi"? Is my schema set up correctly or perhaps something else?
STRING type in Impala has a size limit of 2GB.
VARCHAR's length is whatever you define it to be, but not more than 64KB.
Thus there is a potential of data loss if you implicitly convert one into another.
By default, literals are treated as type STRING. So, in order to insert a literal into VARCHAR field you need to CAST it appropriately.
INSERT INTO my_table(abc, def) VALUES (CAST('Hey' AS VARCHAR(255)),CAST('Hi' AS VARCHAR(255)));

Duplicate keys with Amazon Athena and Open JSONx Serde

I have the following data due to an error:
{
"eventType": "something",
"details": {
"userName": "NotSet",
"username": "test#email.com"
},
"createdAt": 3
}
Creating the table works:
CREATE EXTERNAL TABLE tbl (
eventType string,
`createdAt` string,
details string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://xx/yy'
However when I query (tried details to string, struct, map, always same) I get the duplicate key error:
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "username"
They are duplicate if you use them as row columns, but not as map keys, or even string, why does it fail? The org.apache.hive.hcatalog.data.JsonSerDe can skip but I do not like it since %99.5 data is like this. The org.apache.hive.hcatalog.data.JsonSerDe fails always.
It's now possible to set a parameter in order to cater for keys that only differs by the case. The name is case.insensitive and should be set to FALSE
Example:
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("case.insensitive" = "FALSE")
https://docs.aws.amazon.com/athena/latest/ug/json-serde.html#openx-json-serde
Presto does not support case sensitive column names (they are always converted to lowercase), so it’s not possible to multiple columns that differ only by case.

Schema conflict when storing dataframes with datetime objects using load_table_from_dataframe()

I'm trying to load data from a Pandas DataFrames into a BigQuery table. The DataFrame has a column of dtype datetime64[ns], and when I try to store the df using load_table_from_dataframe(), I get
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table [table name]. Field computation_triggered_time has changed type from DATETIME to TIMESTAMP.
The table has a schema which reads
CREATE TABLE `[table name]` (
...
computation_triggered_time DATETIME NOT NULL,
...
)
In the DataFrame, computation_triggered_time is a datetime64[ns] column. When I read the original DataFrame from CSV, I convert it from text to datetime like so:
df['computation_triggered_time'] = \
df.to_datetime(df['computation_triggered_time']).values.astype('datetime64[ms]')
Note:
The .values.astype('datetime64[ms]') part is necessary because load_table_from_dataframe() uses PyArrow to serialize the df and that fails if the data has nanosecond-precision. The error is something like
[...] Casting from timestamp[ns] to timestamp[ms] would lose data
This looks like a problem with Google's google-cloud-python package, can you report the bug there? https://github.com/googleapis/google-cloud-python