NULL typing in BigQuery CTEs - google-bigquery

I want to execute a merge statement in BigQuery which fills missing fields coming from late data of an upstream data pipeline. For this, I dynamically create the SQL merge statement using a CTE. Since it is late data, I know that the target table already has a partially filled row. So when creating the dynamic SQL merge statement I utilize this information to only update the missing fields by setting the already present fields in the CTE as NULL, using T.already_filled_col= IFNULL(S.already_filled_col, T.already_filled_col), where S.already_filled_col is NULL.
However, since it is a CTE without a schema, BigQuery complains that the NULL value in the CTE is of the wrong type. In particular, I get the error
No matching signature for function IFNULL for argument types: INT64, STRING
How do I specify the type of the NULL value in a CTE?
Below is a minimal working example
Given a target table with the following schema
[
{
"name": "join_col",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "already_filled_col",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "update_col",
"type": "STRING",
"mode": "NULLABLE"
}
]
Fill it with dummy data:
INSERT `my_project.my_dataset.test_table` (join_col, already_filled_col, update_col)
VALUES
('join_key', 'on_time_data', NULL)
My merge statement:
MERGE `my_project.my_dataset.test_table` AS T
USING (
SELECT
"join_key" AS join_col,
NULL AS already_filled_col,
"late_data" AS update_col
) AS S
ON T.join_col= S.join_col
WHEN MATCHED THEN
UPDATE SET
T.already_filled_col= IFNULL(S.already_filled_col, T.already_filled_col),
T.update_col= IFNULL(S.update_col, T.update_col)
I am aware that I could replace NULL AS already_filled_col in the CTE with CAST(NULL AS STRING) AS already_filled_col. However, I am hoping that there is an easier way, since for the actual data the types are not always STRING and deriving them dynamically is something I was hoping to avoid.

Related

How to display a row without showing its null columns in PostgreSQL?

I want to show all the rows in my table with all the columns except those columns that are null.
-- SELECT all users
SELECT * FROM users
ORDER BY user_id ASC;
-- SELECT a user
SELECT * FROM users
WHERE user_id = $1;
Currently my API's GET request returns something like this with the above queries:
{
"user_id": 10,
"name": "Bruce Wayne",
"username": "Batman",
"email": "bat#cave.com",
"phone": null,
"website": null
}
Is there any way I can display it like this so that the null columns aren't shown?
{
"user_id": 10,
"name": "Bruce Wayne",
"username": "Batman",
"email": "bat#cave.com"
}
I understand that you are using serialized (or deserialized) JSON and objects in your code. More serialized modules have special parameters that as ignore nulls and etc.
If you generate this JSON format data on the DB, in the inside SQL codes, then you can use Postgres jsonb_strip_nulls(JSONB) function. This function automatically removes all null values keys in the JSONB recursively and returns the JSONB type.

SQL OpenJSON obtain label in select output

I'm using the OpenJSON function in SQL to import various JSON format files into SQL and can usually handle the variations in formats from various sources, however I've got an example where I can't reach a certain value.
Example JSON format file:
{
"bob": {
"user_type": "standard",
"user_enabled": "true",
"last_login": "2021-07-25"
},
"claire": {
"user_type": "administrator",
"user_enabled": "true",
"last_login": "2021-09-17"
}
}
One of the values I want to return as one of my columns is the user's name;
I believe it's called the key but not entirely sure, because when I execute the following having loaded the json string into the #json variable:
select *
from openjson(#json)
I get two columns, one labelled key containing the username, the other containing my nested json string within {} braces.
Usually, to run my select statement, I would do something like
select username,user_type,user_enabled,last_login
from openjson(#thisjson)
with (
username nvarchar(100),
user_type nvarchar(100),
user_enabled nvarchar(100),
last_login nvarchar(100)
)
I get that sometimes I have to put the base in the brackets after openjson, and sometimes I have to follow the input column definitions with something like '$.last_login' to help traverse the structure, but can't work out how to identify or select the placeholder for the username.

create partitioned table from JSON file in BigQuery

I would like to create a partitioned table in BigQuery. The schema for the table is in JSON format stored in my local path.
I would like to create this table with partition from the JSON file using "bq mk -t" command. Kindly help.
Thanks in advance.
bk mk --table --schema=file.json PROJECTID:DATASET.TABLE
Hope the above example helps.
you can refer to the more options.
One recommendation to use the JSON format for creating the bigquery tables.
(1) If we decide to use the partition the table for better performance use the pseudo partition (_PARTITIONTIME or _PARTITITIONDATE).
(2) Example, partition_date is the column which has the data type of TIMESTAMP (we can use data type column DATE also).
{
"name": "partition_date",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"timePartitioning": {
"type": "DAY"
},
"field" : [
{
"name": "partition_date",
"type": "TIMESTAMP"
}
]
},

Big Query Create View with Repeated Record

We have a series of tables with schema that contains a repeated record, like follows:
[{
name: "field1",
type: "RECORD",
mode: "REPEATED",
fields: [{type: "STRING", name: "subfield1"}, {type: "INTEGER", name: "subfield2"}]
}]
when we create a view that include that repeated record field, we always get error:
Error in query string: Field field1 from table xxxxx is not a leaf field.
I understand that it might be better to use flatten, but all this field contains mostly different filters we want to test on and we have a lot of other non-repeated fields that would be difficult to manage if flattened.
It turned out that the problem is selecting the repeated record field from multiple tables (not in creating view). Is there an easy way to get around that?
Thanks
If you do SELECT field.* from t1, t2 you'll get an error that the * cannot be used to refer fields in a union (as you've noticed above).
You can work around this by wrapping the union in an inner SELECT statement, as in SELECT field.* from (SELECT * from t1, t2).
To give a concrete example, this works:
SELECT payload.pages.*
FROM (
SELECT *
FROM [publicdata:samples.github_nested],
[publicdata:samples.github_nested])

Database schema design for records of nested JSON

I have a list of profile records, which each of the record is like the below:
{
"name": "Peter Pan",
"contacts": [
{
"key": "mobile",
"value": "1234-5678"
}
],
"addresses": [
{
"key": "postal",
"value": "2356 W. Manchester Ave.\nSomewhere District\nA Country"
},
{
"key": "po",
"value": "PO Box 1234"
}
],
"emails": [
{
"key": "work",
"value": "abc#work.com"
},
{
"key": "personal",
"value": "abc#personal.com"
}
],
"url": "http://www.example.com/"
}
I would think about having the following schema structure:
A profile table with id and name field.
A profile_contact table with id, profile_id, key, value field.
A profile_address table with id, profile_id, key, value field.
A profile_email table with id, profile_id, key, value field.
However, I think I am creating too many tables for such a simple JSON!
Would there be performance problems when I search across the tables, since many JOINS are performed to retrieve just one record?
What would be a better way to model the above JSON record into the database? In SQL, or better in NoSQL?
It kind of depends.
If you are planning to have an "infinite" amount of contacts/addresses/emails per user then your idea is a pretty good way to go.
You could also consider (something like) the following:
PROFILE table, containing:
PROFILE_ID
NAME
EMAIL_ADDRESS_WORK
EMAIL_ADDRESS_PERSONAL
PHONE_NUMBER
MOBILE_NUMBER
ADDRESS table, containing:
ADDRESS_ID
PROFILE_ID
STREET
CITY
..etc
This means you can set 2 kinds of emails and 2 kinds of phone numbers per user and they are stored with the profile itself.
Alternatively you can choose to have a separate CONTACT table which contains both phone numbers and email addresses (and maybe other types):
CONTACT_TYPE (phone, mobile, email_work, email_personal)
CONTACT_VALUE
PROFILE_ID
All three (mine, and yours) could work perfectly. To better decide what would work for you, you should write down all possibilities there are (and could be). Maybe you want to be able to add 10 email addresses per profile (then storing them with the profile would be silly), maybe you will have a very large variaty in different contact types, such as IM, facebook, ICQ, twitter (then a CONTACTS table would fit nicely).
So try to find out/list what types of data you will have and see how that will fit into a specific model, then pick the most suitable one :)
This is the most common case for database design... You should stop threading it as something new just because you included json :-)
Just create
User: id, name
Contacts: user-id, id, key, value
Email and addresses and whatever else is like contacts.
Now you just have to select from user and inner join the other tables on user.id=id