create partitioned table from JSON file in BigQuery - google-bigquery

I would like to create a partitioned table in BigQuery. The schema for the table is in JSON format stored in my local path.
I would like to create this table with partition from the JSON file using "bq mk -t" command. Kindly help.
Thanks in advance.

bk mk --table --schema=file.json PROJECTID:DATASET.TABLE
Hope the above example helps.
you can refer to the more options.

One recommendation to use the JSON format for creating the bigquery tables.
(1) If we decide to use the partition the table for better performance use the pseudo partition (_PARTITIONTIME or _PARTITITIONDATE).
(2) Example, partition_date is the column which has the data type of TIMESTAMP (we can use data type column DATE also).
{
"name": "partition_date",
"type": "TIMESTAMP",
"mode": "NULLABLE",
"timePartitioning": {
"type": "DAY"
},
"field" : [
{
"name": "partition_date",
"type": "TIMESTAMP"
}
]
},

Related

Azure Data Factory - How to map SQL query results to a JSON string?

I have an SQL table called FileGroups. I will query for certain columns such as Id, Name and Version. In my Azure Data Factory pipeline, I want to map the column names and row values from the SQL query results to key-value pairs for a JSON string. I also need to include a couple pipeline parameters in the JSON string. I will then pass this JSON string as input for a stored procedure at the end of my pipeline.
The resulting JSON string will look like this:
{
"id": "guid1",
"name": "fileGroup1",
"version": 1.0,
"pipeline_param_1": "value1",
"pipeline_param_2": "value2"
},
{
"id": "guid2",
"name": "fileGroup2",
"version": 2.0,
"pipeline_param_1": "value1",
"pipeline_param_2": "value2"
}
How do I query the SQL table and construct this JSON string all within my ADF pipeline? What activities or data flow transformations do I need to achieve this?
the easiest way to implement it is by using a "copy activity"
Here is a quick demo that i created, i want to transform SQL data into Json, i copied SalesLT.Customer data from sql sample data
created SQL database with sample data in azure portal.
In azure data factory, i added the database as a dataset.
created a pipeline and i named it "mapSQLDataToJSON"
in the pipeline , i added a "Copy activity"
in copy activity, i added the sql db as a source dataset and added a query option , Query : "#concat('select CustomerID,Title, pipeId= ''', pipeline().RunId,''' from SalesLT.Customer')"
here you can select the columns that you need and add to the data a new column like i did , added a new column "pipId" and used pipeline params.
in copy activity i added the blob storage as a sink and data type to be "Json"
tested connections and triggered the pipeline
i opened the blob storage , and i clicked on the copied Json data , and it worked.
Copy activity in ADF:
Data in blob storage:
you can read here about copy activity and pipeline params , links:
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-system-variables
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
If your source database is a Microsoft SQL database, like Azure SQL DB, Sql Server, Managed Instance, Azure Synapse Analytics etc, then it is quite capable manipulating JSON. The FOR JSON clause constructs valid JSON and you can use options like WITHOUT_ARRAY_WRAPPER to produce clean output.
A simple example:
DROP TABLE IF EXISTS #tmp;
CREATE TABLE #tmp (
id VARCHAR(10) NOT NULL,
[name] VARCHAR(20) NOT NULL,
[version] VARCHAR(5) NOT NULL,
pipeline_param_1 VARCHAR(20) NOT NULL,
pipeline_param_2 VARCHAR(20) NOT NULL
);
INSERT INTO #tmp VALUES
( 'guid1', 'fileGroup1', '1.0', 'value1.1', 'value1.2' ),
( 'guid2', 'fileGroup2', '2.0', 'value2.1', 'value2.2' )
SELECT *
FROM #tmp
FOR JSON PATH, WITHOUT_ARRAY_WRAPPER;
Sample output:

NULL typing in BigQuery CTEs

I want to execute a merge statement in BigQuery which fills missing fields coming from late data of an upstream data pipeline. For this, I dynamically create the SQL merge statement using a CTE. Since it is late data, I know that the target table already has a partially filled row. So when creating the dynamic SQL merge statement I utilize this information to only update the missing fields by setting the already present fields in the CTE as NULL, using T.already_filled_col= IFNULL(S.already_filled_col, T.already_filled_col), where S.already_filled_col is NULL.
However, since it is a CTE without a schema, BigQuery complains that the NULL value in the CTE is of the wrong type. In particular, I get the error
No matching signature for function IFNULL for argument types: INT64, STRING
How do I specify the type of the NULL value in a CTE?
Below is a minimal working example
Given a target table with the following schema
[
{
"name": "join_col",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "already_filled_col",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "update_col",
"type": "STRING",
"mode": "NULLABLE"
}
]
Fill it with dummy data:
INSERT `my_project.my_dataset.test_table` (join_col, already_filled_col, update_col)
VALUES
('join_key', 'on_time_data', NULL)
My merge statement:
MERGE `my_project.my_dataset.test_table` AS T
USING (
SELECT
"join_key" AS join_col,
NULL AS already_filled_col,
"late_data" AS update_col
) AS S
ON T.join_col= S.join_col
WHEN MATCHED THEN
UPDATE SET
T.already_filled_col= IFNULL(S.already_filled_col, T.already_filled_col),
T.update_col= IFNULL(S.update_col, T.update_col)
I am aware that I could replace NULL AS already_filled_col in the CTE with CAST(NULL AS STRING) AS already_filled_col. However, I am hoping that there is an easier way, since for the actual data the types are not always STRING and deriving them dynamically is something I was hoping to avoid.

How to display a row without showing its null columns in PostgreSQL?

I want to show all the rows in my table with all the columns except those columns that are null.
-- SELECT all users
SELECT * FROM users
ORDER BY user_id ASC;
-- SELECT a user
SELECT * FROM users
WHERE user_id = $1;
Currently my API's GET request returns something like this with the above queries:
{
"user_id": 10,
"name": "Bruce Wayne",
"username": "Batman",
"email": "bat#cave.com",
"phone": null,
"website": null
}
Is there any way I can display it like this so that the null columns aren't shown?
{
"user_id": 10,
"name": "Bruce Wayne",
"username": "Batman",
"email": "bat#cave.com"
}
I understand that you are using serialized (or deserialized) JSON and objects in your code. More serialized modules have special parameters that as ignore nulls and etc.
If you generate this JSON format data on the DB, in the inside SQL codes, then you can use Postgres jsonb_strip_nulls(JSONB) function. This function automatically removes all null values keys in the JSONB recursively and returns the JSONB type.

Can I use postgres json operators on a column of type text having json value stored?

I have a table with column as text which is storing a json object as text.Can I use json operator functions to extract value?
For eg
create table t1(app_text text);
insert into t1 values('{"userId": "555", "transaction": {"id": "222", "sku": "ABC"}');
WHY DOES BELOW SQL NOT WORK
select app_text -> 'transaction'->>'sku' from t1;
to use a json operator, cast text to json:
select app_text::json -> 'transaction'->>'sku' from t1;

Database schema design for records of nested JSON

I have a list of profile records, which each of the record is like the below:
{
"name": "Peter Pan",
"contacts": [
{
"key": "mobile",
"value": "1234-5678"
}
],
"addresses": [
{
"key": "postal",
"value": "2356 W. Manchester Ave.\nSomewhere District\nA Country"
},
{
"key": "po",
"value": "PO Box 1234"
}
],
"emails": [
{
"key": "work",
"value": "abc#work.com"
},
{
"key": "personal",
"value": "abc#personal.com"
}
],
"url": "http://www.example.com/"
}
I would think about having the following schema structure:
A profile table with id and name field.
A profile_contact table with id, profile_id, key, value field.
A profile_address table with id, profile_id, key, value field.
A profile_email table with id, profile_id, key, value field.
However, I think I am creating too many tables for such a simple JSON!
Would there be performance problems when I search across the tables, since many JOINS are performed to retrieve just one record?
What would be a better way to model the above JSON record into the database? In SQL, or better in NoSQL?
It kind of depends.
If you are planning to have an "infinite" amount of contacts/addresses/emails per user then your idea is a pretty good way to go.
You could also consider (something like) the following:
PROFILE table, containing:
PROFILE_ID
NAME
EMAIL_ADDRESS_WORK
EMAIL_ADDRESS_PERSONAL
PHONE_NUMBER
MOBILE_NUMBER
ADDRESS table, containing:
ADDRESS_ID
PROFILE_ID
STREET
CITY
..etc
This means you can set 2 kinds of emails and 2 kinds of phone numbers per user and they are stored with the profile itself.
Alternatively you can choose to have a separate CONTACT table which contains both phone numbers and email addresses (and maybe other types):
CONTACT_TYPE (phone, mobile, email_work, email_personal)
CONTACT_VALUE
PROFILE_ID
All three (mine, and yours) could work perfectly. To better decide what would work for you, you should write down all possibilities there are (and could be). Maybe you want to be able to add 10 email addresses per profile (then storing them with the profile would be silly), maybe you will have a very large variaty in different contact types, such as IM, facebook, ICQ, twitter (then a CONTACTS table would fit nicely).
So try to find out/list what types of data you will have and see how that will fit into a specific model, then pick the most suitable one :)
This is the most common case for database design... You should stop threading it as something new just because you included json :-)
Just create
User: id, name
Contacts: user-id, id, key, value
Email and addresses and whatever else is like contacts.
Now you just have to select from user and inner join the other tables on user.id=id