BigQuery: Store semi-structured JSON data - sql

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?

There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

Related

Create a hardcoded "mapping table" in Trino SQL

I have a query (several CTEs) that get data from different sources. The output has a column name, but I would like to map this nameg to a more user-friendly name.
Id
name
1
buy
2
send
3
paid
I would like to hard code somewhere in the query (in another CTE?) a mapping table. Don't want to create a separate table for it, just plain text.
name_map=[('buy', 'Item purchased'),('send', 'Parcel in transit'), ('paid', 'Charge processed')]
So output table would be:
Id
name
1
Item purchased
2
Parcel in transit
3
Charge processed
In Trino I see the function map_from_entries and element_at, but don't know if they could work in this case.
I know "case when" might work, but if possible, a mapping table would be more convenient.
Thanks
As a simpler alternative to the other answer, you don't actually need to create an intermediate map using map_from_entries and look up values using element_at. You can just create an inline mapping table with VALUES and use a regular JOIN to do the lookups:
WITH mapping(name, description) AS (
VALUES
('buy', 'Item purchased'),
('send', 'Parcel in transit'),
('paid', 'Charge processed')
)
SELECT description
FROM t JOIN mapping ON t.name = mapping.name
(The query assumes your data is in a table named t that contains a column named name to use for the lookup)
Super interesting idea, and I think I got it working:
with tmp as (
SELECT *
FROM (VALUES ('1', 'buy'),
('2', 'send'),
('3', 'paid')) as t(id, name)
)
SELECT element_at(name_map, name) as name
FROM tmp
JOIN (VALUES map_from_entries(
ARRAY[('buy', 'Item purchased'),
('send', 'Parcel in transit'),
('paid', 'Charge processed')])) as t(name_map) ON TRUE
Output:
name
Item purchased
Parcel in transit
Charge processed
To see a bit more of what's happening, we can look at:
SELECT *, element_at(name_map, name) as name
id
name
name_map
name
1
buy
{buy=Item purchased, paid=Charge processed, send=Parcel in transit}
Item purchased
2
send
{buy=Item purchased, paid=Charge processed, send=Parcel in transit}
Parcel in transit
3
paid
{buy=Item purchased, paid=Charge processed, send=Parcel in transit}
Charge processed
I'm not sure how efficient this is, but it's certainly an interesting idea.

BigQuery: Count consecutive string matches between two fields

I have two tables:
Master_Equipment_Index (alias mei) containing the columns serial_num & model_num
Customer Equipment Index (alias cei) containing the columns account_num, serial_num, & model_num
Originally, guard rails were not implemented to require model attribute input in the mei data whenever new serial_num records were inserted. Whenever that serial_num is later associated with a customer account in the cei data, the model data carries over as null.
What I want to do is backfill the missing model attributes in the cei data from the mei data based on the strongest sequential character match from other similar serial_nums in the mei data.
To further clarify, I don't have access to mass update the mei or cei datasets. I can formalize change requests, but I need to build the function out to prove its worth. So this has to be done outside of any mass action query updates.
cei.account_num
cei.serial_num
cei.model
mei.serial_num
mei.model
serial_num_str_match
row_number
123123123
B4I4SXT1708
null
B4I4SXT178A
Model_Series1
8
1
123123123
B4I4SXT1708
null
B4I4SXTAS34
Model_Series2
7
2
In the table example above row_number 1 has a higher consecutive string match count than row_number 2. I want to only return row_number 1 and populate cei.model with mei.model's value.
cei.account_num
cei.serial_num
cei.model
mei.serial_num
mei.model
serial_num_str_match
row_number
123123123
B4I4SXT1708
Model_Series1
B4I4SXT178A
Model_Series1
8
1
To give an idea as to scale:
The mei data contains 1 million records and the cei data contains 50,000 records. I would have to take and perform this string match for every single cei.account_num, cei.serial_num where the cei.model data is null.
With mac addresses, the first 6 characters identify the vendor and I could look at things similarly in the sample SQL below to help reduce the volume of transactional 1:Many lookups taking place:
/* need to define function */
create temp function string_match_function(x any type, y any type) as (
syntax to generate consecutive string count matches between x and y
);
select * from (
select
c.account_num,
c.serial_num,
m.model,
row_number() over(partition by c.account_num, c.serial_num order by serial_num_str_match desc) seq
from (
select
c.account_num,
c.serial_num,
m.model,
needed: string_match_function(c.serial_num, m.serial_num) as serial_num_str_match
from (
select * from cei where model is null
) c
join (
select * from mei where model is not null
) m on substr(c.serial_num,1,6) = substr(m.serial_num,1,6)
) as a
) as b
where seq = 1
I've looked at different options, some coming from https://hoffa.medium.com/new-in-bigquery-persistent-udfs-c9ea4100fd83, but I'm not finding what I need.
Any insight or direction would be greatly appreciated.
This UDF function counts the equal charachters in each string from the begin:
CREATE TEMP FUNCTION string_match_function(x string, y string)
RETURNS int64
LANGUAGE js
AS r"""
var i=0;
var max_len= Math.min(x.length,y.length);
for(i=0;i<max_len;i++){
if(x[i]!=y[i]) {return i;}
}
return i;
""";
select string_match_function("12a345","1234")
gives 2, because both start with 12

How to include static field without data from a dataset to carry it?

I'm improving a report that currently uses a static table using the lookup function to fill its data from a few different datasets. We're pretty sure this is causing the report to take a lot longer to run, so I'm trying to use a table that uses column groups to achieve the same effect from a single dataset.
Here's what my query currently looks like. This functions exactly as I want it to as long as there's data.
Select CatName, CatCount, Category = 'Category 1', Sorting = 1
FROM
(Select CatName, Count(CatName) as CatCount FROM DataSet WHERE Parameters)
UNION
Select CatName, CatCount, Category = 'Category 2', Sorting = 2
FROM
(Select CatName, Count(CatName) as CatCount FROM DataSet WHERE Parameters)
When there are CatNames and CatCounts to pull from the select statement, the Category works and is pulled by the table as a column group. I need all of the groups to exist at all times.
However, sometimes we don't have data that fits the parameters for a category. The result when that happens is that there isn't a row for the Category field to use and that group doesn't exist in the table. Is there any way I can force the Category field to exist regardless of the data?
If I understand the question correctly, then you may be able to use ISNULL. ISNULL returns either the value you were for which you were looking (check_expression) or the alternative (replacement_value) if check_expression is NULL.
ISNULL ( check_expression , replacement_value )
Select CatName, CatCount, Category = 'Category 2', Sorting = 2
FROM
(Select isnull(CatName,""), Count(CatName) as CatCount FROM DataSet WHERE Parameters)
EDIT
How about a left outer join?
Select b.CatName, b.CatCount, Category = 'Category 2', Sorting = 2
FROM
(select '' as CatName, 0 as Catcount) a left outer join (Select CatName, Count(CatName) as CatCount FROM DataSet WHERE Parameters) b on a.CatName = b.CatName
Found a solution. Took a few tries, not the prettiest, and we'll have to see if it actually improves performance, but it works the way we wanted. Generalized code:
Select C.CatName, C.CatCount, Category = 'Category 1', Sorting = 1
FROM
(Select Top 5 B.CatName, Count(B.CatName) as CatCount
FROM
(Select CatName = case when CatOnlyParam in (Category1Filter) then A.CatName else NULL end
FROM
(Select CatName FROM DataSet WHERE GeneralParameters) as A
) as B
order by CatCount
) as C
UNION
etc
Separating the parameters into different steps guarantees that there will be values for each category, even if those values are NULL. I'm sure there's a cleaner way to get the same effect, but this functions.
Working from the inside out:
Stage 1 (Select statement A): Selects the value from the dataset with very general parameters (between a start and end date, resolved or not, etc).
Stage 2 (Select statement B): Uses the case statement to only pull the data that is relevant for this department while leaving behind NULLs for the data that isn't.
Stage 3 (Select statement C): Takes the data from the list of names and NULLs and gets a count from it. Sorts by that count and takes the top 5. If a category has no data, then the nulls will get "counted" to 0 and passed on to the final step.
Stage 4 (Final select statement): Adds the static fields to the information from the previous step. A category without data will get passed to this as:
CatName: NULL
CatCount: 0
Category: "Category 1"
Sorting: 1
Then this is repeated for the other categories and UNION'd together. Any suggestions to improve this are more than welcome.

How to query all entries with a value in a nested bigquery table

I generated a BigQuery table using an existing BigTable table, and the result is a multi-nested dataset that I'm struggling to query from. Here's the format of an entry from that BigQuery table just doing a simple select * from my_table limit 1:
[
{
"rowkey": "XA_1234_0",
"info": {
"column": [],
"somename": {
"cell": [
{
"timestamp": "1514357827.321",
"value": "1234"
}
]
},
...
}
},
...
]
What I need is to be able to get all entries from my_table where the value of somename is X, for instance. There will be multiple rowkeys where the value of somename will be X and I need all the data from each of those rowkey entries.
OR
If I could have a query where rowkey contains X, so to get "XA_1234_0", "XA_1234_1"... The "XA" and the "0" can change but the middle numbers to be the same. I've tried doing a where rowkey like "$_1234_$" but the query goes on for over a minute and is way too long for some reason.
I am using standard SQL.
EDIT: Here's an example of a query I tried that didn't work (with error: Cannot access field value on a value with type ARRAY<STRUCT<timestamp TIMESTAMP, value STRING>>), but best describes what I'm trying to achieve:
SELECT * FROM `my_dataset.mytable` where info.field_name.cell.value=12345
I want to get all records whose value in field_name equals some value.
From the sample Firebase Analytics dataset:
#standardSQL
SELECT *
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE EXISTS(
SELECT * FROM UNNEST(user_dim.user_properties)
WHERE key='powers' AND value.value.string_value='20'
)
LIMIT 1000
Below is for BigQuery Standard SQL
#standardSQL
SELECT t.*
FROM `my_dataset.mytable` t,
UNNEST(info.somename.cell) c
WHERE c.value = '1234'
above is assuming specific value can appear in each record just once - hope this is a true for you case
If this is not a case - below should make it
#standardSQL
SELECT *
FROM `yourproject.yourdadtaset.yourtable`
WHERE EXISTS(
SELECT *
FROM UNNEST(info.somename.cell)
WHERE value = '1234'
)
which I just realised pretty much same as Felipe's version - but just using your table / schema

SQL returns the unique identifier instead of the value in my Access UNON ALL SQL

So here is my project using MS Access 2010,
I have developed 2 queries to select 2 different reading periods. These queries are called CycleStart and CycleEnd. When I run these 2 queries individually I get expected output results. these 2 queries pull data from tables with a couple lookup fields in them. So the lookup fields use other tables where there are only 2 columns. The next step I use SQL to create a UNION ALL query to bring these 2 cycle queries together for reporting purposes. The problem I run into is that my resulting Union query does not output the same information as the 2 individual cycle queries.
Now the specific issues. My cycle queries have a couple lookup fields referencing another table. For example the Read_Cycle field comes for a table(Read_Cycles) and only has 2 columns, the unique identifer assigned by Access and the Read_Cycle column with the data I enter. When I run the cycle queries the field for Read_Cycle returns the Read_Cycle data as expected, but the union query does not. So here is some structure of my project:
Read_Cycles Table
|ID Col1 | |Cycle_ID Col2|
1 Spring
2 Fall
3 Winter
The data tables behind the CycleStart and the CycleEnd have fields that are lookup values referencing the above described Read_Cycles table.
Query CycleStart and CycleEnd return Spring or fall or winter, which ever value is associated with the record, correctly.
however, the problem I have is that the Union SQL Query returns the ID instead of the value, so instead of getting Fall, I get the 2.
Here is my UNION ALL SQL........
SELECT "CycleEnd" AS source,
[CycleEnd].[Recloser_SerialNo],
[CycleEnd].[Read_Date],
[CycleEnd].[3_Phase_Reading],
[CycleEnd].[A_Phase_Reading],
[CycleEnd].[B_Phase_Reading],
[CycleEnd].[C_Phase_Reading],
[CycleEnd].[Read_Cycle],
[CycleEnd].[PoleNo],
[CycleEnd].[Substation],
[CycleEnd].[Feeder],
[CycleEnd].[Feeder_Description],
[CycleEnd].[Recloser_Location]
FROM [CycleEnd]
UNION ALL
SELECT "CycleStart" AS source,
[CycleStart].[Recloser_SerialNo],
[CycleStart].[Read_Date],
[CycleStart].[3_Phase_Reading] * - 1,
[CycleStart].[A_Phase_Reading] * - 1,
[CycleStart].[B_Phase_Reading] * - 1,
[CycleStart].[C_Phase_Reading] * - 1,
[CycleStart].[Read_Cycle],
[CycleStart].[PoleNo],
[CycleStart].[Substation],
[CycleStart].[Feeder],
[CycleStart].[Feeder_Description],
[CycleStart].[Recloser_Location]
FROM [CycleStart];
All other fields are coming across just fine and as expected, I have narrowed it down to only fields that are a lookup in the original tables.
Any help would be greatly appreciated. Also my SQL experience is really limited so example code would help greatly.
UPDATE:
here is the sql from the CycleEnd that works. I got this by building the query then changing to the SQL view...
SELECT Recloser_Readings.Recloser_SerialNo,
Recloser_Readings.Read_Date,
Recloser_Readings.[3_Phase_Reading],
Recloser_Readings.A_Phase_Reading,
Recloser_Readings.B_Phase_Reading,
Recloser_Readings.C_Phase_Reading,
Recloser_Locations.PoleNo,
Recloser_Locations.Substation,
Recloser_Locations.Feeder,
Recloser_Locations.Feeder_Description,
Recloser_Locations.Recloser_Location,
Recloser_Readings.Read_Cycle
FROM (
Recloser_Inventory LEFT JOIN Recloser_Locations
ON Recloser_Inventory.PoleNo = Recloser_Locations.PoleNo
)
RIGHT JOIN Recloser_Readings
ON Recloser_Inventory.Serial_No = Recloser_Readings.Recloser_SerialNo
WHERE (((Recloser_Readings.Read_Cycle) = "8"));
UPDATE#2
I noticed I grabbed the wrong code that references the Read_Cycles table. Here it is...
SELECT Read_Cycles.Cycle_ID, Read_Cycles.ID
FROM Read_Cycles
ORDER BY Read_Cycles.Cycle_ID DESC;
UPDATE : SYNTAX ERROR FROM THE FOLLOWING CODE!!
SELECT "CycleEnd" as source,
[CycleEnd].[Recloser_SerialNo],
[CycleEnd].[Read_Date],
[CycleEnd].[3_Phase_Reading],
[CycleEnd].[A_Phase_Reading],
[CycleEnd].[B_Phase_Reading],
[CycleEnd].[C_Phase_Reading],
[CycleEnd].[Read_Cycle],
[CycleEnd].[PoleNo],
[CycleEnd].[Substation],
[CycleEnd].[Feeder],
[CycleEnd].[Feeder_Description],
[CycleEnd].[Recloser_Location]
FROM [CycleEnd] JOIN [Read_Cycles] ON [CycleEnd].[Read_Cycle] = [Read_Cycles].[ID]
UNION ALL SELECT "CycleStart" as source,
[CycleStart].[Recloser_SerialNo],
[CycleStart].[Read_Date],
[CycleStart].[3_Phase_Reading]*-1,
[CycleStart].[A_Phase_Reading]*-1,
[CycleStart].[B_Phase_Reading]*-1,
[CycleStart].[C_Phase_Reading]*-1,
[CycleStart].[Read_Cycle],
[CycleStart].[PoleNo],
[CycleStart].[Substation],
[CycleStart].[Feeder],
[CycleStart].[Feeder_Description],
[CycleStart].[Recloser_Location]
FROM [CycleStart] JOIN [Read_Cycles] ON [CycleStart].[Read_Cycle] = [Read_Cycles].[ID];