Optimize SQL query and process result - sql

I am looking for advice on optimizing the following sample query and processing the result. The SQL variant in use is the internal FileMaker ExecuteSQL engine which is limited to the SELECT statement with the following syntax:
SELECT [DISTINCT] {* | column_expression [[AS] column_alias],...}
FROM table_name [table_alias], ...
[ WHERE expr1 rel_operator expr2 ]
[ GROUP BY {column_expression, ...} ]
[ HAVING expr1 rel_operator expr2 ]
[ UNION [ALL] (SELECT...) ]
[ ORDER BY {sort_expression [DESC | ASC]}, ... ]
[ OFFSET n {ROWS | ROW} ]
[ FETCH FIRST [ n [ PERCENT ] ] { ROWS | ROW } {ONLY | WITH TIES } ]
[ FOR UPDATE [OF {column_expression, ...}] ]
The query:
SELECT item1 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active1 = 1 UNION
SELECT item2 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active2 = 1 UNION
SELECT item3 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active3 = 1 UNION
SELECT item4 as val ,interval, interval_next FROM meddata
WHERE fk = 12 AND active4 = 1 ORDER BY val
This may give the following result as a sample:
val,interval,interval_next
Artelac,0,1
Artelac,3,6
Celluvisc,1,3
Celluvisc,12,24
What I am looking to achieve (in addition to suggestions for optimization) is a result formatted like this:
val,interval,interval_next,interval,interval_next,interval,interval_next,interval,interval_next ->etc
Artelac,0,1,3,6
Celluvisc,1,3,12,24
Preferably I would like this processed result to be produced by the SQL engine.
Possible?
Thank you.
EDIT: I included the column names in the result for clarity, though they are not part of the result. I wish to illustrate that there may be an arbitrary number of 'interval' and 'interval_next' columns in the result.

I do not think you need to optimise you query, looks fine to me.
You are looking for something like PIVOT in TSQL, which is not supported in FQL. You biggest issue is going to be a variable number of columns returned.
I think the best approach is to get your intermediate result and use a FileMaker script or Custom Function to pivot it.
An alternative is to get the list of distinct val and loop through them (with CF or script) with FQL Statement for each row. You will not be able to combine them with union as it needs the same number of columns.

Related

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.
My structure will be like so:
[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]
I wanted to use a STRUCT type but it seems all the fields need to be declared?
I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.
Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?
There are two primary methods for storing semi-structured data as you have in your example:
Option #1: Store JSON String
You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.
Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:
# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
UNION ALL
SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
UNION ALL
SELECT 3333 AS id, "{\"a\":27}" as data
UNION ALL
SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)
# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
SELECT id,
CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue # Extract & cast as an INT
FROM records
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
There are some pros and cons to this approach:
Pros
The syntax is fairly straight forward
Less error prone
Cons
Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
Queries will run slower than using pure native SQL.
Option #2: Repeated Fields
BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.
Using the same example, here is how we would do that:
## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
(1111, [("a","27"),("b","62"),("c","string")]),
(2222, [("a","27"),("c","string")]),
(3333, [("a","27")]),
(4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
SELECT id, key, value
FROM records, UNNEST(records.data) AS keyVals
)
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
FROM records R
LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)
# results
# Row | aSum | bSum
# 1 | 108 | 124
As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.
Pros
Store size will be less than JSON
Queries will typically execute faster.
Cons
The syntax is more complex, not as straight forward
Hope that helps, good luck.

How to query all entries with a value in a nested bigquery table

I generated a BigQuery table using an existing BigTable table, and the result is a multi-nested dataset that I'm struggling to query from. Here's the format of an entry from that BigQuery table just doing a simple select * from my_table limit 1:
[
{
"rowkey": "XA_1234_0",
"info": {
"column": [],
"somename": {
"cell": [
{
"timestamp": "1514357827.321",
"value": "1234"
}
]
},
...
}
},
...
]
What I need is to be able to get all entries from my_table where the value of somename is X, for instance. There will be multiple rowkeys where the value of somename will be X and I need all the data from each of those rowkey entries.
OR
If I could have a query where rowkey contains X, so to get "XA_1234_0", "XA_1234_1"... The "XA" and the "0" can change but the middle numbers to be the same. I've tried doing a where rowkey like "$_1234_$" but the query goes on for over a minute and is way too long for some reason.
I am using standard SQL.
EDIT: Here's an example of a query I tried that didn't work (with error: Cannot access field value on a value with type ARRAY<STRUCT<timestamp TIMESTAMP, value STRING>>), but best describes what I'm trying to achieve:
SELECT * FROM `my_dataset.mytable` where info.field_name.cell.value=12345
I want to get all records whose value in field_name equals some value.
From the sample Firebase Analytics dataset:
#standardSQL
SELECT *
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE EXISTS(
SELECT * FROM UNNEST(user_dim.user_properties)
WHERE key='powers' AND value.value.string_value='20'
)
LIMIT 1000
Below is for BigQuery Standard SQL
#standardSQL
SELECT t.*
FROM `my_dataset.mytable` t,
UNNEST(info.somename.cell) c
WHERE c.value = '1234'
above is assuming specific value can appear in each record just once - hope this is a true for you case
If this is not a case - below should make it
#standardSQL
SELECT *
FROM `yourproject.yourdadtaset.yourtable`
WHERE EXISTS(
SELECT *
FROM UNNEST(info.somename.cell)
WHERE value = '1234'
)
which I just realised pretty much same as Felipe's version - but just using your table / schema

postgreSQL query empty array fields within jsonb column

device_id | device
-----------------------------
9809 | { "name" : "printer", "tags" : [] }
9810 | { "name" : "phone", "tags" : [{"count": 2, "price" : 77}, {"count": 3, "price" : 37} ] }
For the following postgres SQL query on a jsonb column "device" that contains array 'tags':
SELECT t.device_id, elem->>'count', elem->>'price'
FROM tbl t, json_array_elements(t.device->'tags') elem
where t.device_id = 9809
device_id is the primary key.
I have two issues that I don't know how to solve:
tags is an array field that may be empty, in which case I got 0 rows. I want output no matter tags is empty or not. Dummy values are ok.
If tags contain multiple elements, I got multiple rows for the same device id. How to aggregate those multiple elements into one row?
Your first problem can be solved by using a left outer join, that will substitute NULL values for missing matches on the right side.
The second problem can be solved with an aggregate function like json_agg, array_agg or string_agg, depending on the desired result type:
SELECT t.device_id,
jsonb_agg(elem->>'count'),
jsonb_agg(elem->>'price')
FROM tbl t
LEFT JOIN LATERAL jsonb_array_elements(t.device->'tags') elem
ON TRUE
GROUP BY t.device_id;
You will get a JSON array containing just null for those rows where the array is empty, I hope that is ok for you.

snowflake json lateral subquery

I have the following in snowflake:
create or replace table json_tmp as select column1 as id, parse_json(column2) as c
from VALUES (1,
'{"id": "0x1",
"custom_vars": [
{ "key": "a", "value": "foo" },
{ "key": "b", "value": "bar" }
] }') v;
Based on the FLATTEN docs, I hoped to turn these into a table looking like this:
+-------+---------+-----+-----+
| db_id | json_id | a | b |
+-------+---------+-----+-----+
+-------+---------+-----+-----+
| 1 | 0x1 | foo | bar |
+-------+---------+-----+-----+
Here is the query I tried; it resulted in a SQL compilation error: "Object 'CUSTOM_VARS' does not exist."
select json_tmp.id as dbid,
f.value:id as json_id,
a.v,
b.v
from json_tmp,
lateral flatten(input => json_tmp.c) as f,
lateral flatten(input => f.value:custom_vars) as custom_vars,
lateral (select value:value as v from custom_vars where value:key = 'a') as a,
lateral (select value:value as v from custom_vars where value:key = 'b') as b;
What exactly is the error here? Is there a better way to do this transformation?
Note - your solution doesn't actually perform any joins - flatten is a "streaming" operation, it "explodes" the input, and then selects the rows it wants. If you only have 2 attributes in the data, it should be reasonably fast. However, if not, it can lead to an unnecessary data explosion (e.g. if you have 1000s of attributes).
The fastest solution depends on how your data is structured exactly, and what you can assume about the input. For example, if you know that 'a' and 'b' are always in that order, you can obviously use
select
id as db_id,
c:id,
c:custom_vars[0].value,
c:custom_vars[1].value
from json_tmp;
If you know that custom_vars is always 2 elements, but the order is not known, you could do e.g.
select
id as db_id,
c:id,
iff(c:custom_vars[0].key = 'a', c:custom_vars[0].value, c:custom_vars[1].value),
iff(c:custom_vars[0].key = 'b', c:custom_vars[0].value, c:custom_vars[1].value)
from json_tmp;
If the size of custom_vars is unknown, you could create a JavaScript function like extract_key(custom_vars, key) that would iterate over custom_vars and return value for the found key (or e.g. null or <empty_string> if not found).
Hope this helps. If not, please provide more details about your problem (data, etc).
Update Nov 2019
There seems to be a function that does this sort of thing:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
object_agg(custom_vars.value:key, custom_vars.value:value):a as a,
object_agg(custom_vars.value:key, custom_vars.value:value):b as b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') custom_vars
group by json_tmp.id
Original answer Sept 2017
The following query seems to work:
select json_tmp.id as dbid,
json_tmp.c:id as json_id,
a.value:value a,
b.value:value b
from
json_tmp,
lateral flatten(input => json_tmp.c, path => 'custom_vars') a,
lateral flatten(input => json_tmp.c, path => 'custom_vars') b
where a.value:key = 'a' and b.value:key = 'b'
;
I'd rather filter in a subquery rather than on the join, so I'm still interested in seeing other answers.

SQL: When and why are two on conditions allowed?

Question:
I recently had an interesting SQL problem.
I had to get the leasing contract for a leasing object.
The problem was, there could be multiple leasing contracts per room, and multiple leasing object per room.
However, because of bad db tinkering, leasing contracts are assigned to the room, not the leasing object. So I had to take the contract number, and compare it to the leasing object number, in order to get the right results.
I thought this would do:
SELECT *
FROM T_Room
LEFT JOIN T_MAP_Room_LeasingObject
ON MAP_RMLOBJ_RM_UID = T_Room.RM_UID
LEFT JON T_LeasingObject
ON LOBJ_UID = MAP_RMLOBJ_LOBJ_UID
LEFT JOIN T_MAP_Room_LeasingContract
ON T_MAP_Room_LeasingContract.MAP_RMCTR_RM_UID = T_Room.RM_UID
LEFT JOIN T_Contracts
ON T_Contracts.CTR_UID = T_MAP_Room_LeasingContract.MAP_RMCTR_CTR_UID
AND T_Contracts.CTR_No LIKE ( ISNULL(T_LeasingObject.LOBJ_No, '') + '.%' )
WHERE ...
However, because the mapping table gets joined before I have the contract number, and I cannot get the contract number without having the mapping table, i have doubled entries.
The problem is a little more complicated, as rooms having no leasing contract needed also to show up, so I couldn't just use an inner join.
With a little bit experimenting, I found that this works as expected:
SELECT *
FROM T_Room
LEFT JOIN T_MAP_Room_LeasingObject
ON MAP_RMLOBJ_RM_UID = T_Room.RM_UID
LEFT JON T_LeasingObject
ON LOBJ_UID = MAP_RMLOBJ_LOBJ_UID
LEFT JOIN T_MAP_Room_LeasingContract
LEFT JOIN T_Contracts
ON T_Contracts.CTR_UID = T_MAP_Room_LeasingContract.MAP_RMCTR_CTR_UID
ON T_MAP_Room_LeasingContract.MAP_RMCTR_RM_UID = T_Room.RM_UID
AND T_Contracts.CTR_No LIKE ( ISNULL(T_LeasingObject.LOBJ_No, '') + '.%' )
WHERE ...
I now see why the two on conditions in one join, which usually are courtesy of query designer, can be useful, and what difference it makes.
I was wondering whether this is a MS-SQL/T-SQL specific thing, or whether this is standard sql.
So I tried in PostgreSQL with another 3 tables.
So I wrote this query on 3 other tables:
SELECT *
FROM t_dms_navigation
LEFT JOIN t_dms_document
ON NAV_DOC_UID = DOC_UID
LEFT JOIN t_dms_project
ON PJ_UID = NAV_PJ_UID
and tried to turn it into one with two on conditions
SELECT *
FROM t_dms_navigation
LEFT JOIN t_dms_document
LEFT JOIN t_dms_project
ON PJ_UID = NAV_PJ_UID
ON NAV_DOC_UID = DOC_UID
So I thought it's t-sql specific, but quickly tried in MS-SQL too, just to find to my surprise that it doesn't work there either.
I thought it might be because of missing foreign keys, so i removed them on all tables in my room query, but it still did not work.
So here my question:
Why are 2 on conditions even legal, does this have a name, and why does it not work on my second example ?
It's standard SQL. Each JOIN has to have a corresponding ON clause. All you're doing is shifting around the order that the joins happen in1 - it's a bit like changing the bracketing of an expression to get around precedence rules.
A JOIN B ON <cond1> JOIN C ON <cond2>
First joins A and B based on cond1. It then takes that combined rowset and joins it to C based on cond2.
A JOIN B JOIN C ON <cond1> ON <cond2>
First joins B and C based on cond1. It then takes A and joins it to the previous combined rowset, based on cond2.
It should work in PostgreSQL - here's the relevant part of the documentation of the SELECT statement:
where from_item can be one of:
[ ONLY ] table_name [ * ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
( select ) [ AS ] alias [ ( column_alias [, ...] ) ]
with_query_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ]
function_name ( [ argument [, ...] ] ) [ AS ] alias [ ( column_alias [, ...] | column_definition [, ...] ) ]
function_name ( [ argument [, ...] ] ) AS ( column_definition [, ...] )
from_item [ NATURAL ] join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
It's that last line that's relevant. Notice that it's a recursive definition - what can be to the left and right of a join can be anything - including more joins.
1As always with SQL, this is the logical processing order - the system is free to perform physical processing in whatever sequence it feels will work best, provided the result is consistent.