How to inject (append) a windowed average into the output using MS Stream Analytics - azure-stream-analytics

Given a JSON stream that resembles:
{
"timestamp": "2017-01-26T20:27:26.099Z",
"Novato": {
"humidity": "40.996",
"barometric": "1011.2"
},
"Redmond": {
"humidity": "60.832",
"barometric": "1011.8"
}
}
For each City in this object, I want to add a new value called humidity_5_second_avg, which is a 5 second tumbling window average.
But of course for each city, it needs to be unique to that city. And I want to append it to the existing cities' values.
For example:
{
"timestamp": "2017-01-26T20:27:26.099Z",
"Novato": {
"humidity": "40.996",
"barometric": "1011.2",
"humidity_5_second_avg": "38.1234"
},
"Redmond": {
"humidity": "60.832",
"barometric": "1011.8",
"humidity_5_second_avg": "32.1234"
}
}
Is this possible with a Stream Analytics query? Or would I need to create two streams (one with the original data, and one with only average data, and merge them together?

This is tricky to get exactly in the way described. It’s easier to break down city information into one row per city first and then use JOIN.
-- Use CROSS APPLY to split original events into one row per city
WITH CityData AS
(
SELECT
r.PropertyName AS City,
r.PropertyValue.*
FROM localinput i TIMESTAMP BY timestamp
CROSS APPLY GetRecordProperties(i) r
WHERE r.PropertyValue.humidity IS NOT NULL
),
Averages AS
(
SELECT
City,
AVG(humidity) as avg_humidity
FROM CityData
GROUP BY city, TumblingWindow(second, 5)
)
SELECT *, System.Timestamp as ts INTO debug FROM Averages
SELECT
c.*, a.avg_humidity
FROM CityData c
JOIN Averages a
ON c.City = a.City AND DATEDIFF(second, c, a) BETWEEN 0 AND 5

Related

How to group results by values that are inside json array in postgreSQL

I have a column of type jSONB that have data like this:
column name: used_filters
row number 1 example:
{ "categories" : ["economic", "Social"], "tags": ["world" ,"eco-friendly"] }
row number 2 example:
{ "categories" : ["economic"], "tags": ["eco-friendly"] , "keywords" : ["2050"] }
I want to group the result to get the most frequent value for each one of the keys
something like this:
key
most_freq
category
economic
tags
eco-friendly
keyword
2050
the keys are not constant and could be something other than the example I said but I know that they will be frequent.
You can extract keys and values as arrays first by using jsonb_each, and then unnest the generated arrays by jsonb_array_elements_text. The rest is classical aggregation along with sorting through the count values by window function such as
SELECT key, value
FROM ( SELECT j.key, jj.value,
RANK() OVER (PARTITION BY j.key ORDER BY COUNT(*) DESC)
FROM t,
LATERAL jsonb_each(js) AS j,
LATERAL jsonb_array_elements_text(j.value) AS jj
GROUP BY j.key, jj.value ) AS q
WHERE rank = 1
Demo

Percentage Matching two JSONB columns,

I am trying to compare two JSONB columns in a table, at the moment it is done in the app, however this doesn't allow proper searching, filtering and ordering without loading the whole data set. It would be better if we could do this comparison in the DB.
The following is an example of the data and calculation.
employer = {
"autism": "1",
"social": "1",
"dementia": "0",
"domestic": "1",
}
employers_keys = ["autism","social","domestic"]
candidate = {
"autism": "0",
"social": "1",
"dementia": "0",
"domestic": "1",
}
candidate_keys = ["social","domestic"]
remainder_keys = employer_key - candidate_key = ["autism"]
1-(remainder_keys.length/employer_keys.length) = 1-(1/3) = 2/3 = 66%
This process is all rather trivial in Ruby, jsonb-> array -> select -> calculation
However, I would like to perform this in SQL or a function at the DB level, something like
function compare_json(employer, candidate) returning a decimal.
More specifically
Select candidates.id,
st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 as distance
from (select * from users where id = 8117) employer,
(select * from users where role_id = 5) candidates
where st_distance_sphere(st_makepoint(employer.long, employer.lat), st_makepoint(candidates.long, candidates.lat)) /
1000 / 8 * 5 < 25
order by distance
The above SQL calculates the distance between a single employer and multiple candidates, the inline queries employer.skills (1 row), candidate.skills (n rows).
So the output should be..
Candidate id, Distance, SkillsMatch(employer.skills, candidates.skills)
As before the edit, any guidance will be welcome.
Here is a pure SQL approach: it works by turning the employer object to a record set, and then performining conditional aggregation:
select 1 - avg( ((d.candidate ->> e.k)::int is distinct from 1)::int ) res
from (values(
'{ "autism": "1", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb,
'{ "autism": "0", "social": "1", "dementia": "0", "domestic": "1" }'::jsonb
)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1
You can easily turn to a function by replacing the literal objects in the values() row constructor with parameters.
Demo on DB Fiddle:
| res |
| ---------------------: |
| 0.66666666666666666667 |
Okay this is what I got to.
CREATE OR REPLACE FUNCTION JSON_COMPARE(employer_json jsonb, candidate_json jsonb, OUT _result numeric)
AS
$$
BEGIN
select 1 - avg(((d.candidate ->> e.k)::int is distinct from 1)::int)
into _result
from (values (employer_json, candidate_json)) d(employer, candidate)
cross join lateral jsonb_each_text(d.employer) e(k, v)
where e.v::int = 1;
RETURN;
END;
$$
LANGUAGE PLPGSQL;
Which is small variation on GMB super fast answer. With a few indexes and correctly limiting the size of the candidate list we get reasonable performance.
I'm new to Stack so my upvote for GMB doesn't show, but thanks again.

Extracting data from an array of JSON objects for specific object values

In my table, there is a column of JSON type which contains an array of objects describing time offsets:
[
{
"type": "start"
"time": 1.234
},
{
"type": "end"
"time": 50.403
}
]
I know that I can extract these with JSON_EACH() and JSON_EXTRACT():
CREATE TEMPORARY TABLE Items(
id INTEGER PRIMARY KEY,
timings JSON
);
INSERT INTO Items(timings) VALUES
('[{"type": "start", "time": 12.345}, {"type": "end", "time": 67.891}]'),
('[{"type": "start", "time": 24.56}, {"type": "end", "time": 78.901}]');
SELECT
JSON_EXTRACT(Timings.value, '$.type'),
JSON_EXTRACT(Timings.value, '$.time')
FROM
Items,
JSON_EACH(timings) AS Timings;
This returns a table like:
start 12.345
end 67.891
start 24.56
end 78.901
What I really need though is to:
Find the timings of specific types. (Find the first object in the array that matches a condition.)
Take this data and select it as a column with the rest of the table.
In other words, I'm looking for a table that looks like this:
id start end
-----------------------------
0 12.345 67.891
1 24.56 78.901
I'm hoping for some sort of query like this:
SELECT
id,
JSON_EXTRACT(timings, '$.[type="start"].time'),
JSON_EXTRACT(timings, '$.[type="end"].time')
FROM Items;
Is there some way to use path in the JSON functions to select what I need? Or, some other way to pivot what I have in the first example to apply to the table?
One possibility:
WITH cte(id, json) AS
(SELECT Items.id
, json_group_object(json_extract(j.value, '$.type'), json_extract(j.value, '$.time'))
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
GROUP BY Items.id)
SELECT id
, json_extract(json, '$.start') AS start
, json_extract(json, '$.end') AS "end"
FROM cte
ORDER BY id;
which gives
id start end
---------- ---------- ----------
1 12.345 67.891
2 24.56 78.901
Another one, that uses the window functions added in sqlite 3.25 and avoids creating intermediate JSON objects:
SELECT DISTINCT Items.id
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'start') OVER ids AS start
, max(json_extract(j.value, '$.time'))
FILTER (WHERE json_extract(j.value, '$.type') = 'end') OVER ids AS "end"
FROM Items
JOIN json_each(timings) AS j ON json_extract(j.value, '$.type') IN ('start', 'end')
WINDOW ids AS (PARTITION BY Items.id)
ORDER BY Items.id;
The key is using the ON clause of the JOIN to limit results to just the two objects in each array that you care about, and then merging those up to two rows for each Items.id into one with a couple of different approaches.

How to query all entries with a value in a nested bigquery table

I generated a BigQuery table using an existing BigTable table, and the result is a multi-nested dataset that I'm struggling to query from. Here's the format of an entry from that BigQuery table just doing a simple select * from my_table limit 1:
[
{
"rowkey": "XA_1234_0",
"info": {
"column": [],
"somename": {
"cell": [
{
"timestamp": "1514357827.321",
"value": "1234"
}
]
},
...
}
},
...
]
What I need is to be able to get all entries from my_table where the value of somename is X, for instance. There will be multiple rowkeys where the value of somename will be X and I need all the data from each of those rowkey entries.
OR
If I could have a query where rowkey contains X, so to get "XA_1234_0", "XA_1234_1"... The "XA" and the "0" can change but the middle numbers to be the same. I've tried doing a where rowkey like "$_1234_$" but the query goes on for over a minute and is way too long for some reason.
I am using standard SQL.
EDIT: Here's an example of a query I tried that didn't work (with error: Cannot access field value on a value with type ARRAY<STRUCT<timestamp TIMESTAMP, value STRING>>), but best describes what I'm trying to achieve:
SELECT * FROM `my_dataset.mytable` where info.field_name.cell.value=12345
I want to get all records whose value in field_name equals some value.
From the sample Firebase Analytics dataset:
#standardSQL
SELECT *
FROM `firebase-analytics-sample-data.android_dataset.app_events_20160607`
WHERE EXISTS(
SELECT * FROM UNNEST(user_dim.user_properties)
WHERE key='powers' AND value.value.string_value='20'
)
LIMIT 1000
Below is for BigQuery Standard SQL
#standardSQL
SELECT t.*
FROM `my_dataset.mytable` t,
UNNEST(info.somename.cell) c
WHERE c.value = '1234'
above is assuming specific value can appear in each record just once - hope this is a true for you case
If this is not a case - below should make it
#standardSQL
SELECT *
FROM `yourproject.yourdadtaset.yourtable`
WHERE EXISTS(
SELECT *
FROM UNNEST(info.somename.cell)
WHERE value = '1234'
)
which I just realised pretty much same as Felipe's version - but just using your table / schema

BigQuery query nested json

I have JSON data which is saved in BigQuery as a string.
{
"event":{
"action":"prohibitedSoftwareCheckResult",
"clientTime":"2017-07-16T12:55:40.828Z",
"clientTimeZone":"3",
"serverTime":"2017-07-16T12:55:39.000Z",
"processList":{
"1":"outlook.exe",
"2":"notepad.exe"
}
},
"user":{
"id":123456,
}
}
I want to have a result set where each process will be in a different row.
Something like:
UserID ProcessName
-------------------------
123456 outlook.exe
123456 notepad.exe
I saw there is an option to query repeated data but the field needs to be RECORD type to my understanding.
Is it possible to convert to RECORD type "on the fly" in a subquery? (I can't change the source field to RECORD).
Or, is there a different way to return the desired result set?
This could be a possible work around for you:
SELECT
user_id,
processListValues
FROM(
SELECT
JSON_EXTRACT_SCALAR(json_data, '$.user.id') user_id,
REGEXP_EXTRACT_ALL(JSON_EXTRACT(json_data, '$.event.processList'), r':"([a-zA-Z0-9\.]+)"') processListValues
FROM data
),
UNNEST(processListValues) processListValues
Using your JSON as example:
WITH data AS(
SELECT """{
"event":{
"action":"prohibitedSoftwareCheckResult",
"clientTime":"2017-07-16T12:55:40.828Z",
"clientTimeZone":"3",
"serverTime":"2017-07-16T12:55:39.000Z",
"processList":{
"1":"outlook.exe",
"2":"notepad.exe",
"3":"outlo3245345okexe"
}
},
"user":{
"id":123456,
}
}""" as json_data
)
SELECT
user_id,
processListValues
FROM(
SELECT
JSON_EXTRACT_SCALAR(json_data, '$.user.id') user_id,
REGEXP_EXTRACT_ALL(JSON_EXTRACT(json_data, '$.event.processList'), r':"([a-zA-Z0-9\.]+)"') processListValues
FROM data
),
UNNEST(processListValues) processListValues
Results:
Row user_id processListValues
1 123456 outlook.exe
2 123456 notepad.exe
3 123456 outlo3245345okexe