Azure Stream analytics query not returning result set when using timestamp by - azure-stream-analytics

I am trying to extract the parts per minute produced where v is an aggregated counter of the parts produced till that time.
My azure SQL Query is as below
select
x.fqn,
( max(cast(y.arrayvalue.v as BIGINT))-(min(cast(y.arrayvalue.v as BIGINT)))) as ppm
from
(SELECT
TS.ArrayIndex,
TS.ArrayValue.FQN,
TS.ArrayValue.vqts
FROM
[EventHubInput] as hub
timestamp by y.arrayvalue.t
CROSS APPLY GetArrayElements(hub.timeseries) AS TS) as x
cross apply GetArrayElements(x.vqts) AS y
where x.fqn like '%Production%' and y.arrayvalue.q=192
group by tumblingwindow(minute,1),x.fqn
My input data looks like this
{
"timeSeries": [
{
"fqn":"MyEnterprise.Gateways.GatewayE.CLX.Tags.StateBasic",
"vqts":[
{
"v": "" ,
"q": 192 ,
"t":"2016-06-24T16:39:45.683+0000"
}
]
}, {
"fqn":"MyEnterprise.Gateways.GatewayE.CLX.Tags.ProductionCount",
"vqts":[
{
"v": 264 ,
"q": 192 ,
"t":"2016-06-24T16:39:45.683+0000"
}
]
}, {
"fqn":".Gateways.GatewayE.CLX.Tags.StateDetailed",
"vqts":[
{
"v": "" ,
"q": 192 ,
"t":"2016-06-24T16:39:45.683+0000"
}
]
} ]
My query returns no result. when I remove the timestamp by y.arrayvalue.t
and add y.arrayvalue.t in the group by clause, I get some result. I realize that maybe this is because I have more than 1 timestamp field for each event, So I wanted to know if it is possible to assign the time data of the first array to timestamp by...something like timestamp by y[0].t

As of today, Azure Stream Analytics does not support timestamp by over a value inside an array. So the answer to your question "if it is possible to assign the time data of the first array to timestamp by" is NO.
Here is a workaround you can use:
First, flatten the input message in one job and output to a staging Event Hub:
WITH flattenTS AS
(
SELECT
TS.ArrayIndex,
TS.ArrayValue.FQN,
TS.ArrayValue.vqts
FROM [EventHubInput]
CROSS APPLY GetArrayElements(hub.timeseries) AS TS
)
, flattenVQTS AS
(
SELECT
ArrayIndex
,FQN
,vqts.ArrayValue.v as v
,vqts.ArrayValue.q as q
,vqts.ArrayValue.t as t
FROM flattenTS TS
CROSS APPLY GetArrayElements(TS.vqts) AS vqts
)
SELECT *
INTO [staging_eventhub]
FROM flattenVQTS
Then, use another job to read the flattened messages and do the windowed aggregation:
SELECT
FQN
,MAX(CAST(v as BIGINT))-MIN(CAST(v as BIGINT)) as ppm
FROM [staging_eventhub] timestamp by t
WHERE fqn LIKE '%Production%' and q=192
GROUP BY tumblingwindow(minute,1), fqn
You may wonder can we just combine above two jobs as multiple steps in a single job and avoid the staging Event Hub. Unfortunately, you cannot use "timestamp by" when you select from CTE or subquery today.

Related

In Snowflake, how do I parse an unnamed JSON array and access each key with key value rather than array slicing methods?

I have a JSON array in Snowflake, where I need to parse into a table, i.e., convert all information into a row. Assume my table includes two columns: id and json_tag column.
An example row is as below:
{ [
json_tag
[
{
"key": "app.name",
"value": "myapp1"
},
{
"key": "device.name",
"value": "myiPhone11"
},
{
"key": "iOS.dist",
"value": "latestDist5"
}
]
This is not a standard example, another example can have 5 or even 6 with new key names. An example is below:
{ [
json_tag
[
{
"key": "app.name",
"value": "myapp2"
},
{
"key": "app.cost",
"value": "$2.99"
},
{
"key": "device.name",
"value": "myiPhoneX"
},
{
"key": "device.color",
"value": "gold"
},
{
"key": "iOS.dist",
"value": "latestDist4.9"
}
]
What I want is working on a table (without creating a new one) where the json row is split into columns as below:
id app.name app.cost device.name device.color iOS.dist
1 myapp1 null myiPhone11 null latestDist5
2 myapp2 $2.99 myiPhoneX gold latestDist4.9
I tried the following snippet:
with parsed_tb as (
select id,
to_variant(parse_json(json_tag)) as parsed_json_tag
from mytable
)
select parsed_json_tag[0]:value::varchar as app_name,
parsed_json_tag[1]:value::varchar as app_cost,
parsed_json_tag[2]:value::varchar as device_name
from tb;
As you can imagine, the snippet above does not work when there is no app.cost in key values or every row differs in number of keys and values.
I tried lateral flatten command in Snowflake, but out creates many rows and I cannot figure out how to put them in columns in the same row. I also tried using recursive command, and could not achieve it.
So my question is:
How can I access a key by its name rather than slicing an array as I do above? - this would solve my problem I guess.
If #1 solution I imagine does not fix, how can I attain the table above?
I have found a solution to this problem with the help of following thread:
Mysql, reshape data from long / tall to wide
Solution is first parsing json column with unique identifier, and then lateral flattening and then grouping the results by unique identifier [formatting table from long to wide as given in the link above].
So for the examples I provided above, the solution would be as follows:
with tb as (
select id,
to_variant(parse_json(json_tags)) as parsed_json_tags
from mytable
),
flattened as (
select id,
VALUE:value::string as value,
VALUE:key::string as key
from mytable, lateral flatten(input => (tb.parsed_json_tags))
)
select id,
MAX( IFF( key='app.name', value, NULL ) ) AS app_name,
MAX( IFF( key='app.cost', value, NULL ) ) AS app_cost,
MAX( IFF( key='device.name', value, NULL ) ) AS device_name,
MAX( IFF( key='device.color', value, NULL ) ) AS device_color,
MAX( IFF( key='iOS.dist', value, NULL ) ) AS ios_dist
from flattened
group by 1;

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs

I followed these instructions to get my AWS WAF data into an Athena table.
I would like to query the data to find the latest requests with an action of BLOCK. This query works:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;
My issue is cleanly identifying the "terminatingrule" - the reason the request was blocked. As an example, a result has
terminatingrule = AWS-AWSManagedRulesCommonRuleSet
And
rulegrouplist = [
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesKnownBadInputsRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesLinuxRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesCommonRuleSet",
"terminatingrule": {
"rulematchdetails": "null",
"action": "BLOCK",
"ruleid": "NoUserAgent_HEADER"
},
"excludedrules":"null"
}
]
The piece of data I would like separated into a column is rulegrouplist[terminatingrule].ruleid which has a value of NoUserAgent_HEADER
AWS provide useful information on querying nested Athena arrays, but I have been unable to get the result I want.
I have framed this as an AWS question but since Athena uses SQL queries, it's likely that anyone with good SQL skills could work this out.
It's not entirely clear to me exactly what you want, but I'm going to assume you are after the array element where terminatingrule is not "null" (I will also assume that if there are multiple you want the first).
The documentation you link to say that the type of the rulegrouplist column is array<string>. The reason why it is string and not a complex type is because there seems to be multiple different schemas for this column, one example being that the terminatingrule property is either the string "null", or a struct/object – something that can't be described using Athena's type system.
This is not a problem, however. When dealing with JSON there's a whole set of JSON functions that can be used. Here's one way to use json_extract combined with filter and element_at to remove array elements where the terminatingrule property is the string "null" and then pick the first of the remaining elements:
SELECT
element_at(
filter(
rulegrouplist,
rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)
),
1
) AS first_non_null_terminatingrule
FROM waf_logs
WHERE action = 'BLOCK'
ORDER BY date DESC
You say you want the "latest", which to me is ambiguous and could mean both first non-null and last non-null element. The query above will return the first non-null element, and if you want the last you can change the second argument to element_at to -1 (Athena's array indexing starts from 1, and -1 is counting from the end).
To return the individual ruleid element of the json:
SELECT from_unixtime(timestamp / 1000e0) AS date, action, httprequest.clientip AS ip, httprequest.uri AS request, httprequest.country as country, terminatingruleid, json_extract(element_at(filter(rulegrouplist,rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON) ),1), '$.terminatingrule.ruleid') AS ruleid
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
I had the same issue but the solution posted by Theo didn't work for me, even though the table was created according to the instructions linked to in the original post.
Here is what worked for me, which is basically the same as Theo's solution, but without the json conversion:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist,
element_at(filter(ruleGroupList, ruleGroup -> ruleGroup.terminatingRule IS NOT NULL),1).terminatingRule.ruleId AS ruleId
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;

Asking for help on correct way to us SQL with CTE to create JSON_OBJECT

The requested JSON needs to be in this form:
{
"header": {
"InstanceName": "US"
},
"erpReferenceData": {
"erpReferences": [
{
"ServiceID": "fb16e421-792b-4e9c-935b-3cea04a84507",
"ERPReferenceID": "J0000755"
},
{
"ServiceID": "7d13d907-0932-44c0-ad81-600c9b97b6e5",
"ERPReferenceID": "J0000756"
}
]
}
}
The program that I created looks like this:
dcl-s OutFile sqltype(dbclob_file);
exec sql
With x as (
select json_object(
'InstanceName' : trim(Cntry) ) objHeader
from xmlhdr
where cntry = 'US'),
y as (
select json_object(
'ServiceID' VALUE S.ServiceID,
'ERPReferenceID' VALUE I.RefCod) oOjRef
FROM IMH I
INNER JOIN GUIDS G ON G.REFCOD = I.REFCOD
INNER JOIN SERV S ON S.GUID = G.GUID
WHERE G.XMLTYPE = 'Service')
VALUES (
select json_object('header' : objHeader Format json ,
'erpReferenceData' : json_object(
'erpReferences' VALUE
JSON_ARRAYAGG(
ObjRef Format json)))
from x
LEFT OUTER JOIN y ON 1=1
Group by objHeader)
INTO :OutFile;
This is the compile error I get:
SQL0122: Position 41 Column OBJHEADER or expression in SELECT list not valid.
I am asking if this is the correct way to create this SQL statement, is there a better easier way? Any idea how to rewrite the SQL statement to make it work correctly?
The key with generating JSON or XML for that matter is to start from the inside and work your way out.
(I've simplified the raw data into just a test table...)
with elm as(select json_object
('ServiceID' VALUE ServiceID,
'ERPReferenceID' VALUE RefCod) as erpRef
from jsontst)
select * from elm;
Now add the next layer as a CTE the builds on the first CTE..
with elm as(select json_object
('ServiceID' VALUE ServiceID,
'ERPReferenceID' VALUE RefCod) as erpRef
from jsontst)
, arr (arrDta) as (values json_array (select erpRef from elm))
select * from arr;
And the next layer...
with elm as(select json_object
('ServiceID' VALUE ServiceID,
'ERPReferenceID' VALUE RefCod) as erpRef
from jsontst)
, arr (arrDta) as (values json_array (select erpRef from elm))
, erpReferences (refs) as ( select json_object
('erpReferences' value arrDta )
from arr)
select *
from erpReferences;
Nice thing about building with CTE's is at each step, you can see the results so far...
You can actually always go back and stick a Select * from CTE; in the middle to see what you have at some point.
Note that I'm building this in Run SQL Scripts. Once you have the statement complete, you can embed it in your RPG program.

PostgreSQL 11: multiple key-value pairs with json_object_agg()

Using PostgreSQL 11, the aggregate function json_object_agg() creates a JSON object from key and value, like the current uptime:
# SELECT json_object_agg('uptime', date_trunc('second',
current_timestamp - pg_postmaster_start_time()));
This outputs the JSON object:
{ "uptime" : "00:45:55" }
How can I merge multiple key-value pairs into one single object? For instance, how to add the PostgreSQL version string to the object?
# SELECT json_object_agg('version', version());
The desired result may look like:
{
"uptime": "00:60:01",
"version": "PostgreSQL 11.7"
}
You can use json_build_object() to generate a JSON object for key/value pairs.
json_object_agg(
json_build_object(
'uptime', date_trunc('second', current_timestamp - pg_postmaster_start_time()),
'version', version()
)
)
In the case of an aggregated query, example:
select point_id, json_agg(
json_build_object(
'lat', lat,
'lng', lon
))
from raw
group by point_id;
Result:
1,"[{"lat": 66.316917131, "lng": 65.308411872}, {"lat": 66.361430767, "lng": 65.218795224}]"
2,"[{"lat": 53.557419623, "lng": 102.39525849}, {"lat": 53.626151788, "lng": 102.529763433}]"
3,"[{"lat": 56.452206128, "lng": 100.731635684}, {"lat": 56.520627931, "lng": 100.771568911}]"
json_object_agg is for aggregating, that is summarizing an arbitrary number of rows.
You can use it for your purpose by making it look like the data is coming in as multiple rows with 2 columns each. You can use a VALUES list for this purpose.
select jsonb_object_agg(key,val) from (
values
('uptime',date_trunc('second', current_timestamp - pg_postmaster_start_time())::text),
('version',version())
) foo(key,val);
Using json(b)_build_object with 4 arguments and one implicit/dummy row is probably more natural for the exact case you give. That might be all you want to do at the moment, but if you are working with JSON you should know how json(b)_object_agg works even if it is not what you need right now.