Call Azure Stream Analytics UDF with multi-dimensional array of last 5 records, grouped by record - azure-iot-hub

I am trying to call an AzureML UDF from Stream Analytics query and that UDF expects an array of 5 rows and 2 columns. The input data is streamed from an IoT hub and we have two fields in the incoming messages: temperature & humidity.
This would be the 'passthrough query' :
SELECT GetMetadataPropertyValue([room-telemetry], 'IoTHub.ConnectionDeviceId') AS RoomId,
Temperature, Humidity
INTO
[maintenance-alerts]
FROM
[room-telemetry]
I have an AzureML UDF (successfully created) that should be called with the last 5 records per RoomId and that will return one value from the ML Model. Obviously, there are multiple rooms in my stream, so I need to find a way to get some kind of windowing of 5 records Grouped per RoomId. I don't seem to find a way to call the UDF with the right arrays selected from the input stream. I know I can create a Javascript UDF that would return an array from the specific fields, but that would be record/by record, where here I would need this with multiple records that are grouped by the RoomId.
Someone has any insights?
Best regards

After the good suggestion of #jean-sébastien and an answer to an isolated question for the array-parsing, I finally was able to stitch everything together in a solution that builds. (still have to get it to run at runtime, though).
So, the solution exists in using CollectTop to aggregate the latest rows of the entity you want to group by, including the specification of a Time Window.
And the next step was to create the javascript UDF to take that data structure and parse it into a multi-dimensional array.
This is the query I have right now:
-- Taking relevant fields from the input stream
WITH RelevantTelemetry AS
(
SELECT engineid, tmp, hum, eventtime
FROM [engine-telemetry]
WHERE engineid IS NOT NULL
),
-- Grouping by engineid in TimeWindows
TimeWindows AS
(
SELECT engineid,
CollectTop(2) OVER (ORDER BY eventtime DESC) as TimeWindow
FROM
[RelevantTelemetry]
WHERE engineid IS NOT NULL
GROUP BY TumblingWindow(hour, 24), engineid
)
--Output timewindows for verification purposes
SELECT engineid, Udf.Predict(Udf.getTimeWindows(TimeWindow)) as Prediction
INTO debug
FROM TimeWindows
And this is the Javascript UDF:
function getTimeWindows(input){
var output = [];
for(var x in input){
var array = [];
array.push(input[x].value.tmp);
array.push(input[x].value.hum);
output.push(array);
}
return output;
}

Related

Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.
Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.
Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio
I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record. 
Here is the difference though:

Query:
SELECT count(*) AS row_count
FROM ads_insights
Result:
 row_count - 316

Query:
SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions
Result:
 row_count - 5612

After unnesting, I’d use the row data to create columns for each conversion like so:
CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase

And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?
Solution:
Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.
You would typically want to do a 'pivot' here. You're most of the way there, you just need to sum and group by the relevant columns in order to get this back to the grain that you originally had and want.
I believe you'll want something like this:
select
ads_insights.some_column,
ads_insights.some_other_column,
sum(case
when value.action_type = 'offsite_conversion.fb_pixel_purchase'
then coalesce(value._28d_click, 0) + coalesce(value._1d_view, 0)
else 0
end) AS website_purchase
from ads_insights,
unnest(actions) as actions
group by 1,2
The initial columns would be whatever you want from the original table. The 'sum case whens' would be to pivot and aggregate the unnested data.
You can actually do some magic with unnests inside the select statement
Does this work for you?
SELECT
some_column,
(SELECT coalesce(_28d_click, 0) + coalesce(_1d_view, 0) from unnest(actions) WHERE action_type = "offsite_conversion.fb_pixel_purchase") AS website_purchase
FROM ads_insights

Stream Analytics Left outer join Not Producing Rows

What I am trying to do:
I want to "throttle" an input stream to its output. Specifically, as I receive multiple similar inputs, I only want to produce an output if one hasn't already been produced in the last N hours.
For example, the input could be thought of as "send an email", but I will get dozens/hundreds of those events. I only want to send an email if I haven't already sent one in the last N hours (or have never sent one).
See the final example here: https://learn.microsoft.com/en-us/stream-analytics-query/join-azure-stream-analytics#examples for something similar to what I am trying to do
What my setup looks like:
There are two inputs to my query:
Ingress: this is the "raw" input stream
Throttled-Sent: this is just a consumer group off of my output stream
My query is as follows:
WITH
AllEvents as (
/* This CTE is here only because we can't seem to use GetMetadataPropertyValue in a join clause, so "materialize" it here for use- later */
SELECT
*,
GetMetadataPropertyValue([Ingress], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Ingress], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Ingress], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Ingress]
),
UseableEvents as (
SELECT *
FROM AllEvents
WHERE NotifyEntityId IS NOT NULL
),
AlreadySentEvents as (
/* These are the events that would have been previously output (referenced here via a consumer group). We want to capture these to make sure we are not sending new events when an older "already-sent" event can be found */
SELECT
*,
GetMetadataPropertyValue([Throttled-Sent], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Throttled-Sent]
)
SELECT i.*
INTO Throttled
FROM UseableEvents i
/* Left join our sent events, looking for those within a particular time frame */
LEFT OUTER JOIN AlreadySentEvents s
ON i.Type = s.Type
AND i.NotifyType = s.NotifyType
AND i.NotifyEntityId = s.NotifyEntityId
AND DATEDIFF(hour, i, s) BETWEEN 0 AND 4
WHERE s.Type IS NULL /* The is null here is for only returning those Ingress rows that have no corresponding AlreadySentEvents row */
The results I'm seeing:
This query is producing no rows to the output. However, I believe it should be producing something because the Throttled-Sent input has zero rows to begin with. I have validated that my Ingress events are showing up (by simply adjusting the query to remove the left join and checking the results).
I feel like my problem is probably linked to one of the following areas:
I can't have an input that is a consumer group off of the output (but I don't know why that wouldn't be allowed)
My datediff usage/understanding is incorrect
Appreciate any help/guidance/direction!
For throttling, I would recommend looking at IsFirst function, it might be easier solution that will not require reading from the output.
For the current query, I think order of DATEDIFF parameters need to be changed as s comes before i: DATEDIFF(hour, s, i) BETWEEN 0 AND 4

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL

How to compare values in same table column and return if equal

I have the following data
I need to extract and return rows of data whenever cs3Horizontal for a row = same column but next row. For example, in the picture you see that cs3Horizontal = 65 for rows 85/86, so return those rows.
I have looked at numerous options using OVER, LEAD and LAG but to be honest the documentation just does not provide enough detail for somebody who has never used these window functions before. I think I am looking at the right solution, but how do I implement it?
I'm using PostgreSQL 9.2.9, compiled by Visual C++ build 1600, 64-bit. The table in question is time series data and as such, the first column is primary key.
LEAD() and LAG() are the solutions. If you want both rows:
select t.*
from (select t.*,
lag(cs3Horizontal) over (order by cs3time) as prev_cs3Horizontal,
lead(cs3Horizontal) over (order by cs3time) as next_cs3Horizontal
from t
) t
where prev_cs3Horizontal = cs3Horizontal or
next_cs3Horizontal = cs3Horizontal;

What's the effective way to count rows in Pig?

In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;
Instead of using directly a GROUP ALL, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version.
The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper).
I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns
a scalar value. One interesting and useful property of many aggregate
functions is that they can be computed incrementally in a distributed
fashion. We call these functions algebraic. COUNT is an example of an
algebraic function because we can count the number of elements in a
subset of the data and then sum the counts to produce a final output.
In the Hadoop world, this means that the partial computations can be
done by the map and combiner, and the final result can be computed by
the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the PARALLEL n keyword this set the
number of reducers.
Increase the parallelism of a job by specifying the number of reduce
tasks, n. The default value for n is 1 (one reduce task).