How to: Self join Stream Analytics? - azure-stream-analytics

Im trying to replace null values with last 10 seconds Average in stream analytics job.
this requires a self join between the stream and the averages that i calculate in the With clause.
It is giving me duplicates(i get same record twice or thrice)? Any suggestions on whats wrong or how to do it properly?
My query is:
WITH MV AS ( Select AVG([Sensor_1]) AS [Sensor_1] From [input] GROUP BY SlidingWindow(second, 10))
SELECT [input].[ID]
,[input].[Timestamp]
,[input].[Result]
,CASE
WHEN [input].[Sensor_1] = 0
THEN [MV].[Sensor_1] ELSE [input].[Sensor_1]
END [Sensor_1]
,[input].[Sensor_2]
,[input].[Sensor_3]
FROM [input]
LEFT OUTER JOIN [MV]
ON DateDiff(second, [input], [MV]) BETWEEN 0 AND 10

Sorry for the delay in responding on this.
The simplest solution is to change ON DateDiff(second, [input], [MV]) BETWEEN 0 AND 10 to ON DateDiff(millisecond, [input], [MV]) = 0.
This is because the timestamps given in the MV step are of the last event that went into the SlidingWindow and those would match the timestamp on the event in Input (note: the smaller the time unit the better for the match but if you are using the in-browser-testing-experience then millisecond is the smallest supported unit).
Do note that while here we can remove duplicates by removing needless matches in the JOIN, in general Stream Analytics has no mechanism to remove duplicates via DISTINCT or anything like that.
Ziv.

Related

AWS Timestream query to get average measure for the first month of samples

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?
I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

Stream Analytics Left outer join Not Producing Rows

What I am trying to do:
I want to "throttle" an input stream to its output. Specifically, as I receive multiple similar inputs, I only want to produce an output if one hasn't already been produced in the last N hours.
For example, the input could be thought of as "send an email", but I will get dozens/hundreds of those events. I only want to send an email if I haven't already sent one in the last N hours (or have never sent one).
See the final example here: https://learn.microsoft.com/en-us/stream-analytics-query/join-azure-stream-analytics#examples for something similar to what I am trying to do
What my setup looks like:
There are two inputs to my query:
Ingress: this is the "raw" input stream
Throttled-Sent: this is just a consumer group off of my output stream
My query is as follows:
WITH
AllEvents as (
/* This CTE is here only because we can't seem to use GetMetadataPropertyValue in a join clause, so "materialize" it here for use- later */
SELECT
*,
GetMetadataPropertyValue([Ingress], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Ingress], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Ingress], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Ingress]
),
UseableEvents as (
SELECT *
FROM AllEvents
WHERE NotifyEntityId IS NOT NULL
),
AlreadySentEvents as (
/* These are the events that would have been previously output (referenced here via a consumer group). We want to capture these to make sure we are not sending new events when an older "already-sent" event can be found */
SELECT
*,
GetMetadataPropertyValue([Throttled-Sent], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Throttled-Sent]
)
SELECT i.*
INTO Throttled
FROM UseableEvents i
/* Left join our sent events, looking for those within a particular time frame */
LEFT OUTER JOIN AlreadySentEvents s
ON i.Type = s.Type
AND i.NotifyType = s.NotifyType
AND i.NotifyEntityId = s.NotifyEntityId
AND DATEDIFF(hour, i, s) BETWEEN 0 AND 4
WHERE s.Type IS NULL /* The is null here is for only returning those Ingress rows that have no corresponding AlreadySentEvents row */
The results I'm seeing:
This query is producing no rows to the output. However, I believe it should be producing something because the Throttled-Sent input has zero rows to begin with. I have validated that my Ingress events are showing up (by simply adjusting the query to remove the left join and checking the results).
I feel like my problem is probably linked to one of the following areas:
I can't have an input that is a consumer group off of the output (but I don't know why that wouldn't be allowed)
My datediff usage/understanding is incorrect
Appreciate any help/guidance/direction!
For throttling, I would recommend looking at IsFirst function, it might be easier solution that will not require reading from the output.
For the current query, I think order of DATEDIFF parameters need to be changed as s comes before i: DATEDIFF(hour, s, i) BETWEEN 0 AND 4

In StreamAnalytic Query how to join data from same input

I have some json data(InputFuelCon) being processed by a stream analytics query. I want to join the input with itself..i.e when I look at one value in the input, I need to look at another value in that same input..how can I do that..
The format of the json is something like
[{"timeseries":[{"fqn":"STATUS.EngineFuelConsumption","vqts":[{"v":10,"q":192,"t":"2018-05-10T12:34:34.000Z"}]},
{"fqn":"STATUS.ShaftsRunning","vqts":[{"v":"1","q":192,"t":"2018-05-10T12:35:34.000Z"}]}]}]
Running the following gives rows but 0 as the value
WITH DataInput1 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%EngineFuelConsumption'
),
DataInput2 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%ShaftsRunning' and DATA.Value like '1'
),
DataInput as (
select I1.Fqn AS fqn,
cast(I1.Value as bigint)/30 AS value,
DATETIMEFROMPARTS(DATEPART(year,I1.Time ),DATEPART(month,I1.Time ),DATEPART(day,I1.Time )
,DATEPART(hour,I1.Time ),00,00,00 ) AS time
from DataInput1 I1 JOIN DataInput2 I2
ON
I1.Time=I2.Time and
DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1
)
select * from DataInput
DataInput1 and DataInput2 if run by themselves, give one record each and with sql experience, the datainput join on the timestamp should give the result, but it doesn't. I don't understand how DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1 works but if I remove it, then there is an error. Any help will be greatly appreciated.
please find some answers below. Let me know if you have any further question.
Why your query doesn't return data with this sample:
I looked at the data and query and the following statement implies you have strict equality on the value "Time": I1.Time=I2.Time. However in your sample, the time is different for the 2 entries, so that's why there is no result.
The DATEDIFF statement doesn't relax any of the equality requirements in the JOIN statement.
By removing the line "I1.Time=I2.Time and" you will see a result for your sample. In that case it will join records arriving within a minute. Note that if you have more than 1 record arriving within the same minute, you will see more than 1 joined result with this logic. Also you may want to use application timestamp to compare the timestamp in the data, and not the arrival time of the message.
More info about DATEDIFF:
JOIN in Azure Stream Analytics are temporal in nature, meaning that each JOIN must provide some limits on how far the matching rows can be separated in time.
In your case since there is no explicit TIMESTAMP, the arrival time will be used.
Then, the JOIN condition will be applied, and in your example there is no data matching the condition.

why does this query not give me only the specified accounts?

Oracle SQL Developer
I expect to see:
In the subquery, I have that the rownumber be less than 2. When I run this query separately, it gives me 2 accounts. However, when I'm running the entire query, the list of account numbers just goes on! what's happening here?
SELECT m.acctno, i.intervalstartdate, d.name, i.intervalvalue
FROM endpoints E
JOIN meters m on m.acctid = e.acctid
LEFT JOIN intervaldata I ON I.acctid = M.acctid
LEFT JOIN endpointmodels EM ON EM.endpointmodelid=E.hwmodelid
LEFT JOIN datadefinitions D ON D.datadefinitionid = I.datadefinitionid
WHERE 1=1
AND E.statuscodeid = 8
AND m.FORM = 2
and exists
(
SELECT m2.acctno
from acct m2
where m2.acctno is not null
--and m2.acctno=m2.acctno
and rownum <= 2
)
AND D.datadefinitionid =7077
AND I.intervalstartdate BETWEEN '24-SEP-2017 00:00' and '25-SEP-2017 00:00'
--TRUNC(sysdate - 1) + interval '1' hour AND TRUNC(sysdate - 1) + interval
'24' hour
ORDER BY M.acctno, I.intervalstartdate, I.datadefinitionid
This query is supposed to give me 97 rows for each account. The data i'm reading, the interval values, are the data we report for each customer in 96 intervals. so Im expecting for 2 accounts for example, to get 194 rows. i want to test for 2 accounts now, but then i want to run for 50,000. so with 2, it's not even working. Just giving me millions of rows for two accounts. Basicaly, i think my row num line of code is being ignored. I can't use an in clause because i cant pass 50,000 accounts into there. so I used the exist operator.
Let me know!
I think the error is in trying to use and exists (...) clause. The exists predicate returns true if the subquery returns any rows at all. So, in your case, the result of exists will always be true, unless the table is empty. This means it has no effect whatsoever on the outer query. You need to use something like
inner join (SELECT m2.acctno
from acct m2
where m2.acctno is not null
--and m2.acctno=m2.acctno
and rownum <= 2) sub1
on sub1.acctno = m.acctno
to get what you want instead of and exists (...).
One obvious mistake is the date condition, where you require a date to be between two STRINGS. If you keep dates in string format, you will run into thousands of problems and bugs and you won't be able to fix them.
Do you understand that '25-APR-2008 00:00:00' is between '24-SEP-2017 00:00:00' and '25-SEP-2017 00:00:00', if you compare them alphabetically (as strings)?
The solution is to make sure the date column is in DATE (or TIMESTAMP) data type, and then to compare to dates, not to strings.
As an aside - this will not cause any errors, but it is still bad code - in the EXISTS condition you have a condition for ROWNUM <= 2. What do you think that will do? The subquery either returns at least one row (the first one will automatically have ROWNUM = 1) or it doesn't. The condition on ROWNUM in that subquery in the EXISTS condition is just garbage.

How can I modify this SQL query to exclude all results except from the previous two hours?

We're currently using SQL Express on SQL Server 2005 and want to set up an automated ftp file transfer every two hours to our client. We want to be able to send them bi-hourly uploads without duplicates throughout the day. Is this possible to do by modifying this existing query?
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and f.CallStartDateTime between convert(varchar(10),getdate()-1,101) and convert(varchar(10),getdate(),101)
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p
where f.compressedfilename = p.processedfilename)
Here's the updated query
Use Sweet
select distinct d.AccountCode, f.ProcessedFileName, f.CallStartDateTime, f.PathToFile
from CSR_CallDetail d, CSR_FileListing f
where d.CallId = f.CallId
and f.ProcessedFileName like '%mp3'
and DATEDiff(hh, f.callstartdatetime, GETDATE ()) <=2
and d.AccountCode > '740000'
and f.AccountCode > '740000'
and not exists (select 1 from( select processedfilename from csr_filelisting) p where f.compressedfilename = p.processedfilename)
Let's say the query you posted returns desired result. If so, we need a date (and time) the records have been saved. All you need is to add condition:
AND DATEDIFF(hh, date_of_record, GETDATE()) <=2
I assume in your case it will be:
AND DATEDIFF(hh, f.CallStartDateTime , GETDATE()) <=2
You won't be able to rely on timestamps to get guaranteed exact sequential nonoverlapping sets of anything. You'll always be up against a race condition. What you should do is add a bit column somewhere that will mean you've already processed that row, and set it appropriately at the time of processing. Use transactions and isolation levels to ensure that no one is updating it while you're working on it (a brief moment, one hopes).