process of late events in SA - azure-stream-analytics

I was doing a test, when I generated data that was 30 days old.
When sent to SA job all that input was dropped, but per settings in event ordering blade I was expecting that all will be passed thru.
Part of job query contains:
---------------all incoming events storage query
SELECT stream.*
INTO [iot-predict-SA2-ColdStorage]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
so my expectation is to have everything that was pushed to SA job in blob storage.
When I sent events that were only 5 hours old - then the input was marked as late (expected) and processed.
Per SS first marked area is showing outdated events input, but no output (red), the second part shows late processed events.
full query
WITH AlertsBasedOnMin
AS (
SELECT stream.SensorGuid
,ref.MinThreshold AS threshold
WHEN (ref.MinThreshold > stream.Value)
END AS isAlert
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
WHERE ref.AggregationTypeFlag = 8
AS (
SELECT stream.SensorGuid
,ref.MaxThreshold AS threshold
WHEN (ref.MaxThreshold < stream.Value)
END AS isAlert
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
WHERE ref.AggregationTypeFlag = 16
AS (
FROM AlertsBasedOnMin
FROM AlertsBasedOnMax
AS (
SELECT SUM(alertMinMaxUnion.isAlert) AS EventCount
,alertMinMaxUnion.SensorGuid AS SensorGuid
FROM alertMinMaxUnion
GROUP BY HoppingWindow(Duration(minute, 1), Hop(second, 30))
HAVING SUM(alertMinMaxUnion.isAlert) > alertMinMaxUnion.Count
AS (
SELECT System.TIMESTAMP [TimeStampUtc]
,0 AS SumValue
,0 AS AvgValue
,0 AS StdDevValue
FROM alertMimMaxComputed computed
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = computed.SensorGuid
AS (
SELECT Count(1) AS eventCount
,stream.SensorGuid AS SensorGuid
,ref.[Count] AS TriggerThreshold
,SUM(stream.Value) AS SumValue
,AVG(stream.Value) AS AvgValue
,STDEV(stream.Value) AS StdDevValue
,ref.AggregationTypeFlag AS flag
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = stream.SensorGuid
GROUP BY HoppingWindow(Duration(minute, 1), Hop(second, 30))
--as this is alert then this factor will be relevant to all of the aggregated queries
Count(1) >= ref.[Count]
ref.AggregationTypeFlag = 1
AVG(stream.Value) >= ref.MaxThreshold
OR AVG(stream.Value) <= ref.MinThreshold
OR (
ref.AggregationTypeFlag = 2
SUM(stream.Value) >= ref.MaxThreshold
OR Sum(stream.Value) <= ref.MinThreshold
OR (
ref.AggregationTypeFlag = 4
STDEV(stream.Value) >= ref.MaxThreshold
OR STDEV(stream.Value) <= ref.MinThreshold
AS (
SELECT System.TIMESTAMP [TimeStampUtc]
,0 AS EventCount
FROM alertsAggregatedByFunction computed
JOIN [iot-predict-SA2-referenceBlob] ref ON ref.SensorGuid = computed.SensorGuid
AS (
FROM alertsAggregatedByFunctionMergedWithReference
FROM alertsMimMaxComputedMergedWithReference
---------------alerts storage query
INTO [iot-predict-SA2-Alerts-ColdStorage]
FROM allAlertsUnioned
---------------alerts to alert events query
INTO [iot-predict-SA2-Alerts-EventStream]
FROM allAlertsUnioned
---------------alerts to stream query
INTO [iot-predict-SA2-TSI-EventStream]
FROM allAlertsUnioned
---------------all incoming events storage query
SELECT stream.*
INTO [iot-predict-SA2-ColdStorage]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime
---------------all incoming events to time insights query
SELECT stream.*
INTO [iot-predict-SA2-TSI-AlertStream]
FROM [iot-predict-SA2-input] stream TIMESTAMP BY stream.UtcTime

Since you are using "TIMESTAMP BY", Stream Analytics job event ordering settings are taking effects. Please check your job's "event ordering" settings, specifically below two:
Events that arrive late -- the late arrival limit between 0 second and 21 days.
Handling other events -- error handling policy, drop or adjust the application time to system clock time.
I guess that, most likely, your late arrival limit was more than 5 hours, so that those 5-hours old events could be processed.
You may already figure out from above that Stream Analytics job can only process "old" events up to 21 days late. To work around this limitation, you can consider one of below options:
Remove TIMESTAMP BY, then all your windowing aggregate will be using enqueue time. This might generate incorrect result according to your query logic.
Select "adjust" as the error handling policy. Again, this might generate incorrect result according to your query logic.
Shifting the application time (stream.UtcTime) to a more resent time by using DATEADD() function, for example TIMESTAMP BY DATEADD(day, 10, UtcTime). This works well when this is a onetime task, and you know the time range of your events.
Use batch job(outside Stream Analytics) to process data that 30 days old.

After a chat with guys from MS, it emerged that my test have to had an extra step to perform.
To have late events processed, regardless late event settings, we need to start this job in a way, that late event is considered as a sent when job was started, so in this particular case, we have to start SA job using custom start date and set it 30 days ago.


JOIN other table only if condition is true for ALL joined rows

I have two tables I'm trying to conditionally JOIN.
dbo.Users looks like this:
dbo.TelemarketingCallAudits looks like this (date format dd/mm/yyyy):
UserID Date CampaignID
------ ---------- ----------
24525 21/01/2018 1
24525 26/08/2018 1
24525 17/02/2018 1
24525 12/01/2017 2
5425 22/01/2018 1
7676 16/11/2017 2
I'd like to return a table that contains ONLY users that I called at least 30 days ago (if CampaignID=1) and at least 70 days ago (if CampaignID=2).
The end result should look like this (today is 02/09/18):
UserID Date CampaignID
------ ---------- ----------
5425 22/01/2018 1
7676 16/11/2017 2
Note that because I called user 24524 with Campaign 1 only 7 days ago, I shall not see the user at all.
I tried this simple AND/OR condition and then I found out it will still return the users I shouldn't see because they do have rows indicating other calls and it simply ignoring the conditioned calls... which misses the goal obviously.
I have no idea on how to condition the overall appearance of the user if ANY of his associated rows in the second table did not meet the condition.
internal_TelemarketingCallAudits.CallAuditID IS NULL --No telemarketing calls is fine
internal_TelemarketingCallAudits.CampaignID = 1 --Campaign 1
DATEADD(dd, 75, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
internal_TelemarketingCallAudits.CampaignID != 1 --Other campaigns
DATEADD(dd, 10, MAX(internal_TelemarketingCallAudits.Date)) < GETDATE() --Last call occured at least 10 days ago
I really appreciate your help.
Try this: SQL Fiddle
select *
from dbo.Users u
inner join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
and mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
I'm not comparing all rows here; but rather saw that you're interested in when the most recent call is; then you only care if that's in the X day window. There's a bit of additional complexity given the X days varies by campaign; so it's not the most recent call you care about so much as the most likely to fall within that window. To get around that, I sort each users' calls by those which are in the window first followed by those which aren't; then sort by most recent first within those 2 groups. This gives me the field r.
By filtering on r = 1 for each user, we only get the most recent call (adjusted for campaign windows). By filtering on LastCalledInWindow = 0 we exclude those who have been called within the campaign's window.
NB: I've used an inner query (aliased c) to hold the campaign ids and their corresponding windows. In reality you'd probably want a campaigns table holding that same information instead of coding inside the query itself.
Hopefully everything else is self-explanatory; but give me a nudge in the comments if you need any further information.
Just realised you'd also said "no calls is fine"... Here's a tweaked version to allow for scenarios where the person has not been called.
SQL Fiddle Example.
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,c.DaysSinceLastCall, tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
inner join (
values (1, 60)
, (2, 70)
) c (CampaignId, DaysSinceLastCall)
on tca.CampaignId = c.CampaignId
) mrc
on mrc.UserId = u.UserId
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
or mrc.r is null --no calls at all
Update: Including a default campaign offset
To include a default, you could do something like the code below (SQL Fiddle Example). Here, I've put each campaign's offset value in the Campaigns table, but created a default campaign with ID = -1 to handle anything for which there is no offset defined. I use a left join between the audit table and the campaigns table so that we get all records from the audit table, regardless of whether there's a campaign defined, then a cross join to get the default campaign. Finally, I use a coalesce to say "if the campaign isn't defined, use the default campaign".
select *
from dbo.Users u
left outer join ( --get the most recent call per user (taking into account different campaign timescales)
select tca.UserId
, tca.CampaignId
, tca.[Date]
, case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end LastCalledInWindow
, row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r
from dbo.TelemarketingCallAudits tca
left outer join Campaigns c
on tca.CampaignId = c.CampaignId
cross join Campaigns dflt
where dflt.CampaignId = -1
) mrc
on mrc.UserId = u.UserId
mrc.r = 1 --only accept the most recent call
and mrc.LastCalledInWindow = 0 --only include if they haven't been contacted in the last x days
or mrc.r is null --no calls at all
That said, I'd recommend not using a default, but rather ensuring that every campaign has an offset defined. i.e. Presumably you already have a campaigns table; and since this offset value is defined per campaign, you can include a field in that table for holding this offset. Rather than leaving this as null for some records, you could set it to your default value; thus simplifying the logic / avoiding potential issues elsewhere where that value may subsequently be used.
You'd also asked about the order by clause. There is no order by 1/0; so I assume that's a typo. Rather the full statement is row_number() over (partition by tca.UserId order by case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end desc, tca.[Date] desc) r.
The purpose of this piece is to find the "most important" call for each user. By "most important" I basically mean the most recent, since that's generally what we're after; though there's one caveat. If a user is part of 2 campaigns, one with an offset of 30 days and one with an offset of 60 days, they may have had 2 calls, one 32 days ago and one 38 days ago. Though the call from 32 days ago is more recent, if that's on the campaign with the 30 day offset it's outside the window, whilst the older call from 38 days ago may be on the campaign with an offset of 60 days, meaning that it's within the window, so is more of interest (i.e. this user has been called within a campaign window).
Given the above requirement, here's how this code meets it:
row_number() produces a number from 1, counting up, for each row in the (sub)query's results. The counter is reset to 1 for each partition
partition by tca.UserId says that we're partitioning by the user id; so for each user there will be 1 row for which row_number() returns 1, then for each additional row for that user there will be a consecutive number returned.
The order by part of this statement defines which of each users' rows gets #1, then how the numbers progress thereafter; i.e. the first row according to the order by gets number 1, the next number 2, etc.
case when DateAdd(Day,coalesce(c.DaysSinceLastCall,dflt.DaysSinceLastCall), tca.[Date]) > getutcdate() then 1 else 0 end returns 1 for calls within their campaign's window, and 0 for those outside of the window. Since we're ordering by this result in ascending order, that says that any records within their campaign's window should be returned before any outside of their campaign's window.
we then order by tca.[Date] desc; i.e. the more recent calls are returned before the later calls.
finally, we name the output of this row number as r and in the outer query filter on r = 1; meaning that for each user we only take one row, and that's the first row according to the order criteria above; i.e. if there's a row in its campaign's window we take that, after which it's whichever call was most recent (within those in the window if there were any; then outside that window if there weren't).
Take a look at the output of the subquery to get a better idea of exactly how this works: SQL Fiddle
I hope that explanation makes some sense / helps you to understand the code? Sadly I can't find a way to explain it more concisely than the code itself does; so if it doesn't make sense try playing with the code and seeing how that affects the output to see if that helps your understanding.

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
start AS session_start,
lead(end1, 1)
ORDER BY end1) AS session_end,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
CASE WHEN datediff(seconds, end1, lead(start, 1)
ORDER BY end1 ASC)) > 10
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
ORDER BY end1 DESC), start) > 10
ELSE 0 END AS reverse_session_switch
FROM app_tax
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
ORDER BY end1 ASC) AS row_num
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

struggling with SQL aggregate function

I have a table containing weights over time which I want to evaluate as flow:
Scan TimeStamp Position Weight
01 14/11/01 12:00 0 0
01 14/11/01 12:10 10 1.6
02 14/11/01 13:00 0 2.6
02 14/11/01 13:10 10 4.2
Now I want to calculate the flow during a scan (begin to end).
My query looks like that:
Select MeanTime, TheFlow From
(Select AVG(TheTimeStamp) as MeanTime From flow Where ScanNumber=73),
(Select Weightdiff / TimeSpan as TheFlow From
(Select (MaxWeight - MinWeight) as WeightDiff From
(Select Weight as MAXWEIGHT from Flow Where ScanNumber=73 HAVING "POSITION"=MAX("POSITION")),
(Select Weight as MINWEIGHT from FLOW Where ScanNumber=73 HAVING "POSITION"=MIN("POSITION")),
(Select (MaxTime - MinTime) * 24 as TimeSpan From
(Select MAX("THETIMESTAMP") as MaxTime From FLOW Where ScanNumber=73),
(Select MIN("THETIMESTAMP") as MinTime From Flow Where ScanNumber=73))));
I get an error:
SQL error code = -104.
Invalid expression in the select list (not contained in either an aggregate function or the GROUP BY clause).
What's wrong?
To clarify my question, I need to extract the following information out of the data:
the mean time between the start (eg. 12:00) and the end eg. 12:10) of a scan (MeanTime)
e.g. Scannumber 01), i.e. 12:05
I need the weight difference between end and start
I have to calculate the "Flow" from the weight diff and the time between start and end
All in all I need two data Meantime and flow, which I want to plot (flow over time)
This should do the job for an individual Scan, which appears to be the requirement.
MeanTime = DATEADD(SECOND, DATEDIFF(SECOND, FirstScan.TimeStamp, LastScan.TimeStamp), FirstScan.TimeStamp)
, WeightDifference = LastScan.Weight - FirstScan.Weight
(SELECT Position = MIN(Position) FROM Flow WHERE Scan = #Scan) MinScan
CROSS JOIN (SELECT Position = MAX(Position) FROM Flow WHERE Scan = #Scan) MaxScan
INNER JOIN Flow FirstScan ON MinScan.Position = FirstScan.Position
AND FirstScan.Scan = #Scan
INNER JOIN Flow LastScan ON MaxScan.Position = LastScan.Position
AND LastScan.Scan = #Scan

Bigquery Query failes the first time and successfully completes the 2nd time

I'm executing the following query.
SELECT properties.os, boundary, user, td,
SUM(boundary) OVER(ORDER BY rows) AS session
SELECT properties.os, ROW_NUMBER() OVER() AS rows, user, td,
CASE WHEN td > 1800 THEN 1 ELSE 0 END AS boundary
SELECT properties.os, AS user,
( - AS td
SELECT properties.os, properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
SELECT properties.os, properties.distinct_id, properties.time,
OVER (PARTITION BY properties.distinct_id
ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE properties.time > 1367916800
AND properties.time < 1380003200)) AS t1
SELECT properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
SELECT properties.distinct_id, properties.time,
(PARTITION BY properties.distinct_id ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
properties.time > 1367916800
AND properties.time < 1380003200 )) AS t2
ON t1.srlno = t2.prev_srlno
WHERE ( - > 0))
It fails the first time with the following error. However on 2nd run it completes without any issue. I'd appreciate any pointers on what might be causing this.
The error message is:
Query Failed
Error: Field 'properties.os' not found in table '__R2'.
Job ID: job_VWunPesUJVLxWGZsMgpoti14BM4
We (the BigQuery team) are in the process of rolling out a new version of the query engine that fixes a number of issues like this one. You likely hit an old version of the query engine and then when you retried, hit the new one. It may take us a day or so with a portion of traffic pointing at the updated version in order to verify there aren't any regressions. Please let us know if you hit this again after 24 hours or so.

SQL query determine stopped time in a range

I have to determine stopped time of an vehicle that sends back to server its status data every 30 second and this data is stored in a table of a database.
The fields of a status record consist of (vehicleID, ReceiveDate, ReceiveTime, Speed, Location).
Now what I want to do is, determine each suspension time at the point that vehicle speed came to zero to the status the vehicle move again and so on for next suspension time.
For example on a given day, a given vehicle may have 10 stopped status and I must determine duration of each by a query.
The result can be like this:
id Recvdate Rtime Duration
1 2010-05-01 8:30 45min
1 2110-05-01 12:21 3hour
This is an application of windows functions (called analytic functions in Oracle).
Your goal is to assign a "block number" to each sequence of stops. That is, all stops in a sequence (for a vehicle) will have the same block number, and this will be different from all other sequences of stops.
Here is a way to assign the block number:
Create a speed flag that says 1 when speed > 0 and 0 when speed = 0.
Enumerate all the records where the speed flag = 1. These are "blocks".
Do a self join to put each flag = 0 in a block (this requires grouping and taking the max blocknum).
Summarize by duration or however you want.
The following code is a sketch of what I mean. It won't solve your problem, because you are not clear about how to handle day breaks, what information you want to summarize, and it has an off-by-1 error (in each sequence of stops it includes the previous non-stop, if any).
with vd as
select vd.*,
(case when SpeedFlag = 1
then ROW_NUMBER() over (partition by id, SpeedFlag) end) as blocknum
select vd.*, (case when speed = 0 then 0 else 1 end) as SpeedFlag
from vehicaldata vd
) vd
select id, blocknum, COUNT(*) as numrecs, SUM(duration) as duration
select, vd.rtime, vd.duration, MAX(vdprev.blocknum) as blocknum
from vd
left outer join vd vdprev
on =
and vd.rtime > vdprev.rtime
group by, vd.rtime, vd.duration
) vd
group by id, blocknum