Stream Analytics Left outer join Not Producing Rows - azure-stream-analytics

What I am trying to do:
I want to "throttle" an input stream to its output. Specifically, as I receive multiple similar inputs, I only want to produce an output if one hasn't already been produced in the last N hours.
For example, the input could be thought of as "send an email", but I will get dozens/hundreds of those events. I only want to send an email if I haven't already sent one in the last N hours (or have never sent one).
See the final example here: https://learn.microsoft.com/en-us/stream-analytics-query/join-azure-stream-analytics#examples for something similar to what I am trying to do
What my setup looks like:
There are two inputs to my query:
Ingress: this is the "raw" input stream
Throttled-Sent: this is just a consumer group off of my output stream
My query is as follows:
WITH
AllEvents as (
/* This CTE is here only because we can't seem to use GetMetadataPropertyValue in a join clause, so "materialize" it here for use- later */
SELECT
*,
GetMetadataPropertyValue([Ingress], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Ingress], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Ingress], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Ingress]
),
UseableEvents as (
SELECT *
FROM AllEvents
WHERE NotifyEntityId IS NOT NULL
),
AlreadySentEvents as (
/* These are the events that would have been previously output (referenced here via a consumer group). We want to capture these to make sure we are not sending new events when an older "already-sent" event can be found */
SELECT
*,
GetMetadataPropertyValue([Throttled-Sent], '[User].[Type]') AS Type,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyType]') AS NotifyType,
GetMetadataPropertyValue([Throttled-Sent], '[User].[NotifyEntityId]') AS NotifyEntityId
FROM
[Throttled-Sent]
)
SELECT i.*
INTO Throttled
FROM UseableEvents i
/* Left join our sent events, looking for those within a particular time frame */
LEFT OUTER JOIN AlreadySentEvents s
ON i.Type = s.Type
AND i.NotifyType = s.NotifyType
AND i.NotifyEntityId = s.NotifyEntityId
AND DATEDIFF(hour, i, s) BETWEEN 0 AND 4
WHERE s.Type IS NULL /* The is null here is for only returning those Ingress rows that have no corresponding AlreadySentEvents row */
The results I'm seeing:
This query is producing no rows to the output. However, I believe it should be producing something because the Throttled-Sent input has zero rows to begin with. I have validated that my Ingress events are showing up (by simply adjusting the query to remove the left join and checking the results).
I feel like my problem is probably linked to one of the following areas:
I can't have an input that is a consumer group off of the output (but I don't know why that wouldn't be allowed)
My datediff usage/understanding is incorrect
Appreciate any help/guidance/direction!

For throttling, I would recommend looking at IsFirst function, it might be easier solution that will not require reading from the output.
For the current query, I think order of DATEDIFF parameters need to be changed as s comes before i: DATEDIFF(hour, s, i) BETWEEN 0 AND 4

Related

Nested SQL evaluation question with unnest

this may be a basic question but I just couldn't figure it out. Sample data and query could be found here. (under the "First-touch" tab)
I'll skip the marketing terminology here but basically what the query does is attributing credits/points to placements (ads) based on certain rule. Here, the rule is "first-touch", which means the credit goes to the first ad user interacted with - could be view or click. The "FLOODLIGHT" here means the user takes action to actually buy the product (conversion).
As you can see in the sample data, user 1 has one conversion and the first ad is placement 22 (first-touch), so 22 gets 1 point. User 2 has two conversions and the first ad of each is 11, so 11 gets 2 points.
The logic is quite simple here but I had a difficult time understanding the query itself. What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time? Aren't they essentially the same? I mean both of them came from UNNEST(t.*_path.events). And attributed_event.event_time also came from the same place.
What does prev_conversion_event.event_time, conversion_event.event_time, and attributed_event.event_time evaluate to in this scenario anyway? I'm just confused as hell here. Much appreciate the help!
For convenience I'm pasting the sample data, the query and output below:
Sample data
Output
/* Substitute *_paths for the specific paths table that you want to query. */
SELECT
(
SELECT
attributed_event_metadata.placement_id
FROM (
SELECT
AS STRUCT attributed_event.placement_id,
ROW_NUMBER() OVER(ORDER BY attributed_event.event_time ASC) AS rank
FROM
UNNEST(t.*_paths.events) AS attributed_event
WHERE
attributed_event.event_type != "FLOODLIGHT"
AND attributed_event.event_time < conversion_event.event_time
AND attributed_event.event_time > (
SELECT
IFNULL( (
SELECT
MAX(prev_conversion_event.event_time) AS event_time
FROM
UNNEST(t.*_paths.events) AS prev_conversion_event
WHERE
prev_conversion_event.event_type = "FLOODLIGHT"
AND prev_conversion_event.event_time < conversion_event.event_time),
0)) ) AS attributed_event_metadata
WHERE
attributed_event_metadata.rank = 1) AS placement_id,
COUNT(*) AS credit
FROM
adh.*_paths AS t,
UNNEST(*_paths.events) AS conversion_event
WHERE
conversion_event.event_type = "FLOODLIGHT"
GROUP BY
placement_id
HAVING
placement_id IS NOT NULL
ORDER BY
credit DESC
It is a quite convoluted query to be fair, I think I know what are you asking, please correct me if not the case.
What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time?
You are doing something like "I want all the events from this (unnest), and for every event, I want to know which events are the predecessor of each other".
Say you have [A, B, C, D] and they are ordered in succession (A happened before B, A and B happened before C, and so on), the result of that unnesting and joining over that condition will get you something like [A:(NULL), B:(A), C:(A, B), D:(A, B, C)] (excuse the notation, hope it is not confusing), being each key:value pair, the Event:(Predecessors). Note that A has no events before it, but B has A, etc.
Now you have a nice table with all the conversion events joined with the events that happened before that one.

In StreamAnalytic Query how to join data from same input

I have some json data(InputFuelCon) being processed by a stream analytics query. I want to join the input with itself..i.e when I look at one value in the input, I need to look at another value in that same input..how can I do that..
The format of the json is something like
[{"timeseries":[{"fqn":"STATUS.EngineFuelConsumption","vqts":[{"v":10,"q":192,"t":"2018-05-10T12:34:34.000Z"}]},
{"fqn":"STATUS.ShaftsRunning","vqts":[{"v":"1","q":192,"t":"2018-05-10T12:35:34.000Z"}]}]}]
Running the following gives rows but 0 as the value
WITH DataInput1 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%EngineFuelConsumption'
),
DataInput2 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%ShaftsRunning' and DATA.Value like '1'
),
DataInput as (
select I1.Fqn AS fqn,
cast(I1.Value as bigint)/30 AS value,
DATETIMEFROMPARTS(DATEPART(year,I1.Time ),DATEPART(month,I1.Time ),DATEPART(day,I1.Time )
,DATEPART(hour,I1.Time ),00,00,00 ) AS time
from DataInput1 I1 JOIN DataInput2 I2
ON
I1.Time=I2.Time and
DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1
)
select * from DataInput
DataInput1 and DataInput2 if run by themselves, give one record each and with sql experience, the datainput join on the timestamp should give the result, but it doesn't. I don't understand how DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1 works but if I remove it, then there is an error. Any help will be greatly appreciated.
please find some answers below. Let me know if you have any further question.
Why your query doesn't return data with this sample:
I looked at the data and query and the following statement implies you have strict equality on the value "Time": I1.Time=I2.Time. However in your sample, the time is different for the 2 entries, so that's why there is no result.
The DATEDIFF statement doesn't relax any of the equality requirements in the JOIN statement.
By removing the line "I1.Time=I2.Time and" you will see a result for your sample. In that case it will join records arriving within a minute. Note that if you have more than 1 record arriving within the same minute, you will see more than 1 joined result with this logic. Also you may want to use application timestamp to compare the timestamp in the data, and not the arrival time of the message.
More info about DATEDIFF:
JOIN in Azure Stream Analytics are temporal in nature, meaning that each JOIN must provide some limits on how far the matching rows can be separated in time.
In your case since there is no explicit TIMESTAMP, the arrival time will be used.
Then, the JOIN condition will be applied, and in your example there is no data matching the condition.

sum 'distinct' rows with same values

I have a database which has a feeder that may have several distributors, each which may have several transformers, each which may have several clients and a certain kVA (power that gets to the clients).
And I have the following code:
SELECT f.feeder,
d.distributor,
count(DISTINCT t.transformer) AS total_transformers,
sum(t.Kvan) AS Total_KVA,
count(c.client) AS Clients,
FROM feeders f
LEFT JOIN distributors d
ON (d.feeder = f.feeder)
LEFT JOIN transformers t
ON (t.transformer = d.transformer)
LEFT JOIN clients c
ON (c.transformer = t.transformer)
WHERE d.transformer IS NOT NULL
GROUP BY f.feeder,
d.distributor,
f.feeder
ORDER BY f.feeder,
d.distributor
The sum is supposed to bring the sum of the different kVA the transformers have. Each transformer has a certain kVA. Problem is, 1 transformer has 1kVA for all the clients it has connected, but it will sum it as if it was 1kVA per client.
I need to group it on the feeder and distributor (I want to see how much kVA the distributor has and how many clients total).
So what should be "feeder1|dist1|2|600|374" brings me "feeder1|dist1|2|130000|374" (1 transformer has 200 kVA and the otherone 400, but it will sum these two 374 times instead of 400+200)
Your data model seems a little messy, in that you've specified a distributor can have many transformers (and logic suggests that a transformer is only on a single distributor) yet your query implies that the transformer ID is on the distributor record, which normally implies the opposite relationship ...
So if that's right, it must mean that you have multiple records in the distributors table for the same distributor - i.e. distributor can't then be a unique key in distributors table, which makes the query quite hard to reason accurately about. (e.g. What happens if the records for a distributor don't all have the same feeder ID on them? I'm guessing you wouldn't like the answer so much... Presumably you mean for that to be impossible, but if the model is as described it's not impossible. And worse I'm now second-guessing whether the apparent keys on the other tables are in fact unique... But I digress...)
Or maybe something else is broken. Point is the info you've given may be inconsistent or incomplete. Since I'm inferring an abnormal data model I can't guarantee the following is bug-free (though if you provide more detail so I can make fewer guesses, I may be able to refine the answer)...
So you know the trouble is that by the time you're ready to do the aggregation, the transformer data is embedded in a larger row that isn't based just on the identity of the transformer. There are a couple ways you could fix it, basically all centered on changing how you look at the aggregation of values. Here's one option:
select f.feeder
, dtc.distributor
-- next values work because transformer is already unique per group
, count(dtc.transformer) total_transformers
, sum(dtc.kvam) total_kvam
, sum(dtc.clients) clients
from feeder f
join (select d.distributor
, d.feeder
, t.transformer
, max(t.kvan) as kvan -- or min, doesn't matter
, count(distinct c.client) clients
from distributors d
left join transformers t
on d.transformer = t.transformer
left join clients c
on c.transformer = t.transformer
where d.transformer is not null
group by d.distributor, d.feeder, t.transformer
) dtc
on dtc.feeder = f.feeder
group by f.feeder, dtc.distributor
A few notes:
I changed the outer query join to an inner join, because any null rows from the original left join from feeder would be eliminated by the original where clause.
I kept the where clause anyway; having it along side the distributor-to-transformer left join is a little weird but is different from either an inner join or an outer join without the where clause (since the where clause acts on the left table's value). I'm avoiding changing the semantics from your original query, but it's weird enough this is something you might want to take another look at.
What using the subquery does here is, the inner query returns one row per feeder/distributor/transformer - i.e. for each feeder/distributor it returns one row per transformer. That row is itself an aggregate so that we can count clients, but since all rows in that aggregation come from the same transformer record we can use max() to get that single record's kvan value onto the aggregation.

How to: Self join Stream Analytics?

Im trying to replace null values with last 10 seconds Average in stream analytics job.
this requires a self join between the stream and the averages that i calculate in the With clause.
It is giving me duplicates(i get same record twice or thrice)? Any suggestions on whats wrong or how to do it properly?
My query is:
WITH MV AS ( Select AVG([Sensor_1]) AS [Sensor_1] From [input] GROUP BY SlidingWindow(second, 10))
SELECT [input].[ID]
,[input].[Timestamp]
,[input].[Result]
,CASE
WHEN [input].[Sensor_1] = 0
THEN [MV].[Sensor_1] ELSE [input].[Sensor_1]
END [Sensor_1]
,[input].[Sensor_2]
,[input].[Sensor_3]
FROM [input]
LEFT OUTER JOIN [MV]
ON DateDiff(second, [input], [MV]) BETWEEN 0 AND 10
Sorry for the delay in responding on this.
The simplest solution is to change ON DateDiff(second, [input], [MV]) BETWEEN 0 AND 10 to ON DateDiff(millisecond, [input], [MV]) = 0.
This is because the timestamps given in the MV step are of the last event that went into the SlidingWindow and those would match the timestamp on the event in Input (note: the smaller the time unit the better for the match but if you are using the in-browser-testing-experience then millisecond is the smallest supported unit).
Do note that while here we can remove duplicates by removing needless matches in the JOIN, in general Stream Analytics has no mechanism to remove duplicates via DISTINCT or anything like that.
Ziv.

SQL query seems to work for 'AND T1.email_address_ IN (subquery)', but returns 0 rows for 'AND T1.email_address_ NOT IN (subquery)'

Good morning. I'm working in Responsys Interact, which is an Oracle-based email campaign management type SAAS product. I'm creating a query to basically filter a target list for an email campaign designed to target a specific sub-set of our master email contact list. Here's the query I created a few weeks ago that appears to work:
/*
Table Symbolic Name
CONTACTS_LIST $A$
Engaged $B$
TRANSACTIONS_RAW $C$
TRANSACTION_LINES_RAW $D$
-- A Responsys Filter (Engaged) will return only an RIID_, nothing else, according to John # Responsys....so,....let's join on that to contact list...
*/
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$A$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
;
This seems to work, in that if I run the query without the sub-query, we get 720K rows; and if I add back the 'AND NOT IN...' subquery, we get about 700K rows, which appears correct based on what my user knows about her data. What I'm (supposedly) doing with the NOT IN subquery is filtering out any email addresses where the customer has purchased certain items from us in the last 60 days.
So, now I need to add in another constraint. We still don't want customers who made certain purchases in the last 60 days as above, but now also we want to exclude customers who have purchased another particular item, but now within the last 12 months. So, I thought I would add another subquery, as shown below. Now, this has introduced several problems:
Performance - the query, which took a couple minutes to run before, now takes quite a few more minutes to run - in fact it seems to time out....
So, I wondered if there's an issue having two subqueries, but before I went to think about alternatives to this, I decided to test my new subquery by temporarily deleting the first subquery, so that I had just one subquery similar to above, but with the new item = 11 and within the last 12 months logic. And so with this, the query finally returned after a few minutes now, but with zero rows.
Trying to figure out why, I tried simply changing the AND NOT IN (subquery) to AND IN (subquery), and that worked, in that it returned a few thousand rows, as expected.
So why would the same SQL when using AND IN (subquery) "work", but the exact same SQL simply changed to AND NOT IN (subquery) return zero rows, instead of what I would expect which would be my 700 something thousdand plus rows, less the couple thousand encapsulated by the subquery result?
Also, what is the best i.e. most performant way to accomplish what I'm trying to do, which is filter by some purchases made within one date range, AND by some other purchases made within a different date range?
Here's the modified version:
SELECT
DISTINCT $A$.EMAIL_ADDRESS_,
$A$.RIID_,
$A$.FIRST_NAME,
$A$.LAST_NAME,
$A$.EMAIL_PERMISSION_STATUS_
FROM
$A$
JOIN $B$ ON $B$.RIID_ = $A$.RIID_
LEFT JOIN $C$ ON $C$.EMAIL_ADDRESS_ = $A$.EMAIL_ADDRESS_
LEFT JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$A$.EMAIL_DOMAIN_ NOT IN ('none.com', 'noemail.com', 'mailinator.com', 'nomail.com') AND
/* don't include hp customers */
$A$.HP_PLAN_START_DATE IS NULL AND
$A$.EMAIL_ADDRESS_ NOT IN
(
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
/* Get only purchase transactions for certain item_id's/SKU's */
($D$.ITEM_FAMILY_ID IN (3,4,5,8,14,15) OR $D$.ITEM_ID IN (704,769,1893,2808,3013) ) AND
/* .... within last 60 days (i.e. 2 months) */
$C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -2)
)
AND
$A$.EMAIL_ADDRESS_ NOT IN
(
/* get purchase transactions for another type of item within last year */
SELECT
$C$.EMAIL_ADDRESS_
FROM
$C$
JOIN $D$ ON $D$.TRANSACTION_ID = $C$.TRANSACTION_ID
WHERE
$D$.ITEM_FAMILY_ID = 11 AND $C$.TRANDATE > ADD_MONTHS(CURRENT_TIMESTAMP, -12)
)
;
Thanks for any ideas/insights. I may be missing or mis-remembering some basic SQL concept here - if so please help me out! Also, Responsys Interact runs on top of Oracle - it's an Oracle product - but I don't know off hand what version/flavor. Thanks!
Looks like my problem with the new subquery was due to poor performance due to lack of indexes. Thanks to Alex Poole's comments, I looked in Responsys and there is a facility to get an 'explain' type analysis, and it was throwing warnings, and suggesting I build some indexes. Found the way to do that on the data sources, went back to the explain, and it said, "The query should run without placing an unnecessary burden on the system". And while it still ran for quite a few minutes, it did finally come back with close to the expected number of rows.
Now, I'm on to tackle the other half of the issue, which is to now incorporate this second sub-query in addition to the first, original subquery....
Ok, upon further testing/analysis and refining my stackoverflow search critieria, the answer to the main part of my question dealing with the IN vs. NOT IN can be found here: SQL "select where not in subquery" returns no results
My performance was helped by using Responsys's explain-like feature and adding some indexes, but when I did that, I also happened to add in a little extra SQL in my sub-query's WHERE clause.... when I removed that, even after indexes built, I was back to zero rows returned. That's because as it turned out at least one of the transactions rows for the item family id I was interested in for this additional sub-query had a null value for email address. And as further explained in the link above, when using NOT IN, as soon as you have a null value involved, SQL can't definitively say it's NOT IN, since you can't really compare to null, so as soon as you have a null, the sub-query's going to evaluate 'false', thus zero rows. When using IN, even though there are nulls present, if you get one positive match, well, that's a match, so the sub-query returns 'true', so that's why you'll get rows with IN, but not with NOT IN. I hadn't realized that some of our transaction data may have null email addresses - now I know, so I just added a where not null to the where clause for the email address, and now all's good.