In StreamAnalytic Query how to join data from same input - azure-stream-analytics

I have some json data(InputFuelCon) being processed by a stream analytics query. I want to join the input with itself..i.e when I look at one value in the input, I need to look at another value in that same input..how can I do that..
The format of the json is something like
[{"timeseries":[{"fqn":"STATUS.EngineFuelConsumption","vqts":[{"v":10,"q":192,"t":"2018-05-10T12:34:34.000Z"}]},
{"fqn":"STATUS.ShaftsRunning","vqts":[{"v":"1","q":192,"t":"2018-05-10T12:35:34.000Z"}]}]}]
Running the following gives rows but 0 as the value
WITH DataInput1 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%EngineFuelConsumption'
),
DataInput2 AS
(
SELECT
DATA.Fqn AS fqn,
DATA.Value AS value,
DATA.time AS time
FROM
(
SELECT
Tag.ArrayValue.Fqn AS fqn,
VQT.ArrayValue.V AS value,
VQT.ArrayValue.T AS time
FROM MetsoQuakeFuelCon AS TimeSeries
CROSS APPLY GetArrayElements(TimeSeries.[timeSeries]) AS Tag
CROSS APPLY GetArrayElements(Tag.ArrayValue.vqts) AS VQT
) AS DATA
WHERE DATA.fqn like '%ShaftsRunning' and DATA.Value like '1'
),
DataInput as (
select I1.Fqn AS fqn,
cast(I1.Value as bigint)/30 AS value,
DATETIMEFROMPARTS(DATEPART(year,I1.Time ),DATEPART(month,I1.Time ),DATEPART(day,I1.Time )
,DATEPART(hour,I1.Time ),00,00,00 ) AS time
from DataInput1 I1 JOIN DataInput2 I2
ON
I1.Time=I2.Time and
DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1
)
select * from DataInput
DataInput1 and DataInput2 if run by themselves, give one record each and with sql experience, the datainput join on the timestamp should give the result, but it doesn't. I don't understand how DATEDIFF(MINUTE,I1,I2) BETWEEN 0 AND 1 works but if I remove it, then there is an error. Any help will be greatly appreciated.

please find some answers below. Let me know if you have any further question.
Why your query doesn't return data with this sample:
I looked at the data and query and the following statement implies you have strict equality on the value "Time": I1.Time=I2.Time. However in your sample, the time is different for the 2 entries, so that's why there is no result.
The DATEDIFF statement doesn't relax any of the equality requirements in the JOIN statement.
By removing the line "I1.Time=I2.Time and" you will see a result for your sample. In that case it will join records arriving within a minute. Note that if you have more than 1 record arriving within the same minute, you will see more than 1 joined result with this logic. Also you may want to use application timestamp to compare the timestamp in the data, and not the arrival time of the message.
More info about DATEDIFF:
JOIN in Azure Stream Analytics are temporal in nature, meaning that each JOIN must provide some limits on how far the matching rows can be separated in time.
In your case since there is no explicit TIMESTAMP, the arrival time will be used.
Then, the JOIN condition will be applied, and in your example there is no data matching the condition.

Related

AWS Timestream query to get average measure for the first month of samples

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?
I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

Modify Postgres query to use generate_series for overall summation over each of several consecutive range intervals

I'm still quite new with SQL, coming from an ORM-centric environment, so please be patient with me.
Provided with a table in the form of:
CREATE TABLE event (id int, order_dates tsrange, flow int);
INSERT INTO event VALUES
(1,'[2021-09-01 10:55:01,2021-09-04 15:16:01)',50),
(2,'[2021-08-15 20:14:27,2021-08-18 22:19:27)',36),
(3,'[2021-08-03 12:51:47,2021-08-05 11:28:47)',41),
(4,'[2021-08-17 09:14:30,2021-08-20 13:57:30)',29),
(5,'[2021-08-02 20:29:07,2021-08-04 19:19:07)',27),
(6,'[2021-08-26 02:01:13,2021-08-26 08:01:13)',39),
(7,'[2021-08-25 23:03:25,2021-08-27 03:22:25)',10),
(8,'[2021-08-12 23:40:24,2021-08-15 08:32:24)',26),
(9,'[2021-08-24 17:19:59,2021-08-29 00:48:59)',5),
(10,'[2021-09-01 02:01:17,2021-09-02 12:31:17)',48); -- etc
the query below does the following:
(here, 'the range' is 2021-08-03T00:00:00 from to 2021-08-04T00:00:00)
For each event that overlaps with the range
Trim the Lower and Upper timestamp values of order_dates to the bounds of the range
Multiply the remaining duration of each applicable event by the event.flow value
Sum all of the multiplied values for a final single value output
Basically, I get all of the events that overlap the range, but only calculate the total value based on the portion of each event that is within the range.
SELECT SUM("total_value")
FROM
(SELECT (EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER * "event"."flow") AS "total_value"
FROM "event"
WHERE "event"."order_dates" && tsrange('2021-08-03T00:00:00'::timestamp, '2021-08-04T00:00:00'::timestamp, '[)')
GROUP BY "event"."id",
GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp),
LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp),
EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER, (EXTRACT(epoch
FROM (LEAST(UPPER("event"."order_dates"), '2021-08-04T00:00:00'::timestamp) - GREATEST(LOWER("event"."order_dates"), '2021-08-03T00:00:00'::timestamp)))::INTEGER * "event"."flow")) subquery
The DB<>Fiddle demonstrating this: https://www.db-fiddle.com/f/jMBtKKRS33Qf2FEoY5EdPA/1
This query started out as a complex set of django annotations and aggregation, and I have simplified it to remove the parts not necessary for this question.
So with the above I get a single total value over the input range (in this case a 1-day range).
But I want to be able to use generate_series to perform this same overall summation to each of several consecutive range intervals
e.g.: query for the total during each of the following ranges:
['2021-08-01T00:00:00', '2021-08-02T00:00:00')
['2021-08-02T00:00:00', '2021-08-03T00:00:00')
['2021-08-03T00:00:00', '2021-08-04T00:00:00')
['2021-08-04T00:00:00', '2021-08-05T00:00:00')
This is somewhat related to my previous question here, but since the timestamps for the queried range are used in so many places within the query, I'm pretty lost for how to do this.
Any help/direction will be appreciated.
This should get you started: https://www.db-fiddle.com/f/qm4F7qqWZMrtXtMejimVJr/1.
Basically what I did was to prepare the ranges with a CTE up-front, then select from that table expression with a CROSS JOIN LATERAL of your original query. Next, I replaced all occurrences of 20210803 with lower(target_range) and 20210804 with upper(target_range), then added the GROUP BY of target_range. Note that only those ranges that overlap at least one row in the input will appear in the output; change the cross join to a LEFT JOIN to always see your input ranges in the output, even if value is null. (If so, ON TRUE is fine for the join condition, since you already do the filtering the WHERE of the inner subquery.)

Query to return the amount of time each field equals a true value

I'm collecting data and storing in SQL Server. The table consist of the data shown in the Example Table below. I need to show the amount of time that each field [Fault0-Fault10] is in a fault state. The fault state is represented as a 1 for the fields value.
I have used the following query and got the desired results. However this is for only one field. I need to have the total time for each field in one query. I'm having issues pulling all of the fields into one query.
SELECT
Distinct [Fault0]
,datediff(mi, min(DateAndTime), max(DateAndTime)) TotalTimeInMins
FROM [dbo].[Fault]
Where Fault0 =1
group by [Fault0]
Results
Assuming you want the max() - min(), then you simply need to unpivot the data. Your query can look like:
SELECT v.faultname,
datediff(minute, min(t.DateAndTime), max(t.DateAndTime)) as TotalTimeInMins
FROM [dbo].[Fault] f CROSS APPLY
(VALUES ('fault0', fault0), ('fault1', fault1), . . ., ('fault10', fault10)
) v(faultname, value)
WHERE v.value = 1
GROUP BY v.faultname;

why does this query not give me only the specified accounts?

Oracle SQL Developer
I expect to see:
In the subquery, I have that the rownumber be less than 2. When I run this query separately, it gives me 2 accounts. However, when I'm running the entire query, the list of account numbers just goes on! what's happening here?
SELECT m.acctno, i.intervalstartdate, d.name, i.intervalvalue
FROM endpoints E
JOIN meters m on m.acctid = e.acctid
LEFT JOIN intervaldata I ON I.acctid = M.acctid
LEFT JOIN endpointmodels EM ON EM.endpointmodelid=E.hwmodelid
LEFT JOIN datadefinitions D ON D.datadefinitionid = I.datadefinitionid
WHERE 1=1
AND E.statuscodeid = 8
AND m.FORM = 2
and exists
(
SELECT m2.acctno
from acct m2
where m2.acctno is not null
--and m2.acctno=m2.acctno
and rownum <= 2
)
AND D.datadefinitionid =7077
AND I.intervalstartdate BETWEEN '24-SEP-2017 00:00' and '25-SEP-2017 00:00'
--TRUNC(sysdate - 1) + interval '1' hour AND TRUNC(sysdate - 1) + interval
'24' hour
ORDER BY M.acctno, I.intervalstartdate, I.datadefinitionid
This query is supposed to give me 97 rows for each account. The data i'm reading, the interval values, are the data we report for each customer in 96 intervals. so Im expecting for 2 accounts for example, to get 194 rows. i want to test for 2 accounts now, but then i want to run for 50,000. so with 2, it's not even working. Just giving me millions of rows for two accounts. Basicaly, i think my row num line of code is being ignored. I can't use an in clause because i cant pass 50,000 accounts into there. so I used the exist operator.
Let me know!
I think the error is in trying to use and exists (...) clause. The exists predicate returns true if the subquery returns any rows at all. So, in your case, the result of exists will always be true, unless the table is empty. This means it has no effect whatsoever on the outer query. You need to use something like
inner join (SELECT m2.acctno
from acct m2
where m2.acctno is not null
--and m2.acctno=m2.acctno
and rownum <= 2) sub1
on sub1.acctno = m.acctno
to get what you want instead of and exists (...).
One obvious mistake is the date condition, where you require a date to be between two STRINGS. If you keep dates in string format, you will run into thousands of problems and bugs and you won't be able to fix them.
Do you understand that '25-APR-2008 00:00:00' is between '24-SEP-2017 00:00:00' and '25-SEP-2017 00:00:00', if you compare them alphabetically (as strings)?
The solution is to make sure the date column is in DATE (or TIMESTAMP) data type, and then to compare to dates, not to strings.
As an aside - this will not cause any errors, but it is still bad code - in the EXISTS condition you have a condition for ROWNUM <= 2. What do you think that will do? The subquery either returns at least one row (the first one will automatically have ROWNUM = 1) or it doesn't. The condition on ROWNUM in that subquery in the EXISTS condition is just garbage.

Eliminating Entries Based On Revision

I need to figure out how to eliminate older revisions from my query's results, my database stores orders as 'Q000000' and revisions have an appended '-number'. My query currently is as follows:
SELECT DISTINCT Estimate.EstimateNo
FROM Estimate
INNER JOIN EstimateDetails ON EstimateDetails.EstimateID = Estimate.EstimateID
INNER JOIN EstimateDoorList ON EstimateDoorList.ItemSpecID = EstimateDetails.ItemSpecID
WHERE (Estimate.SalesRepID = '67' OR Estimate.SalesRepID = '61') AND Estimate.EntryDate >= '2017-01-01 00:00:00.000' AND EstimateDoorList.SlabSpecies LIKE '%MDF%'
ORDER BY Estimate.EstimateNo
So for instance, the results would include:
Q120455-10
Q120445-11
Q121675-2
Q122361-1
Q123456
Q123456-1
From this, I need to eliminate 'Q120455-10' because of the presence of '-11' for that order, and 'Q123456' because of the presence of the '-1' revision. I'm struggling greatly with figuring out how to do this, my immediate thought was to use case statements but I'm not sure what is the best way to implement them and how to filter. Thank you in advance, let me know if any more information is needed.
First you have to parse your EstimateNo column into sequence number and revision number using CHARINDEX and SUBSTRING (or STRING_SPLIT in newer versions) and CAST/CONVERT the revision to a numeric type
SELECT
SUBSTRING(Estimate.EstimateNo,0,CHARINDEX('-',Estimate.EstimateNo)) as [EstimateNo],
CAST(SUBSTRING(Estimate.EstimateNo,CHARINDEX('-',Estimate.EstimateNo)+1, LEN(Estimate.EstimateNo)-CHARINDEX('-',Estimate.EstimateNo)+1) as INT) as [EstimateRevision]
FROM
...
You can then use
APPLY - to select TOP 1 row that matches the EstimateNo or
Window function such as ROW_NUMBER to select only records with row number of 1
For example, using a ROW_NUMBER would look something like below:
SELECT
ROW_NUMBER() OVER(PARTITION BY EstimateNo ORDER BY EstimateRevision DESC) AS "LastRevisionForEstimate",
-- rest of the needed columns
FROM
(
-- query above goes here
)
You can then wrap the query above in a simple select with a where predicate filtering out a specific value of LastRevisionForEstimate, for instance
SELECT --needed columns
FROM -- result set above
WHERE LastRevisionForEstimate = 1
Please note that this is to a certain extent, pseudocode, as I do not have your schema and cannot test the query
If you dislike the nested selects, check out the Common Table Expressions