Calculate Time Differences in Azure Data Lake Analytics U-SQL Job

Calculate Time Differences in Azure Data Lake Analytics U-SQL Job - azure-data-lake

In our project we have to periodically calculate aggregates and further calculations based on the input data received.
One frequent requirement is to calculate time differences between certain rows in our input data stream.
For example, this is my input datastream:
Timestamp Event Value
2017-05-21 11:33 e1 17
2017-05-21 11:37 e2 18
2017-05-21 11:38 e3 18
2017-05-21 11:39 e1 19
2017-05-21 11:42 e2 19
I now want to calculate all the timespans between e2 events and the last received e1 event (ordered by timestamp).
i would expect the result to be:
3 (minutes)
4 (minutes)
A similar requirement would be to calculate the timespans between the same type of event (i.e. all differences between e1 events) where i would expect this result:
6 (minutes)
My attempts so far:
This sort of analytics could pretty easily be achieved using the LAG function in conjunction with the WHEN clause but unfortunately the WHEN clause is missing in U-SQL.
If it would be T-SQL it would also be possible to solve this using Sub-Selects in the SELECT-Clause of the statement but unfortunately this is also not possible in U-SQL.
Do you have any suggestions or sample scripts on how to solve this issue?
Thank you very much for your help!

In U-SQL, you can use c# methods for simple date arithmetic. If your data is as simple as you describe, you could just rank the e1 and e2 events then join them, something like this:
#data =
EXTRACT Timestamp DateTime,
Event string,
Value int
FROM "/input/input58.csv"
USING Extractors.Csv();
//#data = SELECT *
// FROM (
// VALUES
// ( "2017-05-21 11:33", "e1", 17 ),
// ( "2017-05-21 11:37", "e2", 18 ),
// ( "2017-05-21 11:38", "e3", 18 ),
// ( "2017-05-21 11:39", "e1", 19 ),
// ( "2017-05-21 11:42", "e2", 19 )
// ) AS T(Timestamp, Event, Value);
#e1 =
SELECT ROW_NUMBER() OVER(ORDER BY Timestamp) AS rn,
*
FROM #data
WHERE Event == "e1";
#e2 =
SELECT ROW_NUMBER() OVER(ORDER BY Timestamp) AS rn,
*
FROM #data
WHERE Event == "e2";
#working =
SELECT
(e2.Timestamp - e1.Timestamp).TotalSeconds AS diff_sec,
(e2.Timestamp - e1.Timestamp).ToString() AS diff_hhmmss,
e1.Timestamp AS ts1,
e2.Timestamp AS ts2
FROM #e1 AS e1
INNER JOIN #e2 AS e2 ON e1.rn == e2.rn;
OUTPUT #working TO "/output/output.csv"
USING Outputters.Csv(quoting:false);
My results, showing 4 and 3 minutes for sample data:
Will that work for you? If not, please provide a more realistic data sample.

#data =
SELECT
LAST_VALUE(Event == "e1" ? Timestamp : (DateTime?)null) OVER (ORDER BY Timestamp) AS E1Time
// MAX(Event == "e1" ? Timestamp : DateTime.MinValue) OVER (ORDER BY Timestamp) AS E1Time
, Timestamp AS E2Time
FROM #events
HAVING Event == "e2"
;
because aggregates/WFs ignore null (at least they should, U-SQL documentation for LAST_VALUE doesn't say, so needs verification). This allows emulation of conditional behavior such as WHEN. Similar behavior can be obtained with MAX/MIN and an appropriate default.
That said, you should spec the input data and expected result in detail, which may alter the solution. Namely, can aberrant data sequences occur and what behavior is expected (or at least tolerated for the sake of simplicity) if they do:
e1, e1, e2 - Above code ignores earlier e1
e1, e2, e2 - Above code computes 2 values wrt the same e1
e1, e1, e2, e2 - Above code doesn't recognize nesting, same as case 2.
e2 - Above code may crash (null) or throw results off by using DateTime.MinValue.
etc. At some point of complexity you'd probably have to defer to a custom reducer via REDUCE ALL (this is a last resort!), but that would restrict the size of data that can be processed.

Related

SQL Server query order by sequence serie

I am writing a query and I want it to do a order by a series. The first seven records should be ordered by 1,2,3,4,5,6 and 7. And then it should start all over.
I have tried over partition, last_value but I cant figure it out.
This is the SQL code:
set language swedish;
select
tblridgruppevent.id,
datepart(dw,date) as daynumber,
tblRidgrupper.name
from
tblRidgruppEvent
join
tblRidgrupper on tblRidgrupper.id = tblRidgruppEvent.ridgruppid
where
ridgruppid in (select id from tblRidgrupper
where corporationID = 309 and Removeddate is null)
and tblridgruppevent.terminID = (select id from tblTermin
where corporationID = 309 and removedDate is null and isActive = 1)
and tblridgrupper.removeddate is null
order by
datepart(dw, date)
and this is a example the result:
5887 1 J2
5916 1 J5
6555 2 Junior nybörjare
6004 2 Morgonridning
5911 3 J2
6467 3 J5
and this is what I would expect:
5887 1 J2
6555 2 Junior nybörjare
5911 3 J2
5916 1 J5
6004 2 Morgonridning
6467 3 J5

You might get some value by zooming out a little further and consider what you're trying to do and how else you might do it. SQL tends to perform very poorly with row by row processing as well as operations where a row borrows details from the row before it. You also could run into problems if you need to change what range you repeat at (switching from 7 to 10 or 4 etc).
If you need a number there somewhat arbitrarily still, you could add ROW_NUMBER combined with a modulo to get a repeating increment, then add it to your select/where criteria. It would look something like this:
((ROW_NUMBER() OVER(ORDER BY column ASC) -1) % 7) + 1 AS Number
The outer +1 is to display the results as 1-7 instead of 0-6, and the inner -1 deals with the off by one issue (the column starting at 2 instead of 1). I feel like there's a better way to deal with that, but it's not coming to me at the moment.
edit: Looking over your post again, it looks like you're dealing with days of the week. You can order by Date even if it's not shown in the select statement, that might be all you need to get this working.

The first seven records should be ordererd by 1,2,3,4,5,6 and 7. And then it should start all over.
You can use row_number():
order by row_number() over (partition by DATEPART(dw, date) order by tblridgruppevent.id),
datepart(dw, date)
The second key keeps the order within a group.
You don't specify how the rows should be chosen for each group. It is not clear from the question.

where clause with = sign matches multiple records while expected just one record

I have a simple inline view that contains 2 columns.
-----------------
rn | val
-----------------
0 | A
... | ...
25 | Z
I am trying to select a val by matching the rn randomly by using the dbms_random.value() method as in
with d (rn, val) as
(
select level-1, chr(64+level) from dual connect by level <= 26
)
select * from d
where rn = floor(dbms_random.value()*25)
;
My expectation is it should return one row only without failing.
But now and then I get multiple rows returned or no rows at all.
on the other hand,
>>select floor(dbms_random.value()*25) from dual connect by level <1000
returns a whole number for each row and I failed to see any abnormality.
What am I missing here?

The problem is that the random value is recalculated for each row. So, you might get two random values that match the value -- or go through all the values and never get a hit.
One way to get around this is:
select d.*
from (select d.*
from d
order by dbms_random.value()
) d
where rownum = 1;
There are more efficient ways to calculate a random number, but this is intended to be a simple modification to your existing query.
You also might want to ask another question. This question starts with a description of a table that is not used, and then the question is about a query that doesn't use the table. Ask another question, describing the table and the real problem you are having -- along with sample data and desired results.

SQL Server 2005 - SUM'ing one field, but only for the first occurence of a second field

Platform: SQL Server 2005 Express
Disclaimer: I’m quite a novice to SQL and so if you are happy to help with what may be a very simple question, then I won’t be offended if you talk slowly and use small words :-)
I have a table where I want to SUM the contents of multiple rows. However, I want to SUM one column only for the first occurrence of text in a different column.
Table schema for table 'tblMain'
fldOne {varchar(100)} Example contents: “Dandelion“
fldTwo {varchar(8)} Example contents: “01:00:00” (represents hh:mm:ss)
fldThree {numeric(10,0)} Example contents: “65”
Contents of table:
Row number fldOne fldTwo fldThree
------------------------------------------------
1 Dandelion 01:00:00 99
2 Daisy 02:15:00 88
3 Dandelion 00:45:00 77
4 Dandelion 00:30:00 10
5 Dandelion 00:15:00 200
6 Rose 01:30:00 55
7 Daisy 01:00:00 22
etc. ad nausium
If I use:
Select * from tblMain where fldTwo < ’05:00:00’ order by fldOne, fldTwo desc
Then all rows are correctly returned, ordered by fldOne and then fldTwo in descending order (although in the example data I've shown, all the data is already in the correct order!)
What I’d like to do is get the SUM of each fldThree, but only from the first occurrence of each fldOne.
So, SUM the first Dandelion, Daisy and Rose that I come across. E.g.
99+88+55
At the moment, I’m doing this programmatically; return a RecordSet from the Select statement above, and MoveNext through each returned row, only adding fldThree to my ‘total’ if I’ve never seen the text from fldOne before. It works, but most of the Select queries return over 100k rows and so it’s quite slow (slow being a relative term – it takes about 50 seconds on my setup).
The actual select statement (selecting about 100k rows from 1.5m total rows) completes in under a second which is fine. The current programatic loop is quite small and tight, it's just the number of loops through the RecordSet that takes time. I'm using adOpenForwardOnly and adLockReadOnly when I open the record set.
This is a routine that basically runs continuously as more data is added, and also the fldTwo 'times' vary, so I can't be more specific with the Select statement.
Everything that I’ve so far managed to do natively with SQL seems to run quickly and I’m hoping I can take the logic (and work) away from my program and get SQL to take the strain.
Thanks in advance

The best way to approach this is with window functions. These let you enumerate the rows within a group. However, you need some way to identify the first row. SQL tables are inherently unordered, so you need a column to specify the ordering. Here are some ideas.
If you have an id column, which is defined as an identity so it is autoincremented:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by id) as seqnum
from tblMain m
) m
where seqnum = 1
To get an arbitrary row, you could use:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by (select NULL as noorder)) as seqnum
from tblMain m
) m
where seqnum = 1
Or, if FldTwo has the values in reverse order:
select sum(fldThree)
from (select m.*,
row_number() over (partition by fldOne order by FldTwo desc) as seqnum
from tblMain m
) m
where seqnum = 1

Maybe this?
SELECT SUM(fldThree) as ExpectedSum
FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY fldOne ORDER BY fldTwo DSEC) Rn
FROM tblMain) as A
WHERE Rn = 1

overwrite context dimension on calculated member (SSAS)

Let's supose a simple scenario: a fact table F with two dimensions D1 and D2.
F D1 D2
10 A B
15 B C
In this scenario I define a new calculated member C1 using an expression close than this one:
with member measures.C1 as
sum(
descendants( [D1].[Ds].currentMember, , leaves ),
[myMeasure]
)
select
measures.C1 on 0,
[D2].[Ds].AllMembers on 1
from [MyCube]
How can I modify C1 to incorpore all time all D2 members in expression?
I get this results:
C1 D2
10 B
15 C
and I'm looking for this:
C1 D2
35 B
35 C
(of course this is a simplification of real problem, please, don't try to fix C1 expression, only add code to get expected results, I have tried with:
sum(
{ descendants( [D1].[Ds].currentMember, , leaves ),
[D2].[Ds].AllMembers },
[myMeasure]
unsuccesfully)
regards.

For this specific example, change your member statement the following.
WITH MEMBER [Measures].[C1] AS
SUM([D1].[Ds].[All], [myMeasure])
This gives you everything in that dimension for your measure. That value then should be repeated for each attribute in your D2 dimension.
Based on the title of the question and some of your text this is only a small example. It maybe possible that you need to investigate scope. It is pretty powerful and you can do some neat things with it.

How to get a value from previous result row of a SELECT statement?

If we have a table called FollowUp and has rows [ ID(int) , Value(Money) ]
and we have some rows in it, for example
ID --Value
1------70
2------100
3------150
8------200
20-----250
45-----280
and we want to make one SQL Query that get each row ID,Value and the previous Row Value in which data appear as follow
ID --- Value ---Prev_Value
1 ----- 70 ---------- 0
2 ----- 100 -------- 70
3 ----- 150 -------- 100
8 ----- 200 -------- 150
20 ---- 250 -------- 200
45 ---- 280 -------- 250
i make the following query but i think it's so bad in performance in huge amount of data
SELECT FollowUp.ID, FollowUp.Value,
(
SELECT F1.Value
FROM FollowUp as F1 where
F1.ID =
(
SELECT Max(F2.ID)
FROM FollowUp as F2 where F2.ID < FollowUp.ID
)
) AS Prev_Value
FROM FollowUp
So can anyone help me to get the best solution for such a problem ?

This sql should perform better then the one you have above, although these type of queries tend to be a little performance intensive... so anything you can put in them to limit the size of the dataset you are looking at will help tremendously. For example if you are looking at a specific date range, put that in.
SELECT followup.value,
( SELECT TOP 1 f1.VALUE
FROM followup as f1
WHERE f1.id<followup.id
ORDER BY f1.id DESC
) AS Prev_Value
FROM followup
HTH

You can use the OVER statement to generate nicely increasing row numbers.
select
rownr = row_number() over (order by id)
, value
from your_table
With the numbers, you can easily look up the previous row:
with numbered_rows
as (
select
rownr = row_number() over (order by id)
, value
from your_table
)
select
cur.value
, IsNull(prev.value,0)
from numbered_rows cur
left join numbered_rows prev on cur.rownr = prev.rownr + 1
Hope this is useful.

This is not an answer to your actual question.
Instead, I feel that you are approaching the problem from a wrong direction:
In properly normalized relational databases the tuples ("rows") of each table should contain references to other db items instead of the actual values. Maintaining these relations between tuples belongs to the data insertion part of the codebase.
That is, if containing the value of a tuple with closest, smaller id number really belongs into your data model.
If the requirement to know the previous value comes from the view part of the application - that is, a single view into the data that needs to format it in certain way - you should pull the contents out, sorted by id, and handle the requirement in view specific code.
In your case, I would assume that knowing the previous tuples' value really would belong in the view code instead of the database.
EDIT: You did mention that you store them separately and just want to make a query for it. Even still, application code would probably be the more logical place to do this combining.

What about pulling the lines into your application and computing the previous value there?

Create a stored procedure and use a cursor to iterate and produce rows.

You could use the function 'LAG'.
SELECT ID,
Value,
LAG(value) OVER(ORDER BY ID) AS Prev_Value
FROM FOLLOWUP;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas