I have a database that has a table called matchstats which includes a column called time and it is updated each time an action takes place. I also have a column called groundstatsid which when it is not null means the action took place on the ground as opposed to standing. Finally I have a column called Round.
Example:
Time | groundstatsid | Round
1 | NULL | 1
8 | NULL | 1
15 | NULL | 1
18 | 1 | 1
20 | 1 | 1
22 | NULL | 1
30 | NULL | 1
1 | NULL | 2
To get the full time standing I would basically want the query to take the first time (1) and store that, then look at groundstatsid until it sees a NON NULL value and take the time at that position, subtract by the earlier number stored to get the time in standup (17). Then it would continue to look for where groundstatsid IS NULL. Once it finds that value it should do the same process of looking until it finds a NON NULL value in groundstatsid or a new round, in which case it will start the whole process again.
Once it has gone through an entire match I would want it to Sum the results.
I would expect the query of the example to return 25.
I would boil this problem down one where you consider pairs of rows, sorted by time within each round. PostgreSQL can do this in one pass -- no JOINs, no PL/pgSQL -- using window functions:
SELECT
round,
first_value(time) OVER pair AS first_time,
last_value(time) OVER pair AS last_time,
first_value(groundstatsid IS NULL) OVER pair AS first_is_standing,
last_value(groundstatsid IS NULL) OVER pair AS last_is_standing
FROM matchstats
WINDOW pair AS (PARTITION BY round ORDER BY time ROWS 1 PRECEDING);
This tells PostgreSQL to read the rows from the table (presumably constrained by WHERE fightid=? or something), but to consider each round separately for windowing operations. Window functions like first_value and last_value can access the "window", which I specified to be ORDER BY time ROWS 1 PRECEDING, meaning the window contains both the current row and the one immediately preceding it in time (if any). Thus, window functions let us directly output values for both the current row and its predecessor.
For the data you provided, this query yields:
round | first_time | last_time | first_is_standing | last_is_standing
-------+------------+-----------+-------------------+------------------
1 | 1 | 1 | t | t
1 | 1 | 8 | t | t
1 | 8 | 15 | t | t
1 | 15 | 18 | t | f
1 | 18 | 20 | f | f
1 | 20 | 22 | f | t
1 | 22 | 30 | t | t
2 | 1 | 1 | t | t
Looking at these results helped me decide what to do next. Based on my understanding of your logic, I conclude that the person should be regarded as standing from time 1..1, 1..8, 8..15, 15..18, not standing from 18..20, not standing from 20..22, and is standing again from 22..30. In other words, we want to sum the difference between first_time and last_time where first_is_standing is true. Turning that back into SQL:
SELECT round, SUM(last_time - first_time) AS total_time_standing
FROM (
SELECT
round,
first_value(time) OVER pair AS first_time,
last_value(time) OVER pair AS last_time,
first_value(groundstatsid IS NULL) OVER pair AS first_is_standing,
last_value(groundstatsid IS NULL) OVER pair AS last_is_standing
FROM matchstats
WINDOW pair AS (PARTITION BY round ORDER BY time ROWS 1 PRECEDING)
) pairs
WHERE first_is_standing
GROUP BY round;
round | total_time_standing
-------+---------------------
1 | 25
2 | 0
You could also get other values from this same inner query, like the total time or the number of falls by using SUM(CASE WHEN ...) to count independent conditions:
SELECT
round,
SUM(CASE WHEN first_is_standing THEN last_time - first_time ELSE 0 END) AS total_time_standing,
SUM(CASE WHEN first_is_standing AND NOT last_is_standing THEN 1 ELSE 0 END) AS falls,
SUM(last_time - first_time) AS total_time
FROM (
SELECT
round,
first_value(time) OVER pair AS first_time,
last_value(time) OVER pair AS last_time,
first_value(groundstatsid IS NULL) OVER pair AS first_is_standing,
last_value(groundstatsid IS NULL) OVER pair AS last_is_standing
FROM matchstats
WINDOW pair AS (PARTITION BY round ORDER BY time ROWS 1 PRECEDING)
) pairs
GROUP BY round;
round | total_time_standing | falls | total_time
-------+---------------------+-------+------------
1 | 25 | 1 | 29
2 | 0 | 0 | 0
This will calculate standing time for any number of rounds:
SELECT round, sum(down_time - up_time) AS standing_time
FROM (
SELECT round, grp, standing, min(time) AS up_time
,CASE WHEN standing THEN
lead(min(time), 1, max(time)) OVER (PARTITION BY round
ORDER BY min(time))
ELSE NULL END AS down_time
FROM (
SELECT round, time, groundstatsid IS NULL AS standing
,count(groundstatsid) OVER (PARTITION BY round
ORDER BY time) AS grp
FROM tbl
) x
GROUP BY 1, 2, standing
) y
WHERE standing
GROUP BY round
ORDER BY round;
-> sqlfiddle
Explain
Subquery x:
Exploit the fact that count() doesn't count NULL values (neither as aggregate nor as window function). Successive rows with "standing" action (groundstatsid IS NULL) end up with the same value for grp.
Simplify groundstatsid to a boolean var standing, for ease of use and elegance.
Subquery y:
Aggregate per group - standing time matters. From ground time we only need the first row after each standing phase.
Take the minimum time per group as up_time (standing up)
Take the time from the following row (lead(min(time) ...) as down_time (going on the ground). Note that you can use aggregated values in a window function:
lead(min(time), 1, max(time)) OVER ... takes the next min(time) per round an defaults to max(time) of the current row if the round is over (no next row).
Final SELECT:
Only take standing time into account: WHERE groundstatsid IS NULL
sum(down_time - up_time) aggregates the total standing time per round.
Result ordered per round. Voilá.
This makes heavy use of window functions. Needs PostgreSQL 8.4 or later.
You could do the same procedurally in a plpgsql function if performance is your paramount requirement.
Examples here or here.
Related
You may be aware of rolling the results of an aggregate over a specific number of preceding rows. I.e.: how many hot dogs did I eat over the last 7 days
SELECT HotDogCount,
DateKey,
SUM(HotDogCount) OVER (ORDER BY DateKey ROWS 6 PRECEDING) AS HotDogsLast7Days
FROM dbo.HotDogConsumption
Results:
+-------------+------------+------------------+
| HotDogCount | DateKey | HotDogsLast7Days |
+-------------+------------+------------------+
| 3 | 09/21/2020 | 3 |
| 2 | 9/22/2020 | 5 |
| 1 | 09/23/2020 | 6 |
| 1 | 09/24/2020 | 7 |
| 1 | 09/25/2020 | 8 |
| 4 | 09/26/2020 | 12 |
| 1 | 09/27/2020 | 13 |
| 3 | 09/28/2020 | 13 |
| 2 | 09/29/2020 | 13 |
| 1 | 09/30/2020 | 13 |
+-------------+------------+------------------+
Now, the problem I am having is when there are gaps in the dates. So, basically, one day my intestines and circulatory system are screaming at me: "What the heck are you doing, you're going to kill us all!!!" So, I decide to give my body a break for a day and now there is no record for that day. When I use the "ROWS 6 PRECEDING" method, I will now be reaching back 8 days, rather than 7, because one day was missed.
So, the question is, do any of you know how I could use the OVER clause to truly use a date value (something like "DATEADD(day,-7,DateKey)") to determine how many previous rows should be summed up for a true 7 day rolling sum, regardless of whether I only ate hot dogs on one day or on all 7 days?
Side note, to have a record of 0 for the days I didn't eat any hotdogs is not an option. I understand that I could use an array of dates and left join to it and do a
CASE WHEN Datekey IS NULL THEN 0 END
type of deal, but I would like to find out if there is a different way where the rows preceding value can somehow be determined dynamically based on the date.
Window functions are the right approach in theory. But to look back at the 7 preceding days (not rows), we need a range frame specification - which, unfornately, SQL Server does not support.
I am going to recommend a subquery, or a lateral join:
select hdc.*, hdc1.*
from dbo.HotDogConsumption hdc
cross apply (
select coalesce(sum(HotDogCount), 0) HotDogsLast7Days
from dbo.HotDogConsumption hdc1
where hdc1.datekey >= dateadd(day, -7, hdc.datekey)
and hdc1.datekey < hdc.datekey
) hdc1
You might want to adjust the conditions in the where clause of the subquery to the precise frame that you want. The above code computes over the last 7 days, not including today. Something equivalent to your current attempt would be like:
where hdc1.datekey >= dateadd(day, -6, hdc.datekey)
and hdc1.datekey <= hdc.datekey
I'm kind of old school, but this is how I'd go about it:
SELECT
HDC1.HotDogCount
,HDC1.DateKey
,( SELECT SUM( HDC2.HotDogCount )
FROM HotDogConsumption HDC2
WHERE HDC2.DateKey BETWEEN DATEADD( DD, -7, HDC1.DateKey )
AND HDC1.DateKey ) AS 'HotDogsLast7Days'
FROM
HotDogConsumption HDC1
;
Someone younger might use an OUTER APPLY or something.
I have a table of ids and quantities that looks like this:
dbo.Quantity
id | qty
-------
1 | 3
2 | 6
I would like to split the quantity column into multiple lines and number them, but with a set limit (which can be arbitrary) on the maximum quantity allowed for each row.
So for the value of 2, expected output should be:
dbo.DesiredResult
id | qty | bucket
---------------
1 | 2 | 1
1 | 1 | 2
2 | 1 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
In other words,
Running SELECT id, SUM(qty) as qty FROM dbo.DesiredResult should return the original table (dbo.Quantity).
Running
SELECT id, SUM(qty) as qty FROM dbo.DesiredResult GROUP BY bucket
should give you this table.
id | qty | bucket
------------------
1 | 2 | 1
1 | 2 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
I feel I can do this with cursors imperitavely, looping over each row, keeping a counter that increments and resets as the "max" for each is filled. But this is very "anti-SQL" I feel there is a better way around this.
One approach is recursive CTE which emulates cursor sequentially going through rows.
Another approach that comes to mind is to represent your data as intervals and intersections of intervals.
Represent this:
id | qty
-------
1 | 3
2 | 6
as intervals [0;3), [3;9) with ids being their labels
0123456789
|--|-----|
1 2 - id
It is easy to generate this set of intervals using running total SUM() OVER().
Represent your buckets also as intervals [0;2), [2;4), [4;6), etc. with their own labels
0123456789
|-|-|-|-|-|
1 2 3 4 5 - bucket
It is easy to generate this set of intervals using a table of numbers.
Intersect these two sets of intervals preserving information about their labels.
Working with sets should be possible in a set-based SQL query, rather than a sequential cursor or recursion.
It is bit too much for me to write down the actual query right now. But, it is quite possible that ideas similar to those discussed in Packing Intervals by Itzik Ben-Gan may be useful here.
Actually, once you have your quantities represented as intervals you can generate required number of rows/buckets on the fly from the table of numbers using CROSS APPLY.
Imagine we transformed your Quantity table into Intervals:
Start | End | ID
0 | 3 | 1
3 | 9 | 2
And we also have a table of numbers - a table Numbers with column Number with values from 0 to, say, 100K.
For each Start and End of the interval we can calculate the corresponding bucket number by dividing the value by the bucket size and rounding down or up.
Something along these lines:
SELECT
Intervals.ID
,A.qty
,A.Bucket
FROM
Intervals
CROSS APPLY
(
SELECT
Numbers.Number + 1 AS Bucket
,#BucketSize AS qty
-- it is equal to #BucketSize if the bucket is completely within the Start and End boundaries
-- it should be adjusted for the first and last buckets of the interval
FROM Numbers
WHERE
Numbers.Number >= Start / #BucketSize
AND Numbers.Number < End / #BucketSize + 1
) AS A
;
You'll need to check and adjust formulas for errors +-1.
And write some CASE WHEN logic for calculating the correct qty for the buckets that happen to be on the lower and upper boundary of the interval.
Use a recursive CTE:
with cte as (
select id, 1 as n, qty
from t
union all
select id, n + 1, qty
from cte
where n + 1 < qty
)
select id, n
from cte;
Here is a db<>fiddle.
I have a large dataset consisting of four sensors in a single stream, but for simplicity's sake let's reduce that to two sensors that transmit at approximate (but not exact) same times like this:
+---------+-------------+-------+
| Sensor | Time | Value |
+---------+-------------+-------+
| SensorA | 10:00:01.14 | 10 |
| SensorB | 10:00:01.06 | 8 |
| SensorA | 10:00:02.15 | 11 |
| SensorB | 10:00:02.07 | 9 |
| SensorA | 10:00:03.14 | 13 |
| SensorA | 10:00:04.09 | 12 |
| SensorB | 10:00:04.13 | 6 |
+---------+-------------+-------+
I am trying to find the difference between SensorA and SensorB when their readings are within a half-second of each other. Like this:
+-------------+-------+
| Trunc_Time | Diff |
+-------------+-------+
| 10:00:01 | 2 |
| 10:00:02 | 2 |
| 10:00:04 | 6 |
+-------------+-------+
I know I could write queries to put each sensor in its own table (say SensorA_table and SensorB_table), and then join those tables like this:
SELECT
TIMESTAMP_TRUNC(a.Time, SECOND) as truncated_sec,
a.Value - b.Value as sensor_diff
FROM SensorA_table AS a JOIN SensorB_Table AS b
ON b.Time BETWEEN TIMESTAMP_SUB(a.Time, INTERVAL 500 MILLISECOND) AND TIMESTAMP_ADD(a.Time, INTERVAL 500 MILLISECOND)
But that seems very expensive to make every row of the SensorA_table compare against every row of the SensorB_table, given that the sensor tables are each about 10 TB. Or does partitioning automatically take care of this and only look at one block of SensorB's table per row of SensorA's table?
Either way, I am wondering if there is a better way to do this than a full JOIN. Since the matching values are all coming from within a few rows of each other in the original table, it seems like an analytic function might be able to look at a smaller amount of data at a time, but because we can't guarantee alternating rows of A & B, there's no clear LAG or LEAD offset that would always return the correct row.
Is it a matter of writing an analytic functions to return a few LAG and LEAD rows for each row, then evaluate each of those rows with a CASE statement to see if it is the correct row, then calculating the value? Or is there a way of doing a join against an analytic function's window?
Thanks for any guidance here.
One method uses lag(). Something like this:
select st.time, st.value - st.prev_value
from (select st.*,
lag(sensor) over (order by time, sensor) as prev_sensor,
lag(time) over (order by time, sensor) as prev_time,
lag(value) over (order by time, sensor) as prev_value
from sensor_table st
) st
where ( st.sensor = 'A' <> prev_sensor = 'B' ) and
prev_time > timestamp_add(time, interval 1 second)
I have a problem that needs to be solved using sql in oracle.
I have a dataset like given below:
value | date
-------------
1 | 01/01/2017
2 | 02/01/2017
3 | 03/01/2017
3 | 04/01/2017
2 | 05/01/2017
2 | 06/01/2017
4 | 07/01/2017
5 | 08/01/2017
I need to show the result in the below format:
value | date | Group
1 | 01/01/2017 | 1
2 | 02/01/2017 | 2
3 | 03/01/2017 | 3
3 | 04/01/2017 | 3
2 | 05/01/2017 | 4
2 | 06/01/2017 | 4
4 | 07/01/2017 | 5
5 | 08/01/2017 | 6
The logic is whenever value changes over date, it gets assigned a new group/id, but if its the same as the previous one , then its part of the same group.
Here is one method using lag() and cumulative sum:
select t.*,
sum(case when value = prev_value then 0 else 1 end) over (order by date) as grp
from (select t.*,
lag(value) over (order by date) as prev_value
from t
) t;
The logic here is to simply count the number of times that the value changes from one month to the next.
This assumes that date is actually stored as a date and not a string. If it is a string, then the ordering will not be correct. Either convert to a date or use a column that specifies the correct ordering.
Here is a solution using the MATCH_RECOGNIZE clause, introduced in Oracle 12.*
select value, dt, mn as grp
from inputs
match_recognize (
order by dt
measures match_number() as mn
all rows per match
pattern ( a b* )
define b as value = prev(value)
)
order by dt -- if needed
;
Here is how this works: Other than SELECT, FROM and ORDER BY, the query has only one clause, MATCH_RECOGNIZE. What this clause does is: it takes the rows from inputs and it orders them by dt. Then it searches for patterns: one row, marked as a, with no constraints, followed by zero or more rows b, where b is defined by the condition that the value is the same as for the prev[ious] row. What the clause calculates or measures is the match_number() - first "match" of the pattern, second match etc. We use this match number as the group number (grp) in the outer query - that's all we needed!
*Notes: The existence of solutions like this shows why it is important for posters to state their Oracle version. (Run the statement select * from v$version to find out.) Also: date and group are reserved words in Oracle and shouldn't be used as column names. Not even for posting made-up sample data. (There are workarounds but they aren't needed in this case.) Also, whenever using dates like 03/01/2017 in a post, please indicate whether that is March 1 or January 3, there's no way for "us" to tell. (It wasn't important in this case, but it is in the vast majority of cases.)
I'm trying use windowing functions to group records close to each other (within the same partition) into sequential groups. There's probably a better way to solve the problem, but right now what I would like to try is running too slow to be useful. It involves an order by on the select:
order by person_id, rollup_class, rollup_concept_id, exp_num
and another order by in the window function:
lead(days_from_latest) over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num DESC)
Because I have that last column (exp_num) ordered in opposite directions, the query takes forever. I even have two indexes on the table to handle the two directions:
create index deeIdx on results.drug_exposure_extra (person_id,rollup_class, rollup_concept_id,
exp_num);
create index deeIdx2 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num desc);
But that doesn't help. So I'm trying one that orders exp_num in both directions:
create index deeIdx3 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num, exp_num desc);
Does that even make sense? When the index finally finishes building, if it solves the problem, I'll answer my own question...
Nope.
Even with all three indexes, if the two order bys (in select and in over clause) go the same direction, the query runs super fast, if they go opposite directions the query runs super slow. So, at this point I guess I should explain my use case better and ask for ideas for a better approach.
I've got drug exposure records (this is for a cool open-source project http://www.ohdsi.org/, btw), and when a person has drug exposures that begin less than N days from the end of any previous exposure, it should be combined with the earlier ones into a single 'era'. Whenever there is a gap of more than N days, a new era begins.
Over the course of composing this question, it turns out I solved it. It raises some interesting issues, though, so I'll post it and answer it below.
Like asking a doctor, "It hurts when I move my arm like this, what should I do?" the answer is obviously, "Don't move your arm like that." So -- don't try to make windowing functions proceed in a different order from the main query (or probably from each other) -- there's probably a better solution.
Early in working on this I had somehow convinced myself that it would be easier to aggregate eras relative to their ending records rather than their starting records, but that was where I went wrong.
So the expression that gives me the era number I want looks like this:
sum(case when exp_num = 1 or days_from_latest > 30 then 1 else 0 end)
over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num)
as era_num
Explanation: if it's the patient's first exposure to the drug (well, the combination of rollup_class and rollup_concept_id in this case), then that's the beginning of a drug era. It's also the beginning of a drug era if the exposure is more than N days from any earlier exposure. (This point is what makes it a little complicated: say exposure 1 starts at day 1 and is 60 days, exposure 2 starts at day 20 and is 10 days, exposure 3 starts at day 70: it's 40 days after the end of the most recent exposure, 2, which would put it in a new era, but it's only 10 days after exposure 1, which puts it in the same era with 1 and 2.) So, for each record that starts an era the case statement gives us a 1, the rest get 0s. Then we sum that, partitioning over the same partition we used in an earlier query to establish the exp_num, and order by exp_num. I could have specified the rows to sum explicitly by adding rows between unbounded preceding and current row, but that's the default behavior anyway. So the era number increments only at the beginning of new eras.
Here is a much simplified example in response to gordon-linoff's comment below.
create table junk_numbers (x int);
insert into junk_numbers values (1),(2),(3),(5),(7),(9),(10),(15),(20),(25),(26),(28),(30);
-- break into series with gaps of at least 1
select x, gap, 1+sum(case when gap > 1 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 2
7 | 2 | 3
9 | 2 | 4
10 | 1 | 4
15 | 5 | 5
20 | 5 | 6
25 | 5 | 7
26 | 1 | 7
28 | 2 | 8
30 | 2 | 9
-- same query but bigger gaps:
select x, gap, 1+sum(case when gap > 4 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 1
7 | 2 | 1
9 | 2 | 1
10 | 1 | 1
15 | 5 | 2
20 | 5 | 3
25 | 5 | 4
26 | 1 | 4
28 | 2 | 4
30 | 2 | 4