sql index same column two directions for traversing window functions - sql

I'm trying use windowing functions to group records close to each other (within the same partition) into sequential groups. There's probably a better way to solve the problem, but right now what I would like to try is running too slow to be useful. It involves an order by on the select:
order by person_id, rollup_class, rollup_concept_id, exp_num
and another order by in the window function:
lead(days_from_latest) over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num DESC)
Because I have that last column (exp_num) ordered in opposite directions, the query takes forever. I even have two indexes on the table to handle the two directions:
create index deeIdx on results.drug_exposure_extra (person_id,rollup_class, rollup_concept_id,
exp_num);
create index deeIdx2 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num desc);
But that doesn't help. So I'm trying one that orders exp_num in both directions:
create index deeIdx3 on results.drug_exposure_extra (person_id,rollup_class,rollup_concept_id,
exp_num, exp_num desc);
Does that even make sense? When the index finally finishes building, if it solves the problem, I'll answer my own question...
Nope.
Even with all three indexes, if the two order bys (in select and in over clause) go the same direction, the query runs super fast, if they go opposite directions the query runs super slow. So, at this point I guess I should explain my use case better and ask for ideas for a better approach.
I've got drug exposure records (this is for a cool open-source project http://www.ohdsi.org/, btw), and when a person has drug exposures that begin less than N days from the end of any previous exposure, it should be combined with the earlier ones into a single 'era'. Whenever there is a gap of more than N days, a new era begins.
Over the course of composing this question, it turns out I solved it. It raises some interesting issues, though, so I'll post it and answer it below.

Like asking a doctor, "It hurts when I move my arm like this, what should I do?" the answer is obviously, "Don't move your arm like that." So -- don't try to make windowing functions proceed in a different order from the main query (or probably from each other) -- there's probably a better solution.
Early in working on this I had somehow convinced myself that it would be easier to aggregate eras relative to their ending records rather than their starting records, but that was where I went wrong.
So the expression that gives me the era number I want looks like this:
sum(case when exp_num = 1 or days_from_latest > 30 then 1 else 0 end)
over (partition by person_id, rollup_class, rollup_concept_id
order by exp_num)
as era_num
Explanation: if it's the patient's first exposure to the drug (well, the combination of rollup_class and rollup_concept_id in this case), then that's the beginning of a drug era. It's also the beginning of a drug era if the exposure is more than N days from any earlier exposure. (This point is what makes it a little complicated: say exposure 1 starts at day 1 and is 60 days, exposure 2 starts at day 20 and is 10 days, exposure 3 starts at day 70: it's 40 days after the end of the most recent exposure, 2, which would put it in a new era, but it's only 10 days after exposure 1, which puts it in the same era with 1 and 2.) So, for each record that starts an era the case statement gives us a 1, the rest get 0s. Then we sum that, partitioning over the same partition we used in an earlier query to establish the exp_num, and order by exp_num. I could have specified the rows to sum explicitly by adding rows between unbounded preceding and current row, but that's the default behavior anyway. So the era number increments only at the beginning of new eras.
Here is a much simplified example in response to gordon-linoff's comment below.
create table junk_numbers (x int);
insert into junk_numbers values (1),(2),(3),(5),(7),(9),(10),(15),(20),(25),(26),(28),(30);
-- break into series with gaps of at least 1
select x, gap, 1+sum(case when gap > 1 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 2
7 | 2 | 3
9 | 2 | 4
10 | 1 | 4
15 | 5 | 5
20 | 5 | 6
25 | 5 | 7
26 | 1 | 7
28 | 2 | 8
30 | 2 | 9
-- same query but bigger gaps:
select x, gap, 1+sum(case when gap > 4 then 1 else 0 end) over (order by x) as series_num
from (
select x, x - lag(x) over (order by x) as gap
from junk_numbers
) as x_and_gaps
order by x;
x | gap | series_num
----+-----+------------
1 | | 1
2 | 1 | 1
3 | 1 | 1
5 | 2 | 1
7 | 2 | 1
9 | 2 | 1
10 | 1 | 1
15 | 5 | 2
20 | 5 | 3
25 | 5 | 4
26 | 1 | 4
28 | 2 | 4
30 | 2 | 4

Related

Rolling sum based on date (when dates are missing)

You may be aware of rolling the results of an aggregate over a specific number of preceding rows. I.e.: how many hot dogs did I eat over the last 7 days
SELECT HotDogCount,
DateKey,
SUM(HotDogCount) OVER (ORDER BY DateKey ROWS 6 PRECEDING) AS HotDogsLast7Days
FROM dbo.HotDogConsumption
Results:
+-------------+------------+------------------+
| HotDogCount | DateKey | HotDogsLast7Days |
+-------------+------------+------------------+
| 3 | 09/21/2020 | 3 |
| 2 | 9/22/2020 | 5 |
| 1 | 09/23/2020 | 6 |
| 1 | 09/24/2020 | 7 |
| 1 | 09/25/2020 | 8 |
| 4 | 09/26/2020 | 12 |
| 1 | 09/27/2020 | 13 |
| 3 | 09/28/2020 | 13 |
| 2 | 09/29/2020 | 13 |
| 1 | 09/30/2020 | 13 |
+-------------+------------+------------------+
Now, the problem I am having is when there are gaps in the dates. So, basically, one day my intestines and circulatory system are screaming at me: "What the heck are you doing, you're going to kill us all!!!" So, I decide to give my body a break for a day and now there is no record for that day. When I use the "ROWS 6 PRECEDING" method, I will now be reaching back 8 days, rather than 7, because one day was missed.
So, the question is, do any of you know how I could use the OVER clause to truly use a date value (something like "DATEADD(day,-7,DateKey)") to determine how many previous rows should be summed up for a true 7 day rolling sum, regardless of whether I only ate hot dogs on one day or on all 7 days?
Side note, to have a record of 0 for the days I didn't eat any hotdogs is not an option. I understand that I could use an array of dates and left join to it and do a
CASE WHEN Datekey IS NULL THEN 0 END
type of deal, but I would like to find out if there is a different way where the rows preceding value can somehow be determined dynamically based on the date.
Window functions are the right approach in theory. But to look back at the 7 preceding days (not rows), we need a range frame specification - which, unfornately, SQL Server does not support.
I am going to recommend a subquery, or a lateral join:
select hdc.*, hdc1.*
from dbo.HotDogConsumption hdc
cross apply (
select coalesce(sum(HotDogCount), 0) HotDogsLast7Days
from dbo.HotDogConsumption hdc1
where hdc1.datekey >= dateadd(day, -7, hdc.datekey)
and hdc1.datekey < hdc.datekey
) hdc1
You might want to adjust the conditions in the where clause of the subquery to the precise frame that you want. The above code computes over the last 7 days, not including today. Something equivalent to your current attempt would be like:
where hdc1.datekey >= dateadd(day, -6, hdc.datekey)
and hdc1.datekey <= hdc.datekey
I'm kind of old school, but this is how I'd go about it:
SELECT
HDC1.HotDogCount
,HDC1.DateKey
,( SELECT SUM( HDC2.HotDogCount )
FROM HotDogConsumption HDC2
WHERE HDC2.DateKey BETWEEN DATEADD( DD, -7, HDC1.DateKey )
AND HDC1.DateKey ) AS 'HotDogsLast7Days'
FROM
HotDogConsumption HDC1
;
Someone younger might use an OUTER APPLY or something.

Split a quantity into multiple rows with limit on quantity per row

I have a table of ids and quantities that looks like this:
dbo.Quantity
id | qty
-------
1 | 3
2 | 6
I would like to split the quantity column into multiple lines and number them, but with a set limit (which can be arbitrary) on the maximum quantity allowed for each row.
So for the value of 2, expected output should be:
dbo.DesiredResult
id | qty | bucket
---------------
1 | 2 | 1
1 | 1 | 2
2 | 1 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
In other words,
Running SELECT id, SUM(qty) as qty FROM dbo.DesiredResult should return the original table (dbo.Quantity).
Running
SELECT id, SUM(qty) as qty FROM dbo.DesiredResult GROUP BY bucket
should give you this table.
id | qty | bucket
------------------
1 | 2 | 1
1 | 2 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
I feel I can do this with cursors imperitavely, looping over each row, keeping a counter that increments and resets as the "max" for each is filled. But this is very "anti-SQL" I feel there is a better way around this.
One approach is recursive CTE which emulates cursor sequentially going through rows.
Another approach that comes to mind is to represent your data as intervals and intersections of intervals.
Represent this:
id | qty
-------
1 | 3
2 | 6
as intervals [0;3), [3;9) with ids being their labels
0123456789
|--|-----|
1 2 - id
It is easy to generate this set of intervals using running total SUM() OVER().
Represent your buckets also as intervals [0;2), [2;4), [4;6), etc. with their own labels
0123456789
|-|-|-|-|-|
1 2 3 4 5 - bucket
It is easy to generate this set of intervals using a table of numbers.
Intersect these two sets of intervals preserving information about their labels.
Working with sets should be possible in a set-based SQL query, rather than a sequential cursor or recursion.
It is bit too much for me to write down the actual query right now. But, it is quite possible that ideas similar to those discussed in Packing Intervals by Itzik Ben-Gan may be useful here.
Actually, once you have your quantities represented as intervals you can generate required number of rows/buckets on the fly from the table of numbers using CROSS APPLY.
Imagine we transformed your Quantity table into Intervals:
Start | End | ID
0 | 3 | 1
3 | 9 | 2
And we also have a table of numbers - a table Numbers with column Number with values from 0 to, say, 100K.
For each Start and End of the interval we can calculate the corresponding bucket number by dividing the value by the bucket size and rounding down or up.
Something along these lines:
SELECT
Intervals.ID
,A.qty
,A.Bucket
FROM
Intervals
CROSS APPLY
(
SELECT
Numbers.Number + 1 AS Bucket
,#BucketSize AS qty
-- it is equal to #BucketSize if the bucket is completely within the Start and End boundaries
-- it should be adjusted for the first and last buckets of the interval
FROM Numbers
WHERE
Numbers.Number >= Start / #BucketSize
AND Numbers.Number < End / #BucketSize + 1
) AS A
;
You'll need to check and adjust formulas for errors +-1.
And write some CASE WHEN logic for calculating the correct qty for the buckets that happen to be on the lower and upper boundary of the interval.
Use a recursive CTE:
with cte as (
select id, 1 as n, qty
from t
union all
select id, n + 1, qty
from cte
where n + 1 < qty
)
select id, n
from cte;
Here is a db<>fiddle.

Google Big Query : Window Function Row Wise Cumulative Sum Across Columns

I am looking to calculate cumulative sum across columns in Google Big Query.
Assume there are five columns (NAME,A,B,C,D) with two rows of integers, for example:
NAME | A | B | C | D
----------------------
Bob | 1 | 2 | 3 | 4
Carl | 5 | 6 | 7 | 8
I am looking for a windowing function or UDF to calculate the cumulative sum across rows to generate this output:
NAME | A | B | C | D
-------------------------
Bob | 1 | 3 | 6 | 10
Carl | 5 | 11 | 18 | 27
Any thoughts or suggestions greatly appreciated!
I think, there are number of reasonable workarounds for your requirements mostly in the area of designing better your table. All really depends on how you input your data and most importantly how than you consume it
Still, if to stay with presented requirements - Below is not exactly what you expect in your question as an output, but might be usefull as an example:
SELECT name, GROUP_CONCAT(STRING(cum)) AS all FROM (
SELECT name,
SUM(INTEGER(num))
OVER(PARTITION BY name
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum
FROM (
SELECT name, SPLIT(all) AS num FROM (
SELECT name,
CONCAT(STRING(a),',',STRING(b),',',STRING(c),',',STRING(d)) AS all
FROM yourtable
)
)
)
GROUP BY name
Output is:
name all
Bob 1,3,6,10
Carl 5,11,18,26
Depends on how you than consume this data - it still can work for you
Note, not you avoiding now writing something like col1 + col2 + .. + col89 + col90 - but still need to explicitelly mention each column just ones.
in case if you have "luxury" of implementing your requirements outside of GBQ UI, but rather in some Client- you can use BigQuery API to programatically aquire table schema and build on fly your logic/query and than execute it
Take a look at below APIs to start with:
To get table schema - https://cloud.google.com/bigquery/docs/reference/v2/tables/get
To issue query job - https://cloud.google.com/bigquery/docs/reference/v2/jobs/insert
There's no need for a UDF:
SELECT name, a, a+b, a+b+c, a+b+c+d
FROM tab

SQL Query X Days back excluding date ranges (Confusing!)

Ok, I have a tough SQL query, and I'm not sure how to go about writing it.
I am summing the number of "bananas collected" by an employee within the last X days, but what I could really use help on is determining X.
The "last X days" value is defined to be the last 100 days that the employee was NOT out due to Purple Fever, starting from some ChosenDate (we'll say today, 6/24/14). That is to say, if the person was sick with Purple Fever for 3 days, then I want to look back over the last 103 days from ChosenDate rather than the last 100 days. Any other reason the employee may have been out does not affect our calculation.
Table PersonOutIncident
+----------------------+----------+-------------+
| PersonOutIncidentID | PersonID | ReasonOut |
+----------------------+----------+-------------+
| 1 | Sarah | PurpleFever |
| 2 | Sarah | PaperCut |
| 3 | Jon | PurpleFever |
| 4 | Sarah | PurpleFever |
+----------------------+----------+-------------+
Table PersonOutDetail
+-------------------+----------------------+-----------+-----------+
| PersonOutDetailID | PersonOutIncidentID | BeginDate | EndDate |
+-------------------+----------------------+-----------+-----------+
| 1 | 1 | 1/1/2014 | 1/3/2014 |
| 2 | 1 | 1/7/2014 | 1/13/2014 |
| 3 | 2 | 2/1/2014 | 2/3/2014 |
| 4 | 3 | 1/15/2014 | 1/20/2014 |
| 5 | 4 | 5/1/2014 | 5/15/2014 |
+-------------------+----------------------+-----------+-----------+
The tables are established. Many PersonOutDetail records can be associated with one PersonOutIncident record and there may be multiple PersonOutIncident records for a single employee (That is to say, there could be two or three PersonOutIncident records with an identical ReasonOut column, because they represent a particular incident or event and the not-necessarily-continuous days lost due to that particular incident)
The nature of this requirement complicates things, even conceptually to me.
The best I can think of is to check for a BeginDate/EndDate pair within the 100 day base period, then determine the number of days from BeginDate to EndDate and add that to the base 100 days. But then I would have to check again that this new range doesn't overlap or contain additional BeginDate/EndDate pairs and add, if so, add those days as well. I can tell already that this isn't the method I want to use, but I can't wrap my mind quite around how exactly what I need to start/structure this query. Does anyone have an idea that might steer me in the correct direction? I realize this might not be clear and I apologize if I'm just confusing things.
One way to do this is to work with a table or WITH CLAUSE that contains a list of days. Let's say days is a table with one column that contains the last 200 days. (This means the query will break if the employee had more than 100 sick days in the last 200 days).
Now you can get a list of all working days of an employee like this (replace ? with the employee id):
WITH t1 AS
(
SELECT day,
ROW_NUMBER() OVER (ORDER BY day DESC) AS 'RowNumber'
FROM days d
WHERE NOT EXISTS (SELECT * FROM PersonOutDetail pd
INNER JOIN PersonOutIncidentID po ON po.PersonOutIncidentID = pd.PersonOutIncidentID
WHERE d.day BETWEEN pd.BeginDate AND pd.EndDate
AND po.ReasonOut = 'PurpleFever'
AND po.PersonID = ?)
)
SELECT * FROM t1
WHERE RowNumber <= 100;
Alternatively, you can obtain the '100th day' by replacing RowNumber <= 100 with RowNumber = 100.

Find out irregular entries with SQL

I am having some human error entries in my table. Some missing a zero, some has more material than it should be, and so on. So I am trying to scan throughout a table to find some error in an entry groups.
Table goes like this:
| Work Order | Product | Material Qty
---------------------------------
| 1 | Item A | 10
| 2 | Item A | 25
| 3 | Item A | 12
| 4 | Item A | 9
| 5 | Item X | 52
| 6 | Item X | 20
| 7 | Item X | 23
| 8 | Item X | 24
| 9 | Item X | 2
| 10 | Item Z | 20
| 11 | Item Z | 5
---------------------------------
Now, the WO and WO item are not that sequential, I write it as sequential here only for examples.
As you can see, those item A should have number around 10, give or take some. Item X should be around 22, give or take some, meanwhile the query should tag Item Z as all suspicious since there are not enough data to correlate. So I need to isolate WO number 2, 5 and 9, 10 and 11 for people to audit. Any idea how?
I have been trying to create an average of them, and using a percentage to eliminate them. But sometimes, percentage number are too varies. And in case of item Z, there are not enough data to choose which number are normal number, and which number are irregular numbers, and I need to tag both of them for verification, in which case, reducing down to percentage won't help.
Also, if I reduce them to variant percentage against average, its spread are still too wide to tag one of them.
Any ideas? Because I am really stuck this time.
From a statistical basis, you probably want to start with the STDEV standard deviation function.
select *
from
(
select *,
AVG(qty) OVER( Partition by product) av,
STDEV(qty) OVER( Partition by product) sd,
COUNT(*) over (Partition by product) c
from yourtable
) v
where ABS(qty-av)>sd or c<3