Aggregating tsrange values into day buckets with a tie-breaker

Aggregating tsrange values into day buckets with a tie-breaker - sql

So I've got a schema that lets people donate $ to a set of organizations, and that donation is tied to a certain arbitrary period of time. I'm working on a report that looks at each day, and for each organization shows the total number of donations and the total cumulative value of those donations for that organization's day.
For example, here's a mockup of 3 donors, Alpha (orange), Bravo (green), and Charlie (Blue) donating to 2 different organizations (Foo and Bar) over various time periods:
I've created a SQLFiddle that implements the above example in a schema that somewhat reflects what I'm working with in reality: http://sqlfiddle.com/#!17/88969/1
(The schema is broken out into more tables than what you'd come up with given the problem statement to better reflect the real-life version I'm working with)
So far, the query that I've managed to put together looks like this:
WITH report_dates AS (
SELECT '2018-01-01'::date + g AS date
FROM generate_series(0, 14) g
), organizations AS (
SELECT id AS organization_id FROM users
WHERE type = 'Organization'
)
SELECT * FROM report_dates rd
CROSS JOIN organizations o
LEFT JOIN LATERAL (
SELECT
COALESCE(sum(doa.amount_cents), 0) AS total_donations_cents,
COALESCE(count(doa.*), 0) AS total_donors
FROM users
LEFT JOIN donor_organization_amounts doa ON doa.organization_id = users.id
LEFT JOIN donor_amounts da ON da.id = doa.donor_amounts_id
LEFT JOIN donor_schedules ds ON ds.donor_amounts_id = da.id
WHERE (users.id = o.organization_id) AND (ds.period && tsrange(rd.date::timestamp, rd.date::timestamp + INTERVAL '1 day', '[)'))
) o2 ON true;
With the results looking like this:
| date | organization_id | total_donations_cents | total_donors |
|------------|-----------------|-----------------------|--------------|
| 2018-01-01 | 1 | 0 | 0 |
| 2018-01-02 | 1 | 250 | 1 |
| 2018-01-03 | 1 | 250 | 1 |
| 2018-01-04 | 1 | 1750 | 3 |
| 2018-01-05 | 1 | 1750 | 3 |
| 2018-01-06 | 1 | 1750 | 3 |
| 2018-01-07 | 1 | 750 | 2 |
| 2018-01-08 | 1 | 850 | 2 |
| 2018-01-09 | 1 | 850 | 2 |
| 2018-01-10 | 1 | 500 | 1 |
| 2018-01-11 | 1 | 500 | 1 |
| 2018-01-12 | 1 | 500 | 1 |
| 2018-01-13 | 1 | 1500 | 2 |
| 2018-01-14 | 1 | 1000 | 1 |
| 2018-01-15 | 1 | 0 | 0 |
| 2018-01-01 | 2 | 0 | 0 |
| 2018-01-02 | 2 | 250 | 1 |
| 2018-01-03 | 2 | 250 | 1 |
| 2018-01-04 | 2 | 1750 | 2 |
| 2018-01-05 | 2 | 1750 | 2 |
| 2018-01-06 | 2 | 1750 | 2 |
| 2018-01-07 | 2 | 1750 | 2 |
| 2018-01-08 | 2 | 2000 | 2 |
| 2018-01-09 | 2 | 2000 | 2 |
| 2018-01-10 | 2 | 1500 | 1 |
| 2018-01-11 | 2 | 1500 | 1 |
| 2018-01-12 | 2 | 0 | 0 |
| 2018-01-13 | 2 | 1000 | 2 |
| 2018-01-14 | 2 | 500 | 1 |
| 2018-01-15 | 2 | 0 | 0 |
That's pretty close, however the problem with this query is that on days where a donation ends and that same donor begins a new one, it should only count that donor's donation one time, using the higher amount donation as a tie-breaker for the cumulative $ count. An example of that is on 2018-01-13 for organization Foo: total_donors should be 1 and total_donations_cents 1000.
I tried to implement a tie-breaker for using DISTINCT ON but I got off into the weeds... any help would be appreciated!
Also, should I be worried about the performance implications of my implementation so far, given the CTEs and the CROSS JOIN?

Figured it out using DISTINCT ON: http://sqlfiddle.com/#!17/88969/4
WITH report_dates AS (
SELECT '2018-01-01'::date + g AS date
FROM generate_series(0, 14) g
), organizations AS (
SELECT id AS organization_id FROM users
WHERE type = 'Organization'
), donors_by_date AS (
SELECT * FROM report_dates rd
CROSS JOIN organizations o
LEFT JOIN LATERAL (
SELECT DISTINCT ON (date, da.donor_id)
da.donor_id,
doa.id,
doa.donor_amounts_id,
doa.amount_cents
FROM users
LEFT JOIN donor_organization_amounts doa ON doa.organization_id = users.id
LEFT JOIN donor_amounts da ON da.id = doa.donor_amounts_id
LEFT JOIN donor_schedules ds ON ds.donor_amounts_id = da.id
WHERE (users.id = o.organization_id) AND (ds.period && tsrange(rd.date::timestamp, rd.date::timestamp + INTERVAL '1 day', '[)'))
ORDER BY date, da.donor_id, doa.amount_cents DESC
) foo ON true
)
SELECT
date,
organization_id,
COALESCE(SUM(amount_cents), 0) AS total_donations_cents,
COUNT(*) FILTER (WHERE donor_id IS NOT NULL) AS total_donors
FROM donors_by_date
GROUP BY date, organization_id
ORDER BY organization_id, date;
Result:
| date | organization_id | total_donations_cents | total_donors |
|------------|-----------------|-----------------------|--------------|
| 2018-01-01 | 1 | 0 | 0 |
| 2018-01-02 | 1 | 250 | 1 |
| 2018-01-03 | 1 | 250 | 1 |
| 2018-01-04 | 1 | 1750 | 3 |
| 2018-01-05 | 1 | 1750 | 3 |
| 2018-01-06 | 1 | 1750 | 3 |
| 2018-01-07 | 1 | 750 | 2 |
| 2018-01-08 | 1 | 850 | 2 |
| 2018-01-09 | 1 | 850 | 2 |
| 2018-01-10 | 1 | 500 | 1 |
| 2018-01-11 | 1 | 500 | 1 |
| 2018-01-12 | 1 | 500 | 1 |
| 2018-01-13 | 1 | 1000 | 1 |
| 2018-01-14 | 1 | 1000 | 1 |
| 2018-01-15 | 1 | 0 | 0 |
| 2018-01-01 | 2 | 0 | 0 |
| 2018-01-02 | 2 | 250 | 1 |
| 2018-01-03 | 2 | 250 | 1 |
| 2018-01-04 | 2 | 1750 | 2 |
| 2018-01-05 | 2 | 1750 | 2 |
| 2018-01-06 | 2 | 1750 | 2 |
| 2018-01-07 | 2 | 1750 | 2 |
| 2018-01-08 | 2 | 2000 | 2 |
| 2018-01-09 | 2 | 2000 | 2 |
| 2018-01-10 | 2 | 1500 | 1 |
| 2018-01-11 | 2 | 1500 | 1 |
| 2018-01-12 | 2 | 0 | 0 |
| 2018-01-13 | 2 | 1000 | 2 |
| 2018-01-14 | 2 | 500 | 1 |
| 2018-01-15 | 2 | 0 | 0 |

Related

Redshift SQL - Count Sequences of Repeating Values Within Groups

I have a table that looks like this:
| id | date_start | gap_7_days |
| -- | ------------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 |
| 1 | 2021-06-13 00:00:00 | 0 |
| 1 | 2021-06-19 00:00:00 | 0 |
| 1 | 2021-06-27 00:00:00 | 0 |
| 2 | 2021-07-04 00:00:00 | 1 |
| 2 | 2021-07-11 00:00:00 | 1 |
| 2 | 2021-07-18 00:00:00 | 1 |
| 2 | 2021-07-25 00:00:00 | 1 |
| 2 | 2021-08-01 00:00:00 | 1 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-09 00:00:00 | 0 |
| 2 | 2021-08-16 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
| 2 | 2021-08-30 00:00:00 | 1 |
| 2 | 2021-08-31 00:00:00 | 0 |
| 2 | 2021-09-01 00:00:00 | 0 |
| 2 | 2021-08-08 00:00:00 | 1 |
| 2 | 2021-08-15 00:00:00 | 1 |
| 2 | 2021-08-22 00:00:00 | 1 |
| 2 | 2021-08-23 00:00:00 | 1 |
For each ID, I check whether consecutive date_start values are 7 days apart, and put a 1 or 0 in gap_7_days accordingly.
I want to do the following (using Redshift SQL only):
Get the length of each sequence of consecutive 1s in gap_7_days for each ID
Expected output:
| id | date_start | gap_7_days | sequence_length |
| -- | ------------------- | --------------- | --------------- |
| 1 | 2021-06-10 00:00:00 | 0 | |
| 1 | 2021-06-13 00:00:00 | 0 | |
| 1 | 2021-06-19 00:00:00 | 0 | |
| 1 | 2021-06-27 00:00:00 | 0 | |
| 2 | 2021-07-04 00:00:00 | 1 | 6 |
| 2 | 2021-07-11 00:00:00 | 1 | 6 |
| 2 | 2021-07-18 00:00:00 | 1 | 6 |
| 2 | 2021-07-25 00:00:00 | 1 | 6 |
| 2 | 2021-08-01 00:00:00 | 1 | 6 |
| 2 | 2021-08-08 00:00:00 | 1 | 6 |
| 2 | 2021-08-09 00:00:00 | 0 | |
| 2 | 2021-08-16 00:00:00 | 1 | 3 |
| 2 | 2021-08-23 00:00:00 | 1 | 3 |
| 2 | 2021-08-30 00:00:00 | 1 | 3 |
| 2 | 2021-08-31 00:00:00 | 0 | |
| 2 | 2021-09-01 00:00:00 | 0 | |
| 2 | 2021-08-08 00:00:00 | 1 | 4 |
| 2 | 2021-08-15 00:00:00 | 1 | 4 |
| 2 | 2021-08-22 00:00:00 | 1 | 4 |
| 2 | 2021-08-23 00:00:00 | 1 | 4 |
Get the number of sequences for each ID
Expected output:
| id | num_sequences |
| -- | ------------------- |
| 1 | 0 |
| 2 | 3 |
How can I achieve this?

If you want the number of sequences, just look at the previous value. When the current value is "1" and the previous is NULL or 0, then you have a new sequence.
So:
select id,
sum( (gap_7_days = 1 and coalesce(prev_gap_7_days, 0) = 0)::int ) as num_sequences
from (select t.*,
lag(gap_7_days) over (partition by id order by date_start) as prev_gap_7_days
from t
) t
group by id;
If you actually want the lengths of the sequences, as in the intermediate results, then ask a new question. That information is not needed for this question.

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.

You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.

SQL - Window Functions with dense_rank()

I have a dataset structured such as the one below stored in Hive, call it df:
+-----+-----+----------+--------+
| id1 | id2 | date | amount |
+-----+-----+----------+--------+
| 1 | 2 | 11-07-17 | 0.93 |
| 2 | 2 | 11-11-17 | 1.94 |
| 2 | 2 | 11-09-17 | 1.90 |
| 1 | 1 | 11-10-17 | 0.33 |
| 2 | 2 | 11-10-17 | 1.93 |
| 1 | 1 | 11-07-17 | 0.25 |
| 1 | 1 | 11-09-17 | 0.33 |
| 1 | 1 | 11-12-17 | 0.33 |
| 2 | 2 | 11-08-17 | 1.90 |
| 1 | 1 | 11-08-17 | 0.30 |
| 2 | 2 | 11-12-17 | 2.01 |
| 1 | 2 | 11-12-17 | 1.00 |
| 1 | 2 | 11-09-17 | 0.94 |
| 2 | 2 | 11-07-17 | 1.94 |
| 1 | 2 | 11-11-17 | 1.92 |
| 1 | 1 | 11-11-17 | 0.33 |
| 1 | 2 | 11-10-17 | 1.92 |
| 1 | 2 | 11-08-17 | 0.94 |
+-----+-----+----------+--------+
I wish to partition by id1 and id2, and then order by date descending within each grouping of id1 and id2, and then rank "amount" within that, where the same "amount" on consecutive days would receive the same rank. The ordered and ranked output I'd hope to see is shown here:
+-----+-----+------------+--------+------+
| id1 | id2 | date | amount | rank |
+-----+-----+------------+--------+------+
| 1 | 1 | 2017-11-12 | 0.33 | 1 |
| 1 | 1 | 2017-11-11 | 0.33 | 1 |
| 1 | 1 | 2017-11-10 | 0.33 | 1 |
| 1 | 1 | 2017-11-09 | 0.33 | 1 |
| 1 | 1 | 2017-11-08 | 0.30 | 2 |
| 1 | 1 | 2017-11-07 | 0.25 | 3 |
| 1 | 2 | 2017-11-12 | 1.00 | 1 |
| 1 | 2 | 2017-11-11 | 1.92 | 2 |
| 1 | 2 | 2017-11-10 | 1.92 | 2 |
| 1 | 2 | 2017-11-09 | 0.94 | 3 |
| 1 | 2 | 2017-11-08 | 0.94 | 3 |
| 1 | 2 | 2017-11-07 | 0.93 | 4 |
| 2 | 2 | 2017-11-12 | 2.01 | 1 |
| 2 | 2 | 2017-11-11 | 1.94 | 2 |
| 2 | 2 | 2017-11-10 | 1.93 | 3 |
| 2 | 2 | 2017-11-09 | 1.90 | 4 |
| 2 | 2 | 2017-11-08 | 1.90 | 4 |
| 2 | 2 | 2017-11-07 | 1.94 | 5 |
+-----+-----+------------+--------+------+
I attempted this with the following SQL query:
SELECT
id1,
id2,
date,
amount,
dense_rank() OVER (PARTITION BY id1, id2 ORDER BY date DESC) AS rank
FROM
df
GROUP BY
id1,
id2,
date,
amount
But that query doesn't seem to be doing what I'd like it to as I'm not receiving the output I'm looking for.
It seems like a window function using dense_rank, partition by and order by is what I need but I can't quite seem to get it to give me that sample output that I desire. Any help would be much appreciated! Thanks!

This is quite tricky. I think you need to use lag() to see where the value changes and then do a cumulative sum:
select df.*,
sum(case when prev_amount = amount then 0 else 1 end) over
(partition by id1, id2 order by date desc) as rank
from (select df.*,
lag(amount) over (partition by id1, id2 order by date desc) as prev_amount
from df
) df;

MDX last non empty over multiple dimensions

I would geatly appreciate if someone could help me with the
problem. I have the following fact table:
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| EntryNo | ItemNo | CompanyId | BranchId | LocationId | ValuationDate | ValuatedQty | ValuatedAmount |
+=========+========+===========+==========+============+===============+=============+================+
| 1 | Item1 | 1 | 1 | 1 | 2016-03-01 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 2 | Item1 | 1 | 2 | 1 | 2016-03-01 | 4 | 400 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 3 | Item1 | 1 | 1 | 1 | 2016-03-02 | 10 | 1000 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 4 | Item2 | 1 | 1 | 2 | 2016-03-02 | 4 | 200 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 5 | Item2 | 2 | 2 | 2 | 2016-03-02 | 6 | 300 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 6 | Item1 | 2 | 2 | 1 | 2016-03-03 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 7 | Item3 | 1 | 2 | 3 | 2016-03-03 | 0 | 0 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
| 8 | Item1 | 2 | 2 | 3 | 2016-03-03 | 9 | 450 |
+---------+--------+-----------+----------+------------+---------------+-------------+----------------+
There are two measures that represent "overstocked" items on a particular day.
Is it possible to create a calculated member that will allow for slicing data
on the all linked dimensions (Items, Companies, etc.) ? I guess the LastNonEmpty agregration
would be useful here except it is not available in the standard edition.
Given the example the results should be as follows:
By Company:
+---------+-------------+----------------+
| Company | ValuatedQty | ValuatedAmount |
+=========+=============+================+
| 1 | 14 | 1200 |
+---------+-------------+----------------+
| 2 | 15 | 750 |
+---------+-------------+----------------+
By Date:
+------------+-------------+----------------+
| Date | ValuatedQty | ValuatedAmount |
+============+=============+================+
| 2016-03-01 | 4 | 400 |
+------------+-------------+----------------+
| 2016-03-02 | 16 | 1300 |
+------------+-------------+----------------+
| 2016-03-03 | 9 | 450 |
+------------+-------------+----------------+
By Item:
+-------+-------------+----------------+
| Item | ValuatedQty | ValuatedAmount |
+=======+=============+================+
| Item1 | 9 | 450 |
+-------+-------------+----------------+
| Item2 | 6 | 300 |
+-------+-------------+----------------+
| Item3 | 0 | 0 |
+-------+-------------+----------------+

Two functions that come to mind for your requirements are:
Tail: https://msdn.microsoft.com/en-us/library/ms146056.aspx
Bottomcount: https://msdn.microsoft.com/en-us/library/ms144864.aspx
So with Tail something like the following is possible:
WITH SET [LastYearPerSubCat] AS
GENERATE(
[Product].[Product Categories].[SubCategory].members AS S,
S.CURRENTMEMBER
*
TAIL(
NONEMPTY(
[Date].[Calendar Year].[Calendar Year].MEMBERS,
S.CURRENTMEMBER
)
)
)
SELECT
[Measures].[Reseller Gross Profit] ON 0
,[LastYearPerSubCat] ON 1
FROM [Adventure Works];

Get last value with delta from previous row

I have data
| account | type | position | created_date |
|---------|------|----------|------|
| 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 2 | 1 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 2016-08-02 00:00:00 |
| 2 | 1 | 1 | 2016-08-02 00:00:00 |
| 1 | 2 | 1 | 2016-08-03 00:00:00 |
| 2 | 2 | 2 | 2016-08-03 00:00:00 |
| 1 | 1 | 2 | 2016-08-04 00:00:00 |
| 2 | 1 | 1 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2016-08-07 00:00:00 |
| 2 | 2 | 1 | 2016-08-07 00:00:00 |
I need to get last positions (account, type, position) and delta from previous position. I'm trying to use Window functions but only get all rows and can't grouping them/get last.
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
I have result
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 1 | 1 | 2016-08-01 00:00:00 |
| 1 | 1 | 2 | 1 | 2016-08-02 00:00:00 |
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 2 | 2016-08-01 00:00:00 |
| 1 | 2 | 1 | -1 | 2016-08-03 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 2 | 2 | 2016-08-01 00:00:00 |
| 2 | 2 | 1 | 1 | 2016-08-01 00:00:00 |
| and so on |
but i need only last record for each account/type pair
| account | type | position | delta | created_date |
|---------|------|----------|-------|--------------|
| 1 | 1 | 2 | 0 | 2016-08-04 00:00:00 |
| 1 | 2 | 2 | 1 | 2016-08-07 00:00:00 |
| 2 | 1 | 1 | 0 | 2016-08-04 00:00:00 |
| and so on |
Sorry for my bad language and Thanks for any help.

My "best" try..
WITH cte_delta AS (
SELECT
account,
type,
FIRST_VALUE(position) OVER w AS position,
FIRST_VALUE(position) OVER w - LEAD(position, 1, 0) OVER w AS delta,
created_date
FROM table
WINDOW w AS (PARTITION BY account ORDER BY created_date DESC)
),
cte_date AS (
SELECT
account,
type,
MAX(created_date) AS created_date
FROM cte_delta
GROUP BY account, type
)
SELECT cd.*
FROM
cte_delta cd,
cte_date ct
WHERE
cd.account = ct.account
AND cd.type = ct.type
AND cd.created_date = ct.created_date

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Aggregating tsrange values into day buckets with a tie-breaker - sql

Related

Redshift SQL - Count Sequences of Repeating Values Within Groups

SQL window excluding current group?

SQL - Window Functions with dense_rank()

MDX last non empty over multiple dimensions

Get last value with delta from previous row

Categories

Resources