Get apps with the highest review count since a dynamic series of days - sql

I have two tables, apps and reviews (simplified for the sake of discussion):
apps table
id int
reviews table
id int
review_date date
app_id int (foreign key that points to apps)
2 questions:
1. How can I write a query / function to answer the following question?:
Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
I think I know how to write a query if given an explicit date:
SELECT
apps.id,
count(reviews.*)
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
group by
1
having
min(reviews.review_date) >= '2020-01-01'
order by 2 desc
limit 10;
But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.
2. What's the best way to model this data?
It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:
... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...
But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.

I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES.
We don't need to involve the table apps at all. The table reviews has all information we need.
The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:
How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series() call produces the series of days from earliest to latest review. See:
Generating time series between two dates in PostgreSQL
Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

If you are looking for hints, then here are a few:
Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.
Please comment if you need further clarification on this.

The missing piece for me was lateral join.
I can accomplish just about what I want using the following:
select
review_windows.review_window_start,
id,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 year'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1
having
min(reviews.review_date) >= review_windows.review_window_start
order by
2 desc,
3 desc
limit
2
) apps_most_reviews on true;

Related

SQL: How to create supplemental time-series records "out of thin air" from existing records

Suppose I have a table CUSTEVENTS listing customers active in certain months. I now want to consider a customer as being active even if it was in the prior two months.
Simple example, the data might start as:
MONTH_ENDING
CUSTNUM
2022-10-31
72378
2022-11-30
72378
It should be transformed into the following, given the expanded definition of active:
MONTH_ENDING
CUSTNUM
2022-10-31
72378
2022-11-30
72378
**2022-12-31
72378**
**2023-01-31
72378***
I'm arrive at the simplest / most elegant way to get there. I could certainly explode out the data using a time series reference table which would list all the pairs of MONTH_ENDING and "additional" MONTH_ENDING values that "count". Or perhaps I could UNION three subqueries that take the MONTH_ENDING, add_months(MONTH_ENDING,1) add_months(MONTH_ENDING,2). But, maybe there's something even more concise not involving multiple unioned queries or an instrumental time-mapping table.
I happen to be using Teradata but I'm not sure I care about platform-specificity; if there's a Teradata-only approach that works, I'll gladly take it.
The general approach is to first calculate the "Last" event time for a given customer, which is handled by something like
LAG(EVENT_DT) OVER (PARTITION BY CUSTNUM ORDER BY EVENT_DT)
The next concept is islands. You want to calculate that an island begins if the event happened after {your window} has elapsed from the prior one. Vice versa to calculate the island's end.
You can actually find some great online articles about this classic problem: Gaps and Islands problem.
If you understand CTE's, you can probably follow it through this example code I wrote. The first CTE is there to simply allow you to easily add a condition (instead of 1=1) for the events you care about.
WITH CTE_CONDITION AS (
SELECT
EVENT_DT AS dtm,
CUSTNUM
FROM
My_First_Table
WHERE
1 = 1
AND EVENT_DT is not null
),
CTE_LAGGED AS (
SELECT
dtm,
CUSTNUM,
LAG(dtm) OVER (
PARTITION BY CUSTNUM
ORDER BY
dtm
) AS previous_datetime,
LEAD(dtm) OVER (
PARTITION BY CUSTNUM
ORDER BY
dtm
) AS next_datetime,
ROW_NUMBER() OVER (
PARTITION BY CUSTNUM
ORDER BY
CTE_CONDITION.dtm
) AS island_location
FROM
CTE_CONDITION
),
CTE_ISLAND_START AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY CUSTNUM
ORDER BY
dtm
) AS island_number,
CUSTNUM,
dtm AS island_start_datetime,
island_location AS island_start_location
FROM
CTE_LAGGED
WHERE
(
DATEDIFF(MONTH, previous_datetime, dtm) > 2
OR CTE_LAGGED.previous_datetime IS NULL
)
),
CTE_ISLAND_END AS (
SELECT
ROW_NUMBER() OVER (
PARTITION BY CUSTNUM
ORDER BY
dtm
) AS island_number,
CUSTNUM,
dtm AS island_end_datetime,
island_location AS island_end_location
FROM
CTE_LAGGED
WHERE
DATEDIFF(MONTH, dtm, next_datetime) > 2
OR CTE_LAGGED.next_datetime IS NULL
)
SELECT
CTE_ISLAND_START.CUSTNUM,
CTE_ISLAND_START.island_start_datetime,
CTE_ISLAND_END.island_end_datetime,
DATEDIFF(
MONTH, CTE_ISLAND_START.island_start_datetime,
CTE_ISLAND_END.island_end_datetime
) AS ISLAND_DURATION_MONTH,
(
SELECT
COUNT(*)
FROM
CTE_LAGGED
WHERE
CTE_LAGGED.dtm BETWEEN CTE_ISLAND_START.island_start_datetime
AND CTE_ISLAND_END.island_end_datetime
AND CTE_LAGGED.CUSTNUM = CTE_ISLAND_START.CUSTNUM
AND CTE_LAGGED.CUSTNUM = CTE_ISLAND_START.CUSTNUM
) AS island_row_count
FROM
CTE_ISLAND_START
INNER JOIN CTE_ISLAND_END ON CTE_ISLAND_END.island_number = CTE_ISLAND_START.island_number
AND CTE_ISLAND_START.CUSTNUM = CTE_ISLAND_END.CUSTNUM
I wrote this into a Rasgo template using Snowflake syntax, but only minor adjustments should be needed to get this to work in Teradata.
Once you have this result, then this tells you the periods of activity that include the 2 month window. You can then use a calendar table at each month-begin and query or not whether the customer was "active" or not based on whether that date falls into these active ranges.

Add missing months with values from previous month

I need to use this SQL query for a software and get the time in a particular format hence the reason for the Time column however I need the query to insert the months that are missing with the value from the previous month. This is the query I currently have.
SELECT [accountnumber],SUM([postingamount]) AS Amount, [accountingdate],
convert(varchar(4),year(accountingdate))+'M'+ Format(DATEPART( MONTH, accountingdate) , '00')
AS [Time]
FROM [7 GL Detail MACL]
where [accountingdate]>='2019-01-01'
GROUP BY [accountingdate],[postingamount],[accountnumber]
Current Results
Expected Results
Since you didn't specify the RDBMS system you're using, I can't guarantee that this logic will work because every system uses slightly different SQL syntax.
However I used Rasgo datespine function to generate this SQL, as it is quite complex to wrap your head around, and tested it on Snowflake.
The main differences between Snowflake and other systems are: DATEADD and TABLE (GENERATOR())
In case you can't modify this to work in your system, here are the basic steps which you'll want to follow:
Select unique accountnumbers
Select unique dates (month beginnings?) This is where Snowflake uses GENERATOR but other systems might actually have a Calendar table you can select from
Cross Join (cartesian join) these to create every possible combination of accountnumber and date
Outer Join #3 to your data (might have to truncate your date to month-begin)
Filter out rows that dont apply. Like for instance you might have just inserted a row for 1/1/2019 for an account that didn't even begin until 12/12/2020.
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID,
DATEADD('MONTH', (INTERVAL_ID - 1), '2019-01-01'::timestamp_ntz) as SPINE_START,
DATEADD('MONTH', INTERVAL_ID, '2022-06-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 42))
),
GROUPS AS (
SELECT
accountnumber,
MIN(DESIRED_INTERVAL) AS LOCAL_START,
MAX(DESIRED_INTERVAL) AS LOCAL_END
FROM [7 GL Detail MACL]
GROUP BY
accountnumber
),
GROUP_SPINE AS (
SELECT
accountnumber,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM GROUPS G
CROSS JOIN LATERAL (
SELECT
SPINE_START, SPINE_END
FROM GLOBAL_SPINE S
WHERE S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.accountnumber AS GROUP_BY_accountnumber,
GROUP_START,
GROUP_END,
T.*
FROM GROUP_SPINE G
LEFT JOIN {{ your_table }} T
ON DESIRED_INTERVAL >= G.GROUP_START
AND DESIRED_INTERVAL < G.GROUP_END
AND G.accountnumber = T.accountnumber;
You were also doing an aggregation step, but I figure once you get this complicated part down, you can figure out how to finally aggregate it the way you want it.

Postgres: Count multiple events for distinct dates

People of Stack Overflow!
Thanks for taking the time to read this question. What I am trying to accomplish is to pivot some data all from just one table.
The original table has multiple datetime entries of specific events (e.g. when the customer was added add_time and when the customer was lost lost_time).
This is one part of two rows of the deals table:
id
add_time
last_mail_time
lost_time
5
2020-03-24 09:29:24
2020-04-03 13:20:29
NULL
310
2020-03-24 09:29:24
NULL
2020-04-03 13:20:29
I want to create a view of this table. A view that has one row for each distinct date and counts the number of events at this specific time.
This is the goal (times do not match with the example!):
I have working code, like this:
SELECT DISTINCT
change_datetime,
(SELECT COUNT(add_time) as add_time_count FROM deals WHERE add_time::date = change_datetime),
(SELECT COUNT(lost_time) as lost_time_count FROM deals WHERE lost_time::date = change_datetime)
FROM (
SELECT
add_time::date AS change_datetime
FROM
deals
UNION ALL
SELECT
lost_time::date AS change_datetime
FROM
deals
) AS foo
WHERE change_datetime IS NOT NULL
ORDER BY
change_datetime;
but this has some ugly O(n2) queries and takes a lot of time.
Is there a better, more performant way to achieve this?
Thanks!!
You can use a lateral join to unpivot and then aggregate:
select t::date,
count(*) filter (where which = 'add'),
count(*) filter (where which = 'mail'),
count(*) filter (where which = 'lost')
from deals d cross join lateral
(values (add_time, 'add'),
(last_mail_time, 'mail'),
(lost_time, 'lost')
) v(t, which)
group by t::date;

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

SQL Server get customer with 7 consecutive transactions

I am trying to write a query that would get the customers with 7 consecutive transactions given a list of CustomerKeys.
I am currently doing a self join on Customer fact table that has 700 Million records in SQL Server 2008.
This is is what I came up with but its taking a long time to run. I have an clustered index as (CustomerKey, TranDateKey)
SELECT
ct1.CustomerKey,ct1.TranDateKey
FROM
CustomerTransactionFact ct1
INNER JOIN
#CRTCustomerList dl ON ct1.CustomerKey = dl.CustomerKey --temp table with customer list
INNER JOIN
dbo.CustomerTransactionFact ct2 ON ct1.CustomerKey = ct2.CustomerKey -- Same Customer
AND ct2.TranDateKey >= ct1.TranDateKey
AND ct2.TranDateKey <= CONVERT(VARCHAR(8), (dateadd(d, 6, ct1.TranDateTime), 112) -- Consecutive Transactions in the last 7 days
WHERE
ct1.LogID >= 82800000
AND ct2.LogID >= 82800000
AND ct1.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
AND ct2.TranDateKey between dl.BeginTranDateKey and dl.EndTranDateKey
GROUP BY
ct1.CustomerKey,ct1.TranDateKey
HAVING
COUNT(*) = 7
Please help make it more efficient. Is there a better way to write this query in 2008?
You can do this using window functions, which should be much faster. Assuming that TranDateKey is a number and you can subtract a sequential number from it, then the difference constant for consecutive days.
You can put this in a query like this:
SELECT CustomerKey, MIN(TranDateKey), MAX(TranDateKey)
FROM (SELECT ct.CustomerKey, ct.TranDateKey,
(ct.TranDateKey -
DENSE_RANK() OVER (PARTITION BY ct.CustomerKey, ct.TranDateKey)
) as grp
FROM CustomerTransactionFact ct INNER JOIN
#CRTCustomerList dl
ON ct.CustomerKey = dl.CustomerKey
) t
GROUP BY CustomerKey, grp
HAVING COUNT(*) = 7;
If your date key is something else, there is probably a way to modify the query to handle that, but you might have to join to the dimension table.
This would be a perfect task for a COUNT(*) OVER (RANGE ...), but SQL Server 2008 supports only a limited syntax for Windowed Aggregate Functions.
SELECT CustomerKey, MIN(TranDateKey), COUNT(*)
FROM
(
SELECT CustomerKey, TranDateKey,
dateadd(d,-ROW_NUMBER()
OVER (PARTITION BY CustomerKey
ORDER BY TranDateKey),TranDateTime) AS dummyDate
FROM CustomerTransactionFact
) AS dt
GROUP BY CustomerKey, dummyDate
HAVING COUNT(*) >= 7
The dateadd calculates the difference between the current TranDateTime and a Row_Number over all date per customer. The resulting dummyDatehas no actual meaning, but is the same meaningless date for consecutive dates.