Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.
Desired output
c_id day nb_logins_past_6_months
----------------------------------------------
1 2019-01-01 10
1 2019-01-02 10
1 2019-01-03 9
...
1 today 5
➔ One line per user per day with the number of logins between current day and 179 days in the past
Approach 1
1. Cross join each customer ID with calendar table
2. Left join on login table on day
3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)
+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past
Approach 2
1. Cross join each customer ID with calendar table
2. Left join on login table on day between today and 179 days in the past
3. Group by customer ID and day to get nb logins within 179 days
+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows
What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)
In standard SQL, you would use:
select l.*,
count(*) over (partition by customerid
order by login_date
range between interval '6 month' preceding and current row
) as num_logins_180day
from logins l;
This assumes that the logins table has a date of the login with no time component.
I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.
For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.
More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables
(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)
Related
Trying to add a column that counts distinct customers in snowflake based on criteria that changes for every row i.e. needs to count customers between 52 weeks before current week_ending date to current week_ending date.
The query that goes like
select week_ending, sales, last_year_cust_count
from table where year = 2022
now i want the last_year_cust_count to have distinct customers between 52 weeks before week_ending till current week_ending and this needs to show following results as example
Week_ending
Sales
last_year_cust_count
02/01/22
$300
3479
09/01/22
$350
3400
16/01/22
$450
3500
... and so on
The optimal way to solve this over complex structure, is to use a bitmap, and then roll that up to the projections you over.
You should read Using Bitmaps to Compute Distinct Values for Hierarchical Aggregations
The simple, non-performant way is to self join and throw processing power at it.
select a.week_ending, a.sales, count(distinct b.customer) as last_year_cust_count
from table_a as a
join table_a as b
on <filter that I cannot bothered writing to select last 52 weeks base on years and weeks>
where year = 2022
I'm doing some work around what we're spending on support vs. how much those users bring in and came into this unique problem.
Tables I have:
Revenue table: A row for each time a user generates revenue on the platform
Support Contacts Table: A row for each time a user contacts support and the cost associated with that contact
I'm trying to get a table at a daily grain that details...
How many users contacted support on the given day
How much revenue did all users bring in in the last 30 days?
How much did we spend on support contacts in the last 30 days?
The tough part: How much did the users who contacted support on the given day bring in in the last 30 days?
Here's what I have so far:
SELECT DISTINCT
-- Revenue generation date
r.revenue_date
-- Easy summing of contacts/revenue/costs on the given day
,COUNT(DISTINCT sc.user_pk) AS num_user_contacting_support_on_day
,SUM(r.revenue) AS all_users_revenue_for_day
,SUM(sc.support_contact_cost) AS support_costs_on_day
-- Double check that this would sum the p30d revenue/costs for the given day?
,SUM(IF(r.revenue_date BETWEEN r.revenue_date AND DATE_SUB(r.revenue_date, INTERVAL 30 DAY), c.revenue, NULL)) AS p30d_revenue_all_users
,SUM(IF(sc.support_contact_date BETWEEN r.revenue_date AND DATE_SUB(r.revenue_date, INTERVAL 30 DAY), sc.support_contact_cost, NULL)) AS p30d_support_contact_cost
-- The tough part:
-- How do I get the revenue for ONLY users who contacted support on the given day?
-- How do I get the p30d revenue for ONLY users who contacted support on the given day?
FROM revenue_table r
LEFT JOIN support_contact_table sc
ON r.revenue_date = sc.support_contact_date
GROUP BY r.revenue_date
Let me start by saying that I am somewhat new to SQL/Snowflake and have been putting together queries for roughly 2 months. Some of my query language may not be ideal and I fully understand if there's a better, more efficient way to execute this query. Any and all input is appreciated. Also, this particular query is being developed in Snowflake.
My current query is pulling customer volumes by department and date based on a 45 day window with a 24 day lookback from current date and a 21 day look forward based on scheduled appointments. Each date is grouped based on where it falls within that 45 day window: current week (today through next 7 days), Week 1 (forward-looking days 8-14), and Week 2 (forward-looking days 15-21). I have been working to try and build out a comparison column that, for any date that lands within either the Week 1 or Week 2 group, will pull in prior period volumes from either 14 days prior (Week 1) or 21 days prior (Week 2) but am getting nowhere. Is there a best-practice for this type of column? Generic example of the current output is attached. Please note that the 'Prior Wk' column in the sample output was manually populated in an effort to illustrate the way this column should ideally work.
I have tried several different iterations of count(case...) similar to that listed below; however, the 'Prior Wk' column returns the count of encounters/scheduled encounters for the same day rather than those that occurred 14 or 21 days ago.
Count(Case When datediff(dd,SCHED_DTTM,getdate())
between -21 and -7 then 1 else null end
) as "Prior Wk"
I've tried to use an IFF statement as shown below, but no values return.
(IFF(ENCOUNTER_DATE > dateadd(dd,8,getdate()),
count(case when ENC_STATUS in (“Phone”,”InPerson”) AND
datediff(dd,ENCOUNTER_Date,getdate()) between 7 and 14 then 1
else null end), '0')
) as "Prior Wk"
Also have attempted creating and using a temporary table (example included) but have not managed to successfully pull information from the temp table that didn't completely disrupt my encounter/scheduled counts. Please note for this approach I've only focused on the 14 day group and have not begun to look at the 21 day/Week 2 group. My attempt to use the temp table to resolve the problem centered around the following clause (temp table alias: "Date1"):
CASE when AHS.GL_Number = "DATEVISIT1"."GL_NUMBER" AND
datevisit1.lookback14 = dateadd(dd,14,PE.CONTACT_Date)
then "DATEVISIT1"."ENC_Count"
else null end
as "Prior Wk"*
I am extremely appreciative of any insight on the current best practices around pulling prior period data into a column alongside current period data. Any misuse of terminology on my part is not deliberate.
I'm struggling to understand your requirement but it sounds like you need to use window functions https://docs.snowflake.com/en/sql-reference/functions-analytic.html, in this case likely a SUM window function. The LAG window function, https://docs.snowflake.com/en/sql-reference/functions/lag.html, might also be of some help
is it possible to SUM a number over a special time period in Amazon Redshift with a WINDOW-Function?
As an example I'm counting login numbers for different companies per day.
What I now want per row is, that it sums up the logins over the last 4 weeks (referenced by the date of the row): The field which I'm serarching for is marked yellow in the screenshot.
Thanks in advance for your help.
If you have data for each day, then you can use rows:
select t.*,
sum(logs) over (partition by company
order by date
rows between 27 preceding and current row
) as logins_4_weeks
from t;
Redshift does not yet support range for the window frame, so this is your best bet.
My dataset provides a monthly snapshot of customer accounts. Below is a very simplified version:
Date_ID | Acc_ID
------- | -------
20160430| 1
20160430| 2
20160430| 3
20160531| 1
20160531| 2
20160531| 3
20160531| 4
20160531| 5
20160531| 6
20160531| 7
20160630| 4
20160630| 5
20160630| 6
20160630| 7
20160630| 8
Customers can open or close their accounts, and I want to calculate the number of 'new' customers every month. The number of 'exited' customers will also be helpful if this is possible.
So in the above example, I should get the following result:
Month | New Customers
------- | -------
20160430| 3
20160531| 4
20160630| 1
Basically I want to compare distinct account numbers in the selected and previous month, any that exist in the selected month and not previous are new members, any that were there last month and not in the selected are exited.
I've searched but I can't seem to find any similar problems, and I hardly know where to start myself - I've tried using CALCULATE and FILTER along with DATEADD to filter the data to get two months, and then count the unique values. My PowerPivot skills aren't up to scratch to solve this on my own however!
Getting the new users is relatively straightforward - I'd add a calculated column which counts rows for that user in earlier months and if they don't exist then they are a new user:
=IF(CALCULATE(COUNTROWS(data),
FILTER(data, [Acc_ID] = EARLIER([Acc_ID])
&& [Date_ID] < EARLIER([Date_ID]))) = BLANK(),
"new",
"existing")
Once this is in place you can simply write a measure for new_users:
=CALCULATE(COUNTROWS(data), data[customer_type] = "new")
Getting the cancelled users is a little harder because it means you have to be able to look backwards to the prior month - none of the time intelligence stuff in PowerPivot will work out of the box here as you don't have a true date column.
It's nearly always good practice to have a separate date table in your PowerPivot models and it is a good way to solve this problem - essentially the table should be 1 record per date with a unique key that can be used to create a relationship. Perhaps post back with a few more details.
This is an alternative method to Jacobs which also works. It avoids creating a calculated column, but I actually find the calculated column useful to use as a flag against other measures.
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, LASTDATE('Dates'[Date])
)
) - CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATESBETWEEN(
'Dates'[Date], 0, FIRSTDATE('Dates'[Date]) - 1
)
)
It basically uses the dates table to make a distinct count of all Acc_ID from the beginning of time until the first day of the period of time selected, and subtracts that from the distinct count of all Acc_ID from the beginning of time until the last day of the period of time selected. This is essentially the number of new distinct Acc_ID, although you can't work out which Acc_ID's these are using this method.
I could then calculate 'exited accounts' by taking the previous months total as 'existing accounts':
=CALCULATE(
DISTINCTCOUNT('Accounts'[Acc_ID]),
DATEADD('Dates'[Date], -1, MONTH)
)
Then adding the 'new accounts', and subtracting the 'total accounts':
=DISTINCTCOUNT('Accounts'[Acc_ID])