Adding column based on dynamic criteria that changes for every row in snowflake - sql

Trying to add a column that counts distinct customers in snowflake based on criteria that changes for every row i.e. needs to count customers between 52 weeks before current week_ending date to current week_ending date.
The query that goes like
select week_ending, sales, last_year_cust_count
from table where year = 2022
now i want the last_year_cust_count to have distinct customers between 52 weeks before week_ending till current week_ending and this needs to show following results as example
Week_ending
Sales
last_year_cust_count
02/01/22
$300
3479
09/01/22
$350
3400
16/01/22
$450
3500
... and so on

The optimal way to solve this over complex structure, is to use a bitmap, and then roll that up to the projections you over.
You should read Using Bitmaps to Compute Distinct Values for Hierarchical Aggregations
The simple, non-performant way is to self join and throw processing power at it.
select a.week_ending, a.sales, count(distinct b.customer) as last_year_cust_count
from table_a as a
join table_a as b
on <filter that I cannot bothered writing to select last 52 weeks base on years and weeks>
where year = 2022

Related

Cohort retention with SQL BigQuery

I am trying to create a retention table like the following using SQL in Big Query but with MONTHLY cohorts;
I have the following columns to use in my dataset, I am only using one table and it's name is 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
order_date
order_id
customer_id
2020-01-02
12345
6789
I do not need the new user column and the data goes through June 2020 I think ideally a cohort month column that lists January-June cohorts and then 5 periods across.
I have tried so many different things and keep getting errors in BigQuery I think I am approaching it all wrong. The online queries I am trying to pull from seem to use dates rather than months which is also causing some confusion as I think I need to truncate my date column to months only in the query?
Does anyone have a go-to query that will work in BigQuery for a retention table or can help me approach this? Thanks!
This may help you:
With cohorts AS (
SELECT
customer_id,
MIN(DATE(order_date)) AS cohort_date
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders'
GROUP BY 1)
SELECT
FORMAT_DATE("%Y-%m", c.cohort_date) AS cohort_mth,
t.customer_id AS cust_id,
DATE_DIFF(t.order_date, c.cohort_date, month) AS order_period,
FROM 'curious-furnace-341507.TEST.Test_Dataset_-_Orders' t
JOIN cohorts c ON t.customer_id = c.customer_id
WHERE cohort_date >= ('2020-01-01')
AND DATE_DIFF(t.order_date, c.cohort_date, month) <=5
GROUP BY 1, 2, 3
I typically do pivots and % calcs in excel/ sheets. So this will give just you the input data you need for that.
NOTE:
This will give you a count of unique customers who ordered in period X (ignores repeat orders in period).
This also has period 0 (ordered again in cohort_mth) which you may wish to keep/ exclude.

SQL performance issues with window functions on daily basis

Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.
Desired output
c_id day nb_logins_past_6_months
----------------------------------------------
1 2019-01-01 10
1 2019-01-02 10
1 2019-01-03 9
...
1 today 5
➔ One line per user per day with the number of logins between current day and 179 days in the past
Approach 1
1. Cross join each customer ID with calendar table
2. Left join on login table on day
3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)
+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past
Approach 2
1. Cross join each customer ID with calendar table
2. Left join on login table on day between today and 179 days in the past
3. Group by customer ID and day to get nb logins within 179 days
+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows
What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)
In standard SQL, you would use:
select l.*,
count(*) over (partition by customerid
order by login_date
range between interval '6 month' preceding and current row
) as num_logins_180day
from logins l;
This assumes that the logins table has a date of the login with no time component.
I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.
For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.
More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables
(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)

Crosstab Query Month YTD

I am looking for a solution to convert the individual monthly averages to monthly averages from the beginning of the year. In other words from January to said month.
I used the cross tab wizard to group the rating of an employee into months in the column header. Employees are in the row header. The values of the ratings is then averaged.
My issue is this just shows the average rating of an employee for each month. I need a solution that would show me the average of each month if it included all results from the begging of the year (i.e. February would include January's and February's ratings).
TRANSFORM Avg(CSS_Table.[Emp_Rating]) AS AvgOfEmp_Rating
SELECT CSS_Table.[Emp]
FROM Rating_Table
GROUP BY Rating_Table.[Emp]
PIVOT Format([Survey_Date],"mmm") In ("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec");
One possible strategy to attack this
Create one query for each month that calculates the monthly YTD Average.
Create a Union query that joins them all together (i.e. make sure to use the same column names for each query).
Feed the union query into the Pivot.

SQL Statement for MS Access Query to Calculate Quarterly Growth Rate

I have a table named "Historical_Stock_Prices" in a MS Access database. This table has the columns: Ticker, Date1, Open1, High, Low, Close1, Volume, Adj_Close. The rows consist of the data for each ticker for every business day.
I need to run a query from inside my VB.net program that will return a table in my program that displays the growth rates for each quarter of every year for each ticker symbol listed. So for this example I would need to find the growth rate for GOOG in the 4th quarter of 2012.
To calculate this manually I would need to take the Close Price on the last BUSINESS day of the 4th quarter (12/31/2012) divided by the Open Price of the first BUSINESS day of the 4th quarter (10/1/2012). Then I need to subtract by 1 and multiply by 100 in order to get a percentage.
The actual calculation would look like this: ((707.38/759.05)-1)*100 = -6.807%
The first and last days of each quarter may vary due to weekend days.
I cannot come up with the correct syntax for the SQL statement to create a table of Growth Rates from a table of raw Historical Prices. Can anyone help me with the SQL statment?
Here's how I would approach the problem:
I'd start by creating a saved query Access named [Stock_Price_with_qtr] that calculates the year and quarter for each row:
SELECT
Historical_Stock_Prices.*,
Year([Date1]) AS Yr,
Switch(Month([Date1])<4,1,Month([Date1])<7,2,Month([Date1])<10,3,True,4) AS Qtr
FROM Historical_Stock_Prices
Then I'd create another saved query in Access named [Qtr_Dates] that finds the first and last business days for each ticker and quarter:
SELECT
Stock_Price_with_qtr.Ticker,
Stock_Price_with_qtr.Yr,
Stock_Price_with_qtr.Qtr,
Min(Stock_Price_with_qtr.Date1) AS Qtr_Start,
Max(Stock_Price_with_qtr.Date1) AS Qtr_End
FROM Stock_Price_with_qtr
GROUP BY
Stock_Price_with_qtr.Ticker,
Stock_Price_with_qtr.Yr,
Stock_Price_with_qtr.Qtr
That would allow me to use the following query in VB.NET (or C#, or Access itself) to calculate the quarterly growth rates:
SELECT
Qtr_Dates.Ticker,
Qtr_Dates.Yr,
Qtr_Dates.Qtr,
(([Close_Prices]![Close1]/[Open_Prices]![Open1])-1)*100 AS Qtr_Growth
FROM
(
Historical_Stock_Prices AS Open_Prices
INNER JOIN Qtr_Dates
ON (Open_Prices.Ticker = Qtr_Dates.Ticker)
AND (Open_Prices.Date1 = Qtr_Dates.Qtr_Start)
)
INNER JOIN
Historical_Stock_Prices AS Close_Prices
ON (Qtr_Dates.Ticker = Close_Prices.Ticker)
AND (Qtr_Dates.Qtr_End = Close_Prices.Date1)

GROUP BY with date range

I have a table with 4 columns, id, Stream which is text, Duration (int), and Timestamp (datetime). There is a row inserted for every time someone plays a specific audio stream on my website. Stream is the name, and Duration is the time in seconds that they are listening. I am currently using the following query to figure up total listen hours for each week in a year:
SELECT YEARWEEK(`Timestamp`), (SUM(`Duration`)/60/60) FROM logs_main
WHERE `Stream`="asdf" GROUP BY YEARWEEK(`Timestamp`);
This does what I expect... presenting a total of listen time for each week in the year that there is data.
However, I would like to build a query where I have a result row for weeks that there may not be any data. For example, if the 26th week of 2006 has no rows that fall within that week, then I would like the SUM result to be 0.
Is it possible to do this? Maybe via a JOIN over a date range somehow?
The tried an true old school solution is to set up another table with a bunch of date ranges that you can outer join with for the grouping (as in the other table would have all of the weeks in it with a begin / end date).
In this case, you could just get by with a table full of the values from YEARWEEK:
201100
201101
201102
201103
201104
And here is a sketch of a sql statement:
SELECT year_weeks.yearweek , (SUM(`Duration`)/60/60)
FROM year_weeks LEFT OUTER JOIN logs_main
ON year_weeks.yearweek = logs_main.YEARWEEK(`Timestamp`)
WHERE `Stream`="asdf" GROUP BY year_weeks.yearweek;
Here is a suggestion. might not be exactly what you are looking for.
But say you had a simple table with one column [year_week] that contained the values of 1, 2, 3, 4... 52
You could then theoretically:
SELECT
A.year_week,
(SELECT SUM('Duration')/60/00) FROM logs_main WHERE
stream = 'asdf' AND YEARWEEK('TimeStamp') = A.year_week GROUP BY YEARWEEK('TimeStamp'))
FROM
tblYearWeeks A
this obviously needs some tweaking... i've done several similar queries in other projects and this works well enough depending on the situation.
If your looking for a one table/sql based solution then that is deffinately something I would be interested in as well!