Insert zero values for unexisting groups in Redshift - sql

I'm writing a simple query on Amazon Redshift as follows:
SELECT EXTRACT(year FROM created_at) AS year,
EXTRACT(month FROM created_at) AS month,
member_id,
COUNT(*) as pageviews
FROM TABLE
GROUP BY year,
month,
member_id
ORDER BY year,
month,
member_id
This gives me the following result as an example:
year month member_id pageviews
2015 1 100 29
2015 2 100 22
2015 3 100 178
2015 4 100 34
2015 1 200 56
2015 3 200 16
Here's the result I would like to have:
year month member_id pageviews
2015 1 100 29
2015 2 100 22
2015 3 100 178
2015 4 100 34
2015 1 200 56
2015 2 200 0
2015 3 200 16
2015 4 200 0
In the result above, notice the additional rows with zero pageviews.
How do I get this result? Any help would be much appreciated.

Use a cross join to generate the rows and then a left join to bring in the data:
SELECT EXTRACT(year FROM created_at) AS year,
EXTRACT(month FROM created_at) AS month,
m.member_id,
COUNT(t.member_id) as pageviews
FROM (SELECT DISTINCT EXTRACT(year FROM created_at) AS year, EXTRACT(month FROM created_at) AS month FROM TABLE) ym CROSS JOIN
(SELECT DISTINCT member_id FROM TABLE) m LEFT JOIN
TABLE t
ON EXTRACT(year FROM created_at) AS month = ym.year AND
EXTRACT(month FROM created_at) AS month = ym.month AND
t.member_id = m.member_id
GROUP BY ym.year, ym.month, m.member_id
ORDER BY ym.year, ym.month, m.member_id;
This assumes that all year/month combinations are included in the table.
If you have other tables that are better sources for members and the dates, try them -- that may be faster than SELECT DISTINCT.

Related

Sum of last 12 months

I have a table with 3 columns (Year, Month, Value) like this in Sql Server :
Year
Month
Value
ValueOfLastTwelveMonths
2021
1
30
30
2021
2
24
54 (30 + 24)
2021
5
26
80 (54+26)
2021
11
12
92 (80+12)
2022
1
25
87 (SUM of values from 1 2022 TO 2 2021)
2022
2
40
103 (SUM of values from 2 2022 TO 3 2021)
2022
4
20
123 (SUM of values from 4 2022 TO 5 2021)
I need a SQL request to calculate ValueOfLastTwelveMonths.
SELECT Year,
       Month,
Value,
SUM (Value) OVER (PARTITION BY Year, Month)
FROM MyTable
This is much easier if you have a row for each month and year, and then (if needed) you can filter the NULL rows out. The reason it's easier is because then you know how many rows you need to look back at: 11.
If you make a dataset of the years and months, you can then LEFT JOIN to your data, aggregate, and then finally filter the data out:
SELECT *
INTO dbo.YourTable
FROM (VALUES(2021,1,30),
(2021,2,24),
(2021,5,26),
(2021,11,12),
(2022,1,25),
(2022,2,40),
(2022,4,20))V(Year,Month,Value);
GO
WITH YearMonth AS(
SELECT YT.Year,
V.Month
FROM (SELECT DISTINCT Year
FROM dbo.YourTable) YT
CROSS APPLY (VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12))V(Month)),
RunningTotal AS(
SELECT YM.Year,
YM.Month,
YT.Value,
SUM(YT.Value) OVER (ORDER BY YM.Year, YM.Month
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW) AS Last12Months
FROM YearMonth YM
LEFT JOIN dbo.YourTable YT ON YM.Year = YT.Year
AND YM.Month = YT.Month)
SELECT Year,
Month,
Value,
Last12Months
FROM RunningTotal
WHERE Value IS NOT NULL;
GO
DROP TABLE dbo.YourTable;

SQL Bigquery Counting repeated customers from transaction table

I have a transaction table that looks something like this.
userid
orderDate
amount
111
2021-11-01
20
112
2021-09-07
17
111
2021-11-21
17
I want to count how many distinct customers (userid) that bought from our store this month also bought from our store in the previous month. For example, in February 2020, we had 20 customers and out of these 20 customers 7 of them also bought from our store in the previous month, January 2020. I want to do this for all the previous months so ending up with something like.
year
month
repeated customers
2020
01
11
2020
02
7
2020
03
9
I have written this but this only works for only the current month. How would I iterate or rewrite it to get the table as shown above.
WITH CURRENT_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(CURRENT_DATE(),MONTH) AND DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
),
PREVIOUS_PERIOD AS (
SELECT DISTINCT userid
FROM table1
WHERE DATE(orderDate) BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH),MONTH) AND LAST_DAY(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH))
)
SELECT count(1)
FROM CURRENT_PERIOD RC
WHERE RC.userid IN (SELECT DISTINCT userid FROM PREVIOUS_PERIOD)
You can summarize to get one record per month, use lag(), and then aggregate:
select yyyymm,
countif(prev_yyyymm = date_add(yyyymm, interval -1 month)
from (select userid, date_trunc(order_date, month) as yyyymm,
lag(date_trunc(order_date, month)) over (partition by userid order by date_trunc(order_date, month)) as prev_yyyymm
from table1
group by 1, 2
) t
group by yyyymm
order by yyyymm;

Selecting records that have low numbers consecutively

I have a table as following (using bigquery):
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
112
2020
11
30
5
113
2020
11
30
5
Is there a way in which I can select ids that have ratings that are consecutively (two or more consecutive records) low (low as in both records' ratings less than 4.5)?
For example, my desired output is:
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
If you want all rows, then you need to look at both the previous rating and the next rating:
SELECT t.*
FROM (SELECT t.*,
LAG(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS prev_rating,
LEAD(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS next_rating,
FROM dataset.table t
) t
WHERE (rating < 4.5 and prev_rating < 4.5) OR
(rating < 4.5 and next_rating < 4.5)
Below is for BigQuery Standard SQL
select * except(grp, seq_len)
from (
select *, sum(1) over(partition by grp) seq_len
from (
select *,
countif(rating >= 4.5) over(partition by id order by year, month, day) grp
from `project.dataset.table`
)
where rating < 4.5
)
where seq_len > 1

SQL - use only clients that are present in all months

I have a dataset with different clients, and their sales count. Over time, some clients get added and deleted from the data. How do I make sure that when I look at the sales counts, that I am only using a selection of the clients that were in the data set all the time? Ie if I have a client that doesn't have a record for 2018-03, then I don't want that client to be part of the entire query. If a clients does not have a record in 2020-03, then I also do not want this client to be part of the entire query.
For example, the following query:
select DATE_PART (y, sold_date)as year, DATE_PART (mm, sold_date) as month, count(distinct(client))
from sales_data
where sold_date > '2018-01-01'
group by year, month
order by year,month
Yields
year month count
2018 1 78
2018 2 83
2018 3 80
2018 4 83
2018 5 84
2018 6 81
2018 7 83
2018 8 90
2018 9 89
2018 10 95
2018 11 94
2018 12 97
2019 1 102
2019 2 103
2019 3 102
2019 4 105
2019 5 103
2019 6 104
2019 7 104
2019 8 106
2019 9 106
2019 10 108
2019 11 109
2019 12 104
2020 1 104
2020 2 102
2020 3 103
2020 4 98
2020 5 97
2020 6 79
So I want to only use the clients that are in all months, they should not be more than 78, because there can not be more users than the minimal month (2018-1).
FYI, I am using Amazon Redshift here but I am OK with a query that's rdbms agnostic or works for SQL-Server/Oracle/MySQL/PostgreSQL, I am just interested in a pattern on how to solve this issue effectively.
If I'm understanding what you want correctly, and if this is just a one-off query, you could use a correlated subquery in the where clause:
SELECT
DATE_PART(y, s.sold_date) AS year,
DATE_PART(mm, s.sold_date) AS month,
COUNT(DISTINCT s.client)
FROM
sales_data AS s
WHERE
EXISTS (
SELECT sd.client FROM sales_data AS sd WHERE DATE_PART(y,
sd.sold_date) = 2018 AND DATE_PART(mm, sd.sold_date) = 1 AND
sd.client = s.client
) AND
s.sold_date > '2018-01-01'
GROUP BY
year,
month
ORDER
DATE_PART(y, s.sold_date),
DATE_PART(mm, s.sold_date)
presence in all months can be done with 2-step aggregation:
group sales data by customer ID having all months
group sales data joined to (1) by year, month
like this (=12 can be a dynamic expression, depending on the amount of history you have)
with
stable_customers as (
select customer_id
from sales_data
group by 1
having count(distinct date_trunc('month' from sold_date)=12
)
select
DATE_PART (y, sold_date) as year
,DATE_PART (mm, sold_date) as month,
,count(1)
from sales_date
join stable_customers
using (customer_id)
where sold_date > '2018-01-01'
group by year, month
order by year,month
Use window functions. Unfortunately, SQL Server does not support count(distinct) as a window function. Fortunately, there is a simple work-around using dense_rank():
select year, month, count(distinct client)
from (select sd.*, year, month,
(dense_rank() over (order by year, month) +
dense_rank() over (order by year desc, month desc)
) as num_months,
(dense_rank() over (partition by client order by year, month) +
dense_rank() over (partition by client order by year desc, month desc)
) as num_months_client
from sales_data sd cross apply
(values (year(sold_date), month(sold_date))) v(year, month)
where sd.sold_date > '2018-01-01'
) sd
where num_months_client = num_months
group by year, month
order by year, month;
Note: This looks at all months that are in the data. If all clients are missing 2019-03, then that months is not considered at all.

Grouping data on SQL Server

I have this table in SQL Server:
Year Month Quantity
----------------------------
2015 January 10
2015 February 20
2015 March 30
2014 November 40
2014 August 50
How can I identify the different years and months adding two more columns that group the same years with a number and then different months in sequential way like the example
Year Month Quantity Group Subgroup
------------------------------------------------
2015 January 10 1 1
2015 February 20 1 2
2015 March 30 1 3
2014 November 40 2 1
2014 August 50 2 2
You can use DENSE_RANK to calculate the groups for you:
SELECT t1.*, DENSE_RANK() OVER (ORDER BY Year DESC) AS [Group],
DENSE_RANK() OVER (PARTITION BY Year ORDER BY DATEPART(month, Month + ' 01 2010')) AS [SubGroup]
FROM t1
ORDER BY 4, 5
See this fiddle.
To associate group and subgroup with a number you can do this:
WITH RankedTable AS (
SELECT year, month, quantity,
ROW_NUMBER() OVER (partition by year order by Month) AS rn
FROM yourtable)
SELECT year, month, quantity,
SUM (CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (ORDER BY YEAR) as year_group,
rn AS subgroup
FROM RankedTable
Here ROW_NUMBER() OVER clause calculates rank of a month within a year.
And SUM() ... OVER calculates running SUM for the months with rank 1.
SQL Fiddle