Top 2 per month in SQL - sql

I have this dataset, which has dates and products for cities:
CREATE TABLE my_table (
the_id varchar(5) NOT NULL,
the_date timestamp NOT NULL,
the_city varchar(5) NOT NULL,
the_product varchar(1) NOT NULL
);
INSERT INTO my_table
VALUES ('VIS01', '2019-05-02 09:00:00','LISBO','A'),
('VIS02', '2019-05-04 12:00:00','EVORA','A'),
('VIS03', '2019-05-05 18:00:00','LISBO','B'),
('VIS04', '2019-05-06 18:30:00','PORTO','B'),
('VIS05', '2019-05-15 12:05:00','PORTO','C'),
('VIS06', '2019-06-02 18:06:00','EVORA','C'),
('VIS07', '2019-06-02 18:07:00','PORTO','A'),
('VIS08', '2019-06-04 18:08:00','EVORA','B'),
('VIS09', '2019-06-07 18:09:00','LISBO','B'),
('VIS10', '2019-06-09 18:10:00','LISBO','D'),
('VIS11', '2019-06-12 18:11:00','EVORA','D'),
('VIS12', '2019-06-15 18:12:00','LISBO','E'),
('VIS13', '2019-06-15 18:13:00','EVORA','F'),
('VIS14', '2019-06-18 18:14:00','PORTO','G'),
('VIS15', '2019-06-23 18:15:00','LISBO','A'),
('VIS16', '2019-06-25 18:16:00','LISBO','A'),
('VIS17', '2019-06-27 18:17:00','LISBO','F'),
('VIS18', '2019-06-27 18:18:00','LISBO','A'),
('VIS19', '2019-06-28 18:19:00','LISBO','A'),
('VIS20', '2019-06-30 18:20:00','EVORA','D'),
('VIS21', '2019-07-01 18:21:00','EVORA','D'),
('VIS22', '2019-07-04 18:30:00','EVORA','D'),
('VIS23', '2019-07-04 18:31:00','EVORA','B'),
('VIS24', '2019-07-06 18:40:00','EVORA','K'),
('VIS25', '2019-07-12 18:50:00','EVORA','G'),
('VIS26', '2019-07-15 18:00:00','PORTO','C'),
('VIS27', '2019-07-18 18:00:00','PORTO','C'),
('VIS28', '2019-07-25 18:00:00','PORTO','B'),
('VIS29', '2019-07-30 18:00:00','PORTO','M');
And I want the top two per month. The expected result should be:
month product count
2019-05 A 2
2019-05 B 2
2019-06 A 5
2019-06 D 3
2019-07 C 2
2019-07 D 2
But I'm not quite sure how to group by month. Please, any help will be greatly appreciated.

First, you can use to_char(the_date,'YYYY-MM') to get the year and month in the right format.
Next, you can use count(*) to group by the month and product, and row_number() to give a sequence number to each row in the groups.
SELECT to_char(the_date,'YYYY-MM') as month,
the_product as product,
count(*) as p_count,
row_number() over (partition by to_char(the_date,'YYYY-MM') order by count(*) desc) as seq
FROM my_table
group by month, product
Last, you can wrap that in an outer query to select just the columns and rows that you want.
SELECT month, product, p_count as count
FROM (
SELECT to_char(the_date,'YYYY-MM') as month,
the_product as product,
count(*) as p_count,
row_number() over (partition by to_char(the_date,'YYYY-MM') order by count(*) desc) as seq
FROM my_table
group by month, product
) as foo
where foo.seq <= 2;

You can use aggregation and window functions:
select mp.*
from (select date_trunc('month', the_date) as yyyymm,
the_product, count(*) as cnt,
row_number() over (partition by date_trunc('month', the_date) order by count(*) desc) as seqnum
from my_table
group by yyyymm, the_product
) mp
where seqnum <= 2;

In postgresql, I believe you can extract every parts of the timestamp using the Extract function.
e.g.:
SELECT the_date, EXTRACT(MONTH from the_date) as MONTH
the_date
MONTH
'2019-08-05'
08
that said, you can then group by Product, then Month, and Select the TOP 2
SELECT EXTRACT(MONTH from the_date) as month, the_product, count (*) FROM my_table
GROUP BY EXTRACT(MONTH from the_date), the_product
ORDER BY count(*)
LIMIT 2
There might be some optimization to do since I don't have a Database to test the query, but it might give you a good start

Related

RANK() over (PARTITION BY) To show only TOP 3 rows for each month

I have a question about ranking . (My using Pgadmin for my SQL codes)
Mange to get my sum of sales in DESC order and rank 1 to 3 for the month of APR
But how can I achieve my result by showing only rank 1 to 3 for the month of Apr , May and June.
I need to reflect only 9 rows in my table .
SELECT restaurant_id,
EXTRACT(year FROM submitted_on) AS year,
EXTRACT(month FROM submitted_on) AS month,
SUM(total_amount),
RANK() OVER (PARTITION BY(extract(month from submitted_on))
ORDER BY SUM(total_amount) DESC) rank
FROM orders
WHERE submitted_on::date BETWEEN '2021-04-01' AND '2021-06-30'
GROUP BY restaurant_id, year, month
If you just want 3 records you should use row_number instead of rank. for your requirement you can do it in this way:
select t.* from (
SELECT restaurant_id,
EXTRACT(year FROM submitted_on) AS year,
EXTRACT(month FROM submitted_on) AS month,
SUM(total_amount),
RANK() OVER (PARTITION BY(extract(month from submitted_on))
ORDER BY SUM(total_amount) DESC) rank
FROM orders
WHERE submitted_on::date BETWEEN '2021-04-01' AND '2021-06-30'
GROUP BY restaurant_id, year, month
) t
where rank <=3;

Top N items in every month - BIGQUERY

I have a big query program below;
WITH cte AS(
SELECT *
FROM (
SELECT project_name,
SUM(reward_value) AS total_reward_value,
DATE_TRUNC(date_signing, MONTH) as month,
date_signing,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month, date_signing
)
)
SELECT * FROM cte WHERE rank <= 5
that returns the following result:
While I expect to have each unique project to be SUM within each month and then I filter only the top 5.
Something like this:
I got the following error if the date_signing grouping is removed
PARTITION BY expression references column date_signing which is neither grouped nor aggregated at [16:48]
Any hints what should be corrected will be appreciated!
One more subquery maybe then?
WITH cte AS(
SELECT project_name,
SUM(reward_value) as reward_sum,
DATE_TRUNC(date_signing, MONTH) as month
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
),
ranks AS (
SELECT
project_name,
reward_sum,
month,
ROW_NUMBER() over (PARTITION BY month ORDER BY reward_sum DESC) AS rank
)
SELECT *
FROM ranks
WHERE rank <= 5
yeah you can't do that , yo can show the last signing date instead:
WITH cte AS(
SELECT project_name,
SUM(reward_value),
DATE_TRUNC(date_signing, MONTH) as month,
MAX(date_signing) as last_signing_date,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
)
SELECT * FROM cte WHERE rank <= 5

First users by categories in BigQuery

How can I count the new and existing users by categories and years?
For instance, during 2015-2020 if someone bought a product in category_A in 2016 first, it will be counted as a new uesr in 2016 in category_A although this user bought a product in category_B in 2015.
Table_1 (Columns: product_name, date, category, sales, user_id)
Want to get the result as bleow
One approach uses two levels of aggregation:
select extract(year from mindate) yr, category, count(*) num_new
from (
select user_id, category, min(date) mindate
from table_1
group by user_id, category
) t
group by extract(year from mindate)
The subquery retrieves the first purchase date of each user by category. Then, the outer query aggregates by the year of that date.
If you want the count of current users as well, then it is a bit different. You can use a window function in the subquery rather than aggregation, then count distinct values in the outer query:
select extract(year from mindate) yr, category,
countdistinctif(user_id, date = mindate) num_new,
countdistinct(user_id) num_total
from (
select date, user_id, category, min(date) over(partition by user_id, category) mindate
from table_1
) t
group by extract(year from mindate)
Below is for BigQuery Standard SQL
#standardSQL
WITH temp AS (
SELECT *,
0 = COUNT(1) OVER(
PARTITION BY user_id, category
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) new_user
FROM `project.dataset.table_1`
ORDER BY date, user_id
)
SELECT EXTRACT(YEAR FROM date) AS year,
category,
COUNT(DISTINCT IF(new_user, user_id, NULL)) AS num_new,
COUNT(DISTINCT IF(new_user, NULL, user_id)) AS num_existing
FROM temp
GROUP BY year, category

How can I count users in a month that were not present in the month before?

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1
I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])
First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.
The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1
Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;

Retrieve recent 5 days forecast for each cities with latest issue date

I need to retrieve the recent 5 days forecast info for each cities.
My table looks like below
The real problem is with the issue date.
the city may contain several forecast info for the same date with distinct issue date.
I need to retrieve recent 5 records for each cities with latest issue date and group by forecast date
I have tried something like below but not giving the expected result
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID ORDER BY FORECAST_DATE DESC, ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
GROUP BY FORECAST_DATE
) WHERE rn <= 5
Any suggestion or advice will be helpful
This will get the latest issued forecast per day over the most recent 5 days for each city:
SELECT *
FROM (
SELECT f.*,
DENSE_RANK() OVER ( PARTITION BY city_id ORDER BY forecast_date DESC )
AS forecast_rank,
ROW_NUMBER() OVER ( PARTITION BY city_id, forecast_date ORDER BY issue_date DESC )
AS issue_rn
FROM Forecast f
)
WHERE forecast_rank <= 5
AND issue_rn = 1;
Partition by works like group by but for the function only.
Try
with CTE as
(
select t1.*,
row_number() over (partition by city_id, forecast_date order by issue_date desc) as r_ord
from Forecast
)
select CTE.*
from CTE
where r_ord <= 5
Try this
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID, FORECAST_DATE order by ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
) WHERE rn <= 5