For each row in query select top 20 from other query - sql

I'm trying to do something and I'm not sure how to do it.
I have some data like this:
WITH a AS (SELECT theid, thename, thetimestamp FROM mytable)
SELECT thename, TRUNC (thetimestamp, 'HH24'), COUNT (theid) FROM a
group by thename,trunc(thetimestamp,'HH24') ORDER BY COUNT (theid) desc)
which returns me the count grouped by the hour and the name.
I would like it to just be
for each hour, top X counts
Is that possible?
I ended with:
SELECT thename, hour, cnt
FROM
( SELECT thename, hour, cnt,
rank() over (partition by hours order by cnt desc) rnk
FROM
( SELECT thename, TRUNC (thetimestamp, 'HH24') hour, COUNT (theid) cnt
FROM mytable
group by thename,trunc(thetimestamp,'HH24')
)
)
WHERE rnk <= :X

Try:
SELECT thename, hour, cnt
FROM
( SELECT thename, hour, cnt,
rank() over (partition by thename order by cnt desc) rnk
FROM
( SELECT thename, TRUNC (thetimestamp, 'HH24') hour, COUNT (theid) cnt
FROM mytable
group by thename,trunc(thetimestamp,'HH24')
)
)
WHERE rnk <= :X
(I didn't see the purpose of the WITH clause so I removed it from mine).

You could do that with row_number(), but it requires another subquery or another CTE. Here's the double CTE, since Tony Adrews already posted the subquery approach:
WITH a AS (
SELECT thename, TRUNC(thetimestamp, 'HH24') as hour, COUNT(*) cnt
FROM mytable
GROUP BY thename, TRUNC(thetimestamp, 'HH24')
), b AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY hour ORDER BY ctn DESC) rn,
thename, hour, cnt
FROM a
)
SELECT *
FROM b
WHERE rn < 20

Related

Optimizing SQL query - finding a group with in a group

I have a working query and looking for ideas to optimize it.
Query explanation: Within each ID group (visitor_id), look for row where c_id != 0. From that row, show all consecutive rows within that ID group.
select t2.*
from (select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as row_number
from "DB"."schema"."table"
where visitor_id in
(select distinct visitor_id
from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
where c_id in ('101')
)
) as t2
inner join
(select visitor_id, min(rn) as row_number
from
(select *, row_number() OVER (PARTITION BY visitor_id ORDER BY date) as rn
from "DB"."schema"."table"
where visitor_id in
(select distinct visitor_id
from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
where c_id in ('101')
)
) as filtered_table
where c_id != 0
group by visitor_id) as t1
on t2.visitor_id = t1.visitor_id
and t2.row_number >= t1.row_number
so you have a common sub expression
select distinct visitor_id
from (select * from "DB"."schema"."table" where date >= '2021-08-01' and date <= '2021-08-30')
where c_id in ('101')
so that can be moved to a CTE and run just once. like
WITH distinct_visitors AS (
SELECT DISTINCT visitor_id
FROM (SELECT * FROM "DB"."schema"."table" WHERE date >= '2021-08-01' and date <= '2021-08-30')
where c_id in ('101')
)
but the sub clause filter is equally valid as a top level filter, and given it's a value inclusive range filter BETWEEN will give better performance.
WITH distinct_visitors AS (
SELECT DISTINCT visitor_id
FROM "DB"."schema"."table"
WHERE date BETWEEN '2021-08-01' AND'2021-08-30'
AND c_id IN ('101')
)
then both uses of that CTE do the same ROW_NUMBER operation so that can be a CTE
and simplified as such
WITH rw_rows AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
FROM "DB"."schema"."table"
WHERE visitor_id IN (
SELECT DISTINCT visitor_id
FROM "DB"."schema"."table"
WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
AND c_id in ('101')
)
)
SELECT t2.*
FROM rw_rows AS t2
JOIN (
SELECT visitor_id,
min(rn) AS row_number
FROM rw_rows AS filtered_table
WHERE c_id != 0
GROUP BY visitor_id
) AS t1
ON t2.visitor_id = t1.visitor_id
AND t2.row_number >= t1.row_number
So we are want to keep all rows that come after the first non-zero c_id which a QUALIFY should be able to solve like:
WITH rw_rows AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
FROM "DB"."schema"."table"
WHERE visitor_id IN (
SELECT DISTINCT visitor_id
FROM "DB"."schema"."table"
WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
AND c_id in ('101')
)
)
SELECT t2.*,
MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id) as min_rn
FROM rw_rows AS t2
QUALIFY t2.row_number >= min_rn
which without have run feels like the MIN also should be able to be moved to the QUALIFY like:
WITH rw_rows AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY visitor_id ORDER BY date) AS row_number
FROM "DB"."schema"."table"
WHERE visitor_id IN (
SELECT DISTINCT visitor_id
FROM "DB"."schema"."table"
WHERE date BETWEEN '2021-08-01' AND '2021-08-30'
AND c_id in ('101')
)
)
SELECT t2.*
FROM rw_rows AS t2
QUALIFY t2.row_number >= MIN(IFF(c_id != 0, row_number, NULL )) OVER (PARTITION BY visitor_id)
At which point the CTE is not needed, as it's just used once, so could be moved back in, or not as they are the same.

How to get max date among others ids for current id using BigQuery?

I need to get max date for each row over other ids. Of course I can do this with CROSS JOIN and JOIN .
Like this
WITH t AS (
SELECT 1 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-09-01','2021-09-09', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 2 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-20','2021-09-03', INTERVAL 1 DAY)) rep_date
UNION ALL
SELECT 3 AS id, rep_date FROM UNNEST(GENERATE_DATE_ARRAY('2021-08-25','2021-09-05', INTERVAL 1 DAY)) rep_date
)
SELECT id, rep_date, MAX(rep_date) OVER (PARTITION BY id) max_date, max_date_over_others FROM t
JOIN (
SELECT t.id, MAX(max_date) max_date_over_others FROM t
CROSS JOIN (
SELECT id, MAX(rep_date) max_date FROM t
GROUP BY 1
) t1
WHERE t1.id <> t.id
GROUP BY 1
) USING (id)
But it's too wired for huge tables. So I'm looking for the some simpler way to do this. Any ideas?
Your version is good enough I think. But if you want to try other options - consider below approach. It might looks more verbose from first look - but should be more optimal and cheaper to compare with your version with cross join
temp as (
select id,
greatest(
ifnull(max(max_date_for_id) over preceding_ids, '1970-01-01'),
ifnull(max(max_date_for_id) over following_ids, '1970-01-01')
) as max_date_for_rest_ids
from (
select id, max(rep_date) max_date_for_id
from t
group by id
)
window
preceding_ids as (order by id rows between unbounded preceding and 1 preceding),
following_ids as (order by id rows between 1 following and unbounded following)
)
select *
from t
join temp
using (id)
Assuming your original table data just has columns id and dt - wouldn't this solve it? I'm using the fact that if an id has the max dt of everything, then it gets the second-highest over the other id values.
WITH max_dates AS
(
SELECT
id,
MAX(dt) AS max_dt
FROM
data
GROUP BY
id
),
with_top1_value AS
(
SELECT
*,
MAX(dt) OVER () AS max_overall_dt_1,
MIN(dt) OVER () AS min_overall_dt
FROM
max_dates
),
with_top2_values AS
(
SELECT
*,
MAX(CASE WHEN dt = max_overall_dt_1 THEN min_overall_dt ELSE dt END) AS max_overall_dt2
FROM
with_top1_value
),
SELECT
*,
CASE WHEN dt = max_overall_dt1 THEN max_overall_dt2 ELSE max_overall_dt1 END AS max_dt_of_others
FROM
with_top2_values

Top 2 per month in SQL

I have this dataset, which has dates and products for cities:
CREATE TABLE my_table (
the_id varchar(5) NOT NULL,
the_date timestamp NOT NULL,
the_city varchar(5) NOT NULL,
the_product varchar(1) NOT NULL
);
INSERT INTO my_table
VALUES ('VIS01', '2019-05-02 09:00:00','LISBO','A'),
('VIS02', '2019-05-04 12:00:00','EVORA','A'),
('VIS03', '2019-05-05 18:00:00','LISBO','B'),
('VIS04', '2019-05-06 18:30:00','PORTO','B'),
('VIS05', '2019-05-15 12:05:00','PORTO','C'),
('VIS06', '2019-06-02 18:06:00','EVORA','C'),
('VIS07', '2019-06-02 18:07:00','PORTO','A'),
('VIS08', '2019-06-04 18:08:00','EVORA','B'),
('VIS09', '2019-06-07 18:09:00','LISBO','B'),
('VIS10', '2019-06-09 18:10:00','LISBO','D'),
('VIS11', '2019-06-12 18:11:00','EVORA','D'),
('VIS12', '2019-06-15 18:12:00','LISBO','E'),
('VIS13', '2019-06-15 18:13:00','EVORA','F'),
('VIS14', '2019-06-18 18:14:00','PORTO','G'),
('VIS15', '2019-06-23 18:15:00','LISBO','A'),
('VIS16', '2019-06-25 18:16:00','LISBO','A'),
('VIS17', '2019-06-27 18:17:00','LISBO','F'),
('VIS18', '2019-06-27 18:18:00','LISBO','A'),
('VIS19', '2019-06-28 18:19:00','LISBO','A'),
('VIS20', '2019-06-30 18:20:00','EVORA','D'),
('VIS21', '2019-07-01 18:21:00','EVORA','D'),
('VIS22', '2019-07-04 18:30:00','EVORA','D'),
('VIS23', '2019-07-04 18:31:00','EVORA','B'),
('VIS24', '2019-07-06 18:40:00','EVORA','K'),
('VIS25', '2019-07-12 18:50:00','EVORA','G'),
('VIS26', '2019-07-15 18:00:00','PORTO','C'),
('VIS27', '2019-07-18 18:00:00','PORTO','C'),
('VIS28', '2019-07-25 18:00:00','PORTO','B'),
('VIS29', '2019-07-30 18:00:00','PORTO','M');
And I want the top two per month. The expected result should be:
month product count
2019-05 A 2
2019-05 B 2
2019-06 A 5
2019-06 D 3
2019-07 C 2
2019-07 D 2
But I'm not quite sure how to group by month. Please, any help will be greatly appreciated.
First, you can use to_char(the_date,'YYYY-MM') to get the year and month in the right format.
Next, you can use count(*) to group by the month and product, and row_number() to give a sequence number to each row in the groups.
SELECT to_char(the_date,'YYYY-MM') as month,
the_product as product,
count(*) as p_count,
row_number() over (partition by to_char(the_date,'YYYY-MM') order by count(*) desc) as seq
FROM my_table
group by month, product
Last, you can wrap that in an outer query to select just the columns and rows that you want.
SELECT month, product, p_count as count
FROM (
SELECT to_char(the_date,'YYYY-MM') as month,
the_product as product,
count(*) as p_count,
row_number() over (partition by to_char(the_date,'YYYY-MM') order by count(*) desc) as seq
FROM my_table
group by month, product
) as foo
where foo.seq <= 2;
You can use aggregation and window functions:
select mp.*
from (select date_trunc('month', the_date) as yyyymm,
the_product, count(*) as cnt,
row_number() over (partition by date_trunc('month', the_date) order by count(*) desc) as seqnum
from my_table
group by yyyymm, the_product
) mp
where seqnum <= 2;
In postgresql, I believe you can extract every parts of the timestamp using the Extract function.
e.g.:
SELECT the_date, EXTRACT(MONTH from the_date) as MONTH
the_date
MONTH
'2019-08-05'
08
that said, you can then group by Product, then Month, and Select the TOP 2
SELECT EXTRACT(MONTH from the_date) as month, the_product, count (*) FROM my_table
GROUP BY EXTRACT(MONTH from the_date), the_product
ORDER BY count(*)
LIMIT 2
There might be some optimization to do since I don't have a Database to test the query, but it might give you a good start

Top N items in every month - BIGQUERY

I have a big query program below;
WITH cte AS(
SELECT *
FROM (
SELECT project_name,
SUM(reward_value) AS total_reward_value,
DATE_TRUNC(date_signing, MONTH) as month,
date_signing,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month, date_signing
)
)
SELECT * FROM cte WHERE rank <= 5
that returns the following result:
While I expect to have each unique project to be SUM within each month and then I filter only the top 5.
Something like this:
I got the following error if the date_signing grouping is removed
PARTITION BY expression references column date_signing which is neither grouped nor aggregated at [16:48]
Any hints what should be corrected will be appreciated!
One more subquery maybe then?
WITH cte AS(
SELECT project_name,
SUM(reward_value) as reward_sum,
DATE_TRUNC(date_signing, MONTH) as month
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
),
ranks AS (
SELECT
project_name,
reward_sum,
month,
ROW_NUMBER() over (PARTITION BY month ORDER BY reward_sum DESC) AS rank
)
SELECT *
FROM ranks
WHERE rank <= 5
yeah you can't do that , yo can show the last signing date instead:
WITH cte AS(
SELECT project_name,
SUM(reward_value),
DATE_TRUNC(date_signing, MONTH) as month,
MAX(date_signing) as last_signing_date,
Row_number() over (partition by DATE_TRUNC(date_signing, MONTH)
order by SUM(reward_value) desc) AS rank
FROM `deals`
WHERE CAST(date_signing as DATE) > '2019-12-31'
AND CAST(date_signing as DATE) < '2020-02-01'
AND target_category = 'achieved'
AND project_name IS NOT NULL
GROUP BY project_name, month
)
SELECT * FROM cte WHERE rank <= 5

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.