SQL selecting the max DENSE_RANK() value and then average - sql

I would like to modify the query below so it only keeps the highest VISIT_flag value grouped by CUSTOMER_ID, TRANS_TO_DATE and then average VISIT_flag by CUSTOMER_ID.
I'm having challenges figuring out how to take the maximum DENSE_RANK() value and aggregate by taking the average.
(
SELECT
CUSTOMER_ID,
TRANS_TO_DATE ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR - RN) VISIT_flag
from (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$')
)
ORDER BY CUSTOMER_ID, TRANS_TO_DATE

Following your logic in order to get the last VISIT_flag, meaning the last "visit" occured within a day, you must order (within the DENSE_RANK) descending. Though descending is solving the problem of getting the last visit, you cannot calculate the average visits of customer because the VISIT_flag will always be 1. So to bypass this issue you must declare a second DENSE_RANK with the same partition by and ascending order by in order to quantify the visits of the day and calculate your average. So the derived query
SELECT customer_id,AVG(quanitify) FROM (
SELECT
customer_id,
trans_to_date ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR DESC, RN DESC, rownum) VISIT_flag,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR ASC, RN ASC, rownum) quanitify FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$') )) WHERE VISIT_flag = 1 GROUP BY customer_id
Now to be honest the above query can be implemented with easier way without using DENSE_RANK. The above query makes sense only if you remove GROUP BY customer_id from outer query and AVG calculation and you want to get information about the last visit.
In any case you may find below the easier way
SELECT CUSTOMER_ID,AVG(cnt) avg_visits FROM (
SELECT CUSTOMER_ID, TRANS_TO_DATE, count(*) cnt FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$'))
GROUP BY CUSTOMER_ID, TRANS_TO_DATE) GROUP BY CUSTOMER_ID
P.S. i always include rownnum in dense_rank order by statement, in order to prevent exceptional cases (that always exists one to the database :D ) of having the same transaction_time. This will produce two records with the same dense_rank and might an issue to the application that uses the query data.

Related

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

Why all the rank number become 1 when using a window function in a subquery

i have a table with traffic_id, date, start_time, session_id, page, platform, page-views, revenue, segment_id, and customer_id columns in my sessions table. Each customer_id could have multiple session_id with different revenue/date/start_time/page/platform/page_views/segment_id values. Sample data is shown below.
traffic_id|date|start_time|session_id|page|platform|page_views|revenue|segment_id|customer_id
303|1/1/2017|05:23:33|123457080|homepage|mobile|581|37.40|1|310559
I would like to know the max session revenue per customer and the session sequence number as the table shown below.
Customer_id|Date|Maximum|session_revenue|Session_id|Session_Sequence|
138858|1/13/17|100.44|123458749|5
I thought I could just use a subquery to do the job. But all the ranking values are 1 and session_id and date are wrong. Please help!---------------------------------------------------------------------------------------
SELECT max(revenue),customer_id, date, session_id, session_sequence
FROM (
SELECT
revenue,
date,
customer_id,
session_id,
RANK() OVER(partition by customer_id ORDER BY date,start_time ASC) AS session_sequence
FROM sessions
) AS a
group by customer_id
;
Your query should generate an error because the GROUP BY columns and SELECT columns are inconsistent.
Presumably you want the maximum revenue and the sequence number where that occurs.
SELECT s.*
FROM (SELECT s.*,
RANK() OVER (partition by customer_id ORDER BY date, start_time ASC) AS session_sequence,
MAX(revenue) OVER (PARTITION BY customer_id) as max_revenue
FROM sessions
) s
WHERE revenue = max_revenue;

How to calculate the median in Postgres?

I have created a basic database (picture attached) Database, I am trying to find the following:
"Median total amount spent per user in each calendar month"
I tried the following, but getting errors:
SELECT
user_id,
AVG(total_per_user)
FROM (SELECT user_id,
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT EXTRACT(MONTH FROM created_at) AS calendar_month,
user_id,
SUM(amount) AS total_per_user
FROM transactions
GROUP BY calendar_month, user_id) AS total_amount
ORDER BY user_id) AS a
WHERE asc_total IN (desc_total, desc_total+1, desc_total-1)
GROUP BY user_id
;
In Postgres, you could just use aggregate function percentile_cont():
select
user_id,
percentile_cont(0.5) within group(order by total_per_user) median_total_per_user
from (
select user_id, sum(amount) total_per_user
from transactions
group by date_trunc('month', created_at), user_id
) t
group by user_id
Note that date_trunc() is probably closer to what you want than extract(month from ...) - unless you do want to sum amounts of the same month for different years together, which is not how I understood your requirement.
Just use percentile_cont(). I don't fully understand the question. If you want the median of the monthly spending, then:
SELECT user_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY total_per_user
ROW_NUMBER() over (ORDER BY total_per_user DESC) AS desc_total,
ROW_NUMBER() over (ORDER BY total_per_user ASC) AS asc_total
FROM (SELECT DATE_TRUNC('month', created_at) AS calendar_month,
user_id, SUM(amount) AS total_per_user
FROM transactions t
GROUP BY calendar_month, user_id
) um
GROUP BY user_id;
There is a built-in function for median. No need for fancier processing.

Simplify sql query(date to period)

got this table:
initial table
Need to get table like this:
Target table
what I used:
Select ID_CLIENT, BALANCE, HIST_DT as Start_dt, isnull(lead(hist_dt) over (partition by id_client order by hist_dt asc), '2999.12.31') as End_dt
from (
select ID_CLIENT, ID_STATUS, balance, hist_dt, lag(id_status) over (partition by id_client order by id_status) as Prev_ID_status
from Client_History) a
where a.ID_STATUS = a.Prev_ID_status or a.ID_STATUS = 1
order by ID_CLIENT, HIST_DT
I think its very complicated. Will be glad to hear any suggestions of simplyfing this query.
This is a gaps-and-islands problem, most easily solved with the difference of row numbers and aggregation:
select id_client, balance, min(hist_dt), max(hist_dt)
from (select ch.*,
row_number() over (partition by id_client order by balance_hist_dt) as seqnum,
row_number() over (partition by id_client, balance order by balance_hist_dt) as seqnum_2
from client_history ch
) ch
group by id_client, balance, (seqnum - seqnum_2)
order by id_client, min(hist_dt);
Why this works is a little tricky to describe. But if you look at the results of the subquery, you will see how the difference captures adjacent rows with the same balance.

Retrieve recent 5 days forecast for each cities with latest issue date

I need to retrieve the recent 5 days forecast info for each cities.
My table looks like below
The real problem is with the issue date.
the city may contain several forecast info for the same date with distinct issue date.
I need to retrieve recent 5 records for each cities with latest issue date and group by forecast date
I have tried something like below but not giving the expected result
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID ORDER BY FORECAST_DATE DESC, ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
GROUP BY FORECAST_DATE
) WHERE rn <= 5
Any suggestion or advice will be helpful
This will get the latest issued forecast per day over the most recent 5 days for each city:
SELECT *
FROM (
SELECT f.*,
DENSE_RANK() OVER ( PARTITION BY city_id ORDER BY forecast_date DESC )
AS forecast_rank,
ROW_NUMBER() OVER ( PARTITION BY city_id, forecast_date ORDER BY issue_date DESC )
AS issue_rn
FROM Forecast f
)
WHERE forecast_rank <= 5
AND issue_rn = 1;
Partition by works like group by but for the function only.
Try
with CTE as
(
select t1.*,
row_number() over (partition by city_id, forecast_date order by issue_date desc) as r_ord
from Forecast
)
select CTE.*
from CTE
where r_ord <= 5
Try this
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID, FORECAST_DATE order by ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
) WHERE rn <= 5