redshift: count distinct customers over window partition - sql

Redshift doesn't support DISTINCT aggregates in its window functions. AWS documentation for COUNT states this, and distinct isn't supported for any of the window functions.
My use case: count customers over varying time intervals and traffic channels
I desire monthly and YTD unique customer counts for the current year, and also split by traffic channel as well as total for all channels. Since a customer can visit more than once I need to count only distinct customers, and therefore the Redshift window aggregates won't help.
I can count distinct customers using count(distinct customer_id)...group by, but this will give me only a single result of the four needed.
I don't want to get into the habit of running a full query for each desired count piled up between a bunch of union all. I hope this is not the only solution.
This is what I would write in postgres (or Oracle for that matter):
select order_month
, traffic_channel
, count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
, count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
, count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
, count(distinct customer_id) over() as ytd_total_customers
from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */
where to_char(order_month, 'YYYY') = '2017'
How can I solve this in Redshift?
The result needs to work on a redshift cluster, furthermore this is a simplified problem and the actual desired result has product category and customer type, which multiplies the number of partitions needed. Therefore a stack of union all rollups is not a nice solution.

A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.
The key insight is that dense_rank(), ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:
dense_rank() over(partition by order_month, traffic_channel order by customer_id)
Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.
/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
order_month
, traffic_channel
, max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
, max(tc_rnk) over(partition by traffic_channel) ytd_customers_by_channel
, max(mth_rnk) over(partition by order_month) monthly_customers_all_channels
, max(cust_rnk) over() ytd_total_customers
from (
select order_month
, traffic_channel
, dense_rank() over(partition by order_month, traffic_channel order by customer_id) tc_mth_rnk
, dense_rank() over(partition by traffic_channel order by customer_id) tc_rnk
, dense_rank() over(partition by order_month order by customer_id) mth_rnk
, dense_rank() over(order by customer_id) cust_rnk
from orders_traffic_channels
where to_char(order_month, 'YYYY') = '2017'
)
order by order_month, traffic_channel
;
notes
partitions of max() and dense_rank() must match
dense_rank() will rank null values (all at the same rank, the max). If you want to not count null values you need a case when customer_id is not null then dense_rank() ...etc..., or you can subtract one from the max() if you know there are nulls.
Update 2022
Count distinct over partitions in redshift is still not implemented.
I've concluded that this workaround is reasonable if you take care when incorporating it into production pipelines with these in mind:
It creates a lot of code which can hurt readability and maintenance.
Isolate this process of counting by groups into one transform stage rather than mixing this with other logical concepts in the same query.
Using subqueries and non-partitioned groups with count(distinct ..) to get each of your distinct counts is even messier and less readable.
However, the better way is to use dataframe languages that support grouped rollups like Spark or Pandas. Spark rollups by group are compact and readable, the tradeoff is bringing another execution environment and language into your flows.

While Redshift doesn't support DISTINCT aggregates in its window functions, it does have a listaggdistinct function. So you can do this:
regexp_count(
listaggdistinct(customer_id, ',') over (partition by field2),
','
) + 1
Of course, if you have , naturally occurring in your customer_id strings, you'll have to find a safe delimiter.

Another approach is to use
In first select:
row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel
and in the next select
sum(case when row_n_month_channel=1 then 1 else 0 end)

Related

Within postgres query can I break a column into quartiles for partitioning results?

I have a query that partitions data based on a specific column but I'm trying to now partition it based on quartiles within my dataset. So for example, say I have industry of "tech" and "retail" but I break it down in quartiles then for each industry there would be 4 additional partitions.
How can I incorporate it? Do I need to get the quartiles first then pass it into the below code? or can I directly partition the revenue column into quartiles within my partition by line?
with data as (
select
g.ticker,
g.industry,
g.countryname,
g.exchange,
c.year,
c.revenue,
ROW_NUMBER() OVER (PARTITION BY g.industry ORDER BY c.revenue ASC) AS groupingNumRank,
AVG(c.revenue) over (PARTITION BY g.industry) as industavg,
... and so on
I may want to try other ways of splitting the data(maybe in deciles, percentages, etc..), if that's possible as well I'd be interested in learning how to do it.
You would apparently want:
select ntile(4) over (partition by g.industry order by c.revenue) as quartile
Note that ntile() makes sure the tiles are as equal in size as possible. This may result in two rows with the same revenue being in different tiles.
If you don't want this behavior, you can use rank() and arithmetic:
select ceiling( rank() over (partition by g.industry order by c.revenue) * 1.0 /
count(*) over (partition by g.industry)
) as quartile

How to make RANK() or ROW_NUMBER() in ingres/vectorwise ? issue sql code

I am writing a sql query to get sales for different stores on a given day.
The query is run against ingres/vectorwise.
I want to add a column rank where there is the ranking of the store in regard of sales made in comparaison to all the stores.
My select statement is like follows:
SELECT store_number, sum(sales) as sales
FROM stores_sales_indicators
WHERE day = '2019-07-24'
GROUP BY store_number
I tried different things that I am familiar with from sql-server but none of it worked.
I think this is similar to what you're describing (no day included here but you'll get the idea):
declare global temporary table session.stores_sales_indicators
(
store_number integer not null,
sales integer not null
)
on commit preserve rows with norecovery, structure=x100;
insert into session.stores_sales_indicators
values(1,100),(1,200),(2,500),(2,50),(3,50),(3,300);
select
store_number,
sum(sales) as sales,
rank() over (order by sum(sales) desc) as rank
from session.stores_sales_indicators
group by store_number;
See also the fine manual, here's a link to the section on analytic functions:
https://docs.actian.com/vector/5.1/index.html#page/SQLLang%2FAnalytical_Functions.htm

Sequence within a partition in SQL server

I have been looking around for 2 days and have not been able to figure out this one. Using dataset below and SQL server 2016 I would like to get the row number of each row by 'id' and 'cat' ordered by 'date' in asc order but would like to see a reset of the sequence if a different value in the 'cat' column for the same 'id' is found(see rows in green). Any help would be appreciated.
This is a gaps and islands problem. The simplest solution in this case is probably a difference of row numbers:
select t.*,
row_number() over (partition by id, cat, seqnum - seqnum_c order by date) as row_num
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, cat order by date) as seqnum_c
from t
) t;
Why this works is a bit tricky to explain. But, if you look at the sequence numbers in the subquery, you'll see that the difference defines the groups you want to define.
Note: This assumes that the date column provides a stable sort. You seem to have duplicates in the column. If there really are duplicates and you have no secondary column for sorting, then try rank() or dense_rank() instead of row_number().

can we get totalcount and last record from postgresql

i am having table having 23 records , I am trying to get total count of record and last record also in single query. something like that
select count(*) ,(m order by createdDate) from music m ;
is there any way to pull this out only last record as well as total count in PostgreSQL.
This can be done using window functions
select *
from (
select m.*,
row_number() over (order by createddate desc) as rn,
count(*) over () as total_count
from music
) t
where rn = 1;
Another option would be to use a scalar sub-query and combine it with a limit clause:
select *,
(select count(*) from order_test.orders) as total_count
from music
order by createddate desc
limit 1;
Depending on the indexes, your memory configuration and the table definition might be faster then the two window functions.
No, it's not not possible to do what is being asked, sql does not function that way, the second you ask for a count () sql changes the level of your data to an aggregation. The only way to do what you are asking is to do a count() and order by in a separate query.
Another solution using windowing functions and no subquery:
SELECT DISTINCT count(*) OVER w, last_value(m) OVER w
FROM music m
WINDOW w AS (ORDER BY date DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
The point here is that last_value applies on partitions defined by windows and not on groups defined by GROUP BY.
I did not perform any test but I suspect my solution to be the less effective amongst the three already posted. But it is also the closest to your example query so far.

I need the Top 10 results from table

I need to get the Top 10 results for each Region, Market and Name along with those with highest counts (Gaps). There are 4 Regions with 1 to N Markets. I can get the Top 10 but cannot figure out how to do this without using a Union for every Market. Any ideas on how do this?
SELECT DISTINCT TOP 10
Region, Market, Name, Gaps
FROM
TableName
ORDER BY
Region, Market, Gaps DESC
One approach would be to use a CTE (Common Table Expression) if you're on SQL Server 2005 and newer (you aren't specific enough in that regard).
With this CTE, you can partition your data by some criteria - i.e. your Region, Market, Name - and have SQL Server number all your rows starting at 1 for each of those "partitions", ordered by some criteria.
So try something like this:
;WITH RegionsMarkets AS
(
SELECT
Region, Market, Name, Gaps,
RN = ROW_NUMBER() OVER(PARTITION BY Region, Market, Name ORDER BY Gaps DESC)
FROM
dbo.TableName
)
SELECT
Region, Market, Name, Gaps
FROM
RegionsMarkets
WHERE
RN <= 10
Here, I am selecting only the "first" entry for each "partition" (i.e. for each Region, Market, Name tuple) - ordered by Gaps in a descending fashion.
With this, you get the top 10 rows for each (Region, Market, Name) tuple - does that approach what you're looking for??
I think you want row_number():
select t.*
from (select t.*,
row_number() over (partition by region, market order by gaps desc) as seqnum
from tablename t
) t
where seqnum <= 10;
I am not sure if you want name in the partition by clause. If you have more than one name within a market, that may be what you are looking for. (Hint: Sample data and desired results can really help clarify a question.)