I have a dataset with three columns and I need to group by but keeping the "arrays" with small groups ordered by data:
Expected output:
This is a gaps-and-islands problem, most easily solved with the difference of row numbers:
select type, count(*), min(date_status), max(date_status)
from (select t.*,
row_number() over (order by date_status) as seqnum,
row_number() over (partition by type order by date_status) as seqnum_t
from t
) t
group by type, (seqnum - seqnum_t)
order by min(date_status);
Why this works is a little tricky to explain. I find that if someone looks at the results of the subquery, that person will usually see how the difference of the two row number columns identifies groups of adjacent types.
Related
I have a table with a datetime field ("time") and an int field ("index")
Please see the query and the picture below. I want ROW_NUMBER to count from 1 when the index changes, also if the index value exists in previous rows. The red text indicates the output that I want to get from the query. How can I modify the query to give me the expected results?
The query:
select rv.[time], rv.[index], ROW_NUMBER() OVER(PARTITION BY rv.[index] ORDER BY rv.[time], rv.[index] ASC) AS Row#
from
tbl
This is a gaps-and-islands problem. You need to identify groups of adjacent rows. In this case, I think the simplest method is the difference of row numbers:
select rv.*,
row_number() over (partition by index, (seqnum - seqnum_2) order by time) as row_num
from (select t.*,
row_number() over (order by time) as seqnum,
row_number() over (partition by index order by time) as seqnum_2
from tbl t
) rv;
Why this works is a little tricky to explain. If you look at the results of the subquery, you will see how the difference between the two row number values identifies adjacent values that are the same.
Also, you should not use names like time and index for columns, because these a keywords in SQL. I have not escaped the names in the above query. I encourage you to give your columns and tables names that do not need to be escaped.
I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id
I have been looking around for 2 days and have not been able to figure out this one. Using dataset below and SQL server 2016 I would like to get the row number of each row by 'id' and 'cat' ordered by 'date' in asc order but would like to see a reset of the sequence if a different value in the 'cat' column for the same 'id' is found(see rows in green). Any help would be appreciated.
This is a gaps and islands problem. The simplest solution in this case is probably a difference of row numbers:
select t.*,
row_number() over (partition by id, cat, seqnum - seqnum_c order by date) as row_num
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
row_number() over (partition by id, cat order by date) as seqnum_c
from t
) t;
Why this works is a bit tricky to explain. But, if you look at the sequence numbers in the subquery, you'll see that the difference defines the groups you want to define.
Note: This assumes that the date column provides a stable sort. You seem to have duplicates in the column. If there really are duplicates and you have no secondary column for sorting, then try rank() or dense_rank() instead of row_number().
I had a requirement that grouping based on row_number of each group. Please view
Image
SQL queries represent unordered sets. So, the distinction between the two groups for 47641 is undefined.
You can define a query that will assign a group that has exactly one fiberid for each scname. When there are multiples, the assignment is arbitrary.
To do so, you can use dense_rank():
select t.*,
(dense_rank() over (order by scname) - 1 +
row_number() over (partition by scname, fiberid order by fiberid)
) as grp
from t;
If you do have an ordering for the rows then a more stable assignment can be calculated.
Redshift doesn't support DISTINCT aggregates in its window functions. AWS documentation for COUNT states this, and distinct isn't supported for any of the window functions.
My use case: count customers over varying time intervals and traffic channels
I desire monthly and YTD unique customer counts for the current year, and also split by traffic channel as well as total for all channels. Since a customer can visit more than once I need to count only distinct customers, and therefore the Redshift window aggregates won't help.
I can count distinct customers using count(distinct customer_id)...group by, but this will give me only a single result of the four needed.
I don't want to get into the habit of running a full query for each desired count piled up between a bunch of union all. I hope this is not the only solution.
This is what I would write in postgres (or Oracle for that matter):
select order_month
, traffic_channel
, count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
, count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
, count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
, count(distinct customer_id) over() as ytd_total_customers
from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */
where to_char(order_month, 'YYYY') = '2017'
How can I solve this in Redshift?
The result needs to work on a redshift cluster, furthermore this is a simplified problem and the actual desired result has product category and customer type, which multiplies the number of partitions needed. Therefore a stack of union all rollups is not a nice solution.
A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.
The key insight is that dense_rank(), ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:
dense_rank() over(partition by order_month, traffic_channel order by customer_id)
Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.
/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
order_month
, traffic_channel
, max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
, max(tc_rnk) over(partition by traffic_channel) ytd_customers_by_channel
, max(mth_rnk) over(partition by order_month) monthly_customers_all_channels
, max(cust_rnk) over() ytd_total_customers
from (
select order_month
, traffic_channel
, dense_rank() over(partition by order_month, traffic_channel order by customer_id) tc_mth_rnk
, dense_rank() over(partition by traffic_channel order by customer_id) tc_rnk
, dense_rank() over(partition by order_month order by customer_id) mth_rnk
, dense_rank() over(order by customer_id) cust_rnk
from orders_traffic_channels
where to_char(order_month, 'YYYY') = '2017'
)
order by order_month, traffic_channel
;
notes
partitions of max() and dense_rank() must match
dense_rank() will rank null values (all at the same rank, the max). If you want to not count null values you need a case when customer_id is not null then dense_rank() ...etc..., or you can subtract one from the max() if you know there are nulls.
Update 2022
Count distinct over partitions in redshift is still not implemented.
I've concluded that this workaround is reasonable if you take care when incorporating it into production pipelines with these in mind:
It creates a lot of code which can hurt readability and maintenance.
Isolate this process of counting by groups into one transform stage rather than mixing this with other logical concepts in the same query.
Using subqueries and non-partitioned groups with count(distinct ..) to get each of your distinct counts is even messier and less readable.
However, the better way is to use dataframe languages that support grouped rollups like Spark or Pandas. Spark rollups by group are compact and readable, the tradeoff is bringing another execution environment and language into your flows.
While Redshift doesn't support DISTINCT aggregates in its window functions, it does have a listaggdistinct function. So you can do this:
regexp_count(
listaggdistinct(customer_id, ',') over (partition by field2),
','
) + 1
Of course, if you have , naturally occurring in your customer_id strings, you'll have to find a safe delimiter.
Another approach is to use
In first select:
row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel
and in the next select
sum(case when row_n_month_channel=1 then 1 else 0 end)