Partition Function COUNT() OVER possible using DISTINCT - sql

I'm trying to write the following in order to get a running total of distinct NumUsers, like so:
NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])
Management studio doesn't seem too happy about this. The error disappears when I remove the DISTINCT keyword, but then it won't be a distinct count.
DISTINCT does not appear to be possible within the partition functions.
How do I go about finding the distinct count? Do I use a more traditional method such as a correlated subquery?
Looking into this a bit further, maybe these OVER functions work differently to Oracle in the way that they cannot be used in SQL-Server to calculate running totals.
I've added a live example here on SQLfiddle where I attempt to use a partition function to calculate a running total.

There is a very simple solution using dense_rank()
dense_rank() over (partition by [Mth] order by [UserAccountKey])
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc)
- 1
This will give you exactly what you were asking for: The number of distinct UserAccountKeys within each month.

Necromancing:
It's relativiely simple to emulate a COUNT DISTINCT over PARTITION BY with MAX via DENSE_RANK:
;WITH baseTable AS
(
SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr
FROM baseTable
)
SELECT
RM
,ADR
,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1
,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2
-- Not supported
--,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu
FROM CTE
Note:
This assumes the fields in question are NON-nullable fields.
If there is one or more NULL-entries in the fields, you need to subtract 1.

I use a solution that is similar to that of David above, but with an additional twist if some rows should be excluded from the count. This assumes that [UserAccountKey] is never null.
-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
)
+ dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1
An SQL Fiddle with an extended example can be found here.

I think the only way of doing this in SQL-Server 2008R2 is to use a correlated subquery, or an outer apply:
SELECT datekey,
COALESCE(RunningTotal, 0) AS RunningTotal,
COALESCE(RunningCount, 0) AS RunningCount,
COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM document
OUTER APPLY
( SELECT SUM(Amount) AS RunningTotal,
COUNT(1) AS RunningCount,
COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
FROM Document d2
WHERE d2.DateKey <= document.DateKey
) rt;
This can be done in SQL-Server 2012 using the syntax you have suggested:
SELECT datekey,
SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM document
However, use of DISTINCT is still not allowed, so if DISTINCT is required and/or if upgrading isn't an option then I think OUTER APPLY is your best option

There is a solution in simple SQL:
SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users
=>
SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
SELECT user, MIN(time) AS time
FROM users
GROUP BY user
) t

I wandered in here with essentially the same question as whytheq and found David’s solution, but then had to review my old self-tutorial notes regarding DENSE_RANK because I use it so rarely: why DENSE_RANK instead of RANK or ROW_NUMBER, and how does it actually work? In the process, I updated that tutorial to include my version of David’s solution for this particular problem, and then thought it might be helpful for SQL newbies (or others like me who forget stuff).
The whole tutorial text can be copy/pasted into a query editor and then each example query can be (separately) uncommented and run, to see their respective results. (By default, the solution to this problem is uncommented at the bottom.) Or, each example can be copied separately into their own query-edit instance but the TBLx CTE must be included with each.
--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES
-- ( 7, 7 ),
-- ( 7, 7 ),
-- ( 7, 7 ),
-- ( 7, 8 ))
WITH /* SQL-Server version */
TBLx (Col_A, Col_B) AS
(SELECT 7, 7 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 7, 8)
/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/
--SELECT Col_A, Col_B,
-- ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
-- RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS RANK_,
-- DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
--FROM TBLx
/* RESULTS:
Col_A Col_B ROW_NUMBER_ RANK_ DENSE_RANK_
7 7 1 1 1
7 7 2 1 1
7 7 3 1 1
7 8 4 4 2
ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
That is, it’s an order-value (based on "sort" order) but makes no other distinction.
RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
which is *unique* with regard to the others.
That is, each identical row is ranked by the rank-order of the first row-instance of that
(identical) value-set.
DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
assigned a value of 2.
That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
*/
/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/
-- SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
-- -- the calculated DENSE_RANK value is appended to *all* rows in the data set. Without DISTINCT,
-- -- its value for each original-data row-type would just be replicated for each of those rows.
--
-- Col_A, Col_B,
-- DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
-- FROM TBLx
/* RESULTS:
Col_A Col_B DISTINCT_ROWTYPE_COUNT_
7 7 1
7 8 2
*/
/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/
-- SELECT
-- Col_A, Col_B,
--
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--
-- -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
-- +
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
-- - 1 /* (Because DENSE_RANK values are one-based) */
-- AS ROW_TYPES_COUNT_
-- FROM TBLx
/* RESULTS:
COL_A COL_B ROW_TYPES_COUNT_DESC_ ROW_TYPES_COUNT_ASC_ ROW_TYPES_COUNT_
7 7 2 1 2
7 7 2 1 2
7 7 2 1 2
7 8 1 2 2
*/
/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/
SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
-- calculated DENSE_RANK value is appended to *all* rows in the data set. Without DISTINCT, that
-- value would just be replicated for each original-data row.
-- Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
-- must be excluded because their respective differing row-values will defeat the purpose of the
-- DISTINCT operator, above.
DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
+
DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
- 1 /* (Because DENSE_RANK values are one-based) */
AS ROW_TYPES_COUNT_
FROM TBLx
/* RESULTS:
ROW_TYPES_COUNT_
2
*/

Related

Is there a way to collapse ordered rows by terminal values with postgres window clause

I have a table foo:
some_fk
some_field
some_date_field
1
A
1990-01-01
1
B
1990-01-02
1
C
1990-03-01
1
X
1990-04-01
2
B
1990-01-01
2
B
1990-01-05
2
Z
1991-04-11
2
C
1992-01-01
2
B
1992-02-01
2
Y
1992-03-01
3
C
1990-01-01
some_field has 6 possible values: [A,B,C,X,Y,Z]
Where [A,B,C] signify opening or continuation events and [X,Y,Z] signify closing events. How do I get each span of time between the first opening event and closing event of each span, partitioned by some_fk, as shown in the table below:
some_fk
some_date_field_start
some_date_field_end
1
1990-01-01
1990-04-01
2
1990-01-01
1991-04-11
2
1992-01-01
1992-03-01
3
1990-01-01
NULL
*Note that a non-terminated time span ends with NULL
I do have a solution that involves 3 common table expressions, but I'm wondering if there is a (better/more elegant/canonical) way to do this in PostgreSQL without nested queries.
My approach was something like:
WITH ranked AS (
SELECT
RANK() OVER (PARTITION BY some_fk ORDER BY some_date_field) AS "rank"
some_fk,
some_field,
some_date_field
FROM foo
), openers AS (
SELECT * FROM ranked WHERE some_field IN ('A','B','C')
), closers AS (
SELECT
*,
LAG("rank") OVER (PARTITION BY some_fk ORDER BY "rank") AS rank_lag
FROM ranked WHERE some_field IN ('X','Y','Z')
)
SELECT DISTINCT
openers.some_fk,
FIRST_VALUE(openers.some_date_field) OVER (PARTITION BY some_fk ORDER BY "rank")
AS some_date_field_start,
closers.some_date_field AS some_date_field_end
FROM openers
JOIN closers
ON openers.some_fk = closers.some_fk
WHERE openers.some_date_field BETWEEN COALESCE(closers.rank_lag, 0) AND closers.rank
... but I feel there must be a better way.
Thanks in advance for the help.
Another approach is to create a grouping ID by creating a running sum of the closing events. Then in an outer SQL you can Group By and pick min() and max() dates.
Select some_fk,min(some_date) as some_date_field_start, max(some_date) as some_date_field _end
From (
Select some_fk,some_date,
Sum(Case When some_field in ('X','Y','Z') Then 1 Else 0 End)
Over (Partition By some_fk Order By some_date
Rows Between Unbounded Preceding And 1 Preceding)
as some_grouping
From foo
)
Group By some_fk,some_grouping
Order By some_fk,some_grouping
This seems a little simpler at least to me.
The basis of the query is to use LAG to determine if the previous record was a closure.
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
This allows you to filter on the correct 4 records from your expected results, with the first 2 columns included; your mistake was to put the WHERE clause onto that query directly, when what you want to do is use it as is in a sub-query and write the WHERE in the main query.From that point, you have several possibilities to finish the query.
Here is a version using a scalar subquery:
SELECT some_fk, some_date_field AS some_date_field_start,
(
SELECT MIN(some_date_field)
FROM foo
WHERE some_fk = F.some_fk AND some_date_field > F.some_date_field AND some_field IN ('X','Y','Z')
) AS some_date_field_end
FROM (
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
) F
WHERE some_field IN ('A','B','C')
AND COALESCE(previous_some_field,'Z') IN ('X','Y','Z')
Here is another version using a CROSS JOIN LATERAL:
SELECT some_fk, some_date_field AS some_date_field_start, some_date_field_end
FROM (
SELECT *,
LAG(some_field) OVER (PARTITION BY some_fk ORDER BY some_date_field) Previous_some_field
FROM foo
) F1
CROSS JOIN LATERAL (
SELECT MIN(some_date_field) AS some_date_field_end
FROM foo
WHERE some_fk = F1.some_fk AND some_date_field > F1.some_date_field AND some_field IN ('X','Y','Z')
) F2
WHERE some_field IN ('A','B','C')
AND COALESCE(previous_some_field,'Z') IN ('X','Y','Z')

BigQuery Standard SQL - Cumulative Count of (almost) Duplicated Rows

With the following data:
id
field
eventTime
1
A
1
1
A
2
1
B
3
1
A
4
1
B
5
1
B
6
1
B
7
For visualisation purposes, I would like to turn it into the below. Consecutive occurrences of the same field value essentially get aggregated into one.
id
field
eventTime
1
Ax2
1
1
B
3
1
A
4
1
Bx3
5
I will then use STRING_AGG() to turn it into "Ax2 > B > A > Bx3".
I've tried using ROW_NUMBER() to count the repeated instances, with the plan being to utilise the highest row number to modify the string in field, but if I partition on eventTime, there are no consecutive "duplicates", and if I don't partition on it then all rows with the same field value are counted - not just consecutive ones.
I though about bringing in the previous field with LAG() for a comparison to reset the row count, but that only works for transitions from one field value to the other and is a problem if the same field is repeated consecutively.
I'm been struggling with this to the point where I'm considering writing a script that just CASE WHENs up to a reasonable number of consecutive hits, but I've seen it get as high as 17 on a given day and really don't want to be doing that!
My other alternative will just be to enforce a maximum number of field values to help control this, but now I've started this problem I'd quite like to solve it without that, if at all possible.
Thanks!
Consider below
select id,
any_value(field) || if(count(1) = 1, '', 'x' || count(1)) field,
min(eventTime) eventTime
from (
select id, field, eventTime,
countif(ifnull(flag, true)) over(partition by id order by eventTime) grp
from (
select id, field, eventTime,
field != lag(field) over(partition by id order by eventTime) flag
from `project.dataset.table`
)
)
group by id, grp
# order by eventTime
If applied to sample data in your question - output is
Just use lag() to detect when the value of field changes. You can now do that with qualify:
select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field;
For your final step, you can use a subquery:
select id, string_agg(field, '->' order by eventtime)
from (select t.*
from t
where 1=1
qualify lag(field, 1, '') over (partition by id order by eventtime) <> field
) t
group by id;

SQL Ranking N records by one criteria and N records by another and repeat

In my table I have 4 columns Id, Type InitialRanking & FinalRanking. Based on certain criteria I’ve managed to apply InitialRanking to the records (1-20). I now need to apply FinalRanking by identifying the top 7 of Type 1 followed by the
top 3 of Type 2. Then I need to repeat the above until all records have a FinalRanking. My goal would be to achieve the output in the final column of the attached image.
The 7 & 3 will vary over time but for the purposes of this example let’s say they are fixed.
you can try like this
SELECT * FROM(
( SELECT ID,DISTINCT TYPE,
CASE WHEN TYPE=1 THEN
( SELECT TOP 7 INITIALRANK, FINALRANK
from table where type=1)
ELSE
( SELECT TOP 3 INITIALRANK, FINALRANK
from table where type=2)
END CASE
FROM TABLE WHERE TYPE IN (1,2)
)
UNION
( SELECT ID,TYPE,
INITIALRANK, FINALRANK
from table where type not in (1,2))
)
)
A simple (or simplistic) approach to your Final Rank would be the following:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(10-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(10-3)+7
end FinalRank
This can be generalized for more than 2 groups for example with three groups of size 7, 3 and 2, the pattern size is 7+3+2=12 the general form is PartitionedRowNum+(Ceil(PartitionedRowNum/GroupSize)-1)*(PaternSize-GroupSize)+Offset where the offset is the sum of the preceding group sizes:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(12-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(12-3)+7
when 3 then (ceil((row_number() over (partition by type order by initrank))/2)-1)*(12-2)+7+3
end FinalRank

SQL group rows into pairs

I'm trying to add some sort of unique identifier (uid) to partitions made of pairs of rows, i.e. generate some uid/tag for each two rows of (identifier1,identifier2) in a window partition with size = 2 rows.
So, for example, the first 2 rows for ID X would get uid A, the next two rows for the same ID would get uid B and, if there is only one single row left in the partition for ID X, it would get id C.
Here's what I'm trying to accomplish, the picture illustrates the table's structure, I manually added the expectedIdentifier to illustrate the goal:
This is my current SQL, ntile doesn't solve it because the partition size varies:
select
rowId
, ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
, *
from log;
Already tried ntile( (count(*) over partition...) / 2), but that doesn't work.
Generating the UID can be done with md5() or similar, but I'm having trouble tagging the rows as illustrated above (so I can md5 the generated tag/uid)
While count(*) is not supported within a Snowflake window function, count(1) is supported and can be used to create the unique identifier. Below is an example of an integer unique ID matching pairs of rows and handling "odd" row groups:
select
ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
,ceil(count(1) over( partition by firstIdentifier, secondIdentifier order by timestamp asc) / 2) as id
, *
from log;
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier, timestamp
Here is the Sql Server Version
with log (firstidentifier,secondidentifier, timestamp)
as (
select 15396, 14460, 1 union all
select 15396, 14460, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1
)
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier,secondidentifier,timestamp

Select independent distinct with one query

I need to select distinct values from multiple columns in an h2 database so I can have a list of suggestions for the user based on what is in the database. In other words, I need something like
SELECT DISTINCT a FROM table
SELECT DISTINCT b FROM table
SELECT DISTINCT c FROM table
in one query. In-case I am not clear enough, I want a query that given this table (columns ID, thing, other, stuff)
0 a 5 p
1 b 5 p
2 a 6 p
3 c 5 p
would result in something like this
a 5 p
b 6 -
c - -
where '-' is an empty entry.
This is a bit complicated, but you can do it as follows:
select max(thing) as thing, max(other) as other, max(stuff) as stuff
from ((select row_number() over (order by id) as seqnum, thing, NULL as other, NULL as stuff
from (select thing, min(id) as id from t group by thing
) t
) union all
(select row_number() over (order by id) as seqnum, NULL, other, NULL
from (select other, min(id) as id from t group by other
) t
) union all
(select row_number() over (order by id) as seqnum, NULL, NULL, stuff
from (select stuff, min(id) as id from t group by stuff
) t
)
) t
group by seqnum
What this does is assign a sequence number to each distinct value in each column. It then combines these together into a single row for each sequence number. The combination uses the union all/group by approach. An alternative formulation uses full outer join.
This version uses the id column to keep the values in the same order as they appear in the original data.
In H2 (which was not originally on the question), you can use the rownum() function instead (documented here). You may not be able to specify the ordering however.