SQL group rows into pairs - sql

I'm trying to add some sort of unique identifier (uid) to partitions made of pairs of rows, i.e. generate some uid/tag for each two rows of (identifier1,identifier2) in a window partition with size = 2 rows.
So, for example, the first 2 rows for ID X would get uid A, the next two rows for the same ID would get uid B and, if there is only one single row left in the partition for ID X, it would get id C.
Here's what I'm trying to accomplish, the picture illustrates the table's structure, I manually added the expectedIdentifier to illustrate the goal:
This is my current SQL, ntile doesn't solve it because the partition size varies:
select
rowId
, ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
, *
from log;
Already tried ntile( (count(*) over partition...) / 2), but that doesn't work.
Generating the UID can be done with md5() or similar, but I'm having trouble tagging the rows as illustrated above (so I can md5 the generated tag/uid)

While count(*) is not supported within a Snowflake window function, count(1) is supported and can be used to create the unique identifier. Below is an example of an integer unique ID matching pairs of rows and handling "odd" row groups:
select
ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
,ceil(count(1) over( partition by firstIdentifier, secondIdentifier order by timestamp asc) / 2) as id
, *
from log;

select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier, timestamp
Here is the Sql Server Version
with log (firstidentifier,secondidentifier, timestamp)
as (
select 15396, 14460, 1 union all
select 15396, 14460, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1
)
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier,secondidentifier,timestamp

Related

How do I select 1 [oldest] row per group of rows, given multiple groups?

Let's say we have the database table below, called USER_JOBS.
I'd like to write an SQL query that reflects this algorithm:
Divide the whole table in groups of rows defined by a common USER_ID (in the example table, the 2 resulting groups are colored yellow & green)
From each group, select the oldest row (according to SCHEDULE_TIME)
From this example table, the desired SQL query would return these 2 rows:
You can use ranking function (supported in most RDBS):
SELECT *
FROM
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY SCHEDULE_TIME DESC) AS RowID
FROM [table]
)
WHERE RowID = 1
WITH Ranked AS (
SELECT
RANK() OVER (PARTITION BY User_ID ORDER BY ScheduleTime DESC) as Ranking,
*
FROM [table_name]
)
SELECT Status, Sob_Type, User_ID, TimeStamp FROM ranking WHERE Ranks = 1;

SQL Ranking N records by one criteria and N records by another and repeat

In my table I have 4 columns Id, Type InitialRanking & FinalRanking. Based on certain criteria I’ve managed to apply InitialRanking to the records (1-20). I now need to apply FinalRanking by identifying the top 7 of Type 1 followed by the
top 3 of Type 2. Then I need to repeat the above until all records have a FinalRanking. My goal would be to achieve the output in the final column of the attached image.
The 7 & 3 will vary over time but for the purposes of this example let’s say they are fixed.
you can try like this
SELECT * FROM(
( SELECT ID,DISTINCT TYPE,
CASE WHEN TYPE=1 THEN
( SELECT TOP 7 INITIALRANK, FINALRANK
from table where type=1)
ELSE
( SELECT TOP 3 INITIALRANK, FINALRANK
from table where type=2)
END CASE
FROM TABLE WHERE TYPE IN (1,2)
)
UNION
( SELECT ID,TYPE,
INITIALRANK, FINALRANK
from table where type not in (1,2))
)
)
A simple (or simplistic) approach to your Final Rank would be the following:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(10-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(10-3)+7
end FinalRank
This can be generalized for more than 2 groups for example with three groups of size 7, 3 and 2, the pattern size is 7+3+2=12 the general form is PartitionedRowNum+(Ceil(PartitionedRowNum/GroupSize)-1)*(PaternSize-GroupSize)+Offset where the offset is the sum of the preceding group sizes:
row_number() over (partition by type order by initrank) +
case type
when 1 then (ceil((row_number() over (partition by type order by initrank))/7)-1)*(12-7)
when 2 then (ceil((row_number() over (partition by type order by initrank))/3)-1)*(12-3)+7
when 3 then (ceil((row_number() over (partition by type order by initrank))/2)-1)*(12-2)+7+3
end FinalRank

Removing duplicate rows with condition

User sessions are tracked on the system and stored in the following format. Sometimes I get multiple records for the same session id.
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
4 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117
I used to run a sql query with DISTINCT(session_id to keep only one of the multiple records for each session id. BUT I just realized that my query picks the row on the top even when if the bottom row recorded more actions for the same session. So you look at the following table, my query keeps Row 1 & 3, like this;
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Whereas, I would like to keep row 2 and 3, like this;
Row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Is there anyway to do it with a sql query? Thank you!
Below is one of the option for BigQuery Standard SQL
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
ORDER BY row
result is
row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Another option would be as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT session_id,
ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id
This one looks a little cleaner :o)
You can extend above with for example combining distinct codes from deduplicated entries in case if (for example) some actions are missing in one row but not another and etc.)
Update:
Try below to separate expenses of calculating array_length from ordering in partition:
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
FROM (
SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
FROM `project.dataset.table`
)
)
WHERE win

"Group" some rows together before sorting (Oracle)

I'm using Oracle Database 11g.
I have a query that selects, among other things, an ID and a date from a table. Basically, what I want to do is keep the rows that have the same ID together, and then sort those "groups" of rows by the most recent date in the "group".
So if my original result was this:
ID Date
3 11/26/11
1 1/5/12
2 6/3/13
2 10/15/13
1 7/5/13
The output I'm hoping for is:
ID Date
3 11/26/11 <-- (Using this date for "group" ID = 3)
1 1/5/12
1 7/5/13 <-- (Using this date for "group" ID = 1)
2 6/3/13
2 10/15/13 <-- (Using this date for "group" ID = 2)
Is there any way to do this?
One way to get this is by using analytic functions; I don't have an example of that handy.
This is another way to get the specified result, without using an analytic function (this is ordering first by the most_recent_date for each ID, then by ID, then by Date):
SELECT t.ID
, t.Date
FROM mytable t
JOIN ( SELECT s.ID
, MAX(s.Date) AS most_recent_date
FROM mytable s
WHERE s.Date IS NOT NULL
GROUP BY s.ID
) r
ON r.ID = t.ID
ORDER
BY r.most_recent_date
, t.ID
, t.Date
The "trick" here is to return "most_recent_date" for each ID, and then join that to each row. The result can be ordered by that first, then by whatever else.
(I also think there's a way to get this same ordering using Analytic functions, but I don't have an example of that handy.)
You can use the MAX ... KEEP function with your aggregate to create your sort key:
with
sample_data as
(select 3 id, to_date('11/26/11','MM/DD/RR') date_col from dual union all
select 1, to_date('1/5/12','MM/DD/RR') date_col from dual union all
select 2, to_date('6/3/13','MM/DD/RR') date_col from dual union all
select 2, to_date('10/15/13','MM/DD/RR') date_col from dual union all
select 1, to_date('7/5/13','MM/DD/RR') date_col from dual)
select
id,
date_col,
-- For illustration purposes, does not need to be selected:
max(date_col) keep (dense_rank last order by date_col) over (partition by id) sort_key
from sample_data
order by max(date_col) keep (dense_rank last order by date_col) over (partition by id);
Here is the query using analytic functions:
select
id
, date_
, max(date_) over (partition by id) as max_date
from table_name
order by max_date, id
;

Partition Function COUNT() OVER possible using DISTINCT

I'm trying to write the following in order to get a running total of distinct NumUsers, like so:
NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])
Management studio doesn't seem too happy about this. The error disappears when I remove the DISTINCT keyword, but then it won't be a distinct count.
DISTINCT does not appear to be possible within the partition functions.
How do I go about finding the distinct count? Do I use a more traditional method such as a correlated subquery?
Looking into this a bit further, maybe these OVER functions work differently to Oracle in the way that they cannot be used in SQL-Server to calculate running totals.
I've added a live example here on SQLfiddle where I attempt to use a partition function to calculate a running total.
There is a very simple solution using dense_rank()
dense_rank() over (partition by [Mth] order by [UserAccountKey])
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc)
- 1
This will give you exactly what you were asking for: The number of distinct UserAccountKeys within each month.
Necromancing:
It's relativiely simple to emulate a COUNT DISTINCT over PARTITION BY with MAX via DENSE_RANK:
;WITH baseTable AS
(
SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr
FROM baseTable
)
SELECT
RM
,ADR
,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1
,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2
-- Not supported
--,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu
FROM CTE
Note:
This assumes the fields in question are NON-nullable fields.
If there is one or more NULL-entries in the fields, you need to subtract 1.
I use a solution that is similar to that of David above, but with an additional twist if some rows should be excluded from the count. This assumes that [UserAccountKey] is never null.
-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
)
+ dense_rank() over (
partition by [Mth]
order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1
An SQL Fiddle with an extended example can be found here.
I think the only way of doing this in SQL-Server 2008R2 is to use a correlated subquery, or an outer apply:
SELECT datekey,
COALESCE(RunningTotal, 0) AS RunningTotal,
COALESCE(RunningCount, 0) AS RunningCount,
COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM document
OUTER APPLY
( SELECT SUM(Amount) AS RunningTotal,
COUNT(1) AS RunningCount,
COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
FROM Document d2
WHERE d2.DateKey <= document.DateKey
) rt;
This can be done in SQL-Server 2012 using the syntax you have suggested:
SELECT datekey,
SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM document
However, use of DISTINCT is still not allowed, so if DISTINCT is required and/or if upgrading isn't an option then I think OUTER APPLY is your best option
There is a solution in simple SQL:
SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users
=>
SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
SELECT user, MIN(time) AS time
FROM users
GROUP BY user
) t
I wandered in here with essentially the same question as whytheq and found David’s solution, but then had to review my old self-tutorial notes regarding DENSE_RANK because I use it so rarely: why DENSE_RANK instead of RANK or ROW_NUMBER, and how does it actually work? In the process, I updated that tutorial to include my version of David’s solution for this particular problem, and then thought it might be helpful for SQL newbies (or others like me who forget stuff).
The whole tutorial text can be copy/pasted into a query editor and then each example query can be (separately) uncommented and run, to see their respective results. (By default, the solution to this problem is uncommented at the bottom.) Or, each example can be copied separately into their own query-edit instance but the TBLx CTE must be included with each.
--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES
-- ( 7, 7 ),
-- ( 7, 7 ),
-- ( 7, 7 ),
-- ( 7, 8 ))
WITH /* SQL-Server version */
TBLx (Col_A, Col_B) AS
(SELECT 7, 7 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 7, 7 UNION ALL
SELECT 7, 8)
/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/
--SELECT Col_A, Col_B,
-- ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
-- RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS RANK_,
-- DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
--FROM TBLx
/* RESULTS:
Col_A Col_B ROW_NUMBER_ RANK_ DENSE_RANK_
7 7 1 1 1
7 7 2 1 1
7 7 3 1 1
7 8 4 4 2
ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
That is, it’s an order-value (based on "sort" order) but makes no other distinction.
RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
which is *unique* with regard to the others.
That is, each identical row is ranked by the rank-order of the first row-instance of that
(identical) value-set.
DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
assigned a value of 2.
That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
*/
/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/
-- SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
-- -- the calculated DENSE_RANK value is appended to *all* rows in the data set. Without DISTINCT,
-- -- its value for each original-data row-type would just be replicated for each of those rows.
--
-- Col_A, Col_B,
-- DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
-- FROM TBLx
/* RESULTS:
Col_A Col_B DISTINCT_ROWTYPE_COUNT_
7 7 1
7 8 2
*/
/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/
-- SELECT
-- Col_A, Col_B,
--
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--
-- -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
-- +
-- DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
-- - 1 /* (Because DENSE_RANK values are one-based) */
-- AS ROW_TYPES_COUNT_
-- FROM TBLx
/* RESULTS:
COL_A COL_B ROW_TYPES_COUNT_DESC_ ROW_TYPES_COUNT_ASC_ ROW_TYPES_COUNT_
7 7 2 1 2
7 7 2 1 2
7 7 2 1 2
7 8 1 2 2
*/
/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/
SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
-- calculated DENSE_RANK value is appended to *all* rows in the data set. Without DISTINCT, that
-- value would just be replicated for each original-data row.
-- Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
-- must be excluded because their respective differing row-values will defeat the purpose of the
-- DISTINCT operator, above.
DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
+
DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
- 1 /* (Because DENSE_RANK values are one-based) */
AS ROW_TYPES_COUNT_
FROM TBLx
/* RESULTS:
ROW_TYPES_COUNT_
2
*/