Ignoring null values in in a postgresql rank() window function - sql

I am writing a SQL query using PostgreSQL that needs to rank people that "arrive" at some location. Not everyone arrives however. I am using a rank() window function to generate arrival ranks, but in the places where the arrival time is null, rather than returning a null rank, the rank() aggregate function just treats them as if they arrived after everyone else. What I want to happen is that these no-shows get a rank of NULL instead of this imputed rank.
Here is an example. Suppose I have a table dinner_show_up that looks like this:
| Person | arrival_time | Restaurant |
+--------+--------------+------------+
| Dave | 7 | in_and_out |
| Mike | 2 | in_and_out |
| Bob | NULL | in_and_out |
Bob never shows up. The query I'm writing would be:
select Person,
rank() over (partition by Restaurant order by arrival_time asc)
as arrival_rank
from dinner_show_up;
And the result will be
| Person | arrival_rank |
+--------+--------------+
| Dave | 2 |
| Mike | 1 |
| Bob | 3 |
What I want to happen instead is this:
| Person | arrival_rank |
+--------+--------------+
| Dave | 2 |
| Mike | 1 |
| Bob | NULL |

Just use a case statement around the rank():
select Person,
(case when arrival_time is not null
then rank() over (partition by Restaurant order by arrival_time asc)
end) as arrival_rank
from dinner_show_up;

A more general solution for all aggregate functions, not only rank(), is to partition by 'arrival_time is not null' in the over() clause. That will cause all null arrival_time rows to be placed into the same group and given the same rank, leaving the non-null rows to be ranked relative only to each other.
For the sake of a meaningful example, I mocked up a CTE having more rows than the intial problem set. Please forgive the wide rows, but I think they better contrast the differing techniques.
with dinner_show_up("person", "arrival_time", "restaurant") as (values
('Dave' , 7, 'in_and_out')
,('Mike' , 2, 'in_and_out')
,('Bob' , null, 'in_and_out')
,('Peter', 3, 'in_and_out')
,('Jane' , null, 'in_and_out')
,('Merry', 5, 'in_and_out')
,('Sam' , 5, 'in_and_out')
,('Pip' , 9, 'in_and_out')
)
select
person
,case when arrival_time is not null then rank() over ( order by arrival_time) end as arrival_rank_without_partition
,case when arrival_time is not null then rank() over (partition by arrival_time is not null order by arrival_time) end as arrival_rank_with_partition
,case when arrival_time is not null then percent_rank() over ( order by arrival_time) end as arrival_pctrank_without_partition
,case when arrival_time is not null then percent_rank() over (partition by arrival_time is not null order by arrival_time) end as arrival_pctrank_with_partition
from dinner_show_up
This query gives the same results for arrival_rank_with/without_partition. However, the results for percent_rank() do differ: without_partition is wrong, ranging from 0% to 71.4%, whereas with_partition correctly gives pctrank() ranging from 0% to 100%.
This same pattern applies to the ntile() aggregate function, as well.
It works by separating all null values from non-null values for purposes of the ranking. This ensures that Jane and Bob are excluded from the percentile ranking of 0% to 100%.
|person|arrival_rank_without_partition|arrival_rank_with_partition|arrival_pctrank_without_partition|arrival_pctrank_with_partition|
+------+------------------------------+---------------------------+---------------------------------+------------------------------+
|Jane |null |null |null |null |
|Bob |null |null |null |null |
|Mike |1 |1 |0 |0 |
|Peter |2 |2 |0.14 |0.2 |
|Sam |3 |3 |0.28 |0.4 |
|Merry |4 |4 |0.28 |0.4 |
|Dave |5 |5 |0.57 |0.8 |
|Pip |6 |6 |0.71 |1.0 |

select Person,
rank() over (partition by Restaurant order by arrival_time asc)
as arrival_rank
from dinner_show_up
where arrival_time is not null
union
select Person,NULL as arrival_rank
from dinner_show_up
where arrival_time is null;

Related

First and last not null fields over partitions

I have a table like this:
EventID
EventTime
AttrA
AttrB
1
2022-10-01 00:00:01.000000
null
null
1
2022-10-01 00:00:02.000000
a
null
1
2022-10-01 00:00:03.000000
b
1
1
2022-10-01 00:00:04.000000
null
null
2
2022-10-01 00:01:01.000000
aa
11
2
2022-10-01 00:01:02.000000
bb
null
2
2022-10-01 00:01:03.000000
null
null
2
2022-10-01 00:01:04.000000
aa
22
and I want to jump across the records to return the first and last not null AttrA and AttrB values for each eventID based on the eventTime. Each eventID can have multiple records so we can't know where the not nulls may be. So the wished results would be:
EventID
FirstAttrA
LastAttrA
FirstAttrB
LastAttrB
1
a
b
1
1
2
aa
aa
11
22
What I did is to add row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) and then again DESC and then have multiple CTEs like this:
WITH enhanced_table AS
(
SELECT
eventID,
attrA,
attrB,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) as rn,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time DESC) as reversed_rn
),
first_events_with_attrA AS
(
SELECT
eventID,
FIRST(attrA) OVER (PARTITION BY eventID ORDER BY rn ASC) AS url
FROM enhanced_table
WHERE attrA IS NOT NULL
)...
But I need one CTE which scans again the table for each case I want (for this example 4 CTEs in total). It works, but it is slow.
Is there a way to grab the values I am interested in in a more efficient way?
No Need to build Row Numbers , you can directly use native SparkSQL Functions FIRST & LAST with isIgnoreNull as True to achieve the intended results -
Data Preparation
s = StringIO("""
EventID,EventTime,AttrA,AttrB
1,2022-10-01 00:00:01.000000,,
1,2022-10-01 00:00:02.000000,a,
1,2022-10-01 00:00:03.000000,b,1
1,2022-10-01 00:00:04.000000,,
2,2022-10-01 00:01:01.000000,aa,11
2,2022-10-01 00:01:02.000000,bb,
2,2022-10-01 00:01:03.000000,,
2,2022-10-01 00:01:04.000000,aa,22
"""
)
inp_schema = StructType([
StructField('EventID',IntegerType(),True)
,StructField('EventTime',StringType(),True)
,StructField('AttrA',StringType(),True)
,StructField('AttrB',DoubleType(),True)
]
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df,schema=inp_schema)\
.withColumn('AttrA',F.when(F.isnan(F.col('AttrA')),None).otherwise(F.col('AttrA')))\
.withColumn('AttrB',F.when(F.isnan(F.col('AttrB')),None).otherwise(F.col('AttrB')))
sparkDF.show(truncate=False)
+-------+--------------------------+-----+-----+
|EventID|EventTime |AttrA|AttrB|
+-------+--------------------------+-----+-----+
|1 |2022-10-01 00:00:01.000000|null |null |
|1 |2022-10-01 00:00:02.000000|a |null |
|1 |2022-10-01 00:00:03.000000|b |1.0 |
|1 |2022-10-01 00:00:04.000000|null |null |
|2 |2022-10-01 00:01:01.000000|aa |11.0 |
|2 |2022-10-01 00:01:02.000000|bb |null |
|2 |2022-10-01 00:01:03.000000|null |null |
|2 |2022-10-01 00:01:04.000000|aa |22.0 |
+-------+--------------------------+-----+-----+
First & Last
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
EventID,
FIRST(AttrA,True) as First_AttrA,
LAST(AttrA,True) as Last_AttrA,
FIRST(AttrB,True) as First_AttrB,
LAST(AttrB,True) as Last_AttrB
FROM INPUT
GROUP BY 1
""").show()
+-------+-----------+----------+-----------+----------+
|EventID|First_AttrA|Last_AttrA|First_AttrB|Last_AttrB|
+-------+-----------+----------+-----------+----------+
| 1| a| b| 1.0| 1.0|
| 2| aa| aa| 11.0| 22.0|
+-------+-----------+----------+-----------+----------+

How to subset the readmitted cases from an inpatients’ table to calculate the total length of stay of the readmitted cases in SQL Server 17?

I am working with an inpatients' data table that looks like the following:
ID | AdmissionDate |DischDate |LOS |Readmitted30days
+------+-------+-------------+---------------+---------------+
|001 | 2014-01-01 | 2014-01-12 |11 |1
|101 | 2014-02-05 | 2014-02-12 |7 |1
|001 | 2014-02-18 | 2018-02-27 |9 |1
|001 | 2018-02-01 | 2018-02-13 |12 |0
|212 | 2014-01-28 | 2014-02-12 |15 |1
|212 | 2014-03-02 | 2014-03-15 |13 |0
|212 | 2016-12-23 | 2016-12-29 |4 |0
|1011 | 2017-06-10 | 2017-06-21 |11 |0
|401 | 2018-01-01 | 2018-01-11 |10 |0
|401 | 2018-10-01 | 2018-10-10 |9 |0
I want to create another table from the above in which the total length of stay (LOS) is summed up for those who have been readmitted within 30 days. The table I want to create looks like the following:
ID |Total LOS
+------+-----------
|001 |39
|212 |28
|212 |4
|1011 |11
|401 |10
|401 |9
I am using SQL Server Version 17.
Could anyone help me do this?
Thanks in advance
The Readmitted30days column seems irrelevant to the question and a complete red herring. What you seem to want is to aggregate rows which are within 30 days of each other.
This is a type of gaps-and-islands problem. There are a number of solutions, here is one:
We use LAG to check whether the previous DischDate is within 30 days of this AdmissionDate
Based on that we assign a grouping ID by doing a running count
Then simply group by ID and our grouping ID, and sum
The dates and LOS don't seem to match up, so I've given you both
WITH StartPoints AS (
SELECT *,
IsStart = CASE WHEN
DATEADD(day, -30, AdmissionDate) <
LAG(DischDate) OVER (PARTITION BY ID ORDER BY DischDate)
THEN 1 END
FROM YourTable
),
Groupings AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (PARTITION BY ID ORDER BY DischDate ROWS UNBOUNDED PRECEDING)
FROM StartPoints
)
SELECT
ID,
TotalBasedOnDates = SUM(DATEDIFF(day, AdmissionDate, DischDate)), -- do you need to add 1 within the sum?
TotalBasedOnLOS = SUM(LOS)
FROM Groupings
GROUP BY ID, GroupID;
db<>fiddle
if I understand correctly :
select Id, sum(LOS)
from tablename
where Readmitted30days = 1
group by Id
You want to use aggregation:
select id, sum(los)
from t
group by id
having max(Readmitted30days) = 1;
This filters after the aggregation so all los values are included in the sum.
EDIT:
I think I understand. Every occasion where Readmitted30days = 0, you want a row in the result set that combines that row with the following rows up to the next matching row.
If that interpretation is correct, you can construct groups using a cumulative sum and then aggregate:
select id, sum(los)
from (select t.*,
sum(1 - Readmitted30days = 0) over (partition by id order by admissiondate) as grp
from t
) t
group by id, grp;

SQL DB2 Split result of group by based on count

I would like to split the result of a group by in several rows based on a count, but I don't know if it's possible. For instance, if I have a query like this :
SELECT doc.client, doc.template, COUNT(doc) FROM document doc GROUP BY doc.client, doc.template
and a table document with the following data :
ID | name | client | template
1 | doc_a | a | temp_a
2 | doc_b | a | temp_a
3 | doc_c | a | temp_a
4 | doc_d | a | temp_b
The result for the query would be :
client | template | count
a | temp_a | 3
a | temp_b | 1
But I would like to split a row of the result in two or more if the count is higher than 2 :
client | template | count
a | temp_a | 2
a | temp_a | 1
a | temp_b | 1
Is there a way to do this in SQL ?
You can use RCTE like below. Run this statement AS IS first playing with different values in the last column. Max batch size here is 1000.
WITH
GRP_RESULT (client, template, count) AS
(
-- Place your SELECT ... GROUP BY here
-- instead of VALUES
VALUES
('a', 'temp_a', 4500)
, ('a', 'temp_b', 3001)
)
, T (client, template, count, max_batch_size) AS
(
SELECT client, template, count, 1000
FROM GRP_RESULT
UNION ALL
SELECT client, template, count - max_batch_size, max_batch_size
FROM T
WHERE count > max_batch_size
)
SELECT client, template, CASE WHEN count > max_batch_size THEN max_batch_size ELSE count END count
FROM T
ORDER BY client, template, count DESC
The result is:
|CLIENT|TEMPLATE|COUNT |
|------|--------|-----------|
|a |temp_a |1000 |
|a |temp_a |1000 |
|a |temp_a |1000 |
|a |temp_a |1000 |
|a |temp_a |500 |
|a |temp_b |1000 |
|a |temp_b |1000 |
|a |temp_b |1000 |
|a |temp_b |1 |
You may place your SELECT ... GROUP BY statement as specified above afterwards to achieve your goal.
You can use window functions and then aggregate:
SELECT client, template, COUNT(*)
FROM (SELECT doc.client, doc.template,
ROW_NUMBER() OVER (PARTITION BY doc.client, doc.template ORDER BY doc.client) - 1 as seqnum,
COUNT(*) OVER (PARTITION BY doc.client, doc.template) as cnt
FROM document doc
) d
GROUP BY doc.client, doc.template, floor(seqnum * n / cnt)
The subquery enumerates the rows. The outer query then splits the rows into groups of two using MOD().

Better way of writing my SQL query with conditional group by

Here's my data
|vendorname |total|
---------------------
|Najla |10 |
|Disney |20 |
|Disney |10 |
|ToysRus |5 |
|ToysRus |1 |
|Gap |1 |
|Gap |2 |
|Gap |3 |
|Najla |2 |
Here's the resultset I want
|vendorname |grandtotal|
---------------------
|Disney |30 |
|Gap |6 |
|ToysRus |6 |
|Najla |2 |
|Najla |10 |
If the vendorname = 'Najla' I want individual rows with their respective total otherwise I would like to group them and return a sum of their totals.
This is my query--
select *
from
(
select vendorname, sum(total) grandtotal
from vendor
where vendorname<>'Najla'
group by vendorname
union all
select vendorname, total grandtotal
from vendor
where vendorname='Najla'
) A
I was wondering if there's a better way to write this query instead of repeating it twice and performing a union. Is there a condensed way to group some rows "conditionally".
Honestly, I think the union all version is going to be the best performing and easiest to read option if it has appropriate indexes.
You could, however, do something like this (assuming you have a unique id on your table):
select vendorname, sum(total) grandtotal
from t
group by
vendorname
, case when vendorname = 'Najla' then id else null end
rextester demo: http://rextester.com/OGZQ33364
returns
+------------+------------+
| vendorname | grandtotal |
+------------+------------+
| Disney | 30 |
| Gap | 6 |
| ToysRus | 6 |
| Najla | 10 |
| Najla | 2 |
+------------+------------+

SQL Insert Query For Multiple Max IDs

Table w:
|ID|Comment|SeqID|
|1 |bajg | 1 |
|1 |2423 | 2 |
|2 |ref | 1 |
|2 |comment| 2 |
|2 |juk | 3 |
|3 |efef | 1 |
|4 | hy | 1 |
|4 | 6u | 2 |
How do I insert a standard new comment for each ID for a new SeqID (SeqID increase by 1)
The Below query results in the highest SeqID:
Select *
From w
Where SEQID =
(select max(seqid)
from w)
Table w:
|2 |juk | 3 |
Expected Result
Table w:
|ID|Comment|SeqID|
|1 |sqc | 3 |
|2 |sqc | 4 |
|3 |sqc | 2 |
|4 |sqc | 3 |
Will I have to go through and insert all the values (new comment as sqc) I want into the table using the below, or is there a faster way?
INSERT INTO table_name
VALUES (value1,value2,value3,...);
Try this:
INSERT INTO mytable (ID, Comment, SeqID)
SELECT ID, 'sqc', MAX(SeqID) + 1
FROM mytable
GROUP BY ID
Demo here
You are probably better off just calculating the value when you query. Define an identity column on the table, say CommentId and run a query like:
select id, comment,
row_number() over (partition by comment order by CommentId) as SeqId
from t;
What is nice about this approach is that the ids are always sequential, you don't have no opportunities for duplicates, the table does not have to be locked to when inserting, and the sequential ids work even for updates and deletes.