First and last not null fields over partitions - sql

I have a table like this:
EventID
EventTime
AttrA
AttrB
1
2022-10-01 00:00:01.000000
null
null
1
2022-10-01 00:00:02.000000
a
null
1
2022-10-01 00:00:03.000000
b
1
1
2022-10-01 00:00:04.000000
null
null
2
2022-10-01 00:01:01.000000
aa
11
2
2022-10-01 00:01:02.000000
bb
null
2
2022-10-01 00:01:03.000000
null
null
2
2022-10-01 00:01:04.000000
aa
22
and I want to jump across the records to return the first and last not null AttrA and AttrB values for each eventID based on the eventTime. Each eventID can have multiple records so we can't know where the not nulls may be. So the wished results would be:
EventID
FirstAttrA
LastAttrA
FirstAttrB
LastAttrB
1
a
b
1
1
2
aa
aa
11
22
What I did is to add row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) and then again DESC and then have multiple CTEs like this:
WITH enhanced_table AS
(
SELECT
eventID,
attrA,
attrB,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) as rn,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time DESC) as reversed_rn
),
first_events_with_attrA AS
(
SELECT
eventID,
FIRST(attrA) OVER (PARTITION BY eventID ORDER BY rn ASC) AS url
FROM enhanced_table
WHERE attrA IS NOT NULL
)...
But I need one CTE which scans again the table for each case I want (for this example 4 CTEs in total). It works, but it is slow.
Is there a way to grab the values I am interested in in a more efficient way?

No Need to build Row Numbers , you can directly use native SparkSQL Functions FIRST & LAST with isIgnoreNull as True to achieve the intended results -
Data Preparation
s = StringIO("""
EventID,EventTime,AttrA,AttrB
1,2022-10-01 00:00:01.000000,,
1,2022-10-01 00:00:02.000000,a,
1,2022-10-01 00:00:03.000000,b,1
1,2022-10-01 00:00:04.000000,,
2,2022-10-01 00:01:01.000000,aa,11
2,2022-10-01 00:01:02.000000,bb,
2,2022-10-01 00:01:03.000000,,
2,2022-10-01 00:01:04.000000,aa,22
"""
)
inp_schema = StructType([
StructField('EventID',IntegerType(),True)
,StructField('EventTime',StringType(),True)
,StructField('AttrA',StringType(),True)
,StructField('AttrB',DoubleType(),True)
]
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df,schema=inp_schema)\
.withColumn('AttrA',F.when(F.isnan(F.col('AttrA')),None).otherwise(F.col('AttrA')))\
.withColumn('AttrB',F.when(F.isnan(F.col('AttrB')),None).otherwise(F.col('AttrB')))
sparkDF.show(truncate=False)
+-------+--------------------------+-----+-----+
|EventID|EventTime |AttrA|AttrB|
+-------+--------------------------+-----+-----+
|1 |2022-10-01 00:00:01.000000|null |null |
|1 |2022-10-01 00:00:02.000000|a |null |
|1 |2022-10-01 00:00:03.000000|b |1.0 |
|1 |2022-10-01 00:00:04.000000|null |null |
|2 |2022-10-01 00:01:01.000000|aa |11.0 |
|2 |2022-10-01 00:01:02.000000|bb |null |
|2 |2022-10-01 00:01:03.000000|null |null |
|2 |2022-10-01 00:01:04.000000|aa |22.0 |
+-------+--------------------------+-----+-----+
First & Last
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
EventID,
FIRST(AttrA,True) as First_AttrA,
LAST(AttrA,True) as Last_AttrA,
FIRST(AttrB,True) as First_AttrB,
LAST(AttrB,True) as Last_AttrB
FROM INPUT
GROUP BY 1
""").show()
+-------+-----------+----------+-----------+----------+
|EventID|First_AttrA|Last_AttrA|First_AttrB|Last_AttrB|
+-------+-----------+----------+-----------+----------+
| 1| a| b| 1.0| 1.0|
| 2| aa| aa| 11.0| 22.0|
+-------+-----------+----------+-----------+----------+

Related

Sql: Join separately ordered tables

Let's assume I have two sets of events:
Foo
Bar
where I would always expect Bar to follow Foo: Foo -> Bar. I have a table of Foo values:
|----|---------------|------|
| id | ordering-foo | other|
|----|---------------|------|
|1 |1 |X |
|1 |2 |Y |
|----|---------------|------|
|2 |1 |X |
|----|---------------|------|
|3 |2 |X |
|----|---------------|------|
|4 |1 |X |
|4 |2 |Y |
|----|---------------|------|
the ordering field indicates the order at which the Foo events happened per id.
I also have a set of Bar events:
|----|---------------|-------|
| id | ordering_bar | other |
|----|---------------|-------|
|1 |A |XX |
|1 |B |YY |
|----|---------------|-------|
|3 |B |XX |
|----|---------------|-------|
|4 |A |XX |
|----|---------------|-------|
Note that:
while Foo and Bar are both ordered, they don't share the same ordering and we can't simply join them on the said ordering values. Here I have simplified them to numbers vs strings. In the problem that inspired this question, these are the timestamps for each Foo/Bar event respectively, which has the property of foo.ordering < bar.ordering for a Foo->Bar sequence of events, but that's probably not massively helpful to this problem.
The ordering isn't "???", ie just because we have an order entry of 2(B) doesn't mean we'd necessarily have a 1(A) entry. see entries for id: 3
It's possible for us to have a record for Foo but not the subsequent Bar, ie see entries for id: 2, 4
I want to end up with:
|----|----------|-----------|-----------|
| id | ordering | other-foo | other-bar |
| 1 | 1 | X | XX |
| 1 | 2 | Y | YY |
|----|----------|-----------|-----------|
| 2 | 1 | X | null |
|----|----------|-----------|-----------|
| 3 | 2 | X | XX |
|----|----------|-----------|-----------|
| 4 | 1 | X | XX |
| 4 | 2 | Y | null |
|----|----------|-----------|-----------|
How can I get there? In my special case of this problem I only ever have two possible events per event type, per id. ie the ordering values can only ever be: 1,2 / A,B I played around with things like:
case
when count(*) over (partition by foo.id) = 1 and count(*) over (partition by bar.id) = 1 then foo.ordering_foo
when count(*) over (partition by foo.id) = 2 and count(*) over (partition by bar.id) = 1 then 1
when count(*) over (partition by foo.id) = 2 and count(*) over (partition by bar.id) = 2 and max(bar.ordering_bar) over (partition by bar.id) = bar.ordering_bar then 2
when count(*) over (partition by foo.id) = 2 and count(*) over (partition by bar.id) = 2 and min(bar.ordering_bar) over (partition by bar.odering_bar)= bar.ordering_bar then 1
else -1
end as ordering,
ie, I treat each case of:
1 foo, 1 bar
2 foo, 1 bar
2 foo, 2 bar
separately to com up with a composite order. Tho it is likely error-prone, and most importantly I realise this is:
horrible to read/maintain
not flexible enough.
hard to use to get other fields.
So I'm curious if you could solve this more elegantly in the generic case.
You may join the tables using ROW_NUMBER as the following:
SELECT T.id ,T.ordering_foo, T.other other_foo, D.other other_bar
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ordering_foo) foo_rn
FROM foo
) T
LEFT JOIN
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ordering_bar) bar_rn
FROM bar
) D
ON T.ID=D.ID AND T.foo_rn=D.bar_rn
ORDER BY T.id ,T.ordering_foo
See a demo on SQL Server.

How to subset the readmitted cases from an inpatients’ table to calculate the total length of stay of the readmitted cases in SQL Server 17?

I am working with an inpatients' data table that looks like the following:
ID | AdmissionDate |DischDate |LOS |Readmitted30days
+------+-------+-------------+---------------+---------------+
|001 | 2014-01-01 | 2014-01-12 |11 |1
|101 | 2014-02-05 | 2014-02-12 |7 |1
|001 | 2014-02-18 | 2018-02-27 |9 |1
|001 | 2018-02-01 | 2018-02-13 |12 |0
|212 | 2014-01-28 | 2014-02-12 |15 |1
|212 | 2014-03-02 | 2014-03-15 |13 |0
|212 | 2016-12-23 | 2016-12-29 |4 |0
|1011 | 2017-06-10 | 2017-06-21 |11 |0
|401 | 2018-01-01 | 2018-01-11 |10 |0
|401 | 2018-10-01 | 2018-10-10 |9 |0
I want to create another table from the above in which the total length of stay (LOS) is summed up for those who have been readmitted within 30 days. The table I want to create looks like the following:
ID |Total LOS
+------+-----------
|001 |39
|212 |28
|212 |4
|1011 |11
|401 |10
|401 |9
I am using SQL Server Version 17.
Could anyone help me do this?
Thanks in advance
The Readmitted30days column seems irrelevant to the question and a complete red herring. What you seem to want is to aggregate rows which are within 30 days of each other.
This is a type of gaps-and-islands problem. There are a number of solutions, here is one:
We use LAG to check whether the previous DischDate is within 30 days of this AdmissionDate
Based on that we assign a grouping ID by doing a running count
Then simply group by ID and our grouping ID, and sum
The dates and LOS don't seem to match up, so I've given you both
WITH StartPoints AS (
SELECT *,
IsStart = CASE WHEN
DATEADD(day, -30, AdmissionDate) <
LAG(DischDate) OVER (PARTITION BY ID ORDER BY DischDate)
THEN 1 END
FROM YourTable
),
Groupings AS (
SELECT *,
GroupId = COUNT(IsStart) OVER (PARTITION BY ID ORDER BY DischDate ROWS UNBOUNDED PRECEDING)
FROM StartPoints
)
SELECT
ID,
TotalBasedOnDates = SUM(DATEDIFF(day, AdmissionDate, DischDate)), -- do you need to add 1 within the sum?
TotalBasedOnLOS = SUM(LOS)
FROM Groupings
GROUP BY ID, GroupID;
db<>fiddle
if I understand correctly :
select Id, sum(LOS)
from tablename
where Readmitted30days = 1
group by Id
You want to use aggregation:
select id, sum(los)
from t
group by id
having max(Readmitted30days) = 1;
This filters after the aggregation so all los values are included in the sum.
EDIT:
I think I understand. Every occasion where Readmitted30days = 0, you want a row in the result set that combines that row with the following rows up to the next matching row.
If that interpretation is correct, you can construct groups using a cumulative sum and then aggregate:
select id, sum(los)
from (select t.*,
sum(1 - Readmitted30days = 0) over (partition by id order by admissiondate) as grp
from t
) t
group by id, grp;

insert extra rows in query result sql

Given a table with entries at irregular time stamps, "breaks" must be inserted at regular 5 min intervals ( the data associated can / will be NULL ).
I was thinking of getting the start time, making a subquery that has a window function and adds 5 min intervals to the start time - but I only could think of using row_number to increment the values.
WITH data as(
select id, data,
cast(date_and_time as double) * 1000 as time_milliseconds
from t1), -- original data
start_times as(
select id, MIN(CAST(date_and_time as double) * 1000) as start_time
from t1
GROUP BY id
), -- first timestamp for each id
boundries as (
SELECT T1.id,(row_number() OVER (PARTITION BY T1.id ORDER BY T1.date_and_time)-1) *300000 + start_times.start_time
as boundry
from T1
INNER JOIN start_times ON start_times.id= T1.id
) -- increment the number of 5 min added on each row and later full join boundries table with original data
However this limits me to the number of rows present for an id in the original data table, and if the timestamps are spread out, the number of rows cannot cover the amount of 5 min intervals needed to be added.
sample data:
initial data:
|-----------|------------------|------------------|
| id | value | timestamp |
|-----------|------------------|------------------|
| 1 | 3 | 12:00:01.011 |
|-----------|------------------|------------------|
| 1 | 4 | 12:03:30.041 |
|-----------|------------------|------------------|
| 1 | 5 | 12:12:20.231 |
|-----------|------------------|------------------|
| 1 | 3 | 15:00:00.312 |
data after my query:
|-----------|------------------|------------------|
| id | value | timestamp (UNIX) |
|-----------|------------------|------------------|
| 1 | 3 | 12:00:01 |
|-----------|------------------|------------------|
| 1 | 4 | 12:03:30 |
|-----------|------------------|------------------|
| 1 | NULL | 12:05:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | NULL | 12:10:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | 5 | 12:12:20 |
|-----------|------------------|------------------|
| 1 | NULL | 12:15:01 | <-- Data from "boundries"
|-----------|------------------|------------------|
| 1 | NULL | 12:20:01 | <-- Data from "boundries"
|-----------|------------------|------------------| <-- Jumping directly to 15:00:00 (WRONG! :( need to insert more 5 min breaks here )
| 1 | 3 | 15:00:00 |
I was thinking of creating a temporary table inside HIVE and filling it with x rows representing 5 min intervals from the starttime to the endtime of the data table, but I couldn't find any way of accomplishing that.
Any way of using "for loops" ? Any suggestions would be appreciated.
Thanks
You can try calculating the difference between current timestamp and next one, divide 300 to get number of ranges, produce a string of spaces with length = num_ranges, explode to generate rows.
Demo:
with your_table as (--initial data example
select stack (3,
1,3 ,'2020-01-01 12:00:01.011',
1,4 ,'2020-01-01 12:03:30.041',
1,5 ,'2020-01-01 12:20:20.231'
) as (id ,value ,ts )
)
select id ,value, ts, next_ts,
diff_sec,num_intervals,
from_unixtime(unix_timestamp(ts)+h.i*300) new_ts, coalesce(from_unixtime(unix_timestamp(ts)+h.i*300),ts) as calculated_timestamp
from
(
select id ,value ,ts, next_ts, (unix_timestamp(next_ts)-unix_timestamp(ts)) diff_sec,
floor((unix_timestamp(next_ts)-unix_timestamp(ts))/300 --diff in seconds/5 min
) num_intervals
from
(
select id ,value ,ts, lead(ts) over(order by ts) next_ts
from your_table
) s
)s
lateral view outer posexplode(split(space(cast(s.num_intervals as int)),' ')) h as i,x --this will generate rows
Result:
id value ts next_ts diff_sec num_intervals new_ts calculated_timestamp
1 3 2020-01-01 12:00:01.011 2020-01-01 12:03:30.041 209 0 2020-01-01 12:00:01 2020-01-01 12:00:01
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:03:30 2020-01-01 12:03:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:08:30 2020-01-01 12:08:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:13:30 2020-01-01 12:13:30
1 4 2020-01-01 12:03:30.041 2020-01-01 12:20:20.231 1010 3 2020-01-01 12:18:30 2020-01-01 12:18:30
1 5 2020-01-01 12:20:20.231 \N \N \N \N 2020-01-01 12:20:20.231
Additional rows were added. I left all intermediate columns for debugging purposes.
A recursive query could be helpful here but hive does not support these more info.
You may consider creating the table outside of hive or writing a UDF.
Either way this query can be expensive and the use of materialized views/tables are recommended depending on your frequency.
The example shows a UDF inbetween created using pyspark to run the query. It
generate the values in between the min and max timestamp from the dataset
using CTEs and the UDF to create a temporary table intervals
generating all possible intervals using an expensive cross join in possible_records
Using a left join to retrieve the records with actual values (for demonstration purposes i've represented the timestamp value as just the time string)
The code below shows how it was evaluated using hive
Example Code
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType,ArrayType
inbetween = lambda min_value,max_value : [*range(min_value,max_value,5*60)]
udf_inbetween = udf(inbetween,ArrayType(IntegerType()))
sqlContext.udf.register("inbetween",udf_inbetween)
sqlContext.sql("""
WITH max_timestamp(t) as (
select max(timestamp) as t from initial_data2
),
min_timestamp(t) as (
select min(timestamp) as t from initial_data2
),
intervals as (
select explode(inbetween(unix_timestamp(mint.t),unix_timestamp(maxt.t))) as interval_time FROM
min_timestamp mint, max_timestamp maxt
),
unique_ids as (
select distinct id from initial_data2
),
interval_times as (
select interval_time from (
select
cast(from_unixtime(interval_time) as timestamp) as interval_time
from
intervals
UNION
select distinct d.timestamp as interval_time from initial_data2 d
)
order by interval_time asc
),
possible_records as (
select
distinct
d.id,
i.interval_time
FROM
interval_times i, unique_ids d
)
select
p.id,
d.value,
split(cast(p.interval_time as string)," ")[1] as timestamp
FROM
possible_records p
LEFT JOIN
initial_data2 d ON d.id = p.id and d.timestamp = p.interval_time
ORDER BY p.id, p.interval_time
""").show(20)
Output
+---+-----+---------+
| id|value|timestamp|
+---+-----+---------+
| 1| 3| 12:00:01|
| 1| 4| 12:03:30|
| 1| null| 12:05:01|
| 1| null| 12:10:01|
| 1| 5| 12:12:20|
| 1| null| 12:15:01|
| 1| null| 12:20:01|
| 1| null| 12:25:01|
| 1| null| 12:30:01|
| 1| null| 12:35:01|
| 1| null| 12:40:01|
| 1| null| 12:45:01|
| 1| null| 12:50:01|
| 1| null| 12:55:01|
| 1| null| 13:00:01|
| 1| null| 13:05:01|
| 1| null| 13:10:01|
| 1| null| 13:15:01|
| 1| null| 13:20:01|
| 1| null| 13:25:01|
+---+-----+---------+
only showing top 20 rows
Data Prep to replicate
raw_data1 = [
{"id":1,"value":3,"timestam":"12:00:01"},
{"id":1,"value":4,"timestam":"12:03:30"},
{"id":1,"value":5,"timestam":"12:12:20"},
{"id":1,"value":3,"timestam":"15:00:00"},
]
raw_data = [*map(lambda entry : Row(**entry),raw_data1)]
initial_data = sqlContext.createDataFrame(raw_data,schema="id int, value int, timestam string ")
initial_data.createOrReplaceTempView('initial_data')
sqlContext.sql("create or replace temp view initial_data2 as select id,value,cast(timestam as timestamp) as timestamp from initial_data")

Sql Server Aggregation or Pivot Table Query

I'm trying to write a query that will tell me the number of customers who had a certain number of transactions each week. I don't know where to start with the query, but I'd assume it involves an aggregate or pivot function. I'm working in SqlServer management studio.
Currently the data is looks like where the first column is the customer id and each subsequent column is a week :
|Customer| 1 | 2| 3 |4 |
----------------------
|001 |1 | 0| 2 |2 |
|002 |0 | 2| 1 |0 |
|003 |0 | 4| 1 |1 |
|004 |1 | 0| 0 |1 |
I'd like to see a return like the following:
|Visits |1 | 2| 3 |4 |
----------------------
|0 |2 | 2| 1 |0 |
|1 |2 | 0| 2 |2 |
|2 |0 | 1| 1 |1 |
|4 |0 | 1| 0 |0 |
What I want is to get the count of customer transactions per week. E.g. during the 1st week 2 customers (i.e. 002 and 003) had 0 transactions, 2 customers (i.e. 001 and 004) had 1 transaction, whereas zero customers had more than 1 transaction
The query below will get you the result you want, but note that it has the column names hard coded. It's easy to add more week columns, but if the number of columns is unknown then you might want to look into a solution using dynamic SQL (which would require accessing the information schema to get the column names). It's not that hard to turn it into a fully dynamic version though.
select
Visits
, coalesce([1],0) as Week1
, coalesce([2],0) as Week2
, coalesce([3],0) as Week3
, coalesce([4],0) as Week4
from (
select *, count(*) c from (
select '1' W, week1 Visits from t union all
select '2' W, week2 Visits from t union all
select '3' W, week3 Visits from t union all
select '4' W, week4 Visits from t ) a
group by W, Visits
) x pivot ( max (c) for W in ([1], [2], [3], [4]) ) as pvt;
In the query your table is called t and the output is:
Visits Week1 Week2 Week3 Week4
0 2 2 1 1
1 2 0 2 2
2 0 1 1 1
4 0 1 0 0

Ignoring null values in in a postgresql rank() window function

I am writing a SQL query using PostgreSQL that needs to rank people that "arrive" at some location. Not everyone arrives however. I am using a rank() window function to generate arrival ranks, but in the places where the arrival time is null, rather than returning a null rank, the rank() aggregate function just treats them as if they arrived after everyone else. What I want to happen is that these no-shows get a rank of NULL instead of this imputed rank.
Here is an example. Suppose I have a table dinner_show_up that looks like this:
| Person | arrival_time | Restaurant |
+--------+--------------+------------+
| Dave | 7 | in_and_out |
| Mike | 2 | in_and_out |
| Bob | NULL | in_and_out |
Bob never shows up. The query I'm writing would be:
select Person,
rank() over (partition by Restaurant order by arrival_time asc)
as arrival_rank
from dinner_show_up;
And the result will be
| Person | arrival_rank |
+--------+--------------+
| Dave | 2 |
| Mike | 1 |
| Bob | 3 |
What I want to happen instead is this:
| Person | arrival_rank |
+--------+--------------+
| Dave | 2 |
| Mike | 1 |
| Bob | NULL |
Just use a case statement around the rank():
select Person,
(case when arrival_time is not null
then rank() over (partition by Restaurant order by arrival_time asc)
end) as arrival_rank
from dinner_show_up;
A more general solution for all aggregate functions, not only rank(), is to partition by 'arrival_time is not null' in the over() clause. That will cause all null arrival_time rows to be placed into the same group and given the same rank, leaving the non-null rows to be ranked relative only to each other.
For the sake of a meaningful example, I mocked up a CTE having more rows than the intial problem set. Please forgive the wide rows, but I think they better contrast the differing techniques.
with dinner_show_up("person", "arrival_time", "restaurant") as (values
('Dave' , 7, 'in_and_out')
,('Mike' , 2, 'in_and_out')
,('Bob' , null, 'in_and_out')
,('Peter', 3, 'in_and_out')
,('Jane' , null, 'in_and_out')
,('Merry', 5, 'in_and_out')
,('Sam' , 5, 'in_and_out')
,('Pip' , 9, 'in_and_out')
)
select
person
,case when arrival_time is not null then rank() over ( order by arrival_time) end as arrival_rank_without_partition
,case when arrival_time is not null then rank() over (partition by arrival_time is not null order by arrival_time) end as arrival_rank_with_partition
,case when arrival_time is not null then percent_rank() over ( order by arrival_time) end as arrival_pctrank_without_partition
,case when arrival_time is not null then percent_rank() over (partition by arrival_time is not null order by arrival_time) end as arrival_pctrank_with_partition
from dinner_show_up
This query gives the same results for arrival_rank_with/without_partition. However, the results for percent_rank() do differ: without_partition is wrong, ranging from 0% to 71.4%, whereas with_partition correctly gives pctrank() ranging from 0% to 100%.
This same pattern applies to the ntile() aggregate function, as well.
It works by separating all null values from non-null values for purposes of the ranking. This ensures that Jane and Bob are excluded from the percentile ranking of 0% to 100%.
|person|arrival_rank_without_partition|arrival_rank_with_partition|arrival_pctrank_without_partition|arrival_pctrank_with_partition|
+------+------------------------------+---------------------------+---------------------------------+------------------------------+
|Jane |null |null |null |null |
|Bob |null |null |null |null |
|Mike |1 |1 |0 |0 |
|Peter |2 |2 |0.14 |0.2 |
|Sam |3 |3 |0.28 |0.4 |
|Merry |4 |4 |0.28 |0.4 |
|Dave |5 |5 |0.57 |0.8 |
|Pip |6 |6 |0.71 |1.0 |
select Person,
rank() over (partition by Restaurant order by arrival_time asc)
as arrival_rank
from dinner_show_up
where arrival_time is not null
union
select Person,NULL as arrival_rank
from dinner_show_up
where arrival_time is null;