How to Add Extra Rules vs the Previous Row when Ranking SQL? - sql

Let's say I have a table that shows the changes of customer support ticket.
timestamp
date
status
rank
dense_rank
row_number
2021-03-22 05:03:22
2021-03-22
OPEN
1
1
1
2021-03-24 07:10:05
2021-03-24
DECLINED
2
2
2
2021-04-04 09:01:10
2021-04-24
DECLINED
3
3
3 (at random)
2021-04-04 09:01:10
2021-04-24
OPEN
3
3
4 (at random)
If we take a look at the 3rd and 4th records, they are the same exact timestamp.
And I want to sort this consistently based on the timestamp ascendingly (not row_number because it is at random, not rank and dense rank because it is not going to be ascending)
Now we have an additional rule, such as a ticket can't have a sequential same status. In the case of above & incorporating the rule, the sequence of the record should be:
open (2021-03-22) - declined (2021-03-24) - open (2021-04-24) - declined (2021-04-24)
Are there any ways to incorporate this additional rule into rank() over (partition by ... order by ...)?

Assuming that you want this sequence over the entire table, we can try:
SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp,
CASE status WHEN 'OPEN' THEN 1 ELSE 2 END) AS rn
FROM yourTable
ORDER BY timestamp, CASE status WHEN 'OPEN' THEN 1 ELSE 2 END;
The second level of the ORDER BY clause sorts open records before records of any other status. If you wanted this sequence repeated for a given set of records within the table, then you would want to add a PARTITION BY clause to the call to ROW_NUMBER.

Related

How do I use lag to get the previous row before a specific time window of data?

Every day I create a table that looks like this:
user_id
received_at
age_pref
ethnicity_pref
1
10:01
18-28
open_to_all
2
10:05
18-23
open_to_all
1
10:08
18-30
open_to_all
2
10:07
18-25
Hispanic/Iatino
3
10:09
56-33
White
It's a table that lists the actions a user takes from 10am-11am. As you can see, there are 3 distinct user IDs.
Using this, I'm trying to create another table using lag to see if the previous value changed or not. However, the problem is that the first row is inaccurate because there's no way for me to measure if an attribute changed without the previous row before this set of data (maybe it occurred at 930am). How do I get the previous received_at row for each user ID in this table, but only 1 for each user_id? I want the new table to look like this, where the new records are prepended at the beginning
user_id
received_at
age_pref
ethnicity_pref
1
9:48
20-30
asian
2
9:52
30-32
white
3
9:58
28-30
open_to_all
1
10:01
18-28
open_to_all
2
10:05
18-23
open_to_all
1
10:08
18-30
open_to_all
2
10:07
18-25
Hispanic/Iatino
3
10:09
56-33
White
note there are several rows that exist before this time interval. I want the most recent one prepended to the table for the user_ids that exist in the table.
basically I want to include 1 more row for EACH user_id before the time window so that my table that tracks changes is accurate since lag will always have the first row be null.
I guess you can union all the following query:
select distinct on (user_id) user_id, received_at, age_pref, ethnicity_pref
from t
order by user_id, received_at desc
where received_at < '10:00'

How can I create row numbers defined only by the previous row's value?

This is a task previously accomplished by a cursor in a really old T-SQL script that I now have to get rid of. For a person in a table ordered by dates, I have a value indicating a sequence is starting, then continuing, and then when a new one starts (indicating the old one has ended). I cannot figure out how to get each of these sequences to have row numbers. I had something similar in an R codebase a few years ago I used RLE for but this has me stumped. I need to get from this:
ID STATUS DATE A B
1 START 2000-01-01 1 1
1 CONTINUATION_A&B 2000-01-02 NULL NULL
1 CONTINUATION_A&B 2000-01-03 NULL NULL
1 START 2000-01-04 1 1
1 START 2000-01-05 1 1
1 CONTINUATION_A 2000-01-06 NULL NULL
1 CONTINUATION_A 2000-01-07 NULL NULL
To this:
ID STATUS DATE A B
1 START 2000-01-01 1 1
1 CONTINUATION_A&B 2000-01-02 2 2
1 CONTINUATION_A&B 2000-01-03 3 3
1 START 2000-01-04 1 1
1 START 2000-01-05 1 1
1 CONTINUATION_A 2000-01-06 2 1
1 CONTINUATION_A 2000-01-07 3 1
Thanks in advance.
with A as (
select *,
count(case when status = 'START' then 1 end) over (order by "date") as grp
from T
)
select *,
count(case when status in ('START', 'CONTINUATION_A', 'CONTINUATION_A&B') then 1 end)
over (partition by grp order by "date") as A,
count(case when status in ('START', 'CONTINUATION_A&B') then 1 end)
over (partition by grp order by "date") as B
from A;
https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=225a1e37236c18fbb7bdbb76d7ad93dc
This assumes that counters always begin at one. That could be adjusted if necessary with expressions like these:
min(A) over (partition by grp) - 1 /* offset for A */
min(B) over (partition by grp) - 1 /* offset for B */
Not an answer, but important for eventually answering the question and too long for a comment (hence community wiki).
I see this:
in a table orderd by dates
... but I don't see any dates in the question. Where is the date field in your sample data? We need to at least know it's name to give you good code.
One thing to get drilled into your head is tables never have any inherent or natural order. While the primary key/clustered index or insert order may seem like a natural table order, there are plenty of things that can mess with this, and unless you are explicit in your code about the order of your records the database is free to give you results in any order it finds convenient. That is, if there's not a fully-deterministic ORDER BY clause, the ordering of the results for the same query can and does change from moment to moment, depending on things like what other queries are currently running to access the same data or what pages or indexes are already in memory.
This means we need to be able to reference a field in the table to enforce the desired ordering... we need to know about that date field to write the correct SQL statement.

Query to select appropriate row and calculate elapsed time

I need some help in coming up with a query that will return the answer to the question “How long has a Help Desk Ticket been owned by the currently assigned group?” Following is a subset of the data model with some sample data:
Help Desk Cases
Case ID (PK) Assigned Person Assigned Group
123456 Robert Hardware
Help Desk Case Assignment History
Case ID (PK) Seq # (PK) Assigned Group Assigned Person Elapsed Time Row Added Date/Time
123456 1 Hardware 10
123456 2 Software 2
123456 3 Hardware Sam 1
123456 4 Software Sophie 6
123456 5 Hardware 8
123456 6 Hardware Sam 3
123456 7 Hardware Robert
The Elapsed Time column for the most recent row (Seq #7) is not updated until a subsequent row (Seq #8) is written, so I don’t think I can use an aggregate function. For the sample data above, I need to get the Row Added column from Seq # 5 and subtract it from the current date to get the total amount of time the case has been most recently assigned to the Hardware group (we ignore previous assignments such as Seq # 1 and Seq # 3).
The Query output for the example above should be:
Case ID Assigned Group Assigned Person Time Owned
123456 Hardware Robert Current Date - Seq #5 Row Added Date/Time
With Oracle 12c and higher...
select case_id,
last_assigned_group as assigned_group,
last_assigned_person as assigned_person,
nvl(last_row_added, systimestamp) - first_row_added as time_owned
from help_desk_case_assignment_history
match_recognize (
partition by case_id
order by seq#
measures
first(row_added) as first_row_added,
last(row_added) as last_row_added,
last(assigned_group) as last_assigned_group,
last(assigned_person) as last_assigned_person
one row per match
after match skip past last row
pattern (
assignment_run* case_end
)
define
assignment_run as (assigned_group = next(assigned_group)),
case_end as (elapsed_time is null or next(assigned_group) is null)
)
;
In human words: Per each helpdesk case ID find the last uninterrupted "run" of assignments within the same group. For the last "run" of assignments identify its starting time, ending time, and ending person. And display the found values.
With Oracle 11g and lower...
with xyz as (
select X.*,
case when lnnvl(assigned_group = lag(assigned_group) over (partition by case_id order by seq#)) then seq# end as assignment_run_start
from help_desk_case_assignment_history X
),
xyz2 as (
select X.*,
last_value(assignment_run_start) ignore nulls over (partition by case_id order by seq#) as assignment_run_id
from xyz X
),
xyz3 as (
select case_id, assigned_group, assignment_run_id,
max(assigned_person) keep (dense_rank last order by seq#) as last_assigned_person,
nvl(max(row_added) keep (dense_rank last order by seq#), systimestamp)
- min(row_added) keep (dense_rank first order by seq#)
as time_owned,
row_number() over (partition by case_id order by assignment_run_id desc) as last_group_ind
from xyz2 X
group by case_id, assigned_group, assignment_run_id
)
select case_id, assigned_group, last_assigned_person as assigned_person, time_owned
from xyz3
where last_group_ind = 1
;
Perhaps ugly, but pretty straightforward and working.
In human words:
Identify the boundaries (starts) of assignment runs as increasing numeric IDs.
Extend the found assignment run starts to the whole assignment runs.
Calculate the assignments' run times and last assigned persons.
Restrict the previous calculation to the last (by their ID) assignment run only.

SQL Find latest record only if COMPLETE field is 0

I have a table with multiple records submitted by a user. In each record is a field called COMPLETE to indicate if a record is fully completed or not.
I need a way to get the latest records of the user where COMPLETE is 0, LOCATION, DATE are the same and no additional record exist where COMPLETE is 1. In each record there are additional fields such as Type, AMOUNT, Total, etc. These can be different, even though the USER, LOCATION, and DATE are the same.
There is a SUB_DATE field and ID field that denote the day the submission was made and auto incremented ID number. Here is the table:
ID NAME LOCATION DATE COMPLETE SUB_DATE TYPE1 AMOUNT1 TYPE2 AMOUNT2 TOTAL
1 user1 loc1 2017-09-15 1 2017-09-10 Food 12.25 Hotel 65.54 77.79
2 user1 loc1 2017-09-15 0 2017-09-11 Food 12.25 NULL 0 12.25
3 user1 loc2 2017-08-13 0 2017-09-05 Flight 140 Food 5 145.00
4 user1 loc2 2017-08-13 0 2017-09-10 Flight 140 NULL 0 140
5 user1 loc3 2017-07-14 0 2017-07-15 Taxi 25 NULL 0 25
6 user1 loc3 2017-08-25 1 2017-08-26 Food 45 NULL 0 45
The results I would like is to retrieve are ID 4, because the SUB_DATE is later that ID 3. Which it has the same Name, Location, and Date information and there is no COMPLETE with a 1 value.
I would also like to retrieve ID 5, since it is the latest record for the User, Location, Date, and Complete is 0.
I would also appreciate it if you could explain your answer to help me understand what is happening in the solution.
Not sure if I fully understood but try this
SELECT *
FROM (
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your_table t
) a
WHERE CompleteForNameLocationAndDate = 0 AND
SUB_DATE = LastSubDate
So what we have done here:
First, if you run just the inner query in Management Studio, you will see what that does:
The first max function will partition the data in the table by each unique Name,Location,Date set.
In the case of your data, ID 1 & 2 are the first partition, 3&4 are the second partition, 5 is the 3rd partition and 6 is the 4th partition.
So for each of these partitions it will get the max value in the complete column. Therefore any partition with a 1 as it's max value has been completed.
Note also, the convert function. This is because COMPLETE is of datatype BIT (1 or 0) and the max function does not work with that datatype. We therefore convert to INT. If your COMPLETE column is type INT, you can take the convert out.
The second max function partitions by unique Name, Location and Date again but we are getting the max_sub date this time which give us the date of the latest record for the Name,Location,Date
So we take that query and add it to a derived table which for simplicity we call a. We need to do this because SQL Server doesn't allowed windowed functions in the WHERE clause of queries. A windowed function is one that makes use of the OVER keyword as we have done. In an ideal world, SQL would let us do
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your)table t
WHERE MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) = 0 AND
SUB_DATE = MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE)
But it doesn't allow it so we have to use the derived table.
So then we basically SELECT everything from our derived table Where
CompleteForNameLocationAndDate = 0
Which are Name,Location, Date partitions which do not have a record marked as complete.
Then we filter further asking for only the latest record for each partition
SUB_DATE = LastSubDate
Hope that makes sense, not sure what level of detail you need?
As a side, I would look at restructuring your tables (unless of course you have simplified to better explain this problem) as follows:
(Assuming the table in your examples is called Booking)
tblBooking
BookingID
PersonID
LocationID
Date
Complete
SubDate
tblPerson
PersonID
PersonName
tblLocation
LocationID
LocationName
tblType
TypeID
TypeName
tblBookingType
BookingTypeID
BookingID
TypeID
Amount
This way if you ever want to add Type3 or Type4 to your booking information, you don't need to alter your table layout

Exponential decay in SQL for different dates page views

I have a different dates with the amount of products viewed on a webpage over a 30 day time frame. I am trying to create a exponential decay model in SQL. I am using exponential decay because I want to highlight the latest events over older ones. I not sure how to write this in SQL without getting an error. I have never done this before with this type of model so want to make sure I am doing it correctly too.
=================================
Data looks like this
product views date
a 1 2014-05-15
a 2 2014-05-01
b 2 2014-05-10
c 4 2014-05-02
c 1 2014-05-12
d 3 2014-05-11
================================
Code:
create table decay model as
select product,views,date
case when......
from table abc
group by product;
not sure what to write to do the model
I want to penalize products that were viewed that were older vs products that were viewed more recently
Thank you for your help
You can do it like this:
Choose the partition in which you want to apply exponential decay, then order descending by date within such a group.
use the function ROW_NUMBER() with ascendent ordering to get the row numbering within each subgroup.
calculate pow(your_variable_in_[0,1], rownum) and apply it to your result.
Code might look like this (might work in Oracle SQL or db2):
SELECT <your_partitioning>, date, <whatever>*power(<your_variable>,rownum-1)
FROM (SELECT a.*
, ROW_NUMBER() OVER (PARTITION BY <your_partitioning> ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY <your_partitioning>, date DESC
EDIT: I read again over your problem and think I understood now what you asked for, so here is a solution which might work (decay factor is 0.9 here):
SELECT product, sum(adjusted_views) // (i)
FROM (SELECT product, views*power(0.9, rownum-1) AS adjusted_views, date, rownum // (ii)
FROM (SELECT product, views, date // (iii)
, ROW_NUMBER() OVER (PARTITION BY product ORDER BY a.date DESC) AS rownum
FROM YOUR_TABLE a)
ORDER BY product, date DESC)
GROUP BY product
The inner select statement (iii) creates a temporary table that might look like this
product views date rownum
--------------------------------------------------
a 1 2014-05-15 1
a 2 2014-05-14 2
a 2 2014-05-13 3
b 2 2014-05-10 1
b 3 2014-05-09 2
b 2 2014-05-08 3
b 1 2014-05-07 4
The next query (ii) then uses the rownumber to construct an exponentially decaying factor 0.9^(rownum-1) and applies it to views. The result is
product adjusted_views date rownum
--------------------------------------------------
a 1 * 0.9^0 2014-05-15 1
a 2 * 0.9^1 2014-05-14 2
a 2 * 0.9^2 2014-05-13 3
b 2 * 0.9^0 2014-05-10 1
b 3 * 0.9^1 2014-05-09 2
b 2 * 0.9^2 2014-05-08 3
b 1 * 0.9^3 2014-05-07 4
In a last step (the outer query) the adjusted views are summed up, as this seems to be the quantity you are interested in.
Note, however, that in order to be consistent there should be regular distances between the dates, e.g., always on day (--not one day here and a month there, because these will be weighted in a similar fashion although they shouldn't).