Appropriate ranking - sql

I have this kind of data:
contract nr startdate
1 12 01-01-2000
1 12 03-01-2000
1 22 07-01-2000
2 77 12-04-2001
2 78 17-04-2001
My simple goal here is to rank each number within a specific contract, taking into account the start date. The output should look like this:
contract nr startdate my_rank
1 12 01-01-2000 1
1 12 03-01-2000 1
1 22 07-01-2000 2
2 77 12-04-2001 1
2 78 17-04-2001 2
I have tried almost all possible combinations, but couldn't figure it out.
select dense_rank() over
(partition by contract order by nr) as my_rank,* from my_data
The above is close enough, problem is that in some cases a 1 is assigned to the most recent contract, in other cases it is assigned to the least recent (?).
Any hint?

Your ranking is by nr.
If you want to rank by the contract date, you need to incorporate that. But these are different by contract. So, that requires an additional calculation:
select dense_rank() over (partition by contract order by min_startdate) as my_rank,
d.*
from (select d.*,
min(startdate) over (partition by contract, nr) as min_startdate
from my_data d
) d;
I don't know if you want the min() or max() of the start date for your ordering.

Related

Count similar names block per independent partitions

I have a dataframe that looks like this:
id name datetime
44 once 2022-11-22T15:41:00
44 once 2022-11-22T15:42:00
44 once 2022-11-22T15:43:00
44 twice 2022-11-22T15:44:00
44 once 2022-11-22T16:41:00
55 thrice 2022-11-22T17:44:00
55 thrice 2022-11-22T17:46:00
55 once 2022-11-22T17:47:00
55 once 2022-11-22T17:51:00
55 twice 2022-11-22T18:41:00
55 thrice 2022-11-22T18:51:00
My desired output is
id name datetime cnt
44 once 2022-11-22T15:41:00 3
44 once 2022-11-22T15:42:00 3
44 once 2022-11-22T15:43:00 3
44 twice 2022-11-22T15:44:00 1
44 once 2022-11-22T16:41:00 1
55 thrice 2022-11-22T17:44:00 2
55 thrice 2022-11-22T17:46:00 2
55 once 2022-11-22T17:47:00 2
55 once 2022-11-22T17:51:00 2
55 twice 2022-11-22T18:41:00 1
55 thrice 2022-11-22T18:51:00 1
where the new column, cnt, is the maximum count of the name column per block that they follow themselves consecutively.
I attempted the problem by doing:
select
id,
name,
datetime,
row_number() over (partition by id order by datetime) rn1,
row_number() over (partition by id, name order by name, datetime) rn2
from table
but it is obviously not giving the desired output.
I tried also looking at the solutions in SQL count consecutive days but could not figure out from answers given there.
As noted in the question you linked to, this is a typical gaps & islands problem.
The solution is provided in the answers to that question, but I've applied to your sample data specifically for you here:
with gp as (
select *,
Row_Number() over(partition by id order by [datetime])
- Row_Number() over(partition by id, name order by [datetime]) g
from t
)
select id, name, [datetime],
Count(*) over(partition by id, name, g) cnt
from gp;
See Demo DBFiddle

Calculating time with datetime by groups

I have two tables Tickets and Tasks. When ticket is registered then it appears in Tickets table and every action that is made with the ticket is saved in the Tasks table. Tickets table includes information like who created the ticket, start and end dates (if it is closed) etc. Tasks table looks like this:
ID Ticket_ID Task_type_ID Task_type Group_ID Submit_Date
1 120 1 Opened 3 2016-12-09 11:10:22.000
2 120 2 Assign 4 2016-12-09 12:10:22.000
3 120 3 Paused 4 2016-12-09 12:30:22.000
4 120 4 Unpause 4 2016-12-10 10:30:22.000
5 120 2 Assign 6 2016-12-12 10:30:22.000
6 120 2 Assign 7 2016-12-12 15:30:22.000
7 120 5 Modify NULL 2016-12-13 15:30:22.000
8 120 6 Closed NULL 2016-12-13 16:30:22.000
I would like to calculate the time how long each group completed their task. The start time is the time when the ticket was assigned to certain group and end time is when that group completes their task (if they assign it elsewhere or close it). But it should not include the paused time(task_type_ID 3 to 4). Also when ticket is assigned to other group the new group ID appears in the previous task/row. If the task goes through multiple groups it should calculate how long the ticket was in the hands of every group.
I know it is complicated but maybe someone has an idea that I can start to build from.
This is a quite sophisticated gaps-and-island problem.
Here is one approach at it:
select distinct
ticket_id,
group_id,
sum(sum(datediff(minute, submit_date, lead_submit_date)))
over(partition by group_id) elapsed_minutes
from (
select
t.*,
row_number() over(partition by ticket_id order by submit_date) rn1,
row_number() over(partition by ticket_id, group_id order by submit_date) rn2,
lead(submit_date) over(partition by ticket_id order by submit_date) lead_submit_date
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id, rn1 - rn2
In the subquery, we assign row numbers to records within two different partitions (by tickets vs by ticket and group), and recover the date of the next record with lead().
We can then use the difference between the row numbers to build groups of "adjacent" records (where the tickets stays in the same group), while not taking into account periods when the ticket was paused. Aggregation comes into play here.
The final step is to compute the overall time spent in each group : this handles the case when a ticket is assigned to the same group more than once during its lifecycle (although that's not showing in your sample data, the description of the question makes it sound like that may happen). We could do this with another level of aggregation but I went for a window sum and distinct, which avoids adding one more level of nesting to the query.
Executing the subquery independently might help understanding the logic better (see the below db fiddle).
For your sample data, the query yields:
ticket_id | group_id | minutes_elapsed
--------: | -------: | --------------:
120 | 3 | 60
120 | 4 | 2900
120 | 6 | 300
120 | 7 | 1440
I actually think this is pretty simple. Just use lead() to get the next submit time value and aggregate by the ticket and group ignoring pauses:
select ticket_id, group_id, sum(dur_sec)
from (select t.*,
datediff(second, submit_date, lead(submit_date) over (partition by ticket_id order by submit_date)) as dur_sec
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id;
Here is a db<>fiddle (with thanks to GMB for creating the original fiddle).

BigQuery filter WHERE by date for last 5 rows for each value of categorical column

Apologies if the title is a bit wordy - i will create an example below to highlight what i'm referring to. I have the following table of information:
t1
date team num_val
2017-10-04 ab 7
2017-10-03 ab 6
2017-10-02 ab 8
2017-10-05 ab 3
2017-10-07 ab 12
2017-10-06 ab 3
2017-10-01 ab 5
2017-09-08 cd 4
2017-09-09 cd 8
2017-09-10 cd 2
2017-09-14 cd 1
2017-09-13 cd 5
2017-09-11 cd 6
2017-09-12 cd 13
With this table, I would simply like to:
Filter, for each team, the most recent 5 dates
Group by team and sum the num_val column
Simple enough. However, there is no rhyme or reason to the dates for each team (I cannot simply filter on the most recent 5 dates overall, since they may be different for each team). I currently have the following framework for the query:
SELECT
team,
sum(num_val)
FROM t1
GROUP BY team
... any help getting this to the finish line would be greatly appreciated, thanks!!
Few more options for BigQuery Standard SQL, so you see different approaches
Option #1
#standardSQL
SELECT team, SUM(num_val) sum_num FROM (
SELECT team, num_val, ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
FROM `project.dataset.table`
)
WHERE pos <= 5
GROUP BY team
Option #2
#standardSQL
SELECT team, sum_num FROM (
SELECT team,
SUM(num_val) OVER(PARTITION BY team ORDER BY DATE DESC ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS sum_num,
ROW_NUMBER() OVER(PARTITION BY team ORDER BY DATE DESC) pos
FROM `project.dataset.table`
)
WHERE pos = 1
If to apply to sample data from your question - both produce below result
Row team sum_num
1 ab 31
2 cd 27
While above options can be useful in some more complicated cases - in your particular case - I would go with option (similar to one) presented in Filipe's answer
#standardSQL
SELECT team, (SELECT SUM(num_val) FROM UNNEST(num_values)) sum_num
FROM (
SELECT team, ARRAY_AGG(STRUCT(num_val) ORDER BY DATE DESC LIMIT 5) num_values
FROM `project.dataset.table`
GROUP BY team
)
To get the latest 5 for each:
SELECT team, ARRAY_AGG(num_val ORDER BY date DESC LIMIT 5) arr
FROM x
GROUP BY team
Then UNNEST(arr) and add those num_vals.
SELECT team, (SELECT SUM(num_val) FROM UNNEST(arr) num_val) the_sum
FROM (previous)

finding the number of days in between first 2 date point

So the question seems to be quite difficult I wonder if I could get some advice from here. I am trying to solve this with SQLite 3. So I have a data format of this.
customer | purchase date
1 | date 1
1 | date 2
1 | date 3
2 | date 4
2 | date 5
2 | date 6
2 | date 7
number of times the customer repeats is random.
so I just want to find whether customer 1's 1st and 2nd purchase date are fallen in between a specific time period. repeat for other customers. only need to consider 1st and 2nd dates.
Any help would be appreciated!
We can try using ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer ORDER BY "purchase date") rn
FROM yourTable
)
SELECT
customer,
CAST(MAX(CASE WHEN rn = 2 THEN julianday("purchase date") END) -
MAX(CASE WHEN rn = 1 THEN julianday("purchase date") END) AS INTEGER) AS diff_in_days
FROM cte
GROUP BY
customer;
The idea here is to aggregate by customer and then take the date difference between the second and first purchase. ROW_NUMBER is used to find these first and second purchases, for each customer.

Get last entry from each user in database

I have a Postgresql database, and I'm having trouble getting my query right, even though this seems like a common problem.
My table looks like this:
CREATE TABLE orders (
account_id INTEGER,
order_id INTEGER,
ts TIMESTAMP DEFAULT NOW()
)
Everytime there is a new order, I use it to link the account_id and order_id.
Now my problem is that I want to get a list that has the last order (by looking at ts) for each account.
For example, if my data is:
account_id order_id ts
5 178 July 1
5 129 July 6
4 190 July 1
4 181 July 9
3 348 July 1
3 578 July 4
3 198 July 1
3 270 July 12
Then I'd like the query to return only the last row for each account:
account_id order_id ts
5 129 July 6
4 181 July 9
3 270 July 12
I've tried GROUP BY account_id, and I can use that to get the MAX(ts) for each account, but then I have no way to get the associated order_id. I've also tried sub-queries, but I just can't seem to get it right.
Thanks!
select distinct on (account_id) *
from orders
order by account_id, ts desc
https://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT:
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
The row_number() window function can help:
select account_id, order_id, ts
from (select account_id, order_id, ts,
row_number() over(partition by account_id order by ts desc) as rn
from tbl) t
where rn = 1