Running Count of Unique Identifier Occurrences in SQL - sql

So I'm trying to get a running count of uses over time by a unique identifier,
E.G.
Date UniqueID Running Count
1/1/2019 234567 1
1/1/2019 123456 1
1/2/2019 234567 2
1/3/2019 234567 3
1/3/2019 123456 2
Basically I want to be able to see that on 1/3/2019 that was the 3rd time that UniqueID 234567 showed up in the data.
I tried:
SELECT Date, UniqueID,
count(UniqueID) OVER (ORDER BY Date, UniqueID rows unbounded preceding) AS RunningTotal
but this just does a overall running total, so it doesn't reset with a new UniqueID
SELECT Date, UniqueID, count(UniqueID) OVER (ORDER BY Date, UniqueID rows unbounded preceding) AS RunningTotal
Is there anything I could do to make it reset for each UniqueID

Assuming that the 2 in the last row is a typo, you want either ROW_NUMBER() or DENSE_RANK():
SELECT Date, UniqueID,
ROW_NUMBER(UniqueID) OVER (PARTITION BY UniqueID ORDER BY Date) AS RunningTotal
You would use DENSE_RANK() if you could have duplicates on one day that you wanted to count only once.
By the way, you could also express this using COUNT(*):
SELECT Date, UniqueID,
COUNT(*) OVER (PARTITION BY UniqueID ORDER BY Date) AS RunningTotal
There are some subtle differences in the handling of duplicate values. Normally, COUNT() is not used for this purpose because the ranking functions are so pervasive (and useful).

Related

Dense Rank grouping by IDs

I am having trouble getting my DENSE_RANK() function in Oracle to work how I would like. First, my dataset:
ID DATE
1234 01-OCT-2020
1234 01-OCT-2021
1234 01-OCT-2022
2345 01-APR-2020
2345 01-APR-2021
2345 01-APR-2022
I am trying to use the dense rank function to return results with a sequence number based on the DATE field, and grouping by ID. How I want the data to return:
ID DATE SEQ
1234 01-OCT-2020 1
1234 01-OCT-2021 2
1234 01-OCT-2022 3
2345 01-APR-2020 1
2345 01-APR-2021 2
2345 01-APR-2022 3
The query I have so far:
SELECT ID, DATE, DENSE_RANK() Over (order by ID, DATE asc) as SEQ
However, this returns incorrectly as the sequence number will go to 6 (Like its disregarding my intentions to sequence based on the DATE field within a certain ID). If anyone has any insights into how to make this work it would be very much appreciated!
You want row_number():
select id, date, row_number() over (partition by id order by date) as seq
You could actually use dense_rank() as well, if you want duplicates to have the same idea. The key idea is partition by.

Calculating time with datetime by groups

I have two tables Tickets and Tasks. When ticket is registered then it appears in Tickets table and every action that is made with the ticket is saved in the Tasks table. Tickets table includes information like who created the ticket, start and end dates (if it is closed) etc. Tasks table looks like this:
ID Ticket_ID Task_type_ID Task_type Group_ID Submit_Date
1 120 1 Opened 3 2016-12-09 11:10:22.000
2 120 2 Assign 4 2016-12-09 12:10:22.000
3 120 3 Paused 4 2016-12-09 12:30:22.000
4 120 4 Unpause 4 2016-12-10 10:30:22.000
5 120 2 Assign 6 2016-12-12 10:30:22.000
6 120 2 Assign 7 2016-12-12 15:30:22.000
7 120 5 Modify NULL 2016-12-13 15:30:22.000
8 120 6 Closed NULL 2016-12-13 16:30:22.000
I would like to calculate the time how long each group completed their task. The start time is the time when the ticket was assigned to certain group and end time is when that group completes their task (if they assign it elsewhere or close it). But it should not include the paused time(task_type_ID 3 to 4). Also when ticket is assigned to other group the new group ID appears in the previous task/row. If the task goes through multiple groups it should calculate how long the ticket was in the hands of every group.
I know it is complicated but maybe someone has an idea that I can start to build from.
This is a quite sophisticated gaps-and-island problem.
Here is one approach at it:
select distinct
ticket_id,
group_id,
sum(sum(datediff(minute, submit_date, lead_submit_date)))
over(partition by group_id) elapsed_minutes
from (
select
t.*,
row_number() over(partition by ticket_id order by submit_date) rn1,
row_number() over(partition by ticket_id, group_id order by submit_date) rn2,
lead(submit_date) over(partition by ticket_id order by submit_date) lead_submit_date
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id, rn1 - rn2
In the subquery, we assign row numbers to records within two different partitions (by tickets vs by ticket and group), and recover the date of the next record with lead().
We can then use the difference between the row numbers to build groups of "adjacent" records (where the tickets stays in the same group), while not taking into account periods when the ticket was paused. Aggregation comes into play here.
The final step is to compute the overall time spent in each group : this handles the case when a ticket is assigned to the same group more than once during its lifecycle (although that's not showing in your sample data, the description of the question makes it sound like that may happen). We could do this with another level of aggregation but I went for a window sum and distinct, which avoids adding one more level of nesting to the query.
Executing the subquery independently might help understanding the logic better (see the below db fiddle).
For your sample data, the query yields:
ticket_id | group_id | minutes_elapsed
--------: | -------: | --------------:
120 | 3 | 60
120 | 4 | 2900
120 | 6 | 300
120 | 7 | 1440
I actually think this is pretty simple. Just use lead() to get the next submit time value and aggregate by the ticket and group ignoring pauses:
select ticket_id, group_id, sum(dur_sec)
from (select t.*,
datediff(second, submit_date, lead(submit_date) over (partition by ticket_id order by submit_date)) as dur_sec
from mytable t
) t
where task_type <> 'Paused' and group_id is not null
group by ticket_id, group_id;
Here is a db<>fiddle (with thanks to GMB for creating the original fiddle).

Combining COUNT and RANK - PostgreSQL

What I need to select is total number of trips made by every 'id_customer' from table user and their id, dispatch_seconds, and distance for first order. id_customer, customer_id, and order_id are strings.
It should looks like this
+------+--------+------------+--------------------------+------------------+
| id | count | #1order id | #1order dispatch seconds | #1order distance |
+------+--------+------------+--------------------------+------------------+
| 1ar5 | 3 | 4r56 | 1 | 500 |
| 2et7 | 2 | dc1f | 5 | 100 |
+------+--------+------------+--------------------------+------------------+
Cheers!
Original post was edited as during discussion S-man helped me to find exact problem solution. Solution by S-man https://dbfiddle.uk/?rdbms=postgres_10&fiddle=e16aa6008990107e55a26d05b10b02b5
db<>fiddle
SELECT
customer_id,
order_id,
order_timestamp,
dispatch_seconds,
distance
FROM (
SELECT
*,
count(*) over (partition by customer_id), -- A
first_value(order_id) over (partition by customer_id order by order_timestamp) -- B
FROM orders
)s
WHERE order_id = first_value -- C
https://www.postgresql.org/docs/current/static/tutorial-window.html
A window function which gets the total record count per user
B window function which orders all records per user by timestamp and gives the first order_id of the corresponding user. Using first_value instead of min has one benefit: Maybe it could be possible that your order IDs are not really increasing by timestamp (maybe two orders come in simultaneously or your order IDs are not sequential increasing but some sort of hash)
--> both are new columns
C now get all columns where the "first_value" (aka the first order_id by timestamp) equals the order_id of the current row. This gives all rows with the first order by user.
Result:
customer_id count order_id order_timestamp dispatch_seconds distance
----------- ----- -------- ------------------- ---------------- --------
1ar5 3 4r56 2018-08-16 17:24:00 1 500
2et7 2 dc1f 2018-08-15 01:24:00 5 100
Note that in these test data the order "dc1f" of user "2et7" has a smaller timestamp but comes later in the rows. It is not the first occurrence of the user in the table but nevertheless the one with the earliest order. This should demonstrate the case first_value vs. min as described above.
You are on the right track. Just use conditional aggregation:
SELECT o.customer_id, COUNT(*)
MAX(CASE WHEN seqnum = 1 THEN o.order_id END) as first_order_id
FROM (SELECT o.*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_timestamp ASC) as seqnum
FROM orders o
) o
GROUP BY o.customer_id;
Your JOIN is not necessary for this query.
You can use window function :
select distinct customer_id,
count(*) over (partition by customer_id) as no_of_order
min(order_id) over (partition by customer_id order by order_timestamp) as first_order_id
from orders o;
I think there are many mistakes in your original query, your rank isn't partitioned, the order by clause seems incorrect, you filter out all but one "random" order, then apply the count, the list goes on.
Something like this seems closer to what you seem to want?
SELECT
customer_id,
order_count,
order_id
FROM (
SELECT
a.customer_id,
a.order_count,
a.order_id,
RANK() OVER (PARTITION BY a.order_id, a.customer_id ORDER BY a.order_count DESC) AS rank_id
FROM (
SELECT
customer_id,
order_id,
COUNT(*) AS order_count
FROM
orders
GROUP BY
customer_id,
order_id) a) b
WHERE
b.rank_id = 1;

RANK records partitioned by a column in series (Vertica SQL)

I'm trying to use the Vertica rank analytic function to create a rank column partitioned by a column, but only include records that are in a series. For example the query below produces the output below the query
select when_created, status
from tablea
when_created Status
1/1/2015 ACTIVE
3/1/2015 ACTIVE
4/1/2015 INACTIVE
4/6/2015 INACTIVE
6/7/2015 ACTIVE
10/9/2015 INACTIVE
I could modify my query to include a rank column which would produce the following output
select
when_created, status, rank() OVER (PARTITION BY status order by when_created) as rnk
from tablea
when_created Status rnk
1/1/2015 ACTIVE 1
3/1/2015 ACTIVE 2
4/1/2015 INACTIVE 1
4/6/2015 INACTIVE 2
6/7/2015 ACTIVE 3
10/9/2015 INACTIVE 3
However my goal is start over the rank when a series is broken so the desired output is:
when_created Status rnk
1/1/2015 ACTIVE 1
3/1/2015 ACTIVE 2
4/1/2015 INACTIVE 1
4/6/2015 INACTIVE 2
6/7/2015 ACTIVE 1
10/9/2015 INACTIVE 1
Is there a way to accomplish this using the RANK function or is there another way to do it in vertica sql?
Thanks,
Ben
This is a gap-and-islands problem, where the tricky part is to identify the groups to use for a row_number() calculation. One solution uses a difference of row numbers to identify the different groups:
select a.*,
row_number() over (partition by status, seqnum - seqnum_s order by when_created) as rnk
from (select a.*,
row_number() over (order by when_created) as seqnum,
row_number() over (partition by status order by when_created) as seqnum_s
from tablea a
) a;
The logic behind this is tricky when you first see it. I advise you to run the subquery and understand the two row_number() calculations -- and to observe that the difference is constant for the groups you are interested in.

How to add a running count to rows in a 'streak' of consecutive days

Thanks to Mike for the suggestion to add the create/insert statements.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1')
, (1,'2014-10-2')
, (1,'2014-10-3')
, (1,'2014-10-5')
, (1,'2014-10-7')
, (2,'2014-10-1')
, (2,'2014-10-2')
, (2,'2014-10-3')
, (2,'2014-10-5')
, (2,'2014-10-7');
I want to add a new column that is 'days in current streak'
so the result would look like:
pid | date | in_streak
-------|-----------|----------
1 | 2014-10-1 | 1
1 | 2014-10-2 | 2
1 | 2014-10-3 | 3
1 | 2014-10-5 | 1
1 | 2014-10-7 | 1
2 | 2014-10-2 | 1
2 | 2014-10-3 | 2
2 | 2014-10-4 | 3
2 | 2014-10-6 | 1
I've been trying to use the answers from
PostgreSQL: find number of consecutive days up until now
Return rows of the latest 'streak' of data
but I can't work out how to use the dense_rank() trick with other window functions to get the right result.
Building on this table (not using the SQL keyword "date" as column name.):
CREATE TABLE tbl(
pid int
, the_date date
, PRIMARY KEY (pid, the_date)
);
Query:
SELECT pid, the_date
, row_number() OVER (PARTITION BY pid, grp ORDER BY the_date) AS in_streak
FROM (
SELECT *
, the_date - '2000-01-01'::date
- row_number() OVER (PARTITION BY pid ORDER BY the_date) AS grp
FROM tbl
) sub
ORDER BY pid, the_date;
Subtracting a date from another date yields an integer. Since you are looking for consecutive days, every next row would be greater by one. If we subtract row_number() from that, the whole streak ends up in the same group (grp) per pid. Then it's simple to deal out number per group.
grp is calculated with two subtractions, which should be fastest. An equally fast alternative could be:
the_date - row_number() OVER (PARTITION BY pid ORDER BY the_date) * interval '1d' AS grp
One multiplication, one subtraction. String concatenation and casting is more expensive. Test with EXPLAIN ANALYZE.
Don't forget to partition by pid additionally in both steps, or you'll inadvertently mix groups that should be separated.
Using a subquery, since that is typically faster than a CTE. There is nothing here that a plain subquery couldn't do.
And since you mentioned it: dense_rank() is obviously not necessary here. Basic row_number() does the job.
You'll get more attention if you include CREATE TABLE statements and INSERT statements in your question.
create table test (
pid integer not null,
date date not null,
primary key (pid, date)
);
insert into test values
(1,'2014-10-1'), (1,'2014-10-2'), (1,'2014-10-3'), (1,'2014-10-5'),
(1,'2014-10-7'), (2,'2014-10-1'), (2,'2014-10-2'), (2,'2014-10-3'),
(2,'2014-10-5'), (2,'2014-10-7');
The principle is simple. A streak of distinct, consecutive dates minus row_number() is a constant. You can group by the constant, and take the dense_rank() over that result.
with grouped_dates as (
select pid, date,
(date - (row_number() over (partition by pid order by date) || ' days')::interval)::date as grouping_date
from test
)
select * , dense_rank() over (partition by grouping_date order by date) as in_streak
from grouped_dates
order by pid, date
pid date grouping_date in_streak
--
1 2014-10-01 2014-09-30 1
1 2014-10-02 2014-09-30 2
1 2014-10-03 2014-09-30 3
1 2014-10-05 2014-10-01 1
1 2014-10-07 2014-10-02 1
2 2014-10-01 2014-09-30 1
2 2014-10-02 2014-09-30 2
2 2014-10-03 2014-09-30 3
2 2014-10-05 2014-10-01 1
2 2014-10-07 2014-10-02 1