SQL query to turn change log into intervals - sql

So let me describe the problem:
-I have a task table with an assignee column, a created column and a resolved column
(both created and resolved are timestamps)
+---------+----------+------------+------------+
| task_id | assignee | created | resolved |
+---------+----------+------------+------------+
| tsk1 | him | 2000-01-01 | 2018-01-03 |
+---------+----------+------------+------------+
-I have a change log table with a task_id, a from column, a to column and a date column that records each time the assignee is changed
+---------+----------+------------+------------+
| task_id | from | to | date |
+---------+----------+------------+------------+
| tsk1 | me | you | 2017-04-06 |
+---------+----------+------------+------------+
| tsk1 | you | him | 2017-04-08 |
+---------+----------+------------+------------+
I want to select a table that shows a list of all the assignees that worked on a task within an interval
+---------+----------+------------+------------+
| task_id | assignee | from | to |
+---------+----------+------------+------------+
| tsk1 | me | 2000-01-01 | 2017-04-06 |
+---------+----------+------------+------------+
| tsk1 | you | 2017-04-06 | 2017-04-08 |
+---------+----------+------------+------------+
| tsk1 | him | 2017-04-08 | 2018-01-03 |
+---------+----------+------------+------------+
I'm having trouble with the first(/last) row, where the from(/to) should be set as created(/resolved), I don't know how to make a column with data from two different tables...
I've tried making them in their own select and then merging all rows with union, but I don't think this is a very good solution...

Hmmm . . . This is tricker than it seems. The idea is to use lead() to get the next date, but you need to "augment" the data with information from the tasks table:
select task_id, to, date as fromdate,
coalesce(lead(date) over (partition by task_id order by date),
max(resolved) over (partition by task_id)
) as todate
from ((select task_id, to, date, null::timestamp
from log l
) union all
(select distint on (t.task_id) t.task_id, l.from, t.created, t.resolved
from task t join
log l
on t.task_id = l.task_id
order by t.task_id, l.date
)
) t;

demo:db<>fiddle
SELECT
l.task_id,
assignee_from as assignee,
COALESCE(
lag(assign_date) OVER (ORDER BY assign_date),
created
) as date_from,
assign_date as date_to
FROM
log l
JOIN
task t
ON l.task_id = t.task_id
UNION ALL
SELECT * FROM (
SELECT DISTINCT ON (l.task_id)
l.task_id, assignee_to, assign_date, resolved
FROM
log l
JOIN
task t
ON l.task_id = t.task_id
ORDER BY l.task_id, assign_date DESC
) s
ORDER BY task_id, date_from
UNION consists of two parts: The part from the log and finally the last row from the task table.
The first part uses LAG() window function to get the previous date to the current row. Because "me" has no previous row, that would result in a NULL value. So this is catched by getting the created date from the task table.
The second part is to get the last row: Here I am getting the last row of the log by DISTINCT and ORDER BY assign_date DESC. So I know the last assignee_to. The rest is similar to the first part: Getting the resolved value from the task table.

Thanks to the answer from S-Man and Gordon Linoff, I was able to come up with this solution:
SELECT t.task_id,
t.item_from AS assignee,
COALESCE(lag(t.changelog_created) OVER (
PARTITION BY t.task_id ORDER BY t.changelog_created),
max(t.creationdate) OVER (PARTITION BY t.task_id)) AS fromdate,
t.changelog_created as todate
FROM ( SELECT ch.task_id,
ch.item_from,
ch.changelog_created,
NULL::timestamp without time zone AS creationdate
FROM changelog_generic_expanded_view ch
WHERE ch.field::text = 'assignee'::text
UNION ALL
( SELECT DISTINCT ON (t_1.id_task) t_1.id_task,
t_1.assigneekey,
t_1.resolutiondate,
t_1.creationdate
FROM task_jira t_1
ORDER BY t_1.id_task)) t;
Note: this is the final version so the names are a bit different, but the idea stays the same.
This is basically the same code as Gordon Linoff, but I go through changelog in the opposite direction.
I use the 2nd part of UNION ALL to generate the last assignee instead of the first (this is to handle the case where there is no changelog at all, the last assignee is generated without involving changelogs)

Related

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

How to query to capture recency and scale in one query?

I've built a query that calculates the number of ids from a table, per url_count.
with cte as (
select id, count(distinct.url) url_count
from table
group by id
)
select sum(if(url_count >= 1,1,0) scale
from cte
union all
select sum(if(url_count >= 2,1,0) scale
from cte
union all
select sum(if(url_count >= 3,1,0) scale
from cte
union all
select sum(if(url_count >= 4,1,0) scale
from cte
union all
select sum(if(url_count >= 5,1,0) scale
from cte
The query above says; "Give me the list of ids and the number of urls they each go to, then accumulate the number of ids who have gone to [1-5] or more urls"
It's ofc a tedious method, but works and outputs something like;
---------
| scale |
---------
|1213432|
|867554 |
|523523 |
|342232 |
|145889 |
---------
From this table, I also have a date field on the last 5 days which I'm working on adding into this query. Thus lies the challenge; Trying to add a second layer of information to the query; i.e. Recency. Been working on multiple approaches to building a query that outputs all the combinations of different scales, per the date.
The sort of output I've imagined is a pivot table which presents something like;
-------------------------------------------------------------
| date | url_co1 | url_co2 | url_co3 | url_co4 | url_co5|
-------------------------------------------------------------
|2020-01-05| 1213432 | 1112321 | 984332 | 632131 | 234124 |
|2020-01-04| 1012131 | 934242 | 867554 | 533242 | 134234 |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
-------------------------------------------------------------
Where url_co[1-5] represents the number of ids that visited [1-5] or more urls and dates gives up the date that volume was captured. No idea how to write that because once I query:
with cte as (
select id, date, count(distinct.url) url_count
from table
group by id, date
)
I've aggregated to per id, per date, which therefore something goes wrong. =/
Hope that all made sense!
Please, please help! I would appreciate some guidance.
There must be a methodology for getting the combination of volumes per recency that I've missed!
I don't really follow the full question, but the first query can be simplified to:
select url_count, count(*) as this_count,
sum(url_count) over (order by url_count desc) as descending_count
from (select id, count(distinct url) as url_count
from table
group by id
) t
group by url_count
order by url_count;

SQL Select statements that returns differences in minutes from next date-time from the same table

I have a user activity table. I want to see a difference between each process in minutes.
To be more specific here is a partial data from table.
Date |TranType| Prod | Loc
-----------------------------------------------------------
02/27/12 3:17:21 PM | PICK | LIrishGreenXL | E01C015
02/27/12 3:18:18 PM | PICK | LAntHeliconiaS | E01A126
02/27/12 3:19:00 PM | PICK | LAntHeliconiaL | E01A128
02/27/12 3:19:07 PM | PICK | LAntHeliconiaXL | E01A129
I want to retrieve time difference in minutes, between first and second process, than second and third and ....
Thank you
Something like this will work in MS SQL, just change the field names to match yours:
select a.ActionDate, datediff(minute,a.ActionDate,b.ActionDate) as DiffMin
(select ROW_NUMBER() OVER(ORDER BY ActionDate) AS Row, ActionDate
from [dbo].[Attendance]) a
inner join
(select ROW_NUMBER() OVER(ORDER BY ActionDate) AS Row, ActionDate
from [dbo].[Attendance]) b
on a.Row +1 = b.Row

Delete all rows except latest per user, having a certain column value

I've got a table events that contains events for users, for example:
PK | user | event_type | timestamp
--------------------------------
1 | ab | DTV | 1
2 | ab | DTV | 2
3 | ab | CPVR | 3
4 | cd | DTV | 1
5 | cd | DTV | 2
6 | cd | DTV | 3
What I want to do is keep only one event per user, namely the one with the latest timestamp and event_type = 'DTV'.
After applying the delete to the example above, the table should look like this:
PK | user | event_type | timestamp
--------------------------------
2 | ab | DTV | 2
6 | cd | DTV | 3
Can any one of you come up with something that accomplishes this task?
Update: I'm using Sqlite. This is what I have so far:
delete from events
where id not in (
select id from (
select id, user, max(timestamp)
from events
where event_type = 'DTV'
group by user)
);
I'm pretty sure this can be improved upon. Any ideas?
I think you should be able to do something like this:
delete from events
where (user, timestamp) not in (
select user, max(timestamp)
from events
where event_type = 'DTV'
group by user
)
You could potentially do some more sophisticated tricks like table or partition replacement, depending on the database you're working with
If using sql server roo5/2008 then use following sql:
;WITH ce
AS (SELECT *,
Row_number()
OVER (
partition BY [user], event_type
ORDER BY timestamp DESC) AS rownumber
FROM emp)
DELETE FROM ce
WHERE rownumber <> 1
OR event_type <> 'DTV'
Your solution doesn't seem to me reliable enough, because your subquery is pulling a column that is neither aggregated nor added to GROUP BY. I mean, I am not an experienced SQLite user and your solution did work when I tested it. And if there's any confirmation that the id column is always reliably correlated to the MAX(timestamp) value in this situation, fine, your approach seems quite a decent one.
But if you are as unsure about your solution as I am, you could try the following:
DELETE FROM events
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT MAX(timestamp) AS ts
FROM events e
WHERE event_type = 'DTV'
AND user = events.user
) s
WHERE ts = events.timestamp
);
The inner instance of events is assigned a different alias so that the events alias could be used to unambiguously reference the outer instance of the table (the one the DELETE command is actually being applied to). This solution does assume that timestamp is unique per user, though.
A working example can be run and played with on SQL Fiddle.

SQL - Select unique rows from a group of results

I have wrecked my brain on this problem for quite some time. I've also reviewed other questions but was unsuccessful.
The problem I have is, I have a list of results/table that has multiple rows with columns
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 005DTHGP | 1966 | 2006-09-12 | Tracker
| 013DTHGP | 2281 | 2006-11-01 | Tracker
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 2404 | 2006-10-20 | Tracker
| 017DTNGP | 508 | 2007-11-10 | MBio
I am trying to select rows with unique REGISTRATIONS and where the DATE is max (the latest). The IDs are not proportional to the DATE, meaning the ID could be a low value yet the DATE is higher than the other matching row and vise-versa. Therefore I can't use MAX() on both the DATE and ID and grouping just doesn't seem to work.
The results I want are as follows;
| REGISTRATION | ID | DATE | UNITTYPE
| 005DTHGP | 172 | 2007-09-11 | MBio
| 013DTHGP | 2712 | 2008-05-30 | MBio
| 017DTNGP | 508 | 2007-11-10 | MBio
PLEASE HELP!!!?!?!?!?!?!?
You want embedded queries, which not all SQLs support. In t-sql you'd have something like
select r.registration, r.recent, t.id, t.unittype
from (
select registration, max([date]) recent
from #tmp
group by
registration
) r
left outer join
#tmp t
on r.recent = t.[date]
and r.registration = t.registration
TSQL:
declare #R table
(
Registration varchar(16),
ID int,
Date datetime,
UnitType varchar(16)
)
insert into #R values ('A','1','20090824','A')
insert into #R values ('A','2','20090825','B')
select R.Registration,R.ID,R.UnitType,R.Date from #R R
inner join
(select Registration,Max(Date) as Date from #R group by Registration) M
on R.Registration = M.Registration and R.Date = M.Date
This can be inefficient if you have thousands of rows in your table depending upon how the query is executed (i.e. if it is a rowscan and then a select per row).
In PostgreSQL, and assuming your data is indexed so that a sort isn't needed (or there are so few rows you don't mind a sort):
select distinct on (registration), * from whatever order by registration,"date" desc;
Taking each row in registration and descending date order, you will get the latest date for each registration first. DISTINCT throws away the duplicate registrations that follow.
select registration,ID,date,unittype
from your_table
where (registration, date) IN (select registration,max(date)
from your_table
group by registration)
This should work in MySQL:
SELECT registration, id, date, unittype FROM
(SELECT registration AS temp_reg, MAX(date) as temp_date
FROM table_name GROUP BY registration) AS temp_table
WHERE registration=temp_reg and date=temp_date
The idea is to use a subquery in a FROM clause which throws up a single row containing the correct date and registration (the fields subjected to a group); then use the correct date and registration in a WHERE clause to fetch the other fields of the same row.