Difference in dates when actions are taken multiple times - sql

I have the following table:
Table (History h)
| Source ID | Action | Created Date |
| 1 | Filing Rejected | 1/3/2023 |
| 2 | Filing Rejected | 1/4/2023 |
| 1 | Filing Resubmitted | 1/5/2023 |
| 3 | Filing Rejected | 1/5/2023 |
| 2 | Filing Resubmitted | 1/6/2023 |
| 1 | Filing Rejected | 1/7/2023 |
| 3 | Filing Resubmitted | 1/8/2023 |
| 1 | Filing Resubmitted | 1/9/2023 |
The results that I want are:
|Source ID | Rejected Date | Resubmitted Date | Difference |
| 1 | 1/3/2023 | 1/5/2023 | 2 |
| 1 | 1/7/2023 | 1/9/2023 | 2 |
| 2 | 1/4/2023 | 1/6/2023 | 2 |
| 3 | 1/5/2023 | 1/8/2023 | 3 |
My current query language is:
SELECT h1.Source_ID, min(CONVERT(varchar,h1.CREATED_DATE,101)) AS 'Rejected Date',
min(CONVERT(varchar,h2.Created_Date,101)) AS 'Resubmitted Date',
DATEDIFF(HOUR, h1.Created_Date, min(h2.Created_Date)) / 24 Difference
FROM History h1 INNER JOIN History h2
ON h2.Source_ID = h1.Source_ID AND h2.Created_Date > h1.Created_Date
WHERE (h1.Created_Date >= '2023-01-01 00:00:00.000' AND h1.Created_Date <= '2023-01-31 23:59:59.000')
AND ((h1.CHANGE_VALUE_TO = 'Filing Rejected' AND h2.CHANGE_VALUE_TO = 'Filing Resubmitted'))
GROUP BY h1.Source_ID, h1.Created_Date,h2.Created_Date
ORDER BY 'Rejected Date' ASC;
The results I get are:
|Source ID | Rejected Date | Resubmitted Date | Difference |
| 1 | 1/3/2023 | 1/5/2023 | 2 |
| 1 * | 1/3/2023 | 1/9/2023 | 6 |
| 1 | 1/7/2023 | 1/9/2023 | 2 |
| 2 | 1/4/2023 | 1/6/2023 | 2 |
| 3 | 1/5/2023 | 1/8/2023 | 3 |
So there is one row that is showing up that should not be. I have marked it with an asterisk.
I just want the difference from the first rejection to the first resubmission, the second rejection to the second rejection.
Any help, another idea on how to do it, anything really, is greatly appreciated.

If events always properly interleave, one approach uses row_number() and conditional aggregation:
select source_id,
max(case when action = 'Filing Rejected' then created_date end) rejected_date,
max(case when action = 'Filing Resubmitted' then created_date end) resubmitted_date
from (
select h.*,
row_number() over(partition by source_id, action order by created_date) rn
from history h
where created_date >= '2023-01-01' and created_date < '2023-02-01'
) h
group by source_id, rn
order by source_id, rn
This will not work if your data has consecutive rejections or resubmissions.
As for the date difference, we can add another layer so we don’t need to type the conditional expressions twice:
select h.*, datediff(day, rejected_date, resubmitted_date) diff_in_days
from (
select source_id,
max(case when action = 'Filing Rejected' then created_date end) rejected_date,
max(case when action = 'Filing Resubmitted' then created_date end) resubmitted_date
from (
select h.*,
row_number() over(partition by source_id, action order by created_date) rn
from history h
where created_date >= '2023-01-01' and created_date < '2023-02-01'
) h
group by source_id, rn
) h
order by source_id, rejected_date

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

How to flatten a table from row to columns

I use MariaDB 10.2.21
I have not seen this exact case elsewhere, hence my request for assistance.
I have a History table containing one record per change on any of the fields in a JIRA issues:
+----------+---------------+----------+-----------------+---------------------+
| IssueKey | OriginalValue | NewValue | Field | ChangeDate |
+----------+---------------+----------+-----------------+---------------------+
| HRSK-184 | (NULL) | 2 | Risk Detection | 2019-10-24 10:57:27 |
| HRSK-184 | (NULL) | 2 | Risk Occurrence | 2019-10-24 10:57:27 |
| HRSK-184 | (NULL) | 2 | Risk Severity | 2019-10-24 10:57:27 |
| HRSK-184 | 2 | 4 | Risk Detection | 2019-10-25 11:54:07 |
| HRSK-184 | 2 | 6 | Risk Detection | 2019-10-25 11:54:07 |
| HRSK-184 | 2 | 3 | Risk Severity | 2019-10-24 11:54:07 |
| HRSK-184 | 6 | 5 | Risk Detection | 2019-10-26 09:11:01 |
+----------+---------------+----------+-----------------+---------------------+
Every record contains the old and new value and the fieldtype that has changed ('Field') and, of course, the corresponding timestamp of that change.
I want to query the point-in-time status providing me the combination of the most recent values of every of the fields 'Risk Severity, Risk Occurrence and Risk Detection'.
The result should be like this:
+----------+----------------+-------------------+------------------+----------------------+
| IssueKey | Risk Severity | Risk Occurrence | Risk Detection | ChangeDate |
+----------+----------------+-------------------+------------------+----------------------+
| HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01 |
+----------+----------------+-------------------+------------------+----------------------+
Any ideas? I'm stuck...
Thanks in advance for you effort!
You cold use a couple of inline queries
select
IssueKey,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Severity'
order by ChangeDate desc limit 1
) `Risk Severity`,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Occurrence'
order by ChangeDate desc limit 1
) `Risk Occurrence`,
(
select t1.NewValue
from mytable t1
where t1.IssueKey = t.IssueKey and t1.Field = 'Risk Detection'
order by ChangeDate desc limit 1
) `Risk Severity`,
max(ChangeDate) ChangeDate
from mytable t
group by IssueKey
With an index on (IssueKey, Field, ChangeDate, NewValue), this should an efficient option.
Demo on DB Fiddle:
IssueKey | Risk Severity | Risk Occurrence | Risk Severity | ChangeDate
:------- | ------------: | --------------: | ------------: | :------------------
HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01
MariaDB 10.2 has introduced some Window Functions for analytical queries.
One of them is RANK() OVER (PARTITION BY ...ORDER BY...) function.
Firstly, you can apply it, and then pivot through Conditional Aggregation :
SELECT IssueKey,
MAX(CASE WHEN Field = 'Risk Severity' THEN NewValue END ) AS RiskSeverity,
MAX(CASE WHEN Field = 'Risk Occurrence' THEN NewValue END ) AS RiskOccurrence,
MAX(CASE WHEN Field = 'Risk Detection' THEN NewValue END ) AS RiskDetection,
MAX(ChangeDate) AS ChangeDate
FROM
(
SELECT RANK() OVER (PARTITION BY IssueKey, Field ORDER BY ChangeDate Desc) rnk,
t.*
FROM mytable t
) t
WHERE rnk = 1
GROUP BY IssueKey;
IssueKey | RiskSeverity | RiskOccurrence | RiskDetection | ChangeDate
-------- + --------------+-----------------+----------------+--------------------
HRSK-184 | 3 | 2 | 5 | 2019-10-26 09:11:01
Demo

SQL-Server query to select last and previous information for multiple columns

After looking in Stackoverflow I cant find a solution to this problem.
I'm using this query:
SELECT *
FROM(
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL(
SELECT DISTINCT * FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL (...
) a ORDER BY a.ID, a.StartDate
I got something like this, for each ID i would like to keep the last and previous date and other columns, to record a history
+------+------------+-----------+-------+-------+
| ID | StartDate | EndDate | Value | rate |
+------+------------+-----------+-------+-------+
| 1 | 2018-06-29 |2018-10-22 | 15 | 77.2 |
| 1 | 2018-04-28 |2018-06-21 | 23 | 55.3 |
| 1 | 2018-02-24 |2018-04-15 | 41 | 44.3 |
| 1 | 2017-06-29 |2017-11-29 | 55 | 44.1 |
| 2 | 2018-07-29 |2018-11-22 | 15 | 106.1 |
| 2 | 2018-03-28 |2018-07-21 | 23 | 10.8 |
| 2 | 2017-12-28 |2018-03-28 | 22 | 11.0 |
| 3 | 2017-09-28 |2018-01-28 | 11 | 87.09 |
| 3 | 2017-06-27 |2018-09-28 | 58 | 100 |
| ... | ... | ... | ... | ... |
+------+------------+-----------+-------+--------+
And I would like to have the next table, to keep the previous information
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+
| ID | StartDate | EndDate | StartDateP | EndDateP | Value | rate | ValueP| rateP |
+------+------------+------------+-----------+-----------+-------+--------+-------+--------+
| 1 | 2018-06-29 |2018-10-22 | 2018-04-28 |2018-06-21 | 15 | 77.2 | 23 | 55.3 |
| 2 | 2018-07-29 |2018-11-22 | 2018-03-28 |2018-07-21 | 15 | 106.1 | 23 | 10.8 |
| 3 | 2017-09-28 |2018-01-28 | 2017-06-27 |2018-09-28 | 11 | 87.09 | 58 | 100 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
+------+------------+-----------+------------+-----------+-------+--------+-------+--------+
If I understand you correctly you want the row with the latest start date combined with the row with the startdate just before that? This might do the trick
WITH results AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID ORDER BY StartDate DESC) r
FROM (
-- start of your original query
SELECT DISTINCT *
FROM Table_01
ORDER BY ID, StartDate
UNION ALL
(
SELECT DISTINCT *
FROM Table_02
ORDER BY ID, StartDate
)
UNION ALL
(...) a
ORDER BY a.ID, a.StartDate
-- end of your original query
)
)
SELECT
r1.id, r1.startDate, r2.enddate,
r2.startDate startDateP, r2.enddate enddateP,
r1.value, r1.rate,
r2.value valueP, r2.rate rateP
FROM results r1
LEFT JOIN results r2 ON r2.id = r1.id AND r2.r = 2
WHERE r1.r = 1
Another option is using Row_Number() in concert with a conditional aggregation
Example
Select ID
,StartDate = max(case when RN=1 then StartDate end)
,EndDate = max(case when RN=1 then EndDate end)
,StartDateP = max(case when RN=2 then StartDate end)
,EndDateP = max(case when RN=2 then EndDate end)
,Value = max(case when RN=1 then Value end)
,Rate = max(case when RN=1 then Rate end)
,ValueP = max(case when RN=2 then Value end)
,RateP = max(case when RN=2 then Rate end)
From (
Select *
,RN = Row_Number() over (Partition By ID Order by EndDate Desc)
From YourTable
) A
Group By ID
Returns

SQL Server Active Record counts by month

I've created a database storing Incident tickets.
I have created a fact and a number of dimension tables.
Here is some sample data
+---------------------+--------------+--------------+-------------+------------+
| LastModifiedDateKey | TicketNumber | Status | factCurrent | Date |
+---------------------+--------------+--------------+-------------+------------+
| 2774 | T:9992260 | Open | 1 | 4/12/2017 |
| 2777 | T:9992805 | Open | 1 | 7/12/2017 |
| 2777 | T:9993068 | Open | 1 | 7/12/2017 |
| 2777 | T:9993098 | Open | 0 | 7/12/2017 |
| 2793 | T:9993098 | Acknowledged | 0 | 23/12/2017 |
| 2928 | T:9993098 | Closed | 1 | 5/01/2018 |
| 2777 | T:9993799 | Open | 0 | 7/12/2017 |
| 2928 | T:9993799 | Closed | 1 | 5/01/2018 |
| 2778 | T:9994729 | Open | 1 | 8/12/2017 |
| 2774 | T:9994791 | Open | 0 | 4/12/2017 |
| 2928 | T:9994791 | Closed | 1 | 5/01/2018 |
| 2777 | T:9994912 | Open | 1 | 7/12/2017 |
| 2778 | T:9995201 | Open | 0 | 8/12/2017 |
| 2793 | T:9995201 | Closed | 1 | 23/12/2017 |
| 2931 | T:9718629 | Open | 1 | 8/01/2018 |
| 2933 | T:9718629 | Closed | 1 | 10/01/2018 |
| 2932 | T:9855664 | Open | 1 | 9/01/2018 |
| 2931 | T:9891975 | Open | 1 | 8/01/2018 |
+---------------------+--------------+--------------+-------------+------------+
I want a query that will give me the total of tickets open at the end of each month.
In the data January should have 8 and Feb 2.
Note: that a ticket can have multiple rows with same status because a dimension key has changed or multiple rows with different status all in the same month. e.g. T:9993098.
This approach first uses ROW_NUMBER to identify the most recent record for each ticket, for each month/year. It is assumed that the most recent record in a month will contain the status in which a ticket ended for that month. Then, it aggregates over this modified table, counting only tickets which ended the month in an open status.
SELECT
YEAR(Date) + "-" + MONTH(Date) AS date,
COUNT(*) AS num_open_tickets
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARITION BY YEAR(Date), MONTH(Date), TicketNumber
ORDER BY BY Date DESC) rn
FROM yourTable
) t
WHERE t.rn = 1 AND t.Status = 'Open'
GROUP BY
YEAR(Date) + "-" + MONTH(Date);
First, I would generate the months. Then do a cumulative count of the opens minus the closes. Alas, that is a bit tricky because of the repeated rows for a ticket and because you are using an old version of SQL Server.
But . . . you can do this:
with months as (
select dateadd(day, 1 - day(min(date)), min(date)) as mon_start,
max(date) as max_date
from sample
union all
select dateadd(month, 1, mon_start), max_date
from months
where dateadd(month, 1, mon_start) < max_date
)
select m.mon_end,
(select count(distinct case when status = 'Open' then ticket end) -
count(distinct case when status = 'Closed' then ticket end)
from sample s
where s.date <= m.mon_end
) as open_tickets
from (select dateadd(day, -1, mon_start) as mon_end
from months
) m;
This uses a recursive CTE to generate the months. It is easier to generate the first day of the months and then subtract one day afterwards (what is the date when you add 1 month to the last day of February?)
The rest uses a correlated subquery to count the number of open tickets on that date.