Hi all sorry about my poorly worded title I am unsure as to how to phrase exactly what I need. But i will try and explain it better below:
I have a dataset that looks like this:
DECLARE #TestDATA TABLE (PERSON_ID int, START_DATE date, END_DATE date,SERVICE_RANK int)
INSERT INTO #TestDATA
VALUES
(123, '2018-01-31', '2018-02-14', 7),
(123, '2018-03-28', '2018-04-11', 4),
(123, '2018-04-12', '2018-04-30', 4),
(123, '2018-05-25', '2018-06-08', 7),
(123, '2018-06-08', '2018-06-15', 7),
(123, '2018-06-19', '2018-06-26', 7),
(123, '2018-06-26', '2018-09-28', 4),
(123, '2018-10-10', '2018-11-07', 7),
(123, '2018-11-27', '2018-12-11', 7),
(123, '2018-12-11', '2018-12-24', 7)
Which shows a date range and "service rank" for each person (there is only one person in this example but there are 10's of thousands in the database)
Where for each person_id and each service_rank I would like to group the date periods to identify how many distinct periods they have had. So in the above example this is what I would be looking for:
PERSON ID, START_DATE, END_DATE, SERVICE_RANK, SERVICE_PERIOD
123 2018-01-31 2018-02-14 7 1
123 2018-03-28 2018-04-11 4 2
123 2018-04-12 2018-04-30 4 2
123 2018-05-25 2018-06-08 7 3
123 2018-06-08 2018-06-15 7 3
123 2018-06-19 2018-06-26 7 3
123 2018-06-26 2018-09-28 4 4
123 2018-10-10 2018-11-07 7 5
123 2018-11-27 2018-12-11 7 5
123 2018-12-11 2018-12-24 7 5
I have tried row_number, rank, dense_rank and even had a go at the dreaded CURSOR FOR but I cannot get anything work as the windowed functions see the service ranks as the same, so for the above example it would see two service ranks when there are actually 5 they just share the same numbering.
Also in the dataset not every person will jump from one service_rank to another and back. They may go from one to another (eg 4 -> 7) and stay there or they may only have one service_rank over multiple rows.
Any ideas??
This is a gaps-and-islands problem. For this purpose, one method is lag() and a cumulative sum:
select t.*,
sum(case when prev_service_rank = service_rank then 0 else 1 end) over (partition by person_id order by start_date) as service_period
from (select t.*,
lag(service_rank) over (partition by person_id order by start_date) as prev_service_rank
from t
) t;
Related
This is a follow-up question to my initial post
Example Situation: An order system tracks manually entered due dates by recording a system log date that is always unique (this would be a datetime, but I've used dates for simplicity, making each unique).
I would like to assign a section number to each due date grouping where the due date remains the same chronologically.
Stu's response solved the table in my initial post, but I've noticed that if I replace the 4/15/2022 due date associated with SysLogDate of 1/16/2022 to be 4/13/2022, the desired ordering does not seem to be maintained:
Note: 4/13/2022 date is an arbitrary change. The same issue occurs if I use any other unique date that is not yet already in the DueDate column. Ultimately, I also need to be able to handle changes to/from NULL, where someone 'forgets' to enter the date, but replacing the date with NULL also yields the same issue.
Updated Table:
CREATE TABLE #DueDates (OrderNo INT, DueDate Date, SysLogDate Date)
INSERT INTO #DueDates Values (1, '4/10/2022', '1/10/2022')
,(1, '4/10/2022', '1/11/2022')
,(1, '4/15/2022', '1/15/2022')
,(1, '4/13/2022', '1/16/2022') -- Due Date Altered since prior post
,(1, '4/15/2022', '1/17/2022')
,(1, '4/10/2022', '1/18/2022')
,(1, '4/10/2022', '1/19/2022')
,(1, '4/10/2022', '1/20/2022')
,(2, '4/10/2022', '2/16/2022')
,(2, '4/10/2022', '2/17/2022')
,(2, '4/15/2022', '2/18/2022')
,(2, '4/15/2022', '2/20/2022')
,(2, '4/15/2022', '2/21/2022')
,(2, '4/10/2022', '2/22/2022')
,(2, '4/10/2022', '2/24/2022')
,(2, '4/10/2022', '2/26/2022')
Desired Results Are:
OrderNo DueDate SysLogDate SectionNumber_WithinDueDate
1 2022-04-10 2022-01-10 1
1 2022-04-10 2022-01-11 1
1 2022-04-15 2022-01-15 2
1 2022-04-13 2022-01-16 3
1 2022-04-15 2022-01-17 4
1 2022-04-10 2022-01-18 5
1 2022-04-10 2022-01-19 5
1 2022-04-10 2022-01-20 5
2 2022-04-10 2022-02-16 1
2 2022-04-10 2022-02-17 1
2 2022-04-15 2022-02-18 2
2 2022-04-15 2022-02-20 2
2 2022-04-15 2022-02-21 2
2 2022-04-10 2022-02-22 3
2 2022-04-10 2022-02-24 3
2 2022-04-10 2022-02-26 3
...but applying the solution from my prior post to this updated table yields:
OrderNo DueDate SysLogDate SectionNumber_WithinDueDate
1 2022-04-10 2022-01-10 1
1 2022-04-10 2022-01-11 1
1 2022-04-15 2022-01-15 2
1 2022-04-13 2022-01-16 3 **
1 2022-04-15 2022-01-17 3 **
1 2022-04-10 2022-01-18 3 **
1 2022-04-10 2022-01-19 3 **
1 2022-04-10 2022-01-20 3 **
2 2022-04-10 2022-02-16 1
2 2022-04-10 2022-02-17 1
2 2022-04-15 2022-02-18 2
2 2022-04-15 2022-02-20 2
2 2022-04-15 2022-02-21 2
2 2022-04-10 2022-02-22 3
2 2022-04-10 2022-02-24 3
2 2022-04-10 2022-02-26 3
Here's a demo to work that uses the above updated table and the solution from my prior post, and shows the above non-desired results: Fiddle
Demo showing same effect when the date is replaced with NULL: Fiddle with NULL
Copy of the selected solution from my prior post (used in the above Fiddles):
select OrderNo, DueDate, SysLogDate,
dense_rank() over(partition by orderno order by gp) SectionNumber_WithinDueDate
from (
select *,
Row_Number() over(partition by OrderNo order by SysLogDate)
- Row_Number() over(partition by OrderNo, DueDate order by SysLogDate) gp
from #DueDates
)t
order by OrderNo, SysLogDate;
It's a small change in the data, but I haven't been able to work out how to alter the 'Row_Number difference line' in the subquery to get the desired results.
Thank you for any advice you can offer here :)
For gap and island problem, I prefer to use lag() window function as it is easier to understand.
Use lag() to compare previous row value and when changed, set a flag (value 1). Perform a cumulative sum on the flag and you get the grp. Use dense_rank() on the grp and it gives you your SectionNumber_WithinDueDate
As you have NULL value, use ISNULL() to return a date value (99991231) for comparison
select OrderNo, DueDate, SysLogDate,
SectionNumber_WithinDueDate = dense_rank() over (partition by OrderNo
order by grp)
from
(
select *, grp = sum(g) over (partition by OrderNo
order by SysLogDate)
from
(
select *,
g = case when isnull(DueDate, '99991231')
<> isnull(lag(DueDate) over (partition by OrderNo
order by SysLogDate), '99991231')
then 1
else 0
end
from #DueDates
) d
) d
order by OrderNo, SysLogDate;
Fiddle on your sample data :
fiddle 1
fiddle 2
I have table in Oracle SQL presents ID of clients and date with time of their login to application:
ID | LOGGED
----------------
11 | 2021-07-10 12:55:13.278
11 | 2021-08-10 13:58:13.211
11 | 2021-02-11 12:22:13.364
22 | 2021-01-10 08:34:13.211
33 | 2021-04-02 14:21:13.272
I need to select only these clients (ID) who has logged minimum 1 time in last month (August) and minimum 1 time in one month preceding August (June or July)
Currently we have September, so...
I need clients who has logged min 1 time in August
and min 1 time in July or Jun,
if logged in June -> not logg in July
if logged in July -> not logged in June
As a result I need like below:
ID
----
11
How can do that in Oracle SQL ? be aware that column "LOGGED" has Timestamp like: 2021-01-10 08:34:13.211
May be you consider this:
select id
from yourtable
group by id
having count(case
months_between(trunc(sysdate,'MM'),
trunc(logged,'MM')
) when 1 then 1 end
) >= 1
and count
(case when
months_between(trunc(sysdate,'MM') ,
trunc(logged,'MM')
) in (2,3) then 1 end
) = 1
I don't understand one thing:
You wrote :
minimum 1 time in one month preceding August (June or July)
and after then:
if logged in June -> not logg in July
if logged in July -> not logged in June
If you need EXACTLY one month- June or July
just consider my query above.
If you need minimum one logon in June and July, then:
select id
from yourtable
group by id
having count(case
months_between(trunc(sysdate,'MM'),
trunc(logged,'MM')
) when 1 then 1 end
) >= 1
and count
(case when
months_between(trunc(sysdate,'MM') ,
trunc(logged,'MM')
) in (2,3) then 1 end
) >= 1
Your question needs some clarification, but based on what you were describing I am seeing a couple of options.
The simplest one is probably using a combo of data densification (for generating a row for every month for each id) plus an analytical function (for enabling inter-row calculations. Here's a simple example of this:
rem create a dummy table with some more data (you do not seem to worry about the exact timestamp)
drop table logs purge;
create table logs (ID number, LOGGED timestamp);
insert into logs values (11, to_timestamp('2021-07-10 12:55:13.278','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (11, to_timestamp('2021-07-11 12:55:13.278','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (11, to_timestamp('2021-08-10 13:58:13.211','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (11, to_timestamp('2021-02-11 12:22:13.364','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (11, to_timestamp('2021-04-11 12:22:13.364','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (22, to_timestamp('2021-01-10 08:34:13.211','yyyy-mm-dd HH24:MI:SS.FF'));
insert into logs values (33, to_timestamp('2021-04-02 14:21:13.272','yyyy-mm-dd HH24:MI:SS.FF'));
commit;
The following SQL gets your data densified and lists the total count of logins for a month and the previous month on the same row so that you could do a comparative calculation. I have not done then, but I am hoping you get the idea.
with t as
(-- dummy artificial table just to create a time dimension for densification
select distinct to_char(sysdate - rownum,'yyyy-mm') mon
from dual connect by level < 300),
l_sparse as
(-- aggregating your login info per month
select id, to_char(logged,'yyyy-mm') mon, count(*) cnt
from logs group by id, to_char(logged,'yyyy-mm') ),
l_dense as
(-- densification with partition outer join
select t.mon, l.id, cnt from l_sparse l partition by (id)
right outer join t on (l.mon = t.mon)
)
-- final analytical function to list current and previous row info in same record
select mon, id
, cnt
, lag(cnt) over (partition by id order by mon asc) prev_cnt
from l_dense
order by id, mon;
parts of the result:
MON ID CNT PREV_CNT
------- ---------- ---------- ----------
2020-12 11
2021-01 11
2021-02 11 2
2021-03 11 2
2021-04 11 1
2021-05 11 1
2021-06 11
2021-07 11 3
2021-08 11 2 3
2021-09 11 2
2020-12 22
2021-01 22 2
2021-02 22 2
2021-03 22
2021-04 22
...
You can see for ID 11 that for 2021-08 you have logins for the current and previous month, so you can math on it. (Would require another subselect/with branch).
Alternatives to this would be:
interrow calculation plus time math between two logged timestamps
pattern matching
Did not drill into those, not enough info about your real requirement.
I want to count consecutive days (rows) and that is fairly easy (given all the answers to similar questions). But in my data set I have groups of consecutive rows with dates such as:
1. 30/12/2010
2. 31/12/2010
3. 01/01/2011
4. 02/01/2011
Looks like a one group (4 consecutive days), but I would like to split this group into two groups. So when having:
1. 30/12/2010
2. 31/12/2010
3. 01/01/2011
4. 02/01/2011
5. 05/01/2011
6. 06/02/2011
7. 07/02/2011
I would like to see this grouped into four groups (not three):
1. 30/12/2010
2. 31/12/2010
3. 01/01/2011
4. 02/01/2011
5. 05/01/2011
6. 06/02/2011
7. 07/02/2011
I'm using SQL Server 2014
You can number your rows like this:
DECLARE #T TABLE(id INT, dt DATE);
INSERT INTO #T VALUES
(1, '2010-12-30'),
(2, '2010-12-31'),
(3, '2011-01-01'),
(4, '2011-01-02'),
(5, '2011-01-05'),
(6, '2011-02-06'),
(7, '2011-02-07');
WITH CTE1 AS (
SELECT *, YEAR(dt) AS temp_year, ROW_NUMBER() OVER (ORDER BY dt) AS temp_rownum
FROM #T
), CTE2 AS (
SELECT CTE1.*, DATEDIFF(DAY, temp_rownum, dt) AS temp_dategroup
FROM CTE1
)
SELECT *, RANK() OVER (ORDER BY temp_year, temp_dategroup) AS final_rank
FROM CTE2
ORDER BY final_rank, dt
Result:
id dt temp_year temp_rownum temp_dategroup final_rank
1 2010-12-30 2010 1 40539 1
2 2010-12-31 2010 2 40539 1
3 2011-01-01 2011 3 40539 3
4 2011-01-02 2011 4 40539 3
5 2011-01-05 2011 5 40541 5
6 2011-02-06 2011 6 40572 6
7 2011-02-07 2011 7 40572 6
It is possible to use simplify the query but I chose to display all columns so that it is easier to understand. The DATEDIFF trick was copied from this answer.
I have a table of data which stores scans into a building, and this contains well over a million rows of data. I am attempting to add a temporary status column within this query, which counts the scans on a daily basis. For the purpose of this question lets use this as the main data table:
CREATE TABLE DataTable (DataTableID INT IDENTITY(1,1) NOT NULL,
User VARCHAR(50),
EventTime DATETIME)
from this I have narrowed it down to show only the scans for today:
SELECT * FROM DataTable
WHERE CONVERT(DATE,EventTime) = CONVERT(DATE, SYSDATETIME())
It is at this point in which I want to add a status column to this query above. The Status column:
WHEN ODD - will mean that the person is in the building
WHEN EVEN - will mean that the person is not in the building
(This is simply an integer field which starts on 1, and will increment by 1 per scan on that day, PER USER). How would I go about doing this?
I do want to make this a view after so its worth mentioning in case this affects the query syntax
Also its worth mentioning that I cant add a status column to the main table as this would prevent the door access program working, otherwise I would add something in here to control that.
EXAMPLE DATA:
DataTableID User EventTime Status
1 Joe 30/08/2016 09:00:00 1
2 Alan 30/08/2016 08:45:00 1
3 John 30/08/2016 09:02:00 1
4 Steven 30/08/2016 07:30:00 1
5 Joe 30/08/2016 11:00:00 2
6 Mike 30/08/2016 17:30:00 1
7 Joe 30/08/2016 12:00:00 3
You want a simple windowing function for this. Take a look at the query below and let me know if you have any questions. This is ordered by EventTime rather than DataTableID for the windowing, it's then ordered by DataTableID in the final query. This is going to make sure you don't have any issues if your data isn't in the correct order in the table.
Temp table for testing;
CREATE TABLE #DataTable
(DataTableID INT IDENTITY(1,1) NOT NULL,
[User] VARCHAR(50),
EventTime DATETIME)
Fill it with sample data;
INSERT INTO #DataTable
VALUES
('Joe', '2016-08-30 09:00:00')
,('Alan', '2016-08-30 08:45:00')
,('John', '2016-08-30 09:02:00')
,('Steven', '2016-08-30 07:30:00')
,('Joe', '2016-08-30 11:00:00')
,('Mike', '2016-08-30 17:30:00')
,('Joe', '2016-08-30 12:00:00')
Query
SELECT
DataTableID
,[User]
,EventTime
,ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY EventTime) Status
FROM #DataTable
WHERE CONVERT(DATE,EventTime) = CONVERT(DATE, SYSDATETIME())
ORDER BY DataTableID
Output
DataTableID User EventTime Status
1 Joe 2016-08-30 09:00:00.000 1
2 Alan 2016-08-30 08:45:00.000 1
3 John 2016-08-30 09:02:00.000 1
4 Steven 2016-08-30 07:30:00.000 1
5 Joe 2016-08-30 11:00:00.000 2
6 Mike 2016-08-30 17:30:00.000 1
7 Joe 2016-08-30 12:00:00.000 3
Something like:
select *, row_number() over(partition by user, cast(eventtime as date) order by eventtime) as status
from datatable
should do the trick.
However, I'd suggest to create a calculated column as cast(eventtime as date), and compound index on this and user column and the original eventtime column as well for performance reasons.
I have a dataset that includes similar values and a number of dates. I am trying to write a sql query to find the difference in dates between rows and sum the values in another row. The data set looks like this:
id | date | value |
1 2015-06-01 10
1 2015-09-22 25
2 2015-12-10 15
2 2015-07-11 20
2 2015-10-18 25
3 2015-04-05 30
3 2015-05-02 45
4 2015-06-01 20
And what I am trying to get to is this:
id | date_diff | value
1 42 35
2 149 60
3 27 75
4 0 20
The idea is the date_diff finds the difference between the dates for each id, and sums the value. However, the date_diff function isn't working for me, and I think this may be one of the first issues. The date_diff function is returning this error:
function datediff(unknown, date, timestamp without time zone) does not exist
Hint: No function matches the given name and argument types. You might need to add explicit type casts.
I'm using a public data set, so that might be part of the issue. Any help or ideas would be great!
Let's create a test table:
create table test
(
id int ,
date DATE,
value int
)
insert into test values (1, CAST('2015-06-01' AS DATE), 10);
insert into test values (1, CAST('2015-09-22' AS DATE), 25);
insert into test values (2, CAST('2015-12-10' AS DATE), 100);
insert into test values (2, CAST('2015-07-11' AS DATE), 200);
Let's look at the data:
select * from test ORDER BY id;
1 2015-06-01 10
1 2015-09-22 25
2 2015-12-10 100
2 2015-07-11 200
Let's do the magic.
select datediff(day, MIN(DATE), MAX(DATE)) as diff_in_days, sum(value) as sum_of_values FROM test group by id;
113 35
152 300
Please note that you didn't specify what should happen when there are 3 rows with id. The code will still work, but if it would make sense from the logical point of view for your application, I don't know.