SQL - state machine - reporting on historical data based on changeset - sql

I want to record user states and then be able to report historically based on the record of changes we've kept. I'm trying to do this in SQL (using PostgreSQL) and I have a proposed structure for recording user changes like the following.
CREATE TABLE users (
userid SERIAL NOT NULL PRIMARY KEY,
name VARCHAR(40),
status CHAR NOT NULL
);
CREATE TABLE status_log (
logid SERIAL,
userid INTEGER NOT NULL REFERENCES users(userid),
status CHAR NOT NULL,
logcreated TIMESTAMP
);
That's my proposed table structure, based on the data.
For the status field 'a' represents an active user and 's' represents a suspended user,
INSERT INTO status_log (userid, status, logcreated) VALUES (1, 's', '2008-01-01');
INSERT INTO status_log (userid, status, logcreated) VALUES (1, 'a', '2008-02-01');
So this user was suspended on 1st Jan and active again on 1st of February.
If I wanted to get a suspended list of customers on 15th January 2008, then userid 1 should show up. If I get a suspended list of customers on 15th February 2008, then userid 1 should not show up.
1) Is this the best way to structure this data for this kind of query?
2) How do I query the data in either this structure or in your proposed modified structure so that I can simply have a date (say 15th January) and find a list of customers that had an active status on that date in SQL only? Is this a job for SQL?

This can be done, but would be a lot more efficient if you stored the end date of each log. With your model you have to do something like:
select l1.userid
from status_log l1
where l1.status='s'
and l1.logcreated = (select max(l2.logcreated)
from status_log l2
where l2.userid = l1.userid
and l2.logcreated <= date '2008-02-15'
);
With the additional column it woud be more like:
select userid
from status_log
where status='s'
and logcreated <= date '2008-02-15'
and logsuperseded >= date '2008-02-15';
(Apologies for any syntax errors, I don't know Postgresql.)
To address some further issues raised by Phil:
A user might get moved from active, to suspended, to cancelled, to active again. This is a simplified version, in reality, there are even more states and people can be moved directly from one state to another.
This would appear in the table like this:
userid from to status
FRED 2008-01-01 2008-01-31 s
FRED 2008-02-01 2008-02-07 c
FRED 2008-02-08 a
I used a null for the "to" date of the current record. I could have used a future date like 2999-12-31 but null is preferable in some ways.
Additionally, there would be no "end date" for the current status either, so I think this slightly breaks your query?
Yes, my query would have to be re-written as
select userid
from status_log
where status='s'
and logcreated <= date '2008-02-15'
and (logsuperseded is null or logsuperseded >= date '2008-02-15');
A downside of this design is that whenever the user's status changes you have to end date their current status_log as well as create a new one. However, that isn't difficult, and I think the query advantage probably outweighs this.

Does Postgres support analytic queries? This would give the active users on 2008-02-15
select userid
from
(
select logid,
userid,
status,
logcreated,
max(logcreated) over (partition by userid) max_logcreated_by_user
from status_log
where logcreated <= date '2008-02-15'
)
where logcreated = max_logcreated_by_user
and status = 'a'
/

#Tony the "end" date isn't necessarily applicable.
A user might get moved from active, to suspended, to cancelled, to active again. This is a simplified version, in reality, there are even more states and people can be moved directly from one state to another.
Additionally, there would be no "end date" for the current status either, so I think this slightly breaks your query?

#Phil
I like Tony's solution. It seems to most approriately model the situation described. Any particular user has a status for a given period of time (a minute, an hour, a day, etc.), but it is for a duration, not an instant in time. Since you want to know who was active during a certain period of time, modeling the information as a duration seems like the best approach.
I am not sure that additional statuses are a problem. If someone is active, then suspended, then cancelled, then active again, each of those statuses would be applicable for a given duration, would they not? It may be a vey short duration, such as a few seconds or a minute, but they would still be for a length of time.
Are you concerned that a person's status can change multiple times in a given day, but you want to know who was active for a given day? If so, then you just need to more specifically define what it means to be active on a given day. If it is enough that they were active for any part of that day, then Tony's answer works well as is. If they would have to be active for a certain amount of time in a given day, then Tony's solution could be modified to simply determine the length of time (in hours, or minutes, or days), and adding further restrictions in the WHERE clause to retrieve for the proper date, status, and length of time in that status.
As for there being no "end date" for the current status, that is no problem either as long as the end date were nullable. Simply use something like this "WHERE enddate <= '2008-08-15' or enddate is null".

Related

TSQL query to find latest (current) record from period column when there are past present and future records

edited as requested:
My apologies. I've been dealing with this a bit and it's well and truly in my head, but not for the reader.
We have multiple records in table A which have multiple entries in the Period column. Say it's like a football schedule. Teams will have multiple dates/times in the Period column.
When we run query:
We want records selected for the most recent games only.
We don't want the earlier games.
We don't want the games "scheduled" and not yet played.
"Last game played" i.e. Period for teams are often on different days.
Table like:
Team Period
Reds 2021020508:00
Reds 2021011107:00
City 2021030507:00
Reds 2021032607:00
City 2021041607:00
Reds 2021050707:00
When I run query, I want to see the records for last game played regardless of date. So if I run the query on 27 Mar 2021, I want:
City 2021030507:00
Reds 2021032607:00
Keep in mind I used the above as an easily understandable example. In my case I have 1000s of "Teams" each of which may have 100+ different date entries in the Period column and I would like the solution to be applicable regardless of number of records, dates, or when the query is run.
What can I do?
Thanks!
So this gives you your desired output using the sample data, does it fulfil your requirement?
create table x (Team varchar(10), period varchar(20))
insert into x values
('Reds','2021020508:00'),
('Reds','2021011107:00'),
('City','2021030507:00'),
('Reds','2021032607:00'),
('City','2021041607:00'),
('Reds','2021050707:00')
select Team, Max(period) LastPeriod
from x
where period <=Format(GetDate(), 'yyyyMMddhh:mm')
group by Team
The string-formatted date you have order by text, so I think this would work
SELECT TOP 2 *
FROM tableA
WHERE period = FORMAT( GETDATE(), 'yyyyMMddhh:mm' )
ORDER BY period
Perhaps you want:
where period = (select max(t2.period) from t t2)
This returns all rows with the last period in the table.

Should I use Effective Date or Start Date and End Date for historical recording?

I am a Business Analyst and have prepared tables/erd for a system we are implementing.
The context is essentially an employee management system, an employee can join the company, change positions, get promoted, demoted, terminated etc. All of this is required to be tracked for filtering and reporting purposes. Therefore we require historical tracking of records.
My recommendation and original design of the tables included a field called "Effective Date", so essentially effectively from a date onwards a particular "Action" is valid.
Say for example, John joined an organisation as a consultant on the 1st Jan 2017 thus the action was he was hired, therefore the effective date is 1st Jan 2017 and he was a consultant for a certain period of time until he became a senior consultant on the 6th September 2017, thus the effective date is 6th September 2017 with an action of promoted for that record.
By the way we will also be performing calculations on the salary of the employee based on their position and other parameters so there will be derived fields and fields being referenced from other tables etc.
Now my boss and the Solutions Architect have advised not to use the "Effective Date," my boss says there will be "problems" with the calculation but doesn't elaborate, and the Solutions Architect says it would be easier to use a Start Date and an End Date instead of effective date. His rationale is if there's no end date that action/event is active, but is inactive once an end date is provided.
My problem with this is that we'll have to maintain an additional column that I feel is totally uneccessary.
What do the brains trust of StackOverflow advise??
Thanks :)
Your instincts serve you well. Don't use the end date. This adds a complication and source of possible anomalous data. Take the following sequential entries:
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Jan 22
1 ... Feb 1 Jul 30
There was a state change recorded on Jan 1 which was in effect until the next state change on Jan 20. Now we have a problem. According to the EndDate of that version, there was another state change on Jan 22, but the next version started on Feb 1.
This forms a gap in the time stream and we have no indication of where the problem lies. Is the EndDate of Jan 22 wrong? Is the StartDate of Feb 1 wrong? Or is there a missing version that connects the two ends of the gap? There is no way to tell.
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Feb 20
1 ... Feb 1 Jul 30
Now there is an overlap of states. The second state supposedly lasted until Feb 20 but the third state says it started on Feb 1. But the start of one state logically means the end of the previous state. Again, we have no idea (just by looking at the data) which date is wrong.
Knowing that the start of one state also indicates the end of the previous state, looks what happens when we simply remove the EndDate column.
ID <attr> EffDate
1 ... Jan 1
1 ... Jan 20
1 ... Feb 1
Now gaps and overlaps are impossible. Each state begins at the effective date and ends when the next state begins. As the EffDate field is part of the PK, no entry can have the same EffDate value for a given ID value.
This design is not used with the main entity table. It is implemented as a special form of second normal form, what I can version normal form (vnf).
Your Employee table will have fields that don't change over the course of time and some that do. You might also have fields that change but you don't wish to track those changes.
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
LName varchar not null,
Sex enum -- M or F
BDay date,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID )
);
If we wish to track changes to the data, we could add an effective date field. Consider, however, that data such as the hire date and birth date will not change from one version to another. Thus they are dependent only on the ID field. The data that does change (Position, PayRate, DeptID) are dependent on the ID and the effective date field. The table is no longer in 2nf.
So we normalize:
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
Sex enum -- M or F
BDay date
);
create table Employees_V(
ID int not null references Employees( ID ),
EffDate date not null,
LName varchar not null,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID ),
constraint PK_Employees_V primary key( ID, EffDate )
);
The last name can be expected to change now and then, especially among the female employees.
One of the main advantages of this method is that foreign keys cannot reference versions. Now all FKs can reference the main entity table as normal.
The query to obtain the "current" data is relatively simple:
select e.ID, e.Hired, e.FName, v.Lname, e.Sex, e.BDay, v.Position, v.PayRate, v.DeptID
from Employees e
join Employees)V v
on v.ID = e.ID
and v.EffDate =(
select Max( EffDate )
from Employees_V
where ID = v.ID
and EffDate <= GetDate())
where e.ID = 123;
Compare to querying a table with start/end dates.
select ID, Hired, FName, Lname, Sex, BDay, Position, PayRate, DeptID
from Employees
where ID = 123
and StartDate >= GetDate()
and EndDate < GetDate();
This assumes the EndDate value for the current version is a magic value such as 12/31/9999.
This second query looks a lot simpler than the first. Even if the data is normalized as shown above, there is a join but no subquery. It also looks like it will execute much faster.
I have used this technique for about 8 years now and I've never had to alter it because of performance issues. The vnf query runs at worst less than 10% slower than the start/end version. So a one minute query will take about one minute 5 seconds. However, under some conditions, the vnf query will execute faster.
Take entities that have many, many changes (many thousands of versions). The start/end query performs an index scan. It starts at the earliest version and must examine each version in sequence until it finds the one with the EndDate less than the target date. Normally, this is the last version. In the vnf query, the subquery makes it possible to perform an index seek.
So don't reject this design because you think it is slow. It is not slow. Especially when you consider that inserting a new version requires only the one INSERT statement. When working with start/end dates, the insert of a new version requires an UPDATE and then an INSERT. It's two UPDATEs and an INSERT when inserting a new version between two existing versions. To remove a start/end version requires one or two UPDATE and one DELETE statements. To delete a vnf version, just delete the version.
And if the start and end dates between versions ever get out of synch, you have a gap or overlap and good luck finding the right values.
So I'll take the small performance hit to ensure that the data can never get out of synch and turn anomalous on me. This (vnf), as it turns out, is really the simpler design.
Definitely implement the end date. It is a tiny bit more work when writing but you only write it once, but you will report on it many many times and you'll find that it makes everything so much easier (and faster) when the end date is already there on the record.
All over stackoverflow you will find questions about writing queries to find the end date of a given record when it is defined on the 'next' record rather than the 'current' record These queries are ugly and slow
If you look at the back end of enterprise systems like SAP you'll find that records have start and end dates defined.
With regards to your colleagues comments about not using effective date: You don't provide much info so I'll guess. I'm guessing that there is a true 'effective date' when the thing happened but there is also another set of start and end dates which are the payroll effective dates that the change applies to. So if someone starts on the 1st, the payroll effective date might actually be the 15th. This might also be used for FTE calculations. Payroll and pay periods are really a big deal and quite complex so you shouldn't underestimate the complexity there. If you're including pay calculations in this system then at the very least you need to understand what effective payroll dates are.
You should not be afraid of storing four date columns instead of one. Databases are there to make things easy for you not harder.
Using startDate and endDate makes update messy but it helps fetching effective dated much easier and faster.
Updating same record asynchronously may cause overlapping of the dates as we need to fetch all the records within update range and update these records individually.
On the other hand, using effectiveDate only fastens the update process as well as it will terminate the issue of date overlapping. But fetch seems too much complicated with this way.
For example:
ID Data EffDate
1 ... Jan 1 2020
1 ... Jan 30 2020
1 ... Feb 1 2020
In above example, if we want to fetch record of effective date Feb 1 we would have to compare first 3 records to match the highest date (which is not possible if we are fetching list). Upon that it will be mess to join with other effective dated tables.

SQL how to implement if and else by checking column value

The table below contains customer reservations. Customers come and make one record in this table, and the last day this table will be updated its checkout_date field by putting that current time.
The Table
Now I need to extract all customers spending nights.
The Query
SELECT reservations.customerid, reservations.roomno, rooms.rate,
reservations.checkin_date, reservations.billed_nights, reservations.status,
DateDiff("d",reservations.checkin_date,Date())+Abs(DateDiff("s",#12/30/1899
14:30:0#,Time())>0) AS Due_nights FROM reservations, rooms WHERE
reservations.roomno=rooms.roomno;
What I need is, if customer has checkout status, due nights will be calculated checkin_date subtracting by checkout date instead current date, also if customer has checkout date no need to add extra absolute value from 14:30.
My current query view is below, also my computer time is 14:39 so it adds 1 to every query.
Since you want to calculate the Due nights upto the checkout date, and if they are still checked in use current date. I would suggest you to use an Immediate If.
The condition to check would be the status of the room. If it is checkout, then use the checkout_date, else use the Now(), something like.
SELECT
reservations.customerid,
reservations.roomno,
rooms.rate,
reservations.checkin_date,
reservations.billed_nights,
reservations.status,
DateDiff("d", checkin_date, IIF(status = 'checkout', checkout_date, Now())) As DueNights
FROM
reservations
INNER JOIN
rooms
ON reservations.roomno = rooms.roomno;
As you might have noticed, I used a JOIN. This is more efficient than merging the two tables with common identifier. Hope this helps !

storing data ranges - effective representation

I need to store values for every day in timeline, i.e. every user of database should has status assigned for every day, like this:
from 1.1.2000 to 28.05.2011 - status 1
from 29.05.2011 to 30.01.2012 - status 3
from 1.2.2012 to infinity - status 4
Each day should have only one status assigned, and last status is not ending (until another one is given). My question is what is effective representation in sql database? Obvious solution is to create row for each change (with the last day the status is assigned in each range), like this:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
01.01.9999 status 4
this has many problems - if i would want to add another range, say from 15.02.2012, i would need to alter last row too:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
14.02.2012 status 4
01.01.9999 status 8
and it requires lots of checking to make sure there is no overlapping and errors, especially if someone wants to modify ranges in the middle of the list - inserting a new status from 29.01.2012 to 10.02.2012 is hard to implement (it would require data ranges of status 3 and status 4 to shrink accordingly to make space for new status). Is there any better solution?
i thought about completly other solution, like storing each day status in separate row - so there will be row for every day in timeline. This would make it easy to update - simply enter new status for rows with date between start and end. Of course this would generate big amount of needless data, so it's bad solution, but is coherent and easy to manage. I was wondering if there is something in between, but i guess not.
more context: i want moderator to be able to assign status freely to any dates, and edit it if he would need to. But most often moderator will be adding new status data ranges at the end. I don't really need the last status. After moderator finishes editing whole month time, I need to generate raport based on status on each day in that month. But anytime moderator may want to edit data months ago (which would be reflected on updated raports), and he can put one status for i.e. one year in advance.
You seem to want to use this table for two things - recording the current status and the history of status changes. You should separate the current status out and move it up to the parent (just like the registered date)
User
===============
Registered Date
Current Status
Status History
===============
Uptodate
Status
Your table structure should include the effective and end dates of the status period. This effectively "tiles" the statuses into groups that don't overlap. The last row should have a dummy end date (as you have above) or NULL. Using a value instead of NULL is useful if you have indexes on the end date.
With this structure, to get the status on any given date, you use the query:
select *
from t
where <date> between effdate and enddate
To add a new status at the end of the period requires two changes:
Modify the row in the table with the enddate = 01/01/9999 to have an enddate of yesterday.
Insert a new row with the effdate of today and an enddate of 01/01/9999
I would wrap this in a stored procedure.
To change a status on one date in the past requires splitting one of the historical records in two. Multiple dates may require changing multiple records.
If you have a date range, you can get all tiles that overlap a given time period with the query:
select *
from t
where <periodstart> <= enddate and <periodend> >= effdate

Displaying same record twice- SQL Reporting Services

Ok, here's the situation: I need to display the same record in two different sections. stupid i know, but here's why.
The Report I am building is grouped by one Field, called Day. Each record has
date/times, an expected arrival date time, and an expected departure date/time.
so, at this point we have something like this:
Day..............Arrival Time..................Departure Time
18/5.............18/5 9.00am.........19/5 11.00am
The boss only wants to show times that relate to the current day in the arrive/depart coloumns (easy enough with expressions), which ends up like this:
Day..............Arrival Time..................Departure Time
18/5..............9.00am.........................-
the next thing he wants is to display the departing time in the correct day 'group', but as you can imagine as soon as you move to the next row, well you move to the next row of the table.
So the question is: is there anyway to display the same record on multiple coloumns? Have i missed something or have i got an unsolvable problem?
NOTE: this is not the only data in my table either. there is (for example) a name coloumn which also needs to be displayed on both days.
Cartesian Joins are great for duplicating data...
DECLARE #ArrDep TABLE
(
Code varchar(1)
)
INSERT INTO #ArrDep (Code) SELECT "A"
INSERT INTO #ArrDep (Code) SELECT "D"
SELECT DateAdd(dd, DateDiff(dd, 0,
CASE
WHEN ad.Code = "A"
THEN mt.ArrivalTime
ELSE mt.DepartureTime
END), 0) as TheDay
, *
FROM MyTable mt, #ArrDep ad
ORDER BY 1