I need to add a unique constraint to an Oracle database table where a foreign key reference can only exist more than once if 2 other columns which are dates don't overlap
e.g.
car_id start_date end_date
3 01/10/2012 30/09/2013
3 01/10/2013 30/09/2014 -- okay no overlap
3 01/10/2014 30/09/2015 -- okay no overlap
4 01/10/2012 30/09/2013 -- okay different foregin key
3 01/11/2013 01/01/2014 -- * not allowed overlapping dates for this car.
Any suggestions? Thanks in advance.
Last time I've seen a requirement and a solution for this, I've seen this:
Create an after statement trigger. In this trigger do a self join on your table like this:
select count(*)
from your_table a
join your_table b
on a.car_id = b.car_id and
(a.start_date between b.start_date and b.end_date
or
b.start_date between a.start_date and a.end_date)
If count is zero then everything is ok. If count > 0 then raise an exception and the statement will be rolled back.
OBS: This will not work for tables with > millions of rows and many inserts.
It works on small lookup tables or, if you have a big table, with big table and seldom inserts(batch inserts).
I take it that cars are tracked through some sort of process and every date records a state change. For example, you show that car #3 underwent a state change on 1 Oct 2012, again on 1 Oct 2013 and again on 1 Oct 2014. The final entry implies that the state changed again on 1 Oct 2015. Where is the entry showing that? Or is the state something that always lasts exactly one year -- making it possible to specify the end of the state as soon as the state begins? If so, then the entry showing the state change on 1 Nov 2013 is simply wrong. But the one-year specification could just be a coincident. You could have just picked simplistic data points for your example data.
Your concern at this point is to strictly identify valid data from accurate data. We design databases (or should) with an emphasis on data integrity or validity. That means we as sharply as possible constrain each piece of data so it is consistent with the specifications of that piece of data.
For example, the car id field is a foreign key -- generally to a table that defines each instance of the car entity. So we know that at least two cars exist with an id of 3 and 4. Else those values could not exist in the example you show.
But what about accuracy or correctness? Suppose in the last entry in your example, the car id 3 should really have been 4? There is no way to tell from within the database. This illustrates the difference. Both the 3 and 4 are valid values and we are able to constrain these to only valid values. But only one is correct -- assuming for a moment they are the only two cars so far defined. The point is, there is no test, no way to constrain the values to the one that is correct. We can check for validity -- not accuracy.
What you are trying to do is check for accuracy with a validity test. You may claim the "no overlaps" restriction becomes a validity check, but this is just a sort of accuracy check. We can sometimes perform tests to signal data anomalies that indicate an inaccuracy exists somewhere. For example, the overlap could mean the end date of 30 Sep 2014 (second row) is wrong or the start date of 1 Nov 2013 (last row) is wrong or both could be wrong. We have no idea which situation this represents. So we can't just prevent the last row from being entered into the database -- it might be correct with the second row being incorrect.
Invalid data is invalid on its own. Suppose an attempt is made to insert a row for car id 15 and there is no entry for car 15 in the CARS table. Then the value 15 is invalid and the row can be (and should be) prevented from ever entering the table. But date period overlaps are caused by wrong data somewhere -- we have no way of knowing exactly where. We can signal the inconsistency to the user or make a log entry somewhere to have someone look into the problem, but we shouldn't reject the row that "caused" the overlap when it could very well be the existing row that contains the wrong data.
Accuracy, like the data itself, originates from outside the database. If we are lucky enough to be able to detect instances of inaccuracy, the solution also lies outside the database. The best we can do is flag it and have someone investigate to determine what data is correct and what is incorrect and (hopefully) correct the inaccuracy.
UPDATE: Having discussed a bit the concepts of data integrity and accuracy and the differences between them, here is a design idea that may be an improvement.
Note: this is based on the assumption that the date ranges form an unbroken range for each car from the first entry to the last. That is, there are no gaps.
Simple: do away with the end_date field altogether. The first entry for a car sets up the current state of that car with no end date specified. The clear implication is that the state continues indefinitely into the future until the next state change is inserted. The start date of the second state change then becomes the end date of the first state change. Continue as needed.
create table Car_States(
Car_ID int not null,
Start_Date date not null,
..., -- other info
constraint FK_Car_States_Car foreign key( Car_ID )
references Cars( ID ),
constraint PK_Car_States primary key( Car_ID, Start_Date )
);
Now let's look at the data
car_id start_date
3 01/10/2012
3 01/10/2013 -- okay no overlap
3 01/10/2014 -- okay no overlap
4 01/10/2012 -- okay different foreign key
3 01/11/2013 -- What does this mean???
Before that final row was entered, here is how the data is read for the car with id = 3: Car 3 started life in a particular state on 1 Oct 2012, changed to another state on 1 Oct 2013 and then again on 1 Oct 2014 where it remains.
Now the final row is entered: Car 3 started life in a particular state on 1 Oct 2012, changed to another state on 1 Oct 2013, changed to another state on 1 Nov 2013 and then again on 1 Oct 2014 where it remains.
As we can see, we are able to absorb the new data easily into the model. The design makes it impossible to have gaps or overlaps.
But is this really an improvement? What if the last entry was a mistake -- possibly meant to be for a different car instead of car 3? Or the wrong dates were entered. The new model just accepted the incorrect data with no complaints and we proceed not knowing we have incorrect data in the table.
This is true. But how is it any different from the original scenario? The last row represents "wrong" data. The question was, "How do I prevent this?" The answer is, in both cases, "You can't! Sorry." The best either design can do is detect the discrepancy and bring it to someone's attention.
One might think that with the original design, with the start and end dates in the same row, it is easy to determine if the new period overlapped any previously defined period. But this is also easily determined with the start-date-only design. What is important is that the test for such possible inaccuracies being discovered before the data is written to the table primarily lies with the application, not just within the database.
It is up to the users and/or some automated process to verify new and existing data and determine if any inaccuracies exist. The advantage of using only one date is that, after displaying a warning message with an "Are you sure?" response, the new record can be inserted and the operation is finished. With two dates, other records must be found and their dates resynched to match the new period.
Related
I have a table named villas and this table has a column named reserved_dates of type reserved_dates in daterange[]
I want to keep the booked dates in the reserved_dates column.
Villas are booked between certain dates .
For Example:
Check In Date: 2023-02-05 Check Out Date: 2023-02-15.
and in this case I can manually add {"[2023-02-05,2023-02-15)"} value to the reserved_dates column.
what i want is for example when client choose date
Check In Date: 2023-02-10
Check Out Date: 2023-02-20
I want to check, does the selected date range conflict with the one in the database?
And if there is no reservation date, I want to add it, how can I do that?
Or what can I do for this problem?
I couldn't find the Result I wanted and the use of the new date types on many blog platforms, including the PostgreSQL 14 Documentation.
I am able to manually add the date range to reserved_dates. But I can't update the reservation date, if the reservations overlap
An exclusion constraint will do what you want. You have to include more than just the date range (1 room cannot be booked more than once on any given day but 2 separate rooms can be booked on the same day).
CREATE EXTENSION IF NOT EXISTS BTREE_GIST;
CREATE TABLE demo_table(
RoomNumber INTEGER NOT NULL,
CheckIn DATE NOT NULL,
CheckOut DATE NOT NULL,
CHECK (CheckIn < CheckOut ),
EXCLUDE USING gist (RoomNumber WITH =, daterange(CheckIn, CheckOut, '[)'::text) WITH &&)
);
The && operator is the range overlap (range1 && range2), which you can test in a regular SELECT query too.
EDIT: I have seen your comments.
Point 1: the devil is in the details. You named your table villas (plural), suggesting there is more than 1 villa to manage. In which case there should be an additional column to identify which villa is linked to a reservation (if not RoomNumber INTEGER, call it VillaName TEXT).
Honestly, even if you have only 1 villa, it would not hurt to plan for the future and make it so adding another one in the future does not require you to change your entire schema.
Point 2: I do not know why you would store the reservations in an array, it is probably a bad design choice (it will not let you use an index for instance, and delete past records as easily).
UNNEST is a quick fix for you. It turns your array elements into records.
Example:
SELECT *
FROM (
SELECT UNNEST(reserved_dates) AS Reserved, [add other columns here]
FROM villas
) R
But the correct way to do things, it was said in the comments, would rather be in the lines of:
CREATE TABLE villaReservation ([...]);
INSERT INTO villaReservation SELECT UNNEST(reserved_dates), ... FROM villas
WHERE Reserved && daterange('2023-03-10', '2023-02-20', '[)'::text)
Last thing: I personally prefer keeping the 2 bounds of ranges separate in tables (above, keep separate check-in and check-out dates).
It makes migrating from PostgreSQL to another DBMS easier (the table's CREATE script will not need to be adjusted).
It might not apply in your case but it makes it possible to have ranges in the form of [date1, date1), that is null ranges but with a placement in time.I actually encountered 1 use case where things needed to be saved this way, albeit in a different context than yours.
I am a Business Analyst and have prepared tables/erd for a system we are implementing.
The context is essentially an employee management system, an employee can join the company, change positions, get promoted, demoted, terminated etc. All of this is required to be tracked for filtering and reporting purposes. Therefore we require historical tracking of records.
My recommendation and original design of the tables included a field called "Effective Date", so essentially effectively from a date onwards a particular "Action" is valid.
Say for example, John joined an organisation as a consultant on the 1st Jan 2017 thus the action was he was hired, therefore the effective date is 1st Jan 2017 and he was a consultant for a certain period of time until he became a senior consultant on the 6th September 2017, thus the effective date is 6th September 2017 with an action of promoted for that record.
By the way we will also be performing calculations on the salary of the employee based on their position and other parameters so there will be derived fields and fields being referenced from other tables etc.
Now my boss and the Solutions Architect have advised not to use the "Effective Date," my boss says there will be "problems" with the calculation but doesn't elaborate, and the Solutions Architect says it would be easier to use a Start Date and an End Date instead of effective date. His rationale is if there's no end date that action/event is active, but is inactive once an end date is provided.
My problem with this is that we'll have to maintain an additional column that I feel is totally uneccessary.
What do the brains trust of StackOverflow advise??
Thanks :)
Your instincts serve you well. Don't use the end date. This adds a complication and source of possible anomalous data. Take the following sequential entries:
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Jan 22
1 ... Feb 1 Jul 30
There was a state change recorded on Jan 1 which was in effect until the next state change on Jan 20. Now we have a problem. According to the EndDate of that version, there was another state change on Jan 22, but the next version started on Feb 1.
This forms a gap in the time stream and we have no indication of where the problem lies. Is the EndDate of Jan 22 wrong? Is the StartDate of Feb 1 wrong? Or is there a missing version that connects the two ends of the gap? There is no way to tell.
ID <attr> StartDate EndDate
1 ... Jan 1 Jan 20
1 ... Jan 20 Feb 20
1 ... Feb 1 Jul 30
Now there is an overlap of states. The second state supposedly lasted until Feb 20 but the third state says it started on Feb 1. But the start of one state logically means the end of the previous state. Again, we have no idea (just by looking at the data) which date is wrong.
Knowing that the start of one state also indicates the end of the previous state, looks what happens when we simply remove the EndDate column.
ID <attr> EffDate
1 ... Jan 1
1 ... Jan 20
1 ... Feb 1
Now gaps and overlaps are impossible. Each state begins at the effective date and ends when the next state begins. As the EffDate field is part of the PK, no entry can have the same EffDate value for a given ID value.
This design is not used with the main entity table. It is implemented as a special form of second normal form, what I can version normal form (vnf).
Your Employee table will have fields that don't change over the course of time and some that do. You might also have fields that change but you don't wish to track those changes.
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
LName varchar not null,
Sex enum -- M or F
BDay date,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID )
);
If we wish to track changes to the data, we could add an effective date field. Consider, however, that data such as the hire date and birth date will not change from one version to another. Thus they are dependent only on the ID field. The data that does change (Position, PayRate, DeptID) are dependent on the ID and the effective date field. The table is no longer in 2nf.
So we normalize:
create table Employees(
ID int auto_generated primary key,
Hired date not null,
FName varchar not null,
Sex enum -- M or F
BDay date
);
create table Employees_V(
ID int not null references Employees( ID ),
EffDate date not null,
LName varchar not null,
Position enum not null,
PayRate currency,
DeptID int references Depts( ID ),
constraint PK_Employees_V primary key( ID, EffDate )
);
The last name can be expected to change now and then, especially among the female employees.
One of the main advantages of this method is that foreign keys cannot reference versions. Now all FKs can reference the main entity table as normal.
The query to obtain the "current" data is relatively simple:
select e.ID, e.Hired, e.FName, v.Lname, e.Sex, e.BDay, v.Position, v.PayRate, v.DeptID
from Employees e
join Employees)V v
on v.ID = e.ID
and v.EffDate =(
select Max( EffDate )
from Employees_V
where ID = v.ID
and EffDate <= GetDate())
where e.ID = 123;
Compare to querying a table with start/end dates.
select ID, Hired, FName, Lname, Sex, BDay, Position, PayRate, DeptID
from Employees
where ID = 123
and StartDate >= GetDate()
and EndDate < GetDate();
This assumes the EndDate value for the current version is a magic value such as 12/31/9999.
This second query looks a lot simpler than the first. Even if the data is normalized as shown above, there is a join but no subquery. It also looks like it will execute much faster.
I have used this technique for about 8 years now and I've never had to alter it because of performance issues. The vnf query runs at worst less than 10% slower than the start/end version. So a one minute query will take about one minute 5 seconds. However, under some conditions, the vnf query will execute faster.
Take entities that have many, many changes (many thousands of versions). The start/end query performs an index scan. It starts at the earliest version and must examine each version in sequence until it finds the one with the EndDate less than the target date. Normally, this is the last version. In the vnf query, the subquery makes it possible to perform an index seek.
So don't reject this design because you think it is slow. It is not slow. Especially when you consider that inserting a new version requires only the one INSERT statement. When working with start/end dates, the insert of a new version requires an UPDATE and then an INSERT. It's two UPDATEs and an INSERT when inserting a new version between two existing versions. To remove a start/end version requires one or two UPDATE and one DELETE statements. To delete a vnf version, just delete the version.
And if the start and end dates between versions ever get out of synch, you have a gap or overlap and good luck finding the right values.
So I'll take the small performance hit to ensure that the data can never get out of synch and turn anomalous on me. This (vnf), as it turns out, is really the simpler design.
Definitely implement the end date. It is a tiny bit more work when writing but you only write it once, but you will report on it many many times and you'll find that it makes everything so much easier (and faster) when the end date is already there on the record.
All over stackoverflow you will find questions about writing queries to find the end date of a given record when it is defined on the 'next' record rather than the 'current' record These queries are ugly and slow
If you look at the back end of enterprise systems like SAP you'll find that records have start and end dates defined.
With regards to your colleagues comments about not using effective date: You don't provide much info so I'll guess. I'm guessing that there is a true 'effective date' when the thing happened but there is also another set of start and end dates which are the payroll effective dates that the change applies to. So if someone starts on the 1st, the payroll effective date might actually be the 15th. This might also be used for FTE calculations. Payroll and pay periods are really a big deal and quite complex so you shouldn't underestimate the complexity there. If you're including pay calculations in this system then at the very least you need to understand what effective payroll dates are.
You should not be afraid of storing four date columns instead of one. Databases are there to make things easy for you not harder.
Using startDate and endDate makes update messy but it helps fetching effective dated much easier and faster.
Updating same record asynchronously may cause overlapping of the dates as we need to fetch all the records within update range and update these records individually.
On the other hand, using effectiveDate only fastens the update process as well as it will terminate the issue of date overlapping. But fetch seems too much complicated with this way.
For example:
ID Data EffDate
1 ... Jan 1 2020
1 ... Jan 30 2020
1 ... Feb 1 2020
In above example, if we want to fetch record of effective date Feb 1 we would have to compare first 3 records to match the highest date (which is not possible if we are fetching list). Upon that it will be mess to join with other effective dated tables.
I am working on an Hotel DB, and the booking table changes a lot since people book and cancel reservation all the time. Trying to find out the best way to convert the booking table to a fact table in SSAS. I want to be able to get the right statsics from it.
For example: if a client X booked a room on Sep 20th for Dec 20th and canceled the order on Oct 20th. If I run the cube on the month of September (run it in Nov) and I want to see how many rooms got booked in the month of Sep, the order X made should be counted in the sum.
However, if I run the cube for YTD calculation (run it in Nov), the order shouldn't be counted in the sum.
I was thinking about inserting the updates to the same fact table every night, and in addition to the booking number (unique key) and add revision column to the table. So going back to the example, let say client X booking number is 1234, the first time I enter it to the table will get revision 0, in Oct when I add the cancellation record, it will get revision 1 (of course with timestamp on the row).
Now, if I want to look on any piroed of time, I can take it by the timestamp and look at the MAX(revision).
Does it make sense? Any ideas?
NOTE: I gave the example of cancelling the order, but we want to track another statistics.
Another option I read about is partitioning the cubes, but do I partition the entire table. I want to be able to add changes every night. Will I need to partition the entire table every night? it's a huge table.
One way to handle this is to insert records in your fact table for bookings and cancellations. You don't need to look at the max(revision) - cubes are all about aggregation.
If your table looks like this:
booking number, date, rooms booked
You can enter data like this:
00001, 9/10, 1
00002, 9/12, 1
00001, 10/5, -1
Then your YTDs will always have information accurate as of whatever month you're looking at. Simply sum up the booked rooms.
I need to store values for every day in timeline, i.e. every user of database should has status assigned for every day, like this:
from 1.1.2000 to 28.05.2011 - status 1
from 29.05.2011 to 30.01.2012 - status 3
from 1.2.2012 to infinity - status 4
Each day should have only one status assigned, and last status is not ending (until another one is given). My question is what is effective representation in sql database? Obvious solution is to create row for each change (with the last day the status is assigned in each range), like this:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
01.01.9999 status 4
this has many problems - if i would want to add another range, say from 15.02.2012, i would need to alter last row too:
uptodate status
28.05.2011 status 1
30.01.2012 status 3
14.02.2012 status 4
01.01.9999 status 8
and it requires lots of checking to make sure there is no overlapping and errors, especially if someone wants to modify ranges in the middle of the list - inserting a new status from 29.01.2012 to 10.02.2012 is hard to implement (it would require data ranges of status 3 and status 4 to shrink accordingly to make space for new status). Is there any better solution?
i thought about completly other solution, like storing each day status in separate row - so there will be row for every day in timeline. This would make it easy to update - simply enter new status for rows with date between start and end. Of course this would generate big amount of needless data, so it's bad solution, but is coherent and easy to manage. I was wondering if there is something in between, but i guess not.
more context: i want moderator to be able to assign status freely to any dates, and edit it if he would need to. But most often moderator will be adding new status data ranges at the end. I don't really need the last status. After moderator finishes editing whole month time, I need to generate raport based on status on each day in that month. But anytime moderator may want to edit data months ago (which would be reflected on updated raports), and he can put one status for i.e. one year in advance.
You seem to want to use this table for two things - recording the current status and the history of status changes. You should separate the current status out and move it up to the parent (just like the registered date)
User
===============
Registered Date
Current Status
Status History
===============
Uptodate
Status
Your table structure should include the effective and end dates of the status period. This effectively "tiles" the statuses into groups that don't overlap. The last row should have a dummy end date (as you have above) or NULL. Using a value instead of NULL is useful if you have indexes on the end date.
With this structure, to get the status on any given date, you use the query:
select *
from t
where <date> between effdate and enddate
To add a new status at the end of the period requires two changes:
Modify the row in the table with the enddate = 01/01/9999 to have an enddate of yesterday.
Insert a new row with the effdate of today and an enddate of 01/01/9999
I would wrap this in a stored procedure.
To change a status on one date in the past requires splitting one of the historical records in two. Multiple dates may require changing multiple records.
If you have a date range, you can get all tiles that overlap a given time period with the query:
select *
from t
where <periodstart> <= enddate and <periodend> >= effdate
I've been given a stack of data where a particular value has been collected sometimes as a date (YYYY-MM-DD) and sometimes as just a year.
Depending on how you look at it, this is either a variance in type or margin of error.
This is a subprime situation, but I can't afford to recover or discard any data.
What's the optimal (eg. least worst :) ) SQL table design that will accept either form while avoiding monstrous queries and allowing maximum use of database features like constraints and keys*?
*i.e. Entity-Attribute-Value is out.
You could store the year, month and day components in separate columns. That way, you only need to populate the columns for which you have data.
if it comes in as just a year make it default to 01 for month and date, YYYY-01-01
This way you can still use a date/datetime datatype and don't have to worry about invalid dates
Either bring it in as a string unmolested, and modify it so it's consistent in another step, or modify the year-only values during the import like SQLMenace recommends.
I'd store the value in a DATETIME type and another value (just an integer will do, or some kind of enumerated type) that signifies its precision.
It would be easier to give more information if you mentioned what kind of queries you will be doing on the data.
Either fix it, then store it (OK, not an option)
Or store it broken with a fixed computed columns
Something like this
CREATE TABLE ...
...
Broken varchar(20),
Fixed AS CAST(CASE WHEN Broken LIKE '[12][0-9][0-9][0-9]' THEN Broken + '0101' ELSE Broken END AS datetime)
This also allows you to detect good from bad source data
If you don't always have a full date, what sort of keys and constraints would you need? Perhaps store two columns of data; a full date, and a year. For data that has only year, the year is stored and date is null. For items with full info, both are populated.
I'd put three columns in the table:
The provided value (YYYY-MM-DD or YYYY)
A date column, Date or DateTime data type, which is nullable
A year column, as an integer or char(4) depending upon your needs.
I'd always populate the year column, populate the date column only when the provided value is a date.
And, because you've kept the provided value, you can always re-process down the road if needs change.
An alternative solution would be to that of a date mask (like in IP). Store the date in a regular datetime field, and insert an additional field of type smallint or something, where you could indicate which is present (could go even binary here):
If you have YYYY-MM-DD, you would have 3 bits of data, which will have the values 1 if data is present and 0 if not.
Example:
Date Mask
2009-12-05 7 (111)
2009-12-01 6 (110, only year and month are know, and day is set to default 1)
2009-01-20 5 (101, for some strange reason, only the year and the date is known. January has 31 days, so it will never generate an error)
Which solution is better depends on what you will do with it.
This is better when you want to select those with full dates, which are between a certain period (less to write). Also this way it's easier to compare any dates which have masks like 7,6,4. It may also take up less memory (date + smallint may be smaller than int+int+int, and only if datetime uses 64 bit, and smallint uses up as much as int, it will be the same).
I was going to suggest the same solution as #ninesided did above. Additionally, you could have a date field and a field that quantitatively represents your uncertainty. This offers the advantage of being able to represent things like "on or about Sept 23, 2010". The problem is that to represent the case where you only know the year, you'd have to set your date to be the middle of the year, with 182.5 days' uncertainty (assuming non-leap year), which seems ugly.
You could use a similar but distinct approach with a mask that represents what date parts you're confident about - that's what SQLMenace offered in his answer above.
+1 each to recommendations from ninesided, Nikki9696 and Jeff Siver - I support all those answers though none was exactly what I decided upon.
My solution:
a date column used only for complete dates
an int column used for years
a constraint to ensure integrity between the two
a trigger to populate the year if only date is supplied
Advantages:
can run simple (one-column) queries on the date column with missing data ignored (by using NULL for what it was designed for)
can run simple (one-column) queries on the year column for any row with a date (because year is automatically populated)
insert either year or date or both (provided they agree)
no fear of disagreement between columns
self explanatory, intuitive
I would argue that methods using YYYY-01-01 to signify missing data (when flagged as such with a second explanatory column) fail seriously on points 1 and 5.
Example code for Sqlite 3:
create table events
(
rowid integer primary key,
event_year integer,
event_date date,
check (event_year = cast(strftime("%Y", event_date) as integer))
);
create trigger year_trigger after insert on events
begin
update events set event_year = cast(strftime("%Y", event_date) as integer)
where rowid = new.rowid and event_date is not null;
end;
-- various methods to insert
insert into events (event_year, event_date) values (2008, "2008-02-23");
insert into events (event_year) values (2009);
insert into events (event_date) values ("2010-01-19");
-- select events in January without expressions on supplementary columns
select rowid, event_date from events where strftime("%m", event_date) = "01";