I am trying to create a DB model in SQL Server for storing Flight schedules (not real time), i have finally come up with 2 DB model but confused, which one to choose to store the schedules.
Approach:1
For each flight, store the schedules in the same column (123X56X) along with flight name, depart time, arrival time, source, destination. (123X56X) It means that particular flight is available on Sunday(1), Monday(2), Tuesday(3), Thursday(5) and Friday(6)
Approach:2
Keep the flight name, depart time, arrival time, source, destination in one table and create a new mapping table for schedules.
Table1 - wk_days
wk_day_id wk_day_short wk_day_long
1 Sun Sunday
2 Mon Monday
Table2 - flight_schedule
flight_sch_id flight_id src_city_id dest_city_id Depart_tm Arrival_tm Duration
1 1 1 2 6:00 8:00 2:00
Table3 - flight_schedule_wk_days
flight_sch_id wk_day_id
1 2
1 3
1 4
2 2
2 3
2 4
Please suggest, which one is better?
A flight schedule database is actually quite a bit more complicated in the real world than either of your examples. (More on this in a moment.)
To answer your question: In general the normalized database approach is a better idea, especially for a transactional database. The second design is normalized. Your first option is reminiscent of old COBOL flat file systems like the original SABRE system.
Using a normalized approach makes your queries much easier and more efficient. Finding out which flights fly on a Tuesday means scanning and doing an in-string analysis on every record under option 1. In option 2 your database can use an index to answer this question without having to read and analyze each record.
On a broader note, a flight is not just an origin and a destination at a particular time on some set of days of the week. Here are some things that a real-world flight schedule database needs to be able to handle:
Flights have an airline identifier
Flights have an operator airline, which can be different from the seller (i.e. "code-share")
Flights can have multiple legs (i.e. multiple sets of origins and destinations under one number)
Flights as defined by scheduling systems do have days of the week, like your model, but they also need to have a start date and end date for the date range in which the flight will operate.
Depending on what your application is intended to do, you might need to take some or all of these into account.
Related
I have a Calendar table pulled from our mainframe DBs and saved as a local Access table. The table has history back to the 1930s (and I know we use back to the 50s in at least one place), resulting in 31k records. This Calendar table has 3 fields of interest:
Bus_Dt - every day, not just business days. Primary Key
Bus_Day_Ind - indicates if the day was a valid business day for the stock market.
Prir_Bus_Dt - the prior business day. Contains some errors (about 50), all old.
I have written a query to retrieve the first business day on or after the current calendar day, but it runs supremely slowly. (5+ minutes) I have examined the showplan output and see it is being run via an x-join, which between 30k+ record tables gives a solution space (and date comparisons) in the order of nearly 10 million. However, the actual task is not hard, and could be preformed comfortably by excel in minimal time using a simple sort.
My question is thus, is there any way to fix the poor performance of the query, or is this an inherent failing of SQL? (DB2 run on the mainframe also is slow, though not crushingly so. Throwing cycles at the problem and all that.) Secondarily, if I were to trust prir_bus_dt, can I get there better? Or restrict the date range (aka, "cheat"), or any other tricks I didn't think of yet?
SQL:
SELECT TE2Clndr.BUS_DT AS Cal_Dt
, Min(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
, TE2Clndr AS TE2Clndr_1
WHERE TE2Clndr_1.BUS_DAY_IND="Y" AND
TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]
GROUP BY TE2Clndr.BUS_DT;
Showplan:
Inputs to Query
Table 'TE2Clndr'
Table 'TE2Clndr'
End inputs to Query
01) Restrict rows of table TE2Clndr
by scanning
testing expression "TE2Clndr_1.BUS_DAY_IND="Y""
store result in temporary table
02) Inner Join table 'TE2Clndr' to result of '01)'
using X-Prod join
then test expression "TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]"
03) Group result of '02)'
Again, the question is, can this be made better (faster), or is this already as good as it gets?
I have a new query that is much faster for the same job, but it depends on the prir_bus_dt field (which has some errors). It also isn't great theory since prior business day is not necessarily available on everyone's calendar. So I don't consider this "the" answer, merely an answer.
New query:
SELECT TE2Clndr.BUS_DT as Cal_Dt
, Max(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
INNER JOIN TE2Clndr AS TE2Clndr_1
ON TE2Clndr.PRIR_BUS_DT = TE2Clndr_1.PRIR_BUS_DT
GROUP BY TE2Clndr.BUS_DT;
What about this approach
select min(bus_dt)
from te2Clndr
where bus_dt >= date()
and bus_day_ind = 'Y'
This is my reference for date() representing the current date
I need to add a unique constraint to an Oracle database table where a foreign key reference can only exist more than once if 2 other columns which are dates don't overlap
e.g.
car_id start_date end_date
3 01/10/2012 30/09/2013
3 01/10/2013 30/09/2014 -- okay no overlap
3 01/10/2014 30/09/2015 -- okay no overlap
4 01/10/2012 30/09/2013 -- okay different foregin key
3 01/11/2013 01/01/2014 -- * not allowed overlapping dates for this car.
Any suggestions? Thanks in advance.
Last time I've seen a requirement and a solution for this, I've seen this:
Create an after statement trigger. In this trigger do a self join on your table like this:
select count(*)
from your_table a
join your_table b
on a.car_id = b.car_id and
(a.start_date between b.start_date and b.end_date
or
b.start_date between a.start_date and a.end_date)
If count is zero then everything is ok. If count > 0 then raise an exception and the statement will be rolled back.
OBS: This will not work for tables with > millions of rows and many inserts.
It works on small lookup tables or, if you have a big table, with big table and seldom inserts(batch inserts).
I take it that cars are tracked through some sort of process and every date records a state change. For example, you show that car #3 underwent a state change on 1 Oct 2012, again on 1 Oct 2013 and again on 1 Oct 2014. The final entry implies that the state changed again on 1 Oct 2015. Where is the entry showing that? Or is the state something that always lasts exactly one year -- making it possible to specify the end of the state as soon as the state begins? If so, then the entry showing the state change on 1 Nov 2013 is simply wrong. But the one-year specification could just be a coincident. You could have just picked simplistic data points for your example data.
Your concern at this point is to strictly identify valid data from accurate data. We design databases (or should) with an emphasis on data integrity or validity. That means we as sharply as possible constrain each piece of data so it is consistent with the specifications of that piece of data.
For example, the car id field is a foreign key -- generally to a table that defines each instance of the car entity. So we know that at least two cars exist with an id of 3 and 4. Else those values could not exist in the example you show.
But what about accuracy or correctness? Suppose in the last entry in your example, the car id 3 should really have been 4? There is no way to tell from within the database. This illustrates the difference. Both the 3 and 4 are valid values and we are able to constrain these to only valid values. But only one is correct -- assuming for a moment they are the only two cars so far defined. The point is, there is no test, no way to constrain the values to the one that is correct. We can check for validity -- not accuracy.
What you are trying to do is check for accuracy with a validity test. You may claim the "no overlaps" restriction becomes a validity check, but this is just a sort of accuracy check. We can sometimes perform tests to signal data anomalies that indicate an inaccuracy exists somewhere. For example, the overlap could mean the end date of 30 Sep 2014 (second row) is wrong or the start date of 1 Nov 2013 (last row) is wrong or both could be wrong. We have no idea which situation this represents. So we can't just prevent the last row from being entered into the database -- it might be correct with the second row being incorrect.
Invalid data is invalid on its own. Suppose an attempt is made to insert a row for car id 15 and there is no entry for car 15 in the CARS table. Then the value 15 is invalid and the row can be (and should be) prevented from ever entering the table. But date period overlaps are caused by wrong data somewhere -- we have no way of knowing exactly where. We can signal the inconsistency to the user or make a log entry somewhere to have someone look into the problem, but we shouldn't reject the row that "caused" the overlap when it could very well be the existing row that contains the wrong data.
Accuracy, like the data itself, originates from outside the database. If we are lucky enough to be able to detect instances of inaccuracy, the solution also lies outside the database. The best we can do is flag it and have someone investigate to determine what data is correct and what is incorrect and (hopefully) correct the inaccuracy.
UPDATE: Having discussed a bit the concepts of data integrity and accuracy and the differences between them, here is a design idea that may be an improvement.
Note: this is based on the assumption that the date ranges form an unbroken range for each car from the first entry to the last. That is, there are no gaps.
Simple: do away with the end_date field altogether. The first entry for a car sets up the current state of that car with no end date specified. The clear implication is that the state continues indefinitely into the future until the next state change is inserted. The start date of the second state change then becomes the end date of the first state change. Continue as needed.
create table Car_States(
Car_ID int not null,
Start_Date date not null,
..., -- other info
constraint FK_Car_States_Car foreign key( Car_ID )
references Cars( ID ),
constraint PK_Car_States primary key( Car_ID, Start_Date )
);
Now let's look at the data
car_id start_date
3 01/10/2012
3 01/10/2013 -- okay no overlap
3 01/10/2014 -- okay no overlap
4 01/10/2012 -- okay different foreign key
3 01/11/2013 -- What does this mean???
Before that final row was entered, here is how the data is read for the car with id = 3: Car 3 started life in a particular state on 1 Oct 2012, changed to another state on 1 Oct 2013 and then again on 1 Oct 2014 where it remains.
Now the final row is entered: Car 3 started life in a particular state on 1 Oct 2012, changed to another state on 1 Oct 2013, changed to another state on 1 Nov 2013 and then again on 1 Oct 2014 where it remains.
As we can see, we are able to absorb the new data easily into the model. The design makes it impossible to have gaps or overlaps.
But is this really an improvement? What if the last entry was a mistake -- possibly meant to be for a different car instead of car 3? Or the wrong dates were entered. The new model just accepted the incorrect data with no complaints and we proceed not knowing we have incorrect data in the table.
This is true. But how is it any different from the original scenario? The last row represents "wrong" data. The question was, "How do I prevent this?" The answer is, in both cases, "You can't! Sorry." The best either design can do is detect the discrepancy and bring it to someone's attention.
One might think that with the original design, with the start and end dates in the same row, it is easy to determine if the new period overlapped any previously defined period. But this is also easily determined with the start-date-only design. What is important is that the test for such possible inaccuracies being discovered before the data is written to the table primarily lies with the application, not just within the database.
It is up to the users and/or some automated process to verify new and existing data and determine if any inaccuracies exist. The advantage of using only one date is that, after displaying a warning message with an "Are you sure?" response, the new record can be inserted and the operation is finished. With two dates, other records must be found and their dates resynched to match the new period.
I am working on an Hotel DB, and the booking table changes a lot since people book and cancel reservation all the time. Trying to find out the best way to convert the booking table to a fact table in SSAS. I want to be able to get the right statsics from it.
For example: if a client X booked a room on Sep 20th for Dec 20th and canceled the order on Oct 20th. If I run the cube on the month of September (run it in Nov) and I want to see how many rooms got booked in the month of Sep, the order X made should be counted in the sum.
However, if I run the cube for YTD calculation (run it in Nov), the order shouldn't be counted in the sum.
I was thinking about inserting the updates to the same fact table every night, and in addition to the booking number (unique key) and add revision column to the table. So going back to the example, let say client X booking number is 1234, the first time I enter it to the table will get revision 0, in Oct when I add the cancellation record, it will get revision 1 (of course with timestamp on the row).
Now, if I want to look on any piroed of time, I can take it by the timestamp and look at the MAX(revision).
Does it make sense? Any ideas?
NOTE: I gave the example of cancelling the order, but we want to track another statistics.
Another option I read about is partitioning the cubes, but do I partition the entire table. I want to be able to add changes every night. Will I need to partition the entire table every night? it's a huge table.
One way to handle this is to insert records in your fact table for bookings and cancellations. You don't need to look at the max(revision) - cubes are all about aggregation.
If your table looks like this:
booking number, date, rooms booked
You can enter data like this:
00001, 9/10, 1
00002, 9/12, 1
00001, 10/5, -1
Then your YTDs will always have information accurate as of whatever month you're looking at. Simply sum up the booked rooms.
What Query would I need to do to calculate cost of calls in mysql database?
Ive got two tables, one is a call log with call duration, and the other table is the tariff table with peak and offpeak rates, peaktime is 08:00:00 - 19:00:00 offpeak time is 19:00:00 - 08:00:00. rates for peak are say 10p a minute or 0.9992 a second or something on the lines of that. and offpeak 2p minute.
I want to know how to query the two tables to calculate the cost of call according to the call duration and the cost of the call - Rate per sec/minute.
Output would be on another table with CallerId, Source, Destination, call duration , cost of call
This seems relatively straight forward (which usually means I am missing something).
Starting with assumptions. Say the CALL_LOG table looks like this:
CallerId
Source
Destination
Duration
CallStartTime
CallStopTime
. . . and the TARRIFF table looks like this:
Id
RateType (Peak or OffPeak)
RateStartTime
RateStopTime
Rate
And let's assume you are using Oracle, since I don't see that specifically mentioned. But you say CDRs, so probably lots of records, so maybe Oracle. (NOTE: I removed the Oracle specific code and decided to do this as an inner join. Might be too slow though, depending on volume.)
And let's assume that the definition of an "off peak call" is a call that starts during an off-peak time, regardless of when it ends. (Note that this definition is critical to doing it correctly.)
Lastly, let's assume that there are only two rates, peak and off-peak, based on your comments. That seems strange, but ok. I would have thought that the times would differ by day, to allow for weekend rates, but you should be able to extrapolate.
So the cost for a call would then be
SELECT l.CallerId,
l.Source,
l.Destination,
l.Duration,
t.RateType,
l.Duration * t.Rate as Cost
FROM CALL_LOG l
INNER JOIN TARRIF t
ON l.CallStartTime BETWEEN t.RateStartTime and t.RateStopTime
Let's think we have 100+ hotels, and each hotel has at least more than 3 room types.
I want to hold hotel's capacity for one year in the past and one year in the future. How should i design the database for easiest use.
Example:
A hotel has 30 rooms. 10 x "Standard
room", 10 x "Duplex Room", 10 x "Delux
room" I will keep this example on
standard rooms. Today is: 13.01.2011 I
want to keep records from 13.01.2010
to 13.01.2012 What i will store in
database is available rooms. Something
like this(for standard room):
13.01.2011: 10
14.01.2011: 9 (means 1 standard room sold for this day)
15.01.2011: 8 (means 2 standard rooms sold for this day)
16.01.2011: 10 (all available for this day)
17.01.2011: 7 (means 3 standard rooms sold for this day)
18.01.2011: 10
etc...
Thanks in advance.
Let me try to summarize your question to see if I understand it properly:
You have a set of Hotels. Each Hotel
has a set of Rooms. Each Room belongs
to one of a number of possible Room
Types. The lowest level of detail
we're interested in here is a Room.
This suggests a table of Hotels, a lookup table of Room Types, and a table of Rooms: each Room will have a reference to its associated Hotel and Room Type.
For any given day, a room is either
booked (sold) or not booked (let's
leave off partial days for simplicity
at this point). For each day in the
year before and the year after the
current day, you wish to know how many
rooms of each type were available (non-booked) at
each hotel.
Now, since hotels need to be able to look at bookings individually, it's likely you would maintain a table of bookings. But these would typically be defined by a Room, a Start Date, and a number of Nights, which isn't ideal for your stated reporting purposes: it isn't broken down by day.
So you may wish to maintain a "Room Booking Log" table, which simply contains a record for each room booked on each day: this could be as simple as a datestamp column plus a Room ID.
This sort of schema would let you generate the output you're describing relatively easily via aggregate queries (displaying the sum of rooms booked per day, grouped by hotel and room type, for example). The model also seems like it would lend itself to an OLAP cube.
I did a homework question like this once. Basically you need at least 3 tables: one which holds the rooms, one which holds the reservations, and another table that links the too because its not a specific room that is reserved at a given time, its a specific type of room.