Table structure for Scheduling App in SQL DB - sql

I'm working on a database to hold information for an on-call schedule. Currently I have a structure that looks about like this:
Table - Person: (key)ID, LName, FName, Phone, Email
Table - PersonTeam: (from Person)ID, (from Team)ID
Table - Team: (key)ID, TeamName
Table - Calendar: (key dateTime)dt, year, month, day, etc...
Table - Schedule: (from Calendar)dt, (id of Person)OnCall_NY, (id of Person)OnCall_MA, (id of Person)OnCall_CA
My question is: With the Schedule table, should I leave it structured as is, where the dt is a unique key, or should I rearrange it so that dt is non-unique and the table looks like this:
Table - Schedule: (from Calendar)dt, (from Team)ID, (from Person)ID
and have multiple entries for each day, OR would it make sense to just use:
Table - Schedule: (from Calendar)dt, (from PersonTeam)KeyID - [make a key ID on each of the person/team pairings]
A team will always have someone on call, but a person can be on call for more than one team at a time (if they are on multiple teams).
If a completely different setup would work better let me know too!
Thanks for any help! I apologize if my question is unclear. I'm learning fast but nevertheless still fairly new to using SQL daily, so I want to make sure I'm using best practices when I learn so I don't develop bad habits.

The current version, one column per team, is probably not a good idea. Since you're representing teams as a table (and not as an enum or equivalent), it means you expect to add/remove teams over time. That would force you to add/remove columns to the table, which is always a much larger task than adding/removing a few rows.
The 2nd option is the usual solution to a problem like this. A safe choice. You can always define an additional foreign key constraint from Schedule(teamID, personID) to PersonTeam to ensure you don't mistakenly assign schedule duty to a person not belonging to the team.
The 3rd option is pretty much equivalent to the 2nd, only you're swapping a composite natural key for PersonTeam for a surrogate simple key. Since the two components of said composite key are already surrogate, there is no advantage (in terms of immutability, etc.) to adding this additional one. Plus it would turn a very simple N-M relationship (PersonTeam) which most DB managers / ORMs will handle nicely into a more complex object which will need management on its own.
By Occam's razor, I'd do away with the additional surrogate key and use your 2nd option.

In my view, the answer may depend on whether the number of teams is fixed and fairly small. Of course, whether the names of the teams are fixed or not, may also matter, but that would probably have more to do with column naming.
More specifically, my view is this:
If the business requirement is to always have a small and fixed number of people (say, three) on call, then it may well be more convenient to allocate three columns in Schedule, one for every team to hold the ID of the appointed person, i.e. like your current structure:
dt OnCall_NY OnCall_MA OnCall_CA
--- --------- --------- ---------
with dt as the primary key.
If the number of teams (in the Team table) is fixed too, you could include teams' names/designators in the column names like you are doing now, but if the number of teams is more than three and it's just the number of teams in Schedule that is limited to three, then you could just use names like OnCallID1, OnCallID2, OnCallID3.
But even if that requirement is fixed, it may only turn out fixed today, and tomorrow your boss says, "We no longer work with a fixed number of teams (on call)", or "We need to extend the number of teams supported to four, and we may need to extend it further in the future". So, a more universal approach would be the one you are considering switching to in your question, that is
dt Team Person
--- ---- ------
where the primary key would now be dt, Team.
That way you could easily extend/reduce the number of people on call on the database level without having to change anything in the schema.
UPDATE
I forgot to address your third option in my original answer (sorry). Here goes.
Your first option (the one actually implemented at the moment) seems to imply that every team can be presented by (no more than) one person only. If you assign surrogate IDs to the Person/Team pairs and use those keys in Schedule instead of separate IDs for Person and Team, you will probably be unable to enforce the mentioned "one person per team in Schedule" requirement (or, at least, that might prove somewhat troublesome) at the database level, while, using separate keys, it would be just enough to set Team to be part of a composite key (dt, Team) and you are done, no more than one team per day now.
Also, you may have difficulties letting a person change the team over time if their presence in the team was fixated in this way, i.e. with a Schedule reference to the Person/Team pair. You would probably have to change the Team reference in the PersonTeam table, which would result in misrepresentation of historical info: when looking at the people on call back on certain day, the person's Team shown would be the one they belong to now, not the one they did then.
Using separate IDs for people and teams in Schedule, on the other hand, would allow you to let people change teams freely, provided you do not make (Schedule.Team, Schedule.Person) a reference to (PersonTeam.Team, PersonTeam.Person), of course.

Related

Modelling many-to-many relation between more than two tables

I'm modelling a tier-list database using PostgreSQL. This is how it works:
A user can create a new Tier List;
A user can add as many tiers he wants to the list;
A user can add as many items as he can. Initially, the items are added to an "unranked" section (not assigned to any tier), then the user can rank them as he wants.
Modeling details:
A tier necessarily belongs to a tier_list;
An item can be in multiple tier_lists and in multiple tiers as well;
An item added to a tier_list has not necessarily been added to one of the tiers.
For modelling the relations between item-tier and item-tier_list, I thought about two scenarios:
Creating a junction with a composite PFK key of item and tier_list with a nullable tier FK. The records with no tier value would be the unranked ones, while the ones with an assigned tier would be the ranked;
Creating two M-N relations: one between item and tier, storing ranked items, and another between item and tier_list, storing unranked items.
I feel like the first option would be easier to deal with when having to persist things like moving a product between tiers (or even unranking it), while the second looks more compliant to SQL standards. Am I missing something?
First proposed solution model:
Second proposed solution model:
You can create a joint key using 3 different fields.
First of all, why using smallint and not int? Not fluent in Posgres, but it's usually better to have the biggest integer possible as primary key (things can grow faster than you expect).
Second, I strongly suggest to put ID_ before and not after the name of the filed used for lookup. It makes it easier to read.
As how to build your tables:
Item
ID PK
Title
Descriptions
I see no problems here. I'd just change the name in tblProducts, for easier reading.
Tier_List
ID PK
Description
Works fine too. Again I'll look for a better name. I'd call this one tblTiers or tblLegues instead. Usign similar names can bring troubles in 2-3 years when you have to add things and you're not sure what's what. Better use distinctive names for the tables.
Tier (suggesting tblTiers or tblRankings)
ID PK
Tier_List_ID PK FK
Title
Description
Here I see a HUGE problem. For experience, I don't really understand why you create a combination key here with ID and Tier_List_ID. Do you need to reuse the same ID for different tiers? If that ID has a meaning bring it out from the PK absolutely! PK must be simple counters, that will NEVER be changed. I saw people using the ID with a meaning for the end-user. It was a total disaster! I can't even start describing the quantity of garbage data that that DB was containing.
I suppose, because you were talking about ranking, that the ID there is a Rank, a level or something like that.
The table should become
ID PK uuid
Tier_List_ID FK
Rank smallint
Title
Description
There's another reason why I had you do this: when you have a combined PK, certain DBRMs require you to use the same combined key in the lookup tables, and that can become messy fast!
Now, the lookup table:
tier_list_item (tblRankingLookup?)
ID_Product FK PK
ID_Tier_List FK PK
ID_Tier FK PK
You don't need anything else to make it work smoothly! At least, that's how I'd envision it.
Instead I'd add an ID_User (because I'm not sure if all users can see all tiers and all rankings, or they can see only theirs).
Addendum: if you need to have unique combinations of different elements, I'm pretty sure you can create a combined index and mark it as "unique" (don't remember the correct syntax, not sure it is the same in Postgres).
In exmple, if you don't want the Tier table to have the rank repeated only once per tier_list_ID, you can create an index using tier_list_ID and Ranking and mark it unique. This way a two tiers in the same tier_list will not have the same value for the field Rank (rank can still be null).

Create simple database for chess tournaments

I am trying to make simple app for chess tournaments, but I have problem with database, I have users that participate in tournament (thats fine) but how do I give users to the round and match, should i make another relations user_tournament-round-tournament, user_tournament-match-round?
Please see this answers a food for though rather than a solution. In your question there is not enough information to fully cover all use cases, so the answer below contains a lot of speculation.
In my over simplistic view and picking up on your initial model, the tournament_competitors (renaming from user_tournament as we have competitors and not users) table would create a unique id for each enrolled competitor. This id would be used as a reference in a tournament_matches table (the table would link twice to the tournament_competitors this table would connect two opponents - constraint warning). The table would also register the match type.
For the match type, I see two possibilities.
The matches table would list all possible match types (final, semi-final, quarter-final, elimination rounds, etc.) and these would be referred to in the tournament_matches table via id (composite key in the form tournament_id-competitor_id-group_id). This approach, specially for the elimination round matches, requires the need to find a way to link the number of competitors in each elimination group with then number of matches each competitor has to through before they are considered eliminated or not - creating a round number. I see this as a business logic part so not on the DB. The group_id also needs to be calculated and it would be best done on the application.
The alternative is to have the various match types in the tournament_matches table as a free field - populated by the application. The tournament structure (Number of Groups, number of opponents in each group, etc.) would be defined as attributes in the tournaments table. In this view there is no need for the rounds table.

Access 365 - No duplicates based on a forgein key

I have an issue/problem that I'm looking for a solution for. If I find a solution for this, it could fix two issues I'm facing in a database tool I'm creating for my school's admin office.
The gist of it is, I'm looking for a way in Access 365 (or Access 2019) either through a form/query/vba... to limit data to not have duplicates. Now, of course I know that you can set the field in your table to not have duplicates. But, here is the thing, I wouldn't mind a duplicate in that column in certain cases.
So, let me explain the issue/problem by giving an example.
Image that you have two tables. One table is filled with the name of a school bus and the yearly price that the parents have to pay for it. Another table has all the stops that the school busses take.
Now, here comes the problem. I would love to have a system where I can let the user decide the order of those stops.
What I'm currently doing is, I let the user fill in a number. In another column, I have a calculated field that adds the primary key of the schoolbus table "." the number of the order.
So, that would look like this for the first school bus and the first five stops.
1.1, 1.2, 1.3, 1.4, 1.5
And I have a duplicate control that column. Since that way, I can make sure that it doesn't block the number 1 for the next bus.
Something that might work as well, if that's possible in access is that for each school bus you create, you add a column in the stops table to check for duplicates. But how do you make sure then to filter which stops belongs to which schoolbus?
I hope my issue is a bit clear. If it's not, feel free to ask.
I have a similar style problem in another part of the tool with a book fair system.
Where you have an order table where the order number, the student number, the creation and pickup dates are stored. It also has a field to lock the form to avoid any edits for the bookkeeping folks. I also have an orderline form where all the actual bought books are stored. Now, I would love it if a system exists to avoid users adding the same book twice on an order in the orderline table since I do have an amount column for that. So, in a way, it's almost similar to the above described problem.

How to identify duplicate records using client name and address in SQL while both of them is in free text

I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel.
Following is the data structure:
Client_Header
id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)
Client_Address
Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)
My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only.
For example, we have a client like following
Surname : Jones
First name : David
Client_Date_Of_Birth: 10/05/1975
Address: Unit 10 Floor 1, 20 Railway Parade, St Peter, NSW 2044
However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues
Sample of data issues
Note:
Other than those issues, sometime we may miss the first name or last name of the client (but not both) too
Sometimes multiple problems can be find within the same record.
Also sometime, the address may simply be the name of a school,
shopping center etc.
The system does not store any other id that can uniquely identify the client.
I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication.
Thank you for your help!
Not a pretty sight, and I'm afraid I don't have good news for you.
This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future.
Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator.
If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation.
In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample:
Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.")
Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".)
Change "1/F" to "Floor 1."
The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table).
This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field.
You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know.
As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...
Some years ago I had a similar task and I tooks about one years to clean the data.
What I did in short:
send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
List item
at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc...
I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.