Access 365 - No duplicates based on a forgein key - sql

I have an issue/problem that I'm looking for a solution for. If I find a solution for this, it could fix two issues I'm facing in a database tool I'm creating for my school's admin office.
The gist of it is, I'm looking for a way in Access 365 (or Access 2019) either through a form/query/vba... to limit data to not have duplicates. Now, of course I know that you can set the field in your table to not have duplicates. But, here is the thing, I wouldn't mind a duplicate in that column in certain cases.
So, let me explain the issue/problem by giving an example.
Image that you have two tables. One table is filled with the name of a school bus and the yearly price that the parents have to pay for it. Another table has all the stops that the school busses take.
Now, here comes the problem. I would love to have a system where I can let the user decide the order of those stops.
What I'm currently doing is, I let the user fill in a number. In another column, I have a calculated field that adds the primary key of the schoolbus table "." the number of the order.
So, that would look like this for the first school bus and the first five stops.
1.1, 1.2, 1.3, 1.4, 1.5
And I have a duplicate control that column. Since that way, I can make sure that it doesn't block the number 1 for the next bus.
Something that might work as well, if that's possible in access is that for each school bus you create, you add a column in the stops table to check for duplicates. But how do you make sure then to filter which stops belongs to which schoolbus?
I hope my issue is a bit clear. If it's not, feel free to ask.
I have a similar style problem in another part of the tool with a book fair system.
Where you have an order table where the order number, the student number, the creation and pickup dates are stored. It also has a field to lock the form to avoid any edits for the bookkeeping folks. I also have an orderline form where all the actual bought books are stored. Now, I would love it if a system exists to avoid users adding the same book twice on an order in the orderline table since I do have an amount column for that. So, in a way, it's almost similar to the above described problem.

Related

How to identify duplicate records using client name and address in SQL while both of them is in free text

I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel.
Following is the data structure:
Client_Header
id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)
Client_Address
Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)
My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only.
For example, we have a client like following
Surname : Jones
First name : David
Client_Date_Of_Birth: 10/05/1975
Address: Unit 10 Floor 1, 20 Railway Parade, St Peter, NSW 2044
However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues
Sample of data issues
Note:
Other than those issues, sometime we may miss the first name or last name of the client (but not both) too
Sometimes multiple problems can be find within the same record.
Also sometime, the address may simply be the name of a school,
shopping center etc.
The system does not store any other id that can uniquely identify the client.
I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication.
Thank you for your help!
Not a pretty sight, and I'm afraid I don't have good news for you.
This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future.
Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator.
If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation.
In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample:
Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.")
Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".)
Change "1/F" to "Floor 1."
The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table).
This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field.
You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know.
As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...
Some years ago I had a similar task and I tooks about one years to clean the data.
What I did in short:
send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
List item
at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc...
I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.

Table structure for Scheduling App in SQL DB

I'm working on a database to hold information for an on-call schedule. Currently I have a structure that looks about like this:
Table - Person: (key)ID, LName, FName, Phone, Email
Table - PersonTeam: (from Person)ID, (from Team)ID
Table - Team: (key)ID, TeamName
Table - Calendar: (key dateTime)dt, year, month, day, etc...
Table - Schedule: (from Calendar)dt, (id of Person)OnCall_NY, (id of Person)OnCall_MA, (id of Person)OnCall_CA
My question is: With the Schedule table, should I leave it structured as is, where the dt is a unique key, or should I rearrange it so that dt is non-unique and the table looks like this:
Table - Schedule: (from Calendar)dt, (from Team)ID, (from Person)ID
and have multiple entries for each day, OR would it make sense to just use:
Table - Schedule: (from Calendar)dt, (from PersonTeam)KeyID - [make a key ID on each of the person/team pairings]
A team will always have someone on call, but a person can be on call for more than one team at a time (if they are on multiple teams).
If a completely different setup would work better let me know too!
Thanks for any help! I apologize if my question is unclear. I'm learning fast but nevertheless still fairly new to using SQL daily, so I want to make sure I'm using best practices when I learn so I don't develop bad habits.
The current version, one column per team, is probably not a good idea. Since you're representing teams as a table (and not as an enum or equivalent), it means you expect to add/remove teams over time. That would force you to add/remove columns to the table, which is always a much larger task than adding/removing a few rows.
The 2nd option is the usual solution to a problem like this. A safe choice. You can always define an additional foreign key constraint from Schedule(teamID, personID) to PersonTeam to ensure you don't mistakenly assign schedule duty to a person not belonging to the team.
The 3rd option is pretty much equivalent to the 2nd, only you're swapping a composite natural key for PersonTeam for a surrogate simple key. Since the two components of said composite key are already surrogate, there is no advantage (in terms of immutability, etc.) to adding this additional one. Plus it would turn a very simple N-M relationship (PersonTeam) which most DB managers / ORMs will handle nicely into a more complex object which will need management on its own.
By Occam's razor, I'd do away with the additional surrogate key and use your 2nd option.
In my view, the answer may depend on whether the number of teams is fixed and fairly small. Of course, whether the names of the teams are fixed or not, may also matter, but that would probably have more to do with column naming.
More specifically, my view is this:
If the business requirement is to always have a small and fixed number of people (say, three) on call, then it may well be more convenient to allocate three columns in Schedule, one for every team to hold the ID of the appointed person, i.e. like your current structure:
dt OnCall_NY OnCall_MA OnCall_CA
--- --------- --------- ---------
with dt as the primary key.
If the number of teams (in the Team table) is fixed too, you could include teams' names/designators in the column names like you are doing now, but if the number of teams is more than three and it's just the number of teams in Schedule that is limited to three, then you could just use names like OnCallID1, OnCallID2, OnCallID3.
But even if that requirement is fixed, it may only turn out fixed today, and tomorrow your boss says, "We no longer work with a fixed number of teams (on call)", or "We need to extend the number of teams supported to four, and we may need to extend it further in the future". So, a more universal approach would be the one you are considering switching to in your question, that is
dt Team Person
--- ---- ------
where the primary key would now be dt, Team.
That way you could easily extend/reduce the number of people on call on the database level without having to change anything in the schema.
UPDATE
I forgot to address your third option in my original answer (sorry). Here goes.
Your first option (the one actually implemented at the moment) seems to imply that every team can be presented by (no more than) one person only. If you assign surrogate IDs to the Person/Team pairs and use those keys in Schedule instead of separate IDs for Person and Team, you will probably be unable to enforce the mentioned "one person per team in Schedule" requirement (or, at least, that might prove somewhat troublesome) at the database level, while, using separate keys, it would be just enough to set Team to be part of a composite key (dt, Team) and you are done, no more than one team per day now.
Also, you may have difficulties letting a person change the team over time if their presence in the team was fixated in this way, i.e. with a Schedule reference to the Person/Team pair. You would probably have to change the Team reference in the PersonTeam table, which would result in misrepresentation of historical info: when looking at the people on call back on certain day, the person's Team shown would be the one they belong to now, not the one they did then.
Using separate IDs for people and teams in Schedule, on the other hand, would allow you to let people change teams freely, provided you do not make (Schedule.Team, Schedule.Person) a reference to (PersonTeam.Team, PersonTeam.Person), of course.

Re-assigning IDs in a non-IDENTITY type field in SQL Server database

WARNING: This tale of woe contains examples of code smells and poor design decisions, and technical debt.
If you are conversant with SOLID principles, practice TDD and unit test your work, DO NOT READ ON. Unless you want a good giggle at someone's misfortune and gloat in your own awesomeness knowing that you would never leave behind such a monumental pile of crap for your successors.
So, if you're sitting comfortably then I'll begin.
In this app that I have inherited and been supporting and bug fixing for the last 7 months I have been left with a DOOZY of a balls up by a developer that left 6 and a half months ago. Yes, 2 weeks after I started.
Anyway. In this app we have clients, employees and visits tables.
There is also a table called AppNewRef (or something similar) that ... wait for it ... contains the next record ID to use for each of the other tables. So, may contain data such as :-
TypeID Description NextRef
1 Employees 804
2 Clients 1708
3 Visits 56783
When the application creates new rows for Employees, it looks in the AppNewRef table, gets the value, uses that value for the ID, and then updates the NextRef column. Same thing for Clients, and Visits and all the other tables whose NextID to use is stored in here.
Yes, I know, no auto-numbering IDENTITY columns on this database. All under the excuse of "when it was an Access app". These ID's are held in the (VB6) code as longs. So, up to 2 billion 147 million records possible. OK, that seems to work fairly well. (apart from the fact that the app is updating and taking care of locking / updating, etc., and not the database)
So, our users are quite happily creating Employees, Clients, Visits etc. The Visits ID is steady increasing a few dozen at a time. Then the problems happen. Our clients are causing database corruptions while creating batches of visits because the server is beavering away nicely, and the app becomes unresponsive. So they kill the app using task manager instead of being patient and waiting. Granted the app does seem to lock up.
Roll on to earlier this year and developer Tim (real name. No protecting the guilty here) starts to modify the code to do the batch updates in stages, so that the UI remains 'responsive'. Then April comes along, and he's working his notice (you can picture the scene now, can't you ?) and he's beavering away to finish the updates.
End of April, and beginning of May we update some of our clients. Over the next few months we update more and more of them.
Unseen by Tim (real name, remember) and me (who started two weeks before Tim left) and the other new developer that started a week after, the ID's in the visit table start to take huge leaps upwards. By huge, I mean 10000, 20000, 30000 at a time. Sometimes a few hundred thousand.
Here's a graph that illustrates the rapid increase in IDs used.
Roll on November. Customer phones Tech Support and reports that he's getting an error. I look at the error message and ask for the database so I can debug the code. I find that the value is too large for a long. I do some queries, pull the information, drop it into Excel and graph it.
I don't think making the code handle anything longer than a long for the ID's is the right approach, as this app passes that ID into other DLL's and OCX's and breaking the interface on those just seems like a whole world of hurt that I don't want to encounter right now.
One potential idea that I'm investigating is try to modify the ID's so that I can get them down to a lower level. Essentially filling the gaps. Using the ROW_NUMBER function
What I'm thinking of doing is adding a new column to each of the tables that have a Foreign Key reference to these Visit ID's (not a proper foreign key mind, those constraints don't exist in this database). This new column could store the old (current) value of the Visit ID (oh, just to confuse things; on some tables it's called EventID, and on some it's called VisitID).
Then, for each of the other tables that refer to that VisitID, update to the new value.
Ideas ? Suggestions ? Snippets of T-SQL to help all gratefully received.
Option one:
Explicitly constrain all of your foreign key relationships, and set them to be ON UPDATE CASCADE.
This will mean that whenever you change the ID, the foreign keys will automatically be updated.
Then you just run something like this...
WITH
resequenced AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS newID,
*
FROM
yourTable
)
UPDATE
resequenced
SET
id = newID
I haven't done this in ages, so I forget if it causes problems mid-update by having two records with the same id value. If it does, you could do somethign like this first...
UPDATE yourTable SET id = -id
Option two:
Ensure that none of your foreign key relationships are explicitly defined. If they are, note them donw and remove them.
Then do something like...
CREATE TABLE temp AS
newID INT IDENTITY (1,1),
oldID INT
)
INSERT INTO temp (oldID) SELECT id FROM yourTable
/* Do this once for the table you are re-identifiering */
/* Repeat this for all fact tables holding that ID as a foreign key */
UPDATE
factTable
SET
foreignID = temp.newID
FROM
temp
WHERE
foreignID = temp.oldID
Then re-apply any existing foreign key relationships.
This is a pretty scary option. If you forget to update a table, you just borked your data. But, you can give that temp table a much nicer name and KEEP it.
Good luck. And may the lord have mercy on your soul. And Tim's if you ever meet him in a dark alley.
I would create a numbers table that has just a sequence from 1 to whatever max with an increment of 1 is for long and then change the logic of getting the maxid for visitid and maybe the others doing a right join between the numbers and the visits table. and then you can just look for te min of that number
select min(number) from visits right join numbers on visits.id = numbers.number
That way you get all the gaps filled in without having to change any of the other tables.
but I would just redo the whole database.

Creating Table Relationships

I am working on a VB.net (VS-2010, Win XP Pro 2 SP3), Employee Management Project. I need to keep track of Employee Leave Attendance and also each Equipment assigned to an Employee. How can I achieve this using SQLlite.
It will be very useful if you could provide me with examples as I am completely new to the field of SQL and VB.net
I think this can be done with two tables where one has the primary key while the other has a foreign key, but I am not sure. Also how many tables will I need for storing data in Leave and Equipment Form.
I went through other questions but I was unable to figure out a solution for my problem.
(Sorry, I cannot provide with images as this site prevents me from posting images without 10 reps)
Most problems are only as complex, and as simple as you make them. Out of habbit, nearly all tables end up with a unique ID field. There are exceptions, which I will call "link" tables, eg, ones that provide connection details between two data tables.
Now, in your senario
You would need a "holiday" table, where each row will contain the employee unique ID and either a start/finish date, eg, if they take half a day, it needs to be visible, or, just a year and value, eg in 2011, I booked, 2 lots of 35 hours, and 1 lot of 4 hours eg, Ive taken 2 weeks and half a day.
For the equipment, you would need a data table, since an item can only got to 1 employee, it depends if you're going to use this for booking or not, but if its just like a library, eg I currently have a loaner laptop, then you can just have an employee field in the equipment table. If you need a booking system, then you would require link tables and more complex.
Best way to work out your tables is to try and group your data, and then write the items on peices of paper and see how you as a human do it. After a while you end up able to do so in your head.

Is this a good way to model address information in a relational database?

I'm wondering if this is a good design. I have a number of tables that require address information (e.g. street, post code/zip, country, fax, email). Sometimes the same address will be repeated multiple times. For example, an address may be stored against a supplier, and then on each purchase order sent to them. The supplier may then change their address and any subsequent purchase orders should have the new address. It's more complicated than this, but that's an example requirement.
Option 1
Put all the address columns as attributes on the various tables. Copy the details down from the supplier to the PO as it is created. Potentially store multiple copies of the
Option 2
Create a separate address table. Have a foreign key from the supplier and purchase order tables to the address table. Only allow insert and delete on the address table as updates could change more than you intend. Then I would have some scheduled task that deletes any rows from the address table that are no longer referenced by anything so unused rows were not left about. Perhaps also have a unique constraint on all the non-pk columns in the address table to stop duplicates as well.
I'm leaning towards option 2. Is there a better way?
EDIT: I must keep the address on the purchase order as it was when sent. Also, it's a bit more complicated that I suggested as there may be a delivery address and a billing address (there's also a bunch of other tables that have address information).
After a while, I will delete old purchase orders en-masse based on their date. It is after this that I was intending on garbage collecting any address records that are not referenced anymore by anything (otherwise it feels like I'm creating a leak).
I actually use this as one of my interview questions. The following is a good place to start:
Addresses
---------
AddressId (PK)
Street1
... (etc)
and
AddressTypes
------------
AddressTypeId
AddressTypeName
and
UserAddresses (substitute "Company", "Account", whatever for Users)
-------------
UserId
AddressTypeId
AddressId
This way, your addresses are totally unaware of how they are being used, and your entities (Users, Accounts) don't directly know anything about addresses either. It's all up to the linking tables you create (UserAddresses in this case, but you can do whatever fits your model).
One piece of somewhat contradictory advice for a potentially large database: go ahead and put a "primary" address directly on your entities (in the Users table in this case) along with a "HasMoreAddresses" field. It seems icky compared to just using the clean design above, but can simplify coding for typical use cases, and the denormalization can make a big difference for performance.
Option 2, without a doubt.
Some important things to keep in mind: it's an important aspect of design to indicate to the users when addresses are linked to one another. I.e. corporate address being the same as shipping address; if they want to change the shipping address, do they want to change the corporate address too, or do they want to specify a new loading dock? This sort of stuff, and the ability to present users with this information and to change things with this sort of granularity is VERY important. This is important, too, about the updates; give the user the granularity to "split" entries. Not that this sort of UI is easy to design; in point of fact, it's a bitch. But it's really important to do; anything less will almost certainly cause your users to get very frustrated and annoyed.
Also; I'd strongly recommend keeping around the old address data; don't run a process to clean it up. Unless you have a VERY busy database, your database software will be able to handle the excess data. Really. One common mistake I see about databases is attempting to overoptimize; you DO want to optimize the hell out of your queries, but you DON'T want to optimize your unused data out. (Again, if your database activity is VERY HIGH, you may need to have something does this, but it's almost a certainty that your database will work well with still having excess data around in the tables.) In most situations, it's actually more advantageous to simply let your database grow than it is to attempt to optimize it. (Deletion of sporadic data from your tables won't cause a significant reduction in the size of your database, and when it does... well, the reindexing that that causes can be a gigantic drain on the database.)
Do you want to keep a historical record of what address was originally on the purchase order?
If yes go with option 1, otherwise store it in the supplier table and link each purchase order to the supplier.
BTW: A sure sign of a poor DB design is the need for an automated job to keep the data "cleaned up" or in synch. Option 2 is likely a bad idea by that measure
I think I agree with JohnFx..
Another thing about (snail-)mail addresses, since you want to include country I assume you want to ship/mail internationally, please keep the address field mostly freeform text. It's really annoying having to make up an 5 digit zip code when Norway don't have zip-codes, we have 4 digit post-numbers.
The best fields would be:
Name/Company
Address (multiline textarea)
Country
This should be pretty global, if the US-postal system require zip-codes in a specific format, then include that too but make it optional unless USA is selected as country. Everyone know how to format the address in their country, so as long as you keep the linebreaks it should be okay...
Why would any of the rows on the address table become unused? Surely they would still be pointed at by the purchase order that used them?
It seems to me that stopping the duplicates should be the priority, thus negating the need for any cleanup.
In the case of orders, you would never want to update the address as the person (or company) address changed if the order has been sent. You meed the record of where the order was actually sent if there is an issue with the order.
The address table is a good idea. Make a unique constraint on it so that the same entity cannot have duplicate addresses. You may still get them as users may add another one instead of looking them up and if they spell things slightly differently (St. instead of Street) the unique constraint won't prevent that. Copy the data at the time the order is created to the order. This is one case where you want the multiple records because you need a historical record of what you sent where. Only allowing inserts and deletes to the table makes no sense to me as they aren't any safer than updates and involve more work for the database. An update is done in one call to the database. If an address changes in your idea, then you must first delete the old address and then insert the new one. Not only more calls to the databse but twice the chance of making a code error.
I've seen every system using option 1 get into data quality trouble. After 5 years 30% of all addresses will no longer be current.