Audit columns - join or copy? - audit

I have seen many business systems not using joins on Users table but instead copying user name (or firstname+lastname) to a CreatedBy audit field.
I already see one issue with this approach - a user might marry and change her last name, and then CreatedBy field will keep the old value.
Is there any good reason why to draw back from data normalization and create redundant text data for the CreatedBy field?

Storing someone's name without using a foreign key reference lets you delete users without affecting information about which rows they created. Whether this is wise is application-dependent.
In most organizations, if it's important to know what someone's name was 3 years ago, there's some way to do it. It might not be in the database. It might be, "Go ask Jerry in HR. He remembers everyone." Whether this is wise is also application-dependent.
In any case, this doesn't necessarily violate any guidelines of normalization. In the one case, you're storing the current name or the preferred name of a user. In the other case, you're storing the name of a user at the time the row was created, and that still constitutes a true fact.

Related

How to identify duplicate records using client name and address in SQL while both of them is in free text

I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel.
Following is the data structure:
Client_Header
id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)
Client_Address
Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)
My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only.
For example, we have a client like following
Surname : Jones
First name : David
Client_Date_Of_Birth: 10/05/1975
Address: Unit 10 Floor 1, 20 Railway Parade, St Peter, NSW 2044
However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues
Sample of data issues
Note:
Other than those issues, sometime we may miss the first name or last name of the client (but not both) too
Sometimes multiple problems can be find within the same record.
Also sometime, the address may simply be the name of a school,
shopping center etc.
The system does not store any other id that can uniquely identify the client.
I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication.
Thank you for your help!
Not a pretty sight, and I'm afraid I don't have good news for you.
This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future.
Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator.
If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation.
In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample:
Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.")
Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".)
Change "1/F" to "Floor 1."
The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table).
This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field.
You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know.
As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...
Some years ago I had a similar task and I tooks about one years to clean the data.
What I did in short:
send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
List item
at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc...
I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.

How to identify a user with a not changeable number on IBM Notes

We have a few IBM Notes databases here, at least one hundred I think, and if we have to identify a user we are using the given Name at the moment. We are also connecting this with a database of all the employees here, using it to do time-management and administrative stuff.
Therefore we need to determine which user is which, as I said we are doing that by the name at the moment. But names change, so now we would like to change to a not changing ID. I thought we could use the key identifier, or one of them at least. So my question is, is there a way to get it through Lotus Script? If not, is there another way to identify the user of a certain key-file?
Lotus Notes and Domino do not have any builtin unique key identifier for users. It was never part of the design. You can't use the noteid of the Person document because that varies from one replica of the Domino Directory to another, and you should not use the unid, because although that's stable across replicas it can still change if you have to recreate the Person document, which you might have to do if the employee leaves your company and then comes back, or if the Person document is damaged.
The way most large organizations deal with this is to set the EmployeeID field in the Person document and use that as the unique identifier. Some organizations might also create unique identifiers and use them for the ShortName.
Whilst I don't diasagree with Richards answer in general, it is possible to 'capture' a Document-ID (UNID) in a separate 'created when composed' field. This will then hold a static 'unique' 32 character reference. Once this has been set, it should not change unless you have a process to change it. A UNID is based on/derived from a time-stamp so they are extremely unlikely to be reused or clash with future system generated unids, even after many years or many millions of docs being used. This captured value may or may not agree with the actual system assigned unid, but this generally doesn't matter.
If you are unhappy with a copy of an system unid, then this could be used on a temporary basis until it can be overridden with an external ID reference. Alternatively, use a proven external 'guid' reference and assign that to the records you want to track.
If the raw unid creates side issues for you, use #password to hash it to another unique value. This generates a quick/old MD5 style hash, but good enough for reference-IDs.
One other point to mention is that keeping previous names in a list is also feasible, so that matching an 'old name' is still viable. Either create an additional field in the person doc (admins may hate you for doing this) or add older names to the 'fullname' field list. This usually contains a list of name variations. If you add the older name to the bottom of the list (ie: append) the you and the system can match against this name (for logins and routing) but Notes only ever uses the first name in the list as the users 'official' name (for reader/author fields etc), which should be the current name. To find the name, just lookup the name in the ($Users) view if using the FullName field or create a new view in using your own field (again Admins may hate you more).

Is it sensible to have a table that does not reference any other in a database design?

I'd like to get some advice on database design. Specifically, consider the following (hypothetical) scenario:
Employees - table holding all employee details
Users - table holding employees that have username and password to access software
UserLog - table to track when users login and logout and calculate
time on software
In this scenario, if an employee leaves the company I also want to make sure I delete them from the Users table so that they can no longer access the software. I can achieve this using ON DELETE CASCADE as part of the FK relationship between EmployeeID in Employees and Users.
However, I don't want to delete their details from the UserLog as I am interested in collating data on how long people spend on the software and the fact that they no longer work at the company does not mean their user behaviour is no longer relevant.
What I am left with is a table UserLog that has no relationships with any other tables in my database. Is this a sensible idea?
Having looked through books etc / googled online I haven't come across any DB schemas with tables that have no relationships with others and so my gut instinct here is saying that my approach is not robust...
I'd appreciate some guidance please.
My personal preference in this case would be to "soft delete" an employee by adding a "DeletedDate" column to the Employees table. This will allow you to maintain referential integrity with your UserLog table and all details for all employees, past and present, remain available in the database.
The downside to this approach is that you need to add application logic to check for active employees.
Yes, this is perfectly sensible. The log is just a raw audit of data that should never change. It doesn't need to be normalized (and shouldn't be) and/or linked to other tables.
Ideally, I would put write-heavy audit logging in a different database entirely than the read-heavy transactional day-to-day stuff. They may grow differently over time. But starting small it's fine to keep them in the same database as long as you understand the fundamental differences between them.
On a side note, I would recommend not deleting the users from the tables. Maybe have some kind of IsActive or IsDeleted bit on them that would effectively blind them from the application, but deleting should be avoided if possible.
The problem you have here is that it's perfectly possible to insert UserLog data for users that have never existed as there's no link to the table that defines valid users.
I would say that perhaps the better course of action would be to mark the users as invalid and remove all their personal details when they leave rather than delete the record entirely.
That's not to say there aren't situations where it is valid to have a table (or tables) on the database that don't reference others.
Is this a sensible idea
The problem is this. Since the data isn't linked you can delete something from the employee table and still have references in the UserLog. After the employee infomration is deleted, you have no way of knowing what Log data ties back to. Is this ok? Technically yes. There is nothing preventing you from doing it, but then why are you keeping the data in the first place? You also have no guarantee that the data in the table actually is about an employee. Someone could accidently enter a wrong EmployeeID in the table that doesn't belong to anyone. Keys help prevent data corruption. It's always better to have extra data than it is to have bad data.
What I've found is that you never want to delete data when possible. Space is cheap, and you can add flags etc. to show the record isn't active. Yes, this does cause more work (this can be quickly remedied by creating a view which only shows active employees), and saying that you should never delete data is far fetched, but you start linking data together. Deleting becomes very difficult. If you are not adding a FK just so you can delete records, it's a tell tale sign you need to rethink your strategy.
Relying on Cascade Delete can be very dangerous too. The model you are stating is that anytime you don't want data deleted you have to know not to add a FK to that table which links it back to users. It doesn't take long for someone to forget this.
What you can do is use logical deletion or disabling a user by adding a bool value Deleted or Disabled to the Users table.
Or replace the EmployeeId with the name of the employee in the UserLog.
An alternative to using the soft delete process, is to store all the historical details you would want about the user at the time the log record is created rather than store the employee id. So you might have username, logintime, logouttime, sessionlength in your table.
Sensible? Sure, as in it makes sense as you've described your need to keep those users indefinitely. The problem you'll run into is maintaining the tables. Instead of doing a cascading update once, you'll have to use at least two updates in order to insert a new user.
I think a table as you are suggesting is perfectly fine. I frequently encounter log tables that are do not have explicit relationships with other tables. Just because a database is "relational" doesn't mean everything has to relate haha.
One thing that I do notice though is that you are using EmployeeID in the log, but not using it as a foreign key to your Employee table. I understand why you don't want that, since you will be dropping employees. But, if you are dropping them completely, then the EmployeeID column is meaningless.
A solution to this would be to keep a flag for employees, such as active, that tracks if they are active or not. That way, the log data is meaningful.
IANADBA but it's generally considered very bad practice indeed to delete almost anything from a DB ever,It would be far better here to have some kind of locked flag / "deleted" datestamp on your users table and preserve your FK.

SQL - Table Design - DateCreated and DateUpdated columns

For my application there are several entity classes, User, Customer, Post, and so on
I'm about to design the database and I want to store the date when the entities were created and updated. This is where it gets tricky. Sure one option is to add created_timestamp and update_timestamp columns for each of the entity tables but that isn't that redudant?
Another possibility could be to create a log table that stores this information, and it could be made to contain keep track of updates for any entity.
Any thoughts? I'm leaning on implementing the latter.
The single-log-table-for-all-tables approach has two main problems that I can think of:
The design of the log table will (probably) constrain the design of all the other tables. Most likely the log table would have one column named TableName and then another column named PKValue (which would store the primary key value for the record you're logging). If some of your tables have compound primary keys (i.e. more than one column), then the design of your log table would have to account for this (probably by having columns like PKValue1, PKValue2 etc.).
If this is a web application of some sort, then the user identity that would be available from a trigger would be the application's account, instead of the ID of the web app user (which is most likely what you really want to store in your CreatedBy field). This would only help you distinguish between records created by your web app code and records created otherwise.
CreatedDate and ModifiedDate columns aren't redundant just because they're defined in each table. I would stick with that approach and put insert and update triggers on each table to populate those columns. If I also needed to record the end-user who made the change, I would skip the triggers and populate the timestamp and user fields from my application code.
I do the latter, with a "log" or "events" table. In my experience, the "updated" timestamp becomes frustrating pretty quick, because a lot of the time you find yourself in a fix where you want not just the very latest update time.
How often will you need to include the created/updated timestamps in your presentation layer? If the answer is anything more than "once in a great great while", I think you would be better served by having those columns in each table.
On a project I worked on a couple of years ago, we implemented triggers which updated what we called an audit table (it stored basic information about the changes being made, one audit table per table). This included modified date (and last modified).
They were only applied to key tables (not joins or reference data tables).
This removed a lot of the normal frustration of having to account for LastCreated & LastModified fields, but introduced the annoyance of keeping the triggers up to date.
In the end the trigger/audit table design worked well and all we had to remember was to remove and reapply the triggers before ETL(!).
It's for a web based CMS I work on. The creation and last updated dates will be displayed on most pages and there will be lists for the last created (and updated) pages. The admin interface will also use this information.

Is this a good way to model address information in a relational database?

I'm wondering if this is a good design. I have a number of tables that require address information (e.g. street, post code/zip, country, fax, email). Sometimes the same address will be repeated multiple times. For example, an address may be stored against a supplier, and then on each purchase order sent to them. The supplier may then change their address and any subsequent purchase orders should have the new address. It's more complicated than this, but that's an example requirement.
Option 1
Put all the address columns as attributes on the various tables. Copy the details down from the supplier to the PO as it is created. Potentially store multiple copies of the
Option 2
Create a separate address table. Have a foreign key from the supplier and purchase order tables to the address table. Only allow insert and delete on the address table as updates could change more than you intend. Then I would have some scheduled task that deletes any rows from the address table that are no longer referenced by anything so unused rows were not left about. Perhaps also have a unique constraint on all the non-pk columns in the address table to stop duplicates as well.
I'm leaning towards option 2. Is there a better way?
EDIT: I must keep the address on the purchase order as it was when sent. Also, it's a bit more complicated that I suggested as there may be a delivery address and a billing address (there's also a bunch of other tables that have address information).
After a while, I will delete old purchase orders en-masse based on their date. It is after this that I was intending on garbage collecting any address records that are not referenced anymore by anything (otherwise it feels like I'm creating a leak).
I actually use this as one of my interview questions. The following is a good place to start:
Addresses
---------
AddressId (PK)
Street1
... (etc)
and
AddressTypes
------------
AddressTypeId
AddressTypeName
and
UserAddresses (substitute "Company", "Account", whatever for Users)
-------------
UserId
AddressTypeId
AddressId
This way, your addresses are totally unaware of how they are being used, and your entities (Users, Accounts) don't directly know anything about addresses either. It's all up to the linking tables you create (UserAddresses in this case, but you can do whatever fits your model).
One piece of somewhat contradictory advice for a potentially large database: go ahead and put a "primary" address directly on your entities (in the Users table in this case) along with a "HasMoreAddresses" field. It seems icky compared to just using the clean design above, but can simplify coding for typical use cases, and the denormalization can make a big difference for performance.
Option 2, without a doubt.
Some important things to keep in mind: it's an important aspect of design to indicate to the users when addresses are linked to one another. I.e. corporate address being the same as shipping address; if they want to change the shipping address, do they want to change the corporate address too, or do they want to specify a new loading dock? This sort of stuff, and the ability to present users with this information and to change things with this sort of granularity is VERY important. This is important, too, about the updates; give the user the granularity to "split" entries. Not that this sort of UI is easy to design; in point of fact, it's a bitch. But it's really important to do; anything less will almost certainly cause your users to get very frustrated and annoyed.
Also; I'd strongly recommend keeping around the old address data; don't run a process to clean it up. Unless you have a VERY busy database, your database software will be able to handle the excess data. Really. One common mistake I see about databases is attempting to overoptimize; you DO want to optimize the hell out of your queries, but you DON'T want to optimize your unused data out. (Again, if your database activity is VERY HIGH, you may need to have something does this, but it's almost a certainty that your database will work well with still having excess data around in the tables.) In most situations, it's actually more advantageous to simply let your database grow than it is to attempt to optimize it. (Deletion of sporadic data from your tables won't cause a significant reduction in the size of your database, and when it does... well, the reindexing that that causes can be a gigantic drain on the database.)
Do you want to keep a historical record of what address was originally on the purchase order?
If yes go with option 1, otherwise store it in the supplier table and link each purchase order to the supplier.
BTW: A sure sign of a poor DB design is the need for an automated job to keep the data "cleaned up" or in synch. Option 2 is likely a bad idea by that measure
I think I agree with JohnFx..
Another thing about (snail-)mail addresses, since you want to include country I assume you want to ship/mail internationally, please keep the address field mostly freeform text. It's really annoying having to make up an 5 digit zip code when Norway don't have zip-codes, we have 4 digit post-numbers.
The best fields would be:
Name/Company
Address (multiline textarea)
Country
This should be pretty global, if the US-postal system require zip-codes in a specific format, then include that too but make it optional unless USA is selected as country. Everyone know how to format the address in their country, so as long as you keep the linebreaks it should be okay...
Why would any of the rows on the address table become unused? Surely they would still be pointed at by the purchase order that used them?
It seems to me that stopping the duplicates should be the priority, thus negating the need for any cleanup.
In the case of orders, you would never want to update the address as the person (or company) address changed if the order has been sent. You meed the record of where the order was actually sent if there is an issue with the order.
The address table is a good idea. Make a unique constraint on it so that the same entity cannot have duplicate addresses. You may still get them as users may add another one instead of looking them up and if they spell things slightly differently (St. instead of Street) the unique constraint won't prevent that. Copy the data at the time the order is created to the order. This is one case where you want the multiple records because you need a historical record of what you sent where. Only allowing inserts and deletes to the table makes no sense to me as they aren't any safer than updates and involve more work for the database. An update is done in one call to the database. If an address changes in your idea, then you must first delete the old address and then insert the new one. Not only more calls to the databse but twice the chance of making a code error.
I've seen every system using option 1 get into data quality trouble. After 5 years 30% of all addresses will no longer be current.