sql database model for contact details - sql

I need a db to store, i.e. user records. Regular stuff: name, e-mail, address, phone, fax and so on. The problem is, in this case there can be more than one phone number per user. And more than one e-mail. Even more than one address. And a lot more of more-than-one stuff.
One approach is to store everything in one table, e.g. serialized phones array in one phones column. Or comma separated phones in one phones column. But I really dont like this way, I'd rather do overcomplicated database to make programming logic simpler than the other way round.
Another one is individual table for phones, individual table for addresses and so on. Columns: id, customer_id, phone. customer_id references customers.id Now this seems like a real overkill, having around 10 tables just for storing contact details.
And yet another idea I came up is one additional table for contacts with columns like id, customer_id ( <-foreign key), key, value. Where key can be "phone" and value "+123 3435454", or key "e-mail" and value... you got the idea. So far I like this one best.
What would you suggest? What would be downsides of #3 method?
db I'm going to use is postgresql, but it doesnt really matter.

One table row per entity is correct. So one for emails, one for phone numbers etc is correct. it's not overkill: it's normalised database design.
Your option 3 can be done, but, say, what if you want to enforce a certain pattern for phone numbers and email addresses?

Some may say that method #3 is the best. Others will say that it is basically one of the most common anti-patterns in databases - namely EAV, which gets a lot of hate.
As for me - I know too little about your application to suggest solution. Generally - method #2 gives you the most functionality.
There is also method #4, and it's variant - method #5:
4 - use arrays of values - i.e. phone column, instead of being of TEXT database, is TEXT[], and then you can store many phones in it.
5 - since you're on PostgreSQL - use it. There is pretty cool hstore datatype in contrib, which you can use instead of arrays to add some semantics (like type of phone).

Some purists would suggest no 3 to be the way to go. In fact using that approach you could theoretically build a single table database!
However like gbn mentioned, this would cause problems with specific formats, and you would have to enforce data lengths etc on the client only.
I would go with your 2nd suggestion and use the approach of having different tables, with an address type, phone number type etc similar to that shown below.
id number
addrss_type varchar (home/contact/mail etc)
addrs_line1 varchar
addrs_line2 varchar
etc, etc

Downsides:
That violates first normal form, as you'll have many values for one field
As you said, that would be a nightmare to maintain
Seems bets option but could be normalized by using a domain table for that "key" value
Customer (Id, Name)
AdditionalData( Id, Name ) (phone, address, etc)
CustomerAdditionalData(Id, CustomerId, AdditionalDataId, Value)

Related

How to identify duplicate records using client name and address in SQL while both of them is in free text

I have a database with millions of client contacts. However, a lot of them are duplicated and may I ask some hero from here to advise how to identify those duplicates using Oracle SQL, PL/SQL or Excel.
Following is the data structure:
Client_Header
id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)
Client_Address
Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)
My challenge is other than Client_Date_Of_Birth and those key fields, all fields are free text only.
For example, we have a client like following
Surname : Jones
First name : David
Client_Date_Of_Birth: 10/05/1975
Address: Unit 10 Floor 1, 20 Railway Parade, St Peter, NSW 2044
However, as those fields are free text, I have a lot of data issues and following link (jpeg file only) illustrated some of those issues
Sample of data issues
Note:
Other than those issues, sometime we may miss the first name or last name of the client (but not both) too
Sometimes multiple problems can be find within the same record.
Also sometime, the address may simply be the name of a school,
shopping center etc.
The system does not store any other id that can uniquely identify the client.
I understand it is close to impossible to gather all duplicate records where the client address is a school or shopping center. However, for other cases, is there anyway to identify most of the duplication.
Thank you for your help!
Not a pretty sight, and I'm afraid I don't have good news for you.
This is a common problem in databases, especially if the data entry personnel are insufficiently trained. One of the main objectives in data entry training is to make the problem well understood and show ways to avoid it. Something to keep in mind in the future.
Unfortunately, there isn't any "magic wand" that will clean your data for you. I'm sorry, but you have before you one of the most tedious tasks in database maintenance. You're going to have to basically remove the duplicates by hand, and the job requires more of an editor than a database administrator.
If you have millions of records, of which perhaps a million are actually duplicates, I would estimate that it will take an expert working full time for at least two years -- and probably longer -- to clean up your problem: to do it in two years would require fixing 2000 records a day, with time off on weekends and two weeks of vacation.
In the end, the only sure way to remove all the duplicates is to compare all of them and remove them one at a time. But there are plenty of tricks you can use to get rid of blocks of them at once. Here are a few that I can think of with your data sample:
Change "Dave" to "David" in both first and last name fields. (Make sure that nobody actually has the last name "Dave.")
Change all instances of "Jones David" to "David Jones." (Make sure that there are no people named "Jones David".)
Change "1/F" to "Floor 1."
The idea is to focus on some of the fields, and in those fields get all of the duplicates to be exact duplicates. Once you have that done, you delete all the records with the target values in the fields, except the one with the primary key of the record that you want to keep (if your table isn't keyed, you'll have to find another way to do it, such as selecting the top record into a new table).
This technique speeds things up for records with a large number of duplicates. Where you have only a few duplicates, it's quicker to just identify them one by one. One way to do this quickly is to go into edit mode on a table, work with a particular field (for example, the postal code field in this case), and put a unique value in that field when you want to mark it for deletion (in this case, perhaps a single zero). Then you can periodically delete all the records with that value in the field.
You'll also need to sort the data in multiple ways to find the duplicates, which it appears you already know.
As for your notes, don't try to identify all the ways that the data is messed up. Once you identify one record as a duplicate of another, you don't care what's wrong with it, you just have to get rid of it. If you have two records and each contains data that you want to keep that the other one is missing, then you'll have to consolidate them and delete one of them. And then go on to the next, and the next, and the next...
Some years ago I had a similar task and I tooks about one years to clean the data.
What I did in short:
send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
List item
at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc...
I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.

What is the most correct way to store a "list" in a SQL Database?

So, I've read a lot about how stashing multiple values into one column is a bad idea and violates the first rule of data normalisation (which, surprisingly, is not "Do Not Talk About Data Normalisation") so I need some help.
At the moment I'm designing an ASP .NET webpage for the place I work for. I want to display data on a web page depending on what Active Directory groups the person belongs to. The first way of doing this that comes to mind is to have a table with, essentially, a column containing the AD group and the second column containing what list of computers belong to that list.
I've learnt that this is showing great disregard for relational databases, so what is a better way to do it? I want to control this access by SQL tables, so I can add/remove from these tables and change end users access accordingly.
Thanks for the help! :)
EDIT: To describe exactly what I want to do is this:
We have a certain group of computers that need to be checked up on, however these computers are in physically difficult to reach locations. The organisation I belong to has remote control enabled for these computers, however they're not in the business of giving out the remote control password (understandable).
The added layer of complexity is that, depending on who you are, our clients should only be able to see a certain group of computers (that is, the group of computers that their area owns). So, if Group A has Thomas in it, and Group B has Jones in it, if you belong to either group then you would just see one entry. However, if you belong to both groups you should see both Thomas and Jones computers in it.
The reason why I think that storing this data in a SQL cell is the way to go is because, to store them in tables would require (in my mind) a new table for each new "group" of computers. I don't want to crank out SQL tables for every new group, I'd much rather just have an added row in a SQL table somewhere.
Does this make any sense?
You basically have three options in SQL Server:
Storing the values in a single column.
Storing the values in a junction table.
Storing the values as XML (or as some other structured data format).
(Other databases have other options, such as arrays, nested tables, and JSON.)
In almost all cases, using a junction table is the correct approach. Why? Here are some reasons:
SQL Server has (relatively) lousy string manipulation, so doing something as simple as ensuring a unique list is really, really hard.
A junction table allows you to store lots of other information (When was a machine added? What is the full description of the machine? etc. etc.).
Most queries that you want are pretty easy with a junction table (with the one exception of getting a comma-delimited list, alas -- which is just counterintuitive rather than "hard").
All the types are stored natively.
A junction table allows you to enforce constraints (both check and foreign key) on the elements of the list.
Although a delimited list is almost never the right solution, it is possible to think of cases where it might be useful:
The list doesn't change and presentation of the list is very important.
Space usage is an issue (alas, denormalization often results in fewer pages).
Queries do not really access elements of the list, just the entire thing.
XML is also a reasonable choice under some circumstances. In the most recent versions of SQL Server, this can be made pretty efficient. However, it incurs the overhead of reading and parsing XML -- and things like duplicate elimination are still not obvious.
So, you do have options. In almost all cases, the junction table is the right approach.
There is an "it depends" that you should consider. If the data is never going to be queried (or queried very rarely) storing it as XML or JSON would be perfectly acceptable. Many DBAs would freak out but it is much faster to get the blob of data that you are going to send to the client than to recompose and decompose a set of columns from a secondary table. (There is a reason document and object databases are becoming so popular.)
... though I would ask why are you replicating active directory to your database and how are you planning on keeping these in sync.
I not really a bad idea to store multiple values in one column, but will depend the search you want.
If you just only want to know the persons that is part of a group then you can store persons in one column with a group id as key. For update you just update the entire list in a group.
But if you want to search a specified person that belongs to group, then its not recommended that you store this multiple persons in one column. In this case its better to store a itermedium table that store person id, and group id.
Sounds like you want a table that maps users to group IDs and a second table that maps group IDs to which computers are in that group. I'm not sure, your language describing the problem was a bit confusing to me.
a list has some columns like: name, family name, phone number etc.
and rows like name=john familyName= lee number=12321321
name=... familyname=... number=...
an sql database works same way. every row in a sql database is a record. so you jusr add records of your list into your database using insert query.
complete explanation in here:
http://www.w3schools.com/sql/sql_insert.asp
This sounds like a typical many-to-many problem. You have many groups and many computers and they are related to eachother. In this situation, it is often recommended to use a mapping table, a.k.a. "junction table" or "cross-reference" table. This table consist solely of the two foreign keys in your other tables.
If your tables look like this:
Computer
- computerId
- otherComputerColumns
Group
- groupId
- othergroupColumns
Then your mapping table would look like this:
GroupComputer
- groupId
- computerId
And you would insert a single record for every relationship between a group and computer. This is in compliance with the rules for third normal form in regards to database normalization.
You can have a table with the group and group id, another table with the computer and computer id and a third table with the relation of group id and computer id.

Is it a good practice to store different data in the same column - SQL

I currently have a Users table which has the following columns:
UserId
Address
Currently we are storing the IP Address in the Address column. Moving forward, we need to store say a MAC address. Is it a good practice to store both of them in the same column (both are varchar) and have another column indicating what type it is typeOfAddress int with a 1 or 2 indicating type of address. Or should I create a separate nullable column (and change the existing one to nullable as well)
No it isn't good practice. One of the first things I was told several decades ago was 'a computer field should not be used for more than one purpose'. Otherwise you have in principle no way of knowing what it means, unless you have another indicator field, which you may as well use for the other value type, and you are introducing the entirely unnecessary risk of misinterpretation.
Absolutely create a separate nullable column. It will make writing all your future SQL on this table much easier, and keep everything much more maintainable overall.
Well, based on the information given, it depends.
Would a user have both an IP address and a MAC address? If they do, you would need to put these records in a separate table with a foreign key relationship to the user record.
What about retrieval? Do these values often need to be retrieved when retrieving data about a user? If they do, it may make sense to store them in the same table and avoid the table join each time.
I'd be inclined to have two separate columns, one for IpAddress and another for MacAddress. This way it is clear what data is in what column without the need to lookup against another column to figure it out.
Recommended is this structure:
UserId
AddressType
Address
because it is in 3rd NF.
The disadvantages of simply using multiple columns in a single table are these:
The need to remember to "null-handle" whenever filtering on Address
Over time, the paradigm may expand beyond sense as new address types are addded; but a conversion becomes too costly to contemplate.
An AddressType table should be added with a FK relationship to prevent accidental entry of AddressTypes that look valid, but aren't.
Best of all, theoretically, would be a separate table for each AddressType, linked on UserId to the user table; but that really does look like overkill for the problem as presented.
If you don't think you're going to be adding a bunch more different types of addresses, my preference would be for having two columns. I think it's simpler conceptually and simpler to query. But if you think you're going to be adding more types of "addresses" (whatever that may be), having 5 or 6 different columns may not be ideal. In that case, I'd go with one address column and one address_type column, and create a junction table that defines the address types.
Instead of making new columns, create a new table Address and AddressType, link to those:
Users
foreign key address_id
Address
id
value (varchar)
foreign key adress_type_id
AddressType
id
name (varchar)
Might be overkill, but it's good normalisation practice

Possible to have a table with variable columns?

It might be a stupid question, but here goes:
Is it possible to make a dynamic table that's able to contain rows with variable number of columns and custom column names?
I have glanced over EAV-modelling, but it seems heavy. A real life example could be this:
Let's say I have a register with customers. But each customer might have different information to be entered. And depending on what you want to enter, it should be reflected in the database. (I.E. every customer has different columns)
Is this impossible/probable?
Update:
The standard approach (i.e. having a table with all needed columns and saving information only into columns that make sense for a particular customer while setting the remaining ones to NULL) doesn't work for me because what I want can't use 'fixed' column names. Example one customer might want CVR-number and another might want their phonenumber as a reference number. And a third might want some completely different information. So to avoid having a table containing 500 columns, I have now thought of making an extra table containing rows of column-data. Like so: Id, Name, Value, CustomerId. So when I want information for a customer, all I have to do is to iterate through this table with a specific customer Id.
my own edit!:
Sorry for troubling you with this simple SQL-issue! :-) Have a nice day...
You could model this as a one-to-many relationship between a Customer and a CustomerAttributes table. Something like:
**Customer table**
CustomerId
LastName
FirstName
...
**CustomerAttributes table**
CustomerId
AttributeName
AttributeValue
This is not possible in Sql-Server. As Marco says, you can store each customer's data in xml.
If all the columns are known ahead of time and some customers use one set and other customers use a different set, then sub-tables with each set of columns is the normal approach.
If the columns are not known ahead of time, then how would the data even be used? No code or reports could refer to it. Perhaps it should be stored unstructured in a general purpose 'Notes' field.
As far as I know it's not possible in standard relational databases, but you can take a look at schema-less databases called 'No-SQL' like MongoDB

SQL Select help - removing white space

I'm running into a problem when trying to select records from my 2005 MS-SQL database (I'm still very new to SQL, but I learned and use the basic commands from w3schools already). In theory, all my manufacturer records should be unique. At least that is how I intended it to be when I did my first massive data dump into it. Unfortunately, that is not the case and now I need to fix it! Here is my scenario:
Table name = ItemCatalog
Relevant columns = Partnumber,Manufacturer,Category
When I did a SELECT DISTINCT Manufacturer FROM ItemCatalog this little problem is what turned up:
Cables2Go
CablesToGo
Cables To Go
CableToGo Inc
CablesToGo Inc
All 5 of those showed up as distinct, which they are. Can't fault my SELECT statement for returning it, but from my human perspective they are all the same manufacturer! One method I see working is doing an UPDATE command and fixing all the permutations that show up, but I have a LOT of manufacturers and this would be very time consuming.
Is there a way when I punch in a SELECT statement, that I can find all the likely permutations of a manufacturer name (or any field really)? I attempted the LIKE operator, so my statement would read
SELECT Manufacturer FROM ItemCatalog WHERE Manufacturer LIKE '%CablesToGo%'
but that didn't turn out as well as I had hoped. Here's the nasty bit, my other program that I'm putting together absolutely requires that I only ask for a single manufacturer name, not all 5 variations. Maybe I'm talking in circles here, but is there is a simple way in one statement for me to find a similar string?
If you are doing some data mining, you could also try the SOUNDEX and DIFFERENCE function in SQL Server.
While they are both outdated (they don't handle foreign character very well), they could yield some interesting result for you:
SELECT * FROM ItemCatalog WHERE SOUNDEX(Manufacturer) = SOUNDEX('Cables To Go');
and
SELECT * FROM ItemCatalog WHERE DIFFERENCE(Name, 'Cables To Go') >= 3;
The number 3 means likely similar (0 mean not similar and 4 is very similar)
There are a few number of better SOUNDEX function available on the internet. See Tek-Tips for an example.
Here is another example at SQL Team.
Standard SQL has a SIMILAR statement, which is a bit more powerful than LIKE.
However, you could use LIKE to good effect with:
Manufacturer LIKE 'Cable%Go%'
This would work in this specific case, finding all the variants listed. However, it would also find 'Cable TV Gorgons' and you probably don't need them included. Your version would also find 'We Hate CablesToGo With Ferocity Inc', which you probably didn't want either.
However, data cleansing is a major problem, and there are companies that make a living out of providing data cleansing. You often end up making a dictionary or thesaurus of terms (company names here) mapping all the variants encountered to the canonical form. The problem is that sometimes you find the same variant spelling is used for two separate canonical forms. For example, a pair of bright sparks might both decide to use 'C2G' as an abbreviation, but one uses it for 'Cables To Go Inc' and the other uses it for 'Computers To Gamers Inc'. You have to use some other information to determine whether a particular instance of 'C2G' means 'Cables' or 'Computers'.
'Cable%Go%' might work for that one case, but if you have other variations for other strings, you'll probably have to do a lot of manual data cleanup.
I suggest you to use object relational mapping tool to map your table into object and add filtering logic there.
One option you have is to loosen your wildcard search to something like 'Cables%Go%'. This might be good in the short term, but with this approach you run the risk of matching more manufacturers than you want (ie , Cables on the Go, etc).
You could also put together a mapping table, which would put all of the variants of Cables To Go into a single group, which your app can query and normalize for your ItemCatalog query.
Another option you have is to introduce a Manufacturers table. This your ItemCatalog table would then have a foreign key to this table and only allow manufacturers that are in the Manufacturer table. This would require some cleanup of your ItemCatalog table to get it working, assuming that you want all of the variants of Cables to Go to be the same.
I know others are suggesting query fixes - I thought I'd elaborate on my long-term fix for kicks.
You could create another table relating each of the variations to a single manufacturer entity. If I encountered this situation at work (and I have), I would be enticed to fix it.
Create a manufacturer's table with a primary key, name, etc..
Create a table with aliases - these will only be needed when you are presented with data that doesn't have the manufacturer's ID (like an import file).
Modify ItemCatalog such that it references the primary key from the manufacturer table (i.e. a ManufacturerID foreign key).
When importing data to ItemCatalog, assign the ManufacturerID foreign key based on matches to the alias table. If you have a name that matches 2+ records then you flag them for manual review or you try to match on more than manufacturer name.