Combining similar values in a single column - sql

I have a column that is being used to list competitors names in a table I'm putting together. Right now don't have a lot of control over how these inputs are made, and it causes some serious headaches. There are random spaces and misspellings throughout our data, and yet we need to list the data by competitor.
As an example (not actual SQL I'm using), list of competitors:
Price Cutter
PriceCutter
PriceCuter
Price Cuter
If I ran the query:
SELECT Competitor_Name, SUM(Their_Sales),
FROM Cmdata.Competitors
Where Their_Sales
Between 10000 AND 100000000
Group by Competitor_Name
I would get a different entry for each version of Price Cutter, something I clearly want to avoid.
I would think this problem would come up a lot, but I did a Google search and came up dry. I will admit, the question is kind of hard to articulate in a few words, maybe that's why I didn't come with anything. Either that or this is so basic I should already know...
(PS- Yes, we're moving to a drop down menu, but it's gonna take some time. In the mean time, is there a solution?)

You need to add a Competitor table, that has a standard name for each competitor.
Then, use foreign key references in other tables.
The problem that you are facing is a data cleansing and data modeling issue. It is not particularly hard to solve, but it does require a fair amount of work. You can get started by getting a list of all the current spellings and standardize them -- probably in an Excel spreadsheet.
If you do that, you can then create a lookup table and change the values by looking them up.
However, in the medium term, you should be creating a Competitor table and modelling the data in the way that your application needs.

This is a very hard problem in general. If your database supports it, you could try grouping by SOUNDEX(Competitor_Name) instead of just Competitor_Name.
Really, the Competitor_Name column should be a foreign key into a Competitors table anyway, instead of a bare text field.
Whatever you do to fix, you should also UPDATE the table so that you don't have to do this sort of hoop-jumping in the future.

(I'm a bit hazy on the syntax, but this is close)
alter table Competitors add column cleanedName (varchar(100));
update Competitors set cleanedName = Replace(Upper(Competitor_Name), ' ', '')
then Group by cleanedName instead of Competitor_Name

Related

How can I divide a single table into multiple tables in access?

Here is the problem: I am currently working with a Microsoft Access database that the previous employee created by just adding all the data into one table (yes, all the data into one table). There are about 186 columns in that one table.
I am now responsible for dividing each category of data into its own table. Everything is going fine although progress is too slow. Is there perhaps an SQL command that will somehow divide each category of data into its proper table? As of now I am manually looking at the main table and carefully transferring groups of data into each respective table along with its proper IDs making sure data is not corrupted. Here is the layout I have so far:
Note: I am probably one of the very few at my campus with database experience.
I would approach this as a classic normalisation process. Your single hugely wide table should contain all of the entities within your domain so as long as you understand the domain you should be able to normalise the structure until you're happy with it.
To create your foreign key lookups run distinct queries against the columns your going to remove and then add the key values back in.
It sounds like you know what you're doing already ? Are you just looking for reassurance that you're on the right track ? (Which it looks like you are).
Good luck though, and enjoy it - it sounds like a good little piece of work.

Strategy for avoiding a common sql development error (misleading result on join bug)

Sometimes when i'm writing moderately complex SELECT statements with a few JOINs, wrong key columns are sometimes used in the JOIN statement that still return valid-looking results.
Because the auto numbering values (especially early in development) all tend to fall in similar ranges (sub 100s or so) the SELECT sill produces some results. These results often look valid at first glance and a problem is not detected until much, much later making debugging much more difficult because familiarity with the data structures and code has staled. (Gone stale in the dev's mind.)
i just spent several hours tracking down yet another of this issue that i've run into a too many times before. i name my tables and columns carefully, write my SQL statements methodically but this is an issue i can't seem to competely avoid. It comes back and bites me for hours of productivity about twice a year on average.
My question is: Has anyone come up with a clever method for avoiding this; what i assume is probably a common SQL bug/mistake?
i have thought of trying to auto-number starting with different start values but this feels cludgy and would get ugly trying to keep such a scheme straight for data models with dozens of tables... Any better ideas?
P.S.
i am very careful and methodical in naming my tables and columns. Patient table gets PatientId column, Facility get a FacilityId etc. This issues tends to arise when there are join tables involved where the linkage takes on extra meaning such as: RelatedPatientId, ReferingPatientId, FavoriteItemId etc.
When writing long complex SELECT statements try to limit the result to one record.
For instance, assume you have this gigantic enormous awesome CMS system and you have to write internal reports because the reports that come with it are horrendous. You notice that there are about 500 tables. Your select statement joins 30 of these tables. Your result should limit your row count by using a WHERE clause.
My advice is to rather then get all this code written and generalized for all cases, break the problem up and use WHERE and limit the row count to only say a record. Check all fields, if they look ok, break it up and let your code return more rows. Only after further checking should you generalize.
It bites a lot of us who keep adding more and more joins until it seems to look ok, but only after Joe Blow the accountant runs the report does he realize that the PO for 4 million was really the telephone bill for the entire year. Somehow that join got messed up!
One option would be to use your natural keys.
More practically, Red Gate SQL Prompt picks the FK columns for me.
I also tend to build up one JOIN at a time to see how things look.
If you have a visualization or diagramming tool for your SQL statements, you can follow the joins visually, and any errors will become immediately apparent, provided you have followed a sensible naming scheme for your primary and foreign keys.
Your column names should take care of this unless you named them all "ID". Are you writing multiple select statement using the same tables? You may want to create views for the more common ones.
If you're using SQL Server, you can use GUID columns as primary keys (that's what we do). You won't have problems with collisions again.
You could use GUIDs as your primary keys, but it has its pros and cons.
This pro is actually not mentioned on that page.
I have never tried doing this myself - I use a tool on top of SQL that makes incorrect joins very unlikely, so I don't have this problem. I just thought I'd mention it as another option though!
For IDs use TableNameID, for example for table Person, use PersonID
Use db model and look at the drawing when writing queries.
This way join looks like:
... ON p.PersonID = d.PersonID
as opposed to:
... ON p.ID = d.ID
Auto-increment integer PKs are among your best friends.

Normalize or Denormalize: Store Contact Details (Phone Numbers) in separate Table? Search Performance?

I'm designing a database application which stores simple contact information (First/Last Name etc.) and I also have to store phone numbers. Besides phone numbers I have to store what they are for (mobile, business etc.) and possibly an additional comment for each.
My first approach was to normalize and keep the phone numbers in a separate table, so that I have my 'Contacts' and my 'PhoneNumbers' table. The PhoneNumbers table would be like this:
Id int PK
ContactId int FK<->Contacts.Id
PhoneNumber nvarchar(22)
Description nvarchar(100)
However, it would make things a lot easier AND save a SQL Join on retrieval if I just stored this information as part of each contact's record (assuming that I limit the total # of phone numbers that can be stored, to say 4 numbers total).
However, I end up with an "ugly" structure like this:
PhoneNumber1 nvarchar(22)
Description1 nvarchar(100)
PhoneNumber2 nvarchar(22)
Description2 nvarchar(100)
etc. etc.
It looks amateurish to me but here are the advantages I see:
1) In ASP.NET MVC I can simply attach the input textboxes to my LINQ object's properties and I'm done with wiring up record adds and updates.
2) No SQL Join necessary to retrieve the information.
Unfortunately I am not very knowledgeable on issues such as table width problems (I read that this can cause problems if it grows too big/too many columns and that performance issues come up?) and then also it would mean that when I search for a phone number I'd have to look at 4 fields instead of 1 if I kept it in a separate table.
My application has about 80% search/data retrieval activity so search performance is an important factor.
I appreciate your help in finding the right way to do this. Separate table or keep it all in one? Thank you!
It won't likely cause problems to have the data denormalized like that, but I wouldn't suggest it. Even though it may be more complex to query, it's better to have well formed data that you can manipulate in a multitude of ways. I would suggest a database schema like this:
Contacts:
ID (Primary Key)
Name
Job Title
Phone Number Categories:
ID (Primary key)
Name
Phone Numbers:
ID (Primary Key)
Category_ID (Foreign Key -> Phone Number Categories.ID)
Contact_ID (Foreign Key -> Contacts.ID)
Phone Number
This allows you a lot of flexibility in the number of phone numbers allowed, and gives you the ability to categorize them.
This might be fine now, but what happens when someone wants a fifth phone number? Do you keep on adding more and more fields?
Another thing to consider is how would you run a query to say 'Give me all people and their mobile phone numbers', or 'Give me everyone with no phone number'? With a separate table, this is easy, but with one table the mobile number could be in any one of four fields so it becomes much more complicated.
If you take the normalised approach, and in future you wanted to add additional data about the phone number, you can simply add another column to the phone numbers table, not add 4 columns to the contacts table.
Going back to the first point about adding more phone numbers in future - if you do add more numbers, you will have to amend probably every query/bit of logic/form that works on data to do with phone numbers.
I'm in favor of the normalized approach. What if you decided that you wanted to add an "Extension" column for business phone numbers? You'd have to create the columns "Extension1", "Extension2", "Extension3", etc. This could become quite tedious to maintain at some point.
Then again, I don't think you can go too wrong either way. It's not like normalization/denormalization will take all that much time if you decided to switch to the other method.
An important principle of denormalization is that it does not sacrifice normalized data. You should always start with a schema that accurately describes your data. As such you should put different kinds of information in different kinds of tables. You should also put as many constraints on your data as you think is reasonable.
All of these goals tend to make queries a teeny bit longer, as you have to join different tables to get the desired information, but with the right names for tables and columns, this shouldn't be a burden from the point of view of readability.
More imporantly, these goals can have an affect on performance. You should monitor your actual load to see if your database is performing adequately. If nearly all of your queries are returning quickly, and you have lots of CPU headroom for more queries, then you're done.
If you find that write queries are taking long, make sure you don't denormalize your data. You will make the database work harder to keep things consistent, since it will have to do many reads followed by many more writes. Instead, you want to look at your indexes. Do you have indexes on columns you rarely query? Do you have indexes that are needed to verify the integrity of an update?
If read queries are your bottleneck, then once again, you want to start by looking at your indexes. Do you need to add an index or two to avoid table scans? If you just can't avoid the table scans, are there any things you could do to make each row smaller, like by reducing the number of characters in a varchar column, or splitting rarely queried columns into another table to be joined upon when they are needed.
If there is a specific slow query that always uses the same join, then that query might benefit from denormalization. First verify that reads on those tables strongly outnumber writes. Determine which columns you need from one table to add to the other. You might want to use a slightly different name to those columns so that it's more obvious that they are from denormalization. Alter your write logic to update both the original table used in the join, and the denormalized fields.
It's important to note that you aren't removing the old table. The problem with denormalized data is that while it accelerates the specific query it was designed for, it tends to complicate other queries. In particular, write queries must do more work to insure that the data remains consistent, either by copying data from table to table, by doing additonal subselects to make sure that the data is valid, or jump over other sorts of hurdles. By keeping the original table, you can leave all your old constraints in place, so at least those original columns are always valid. If you find for some reason that the denormalized columns are out of sync, you can switch back to the original, slower query and everything is valid, and then you can work on ways to rebuild the denormalized data.
I agree with #Tom that normalizing makes more sense and provides flexibility. If you get your indexes right you shouldn't suffer too much by doing a join between the tables.
As for your normalized table I'd add a type or code field so you can ID say Home, Home 1,Home 2, Business, Bus 1, Bus 2, Mobile, Mob1 etc...
Id int PK
ContactId int FK<->Contacts.Id
Code char(5)
PhoneNumber nvarchar(22)
Description nvarchar(100)
And store this type in a separate table with say other code/code description information
We tend to have a Code Group table with info such as
CODE_GROUP, CODE DESC
ST State
PH Phone Number
AD Address Type
And a CODE table with
CODE_ID, CODE_GROUP, DESCRIPTION
MB1 PH Mobile One
MB2 PH Mobile Two
NSW ST New South Wales
etc...
You can expand this out to have long description, short desc, ordering, filtering etc
How about an XML field in the Contacts table? It takes away the complexity of another table.
(Please correct me if this is a bad idea, I've never worked with XML fields before)

How to change string data with a mass update?

I have a table (T1) in t-sql with a column (C1) that contains almost 30,000 rows of data.
Each column contains values like MSA123, MSA245, MSA299, etc. I need to run an update script so the MSA part of the string changes to CMA. How can I do this?
update t1
set c1 = replace(c1,"MSA","CMA")
where c1 like "MSA%"
I don't have SQL Server in front of me, but I believe that this will work:
UPDATE T1 SET C1 = REPLACE(C1, 'MSA', 'CMA');
You can use REPLACE function to do it.
In addition to what fallen888 posted, if there are other values in that table/column as well you can use the LIKE operator in the where clause to make sure you only update the records you care about:
... WHERE [C1] LIKE 'MSA[0-9][0-9][0-9]'
While replace will appear to work, what happens when you need to replace M with C for MSA but not for MCA? Or if you have MSAD as well as MSA in the data right now and you didn't want that changed (or CMSA). Do you even know for sure if you have any data being replaced that you didn't want replaced?
The proper answer is never to store data that way. First rule of database design is to only store one piece of information per field. You should have a related table instead. It will be easier to maintain over time.
I have to disagree with HLGEM's post. While is is true that the first normal form talks about atomocity in E.F. Codd's original vision (is is the most controversial aspect of 1NF IMHO) the original request does not necessarily mean that there are no related tables or that the value is not atomic.
MSA123 may be the natural key of the object in question and the company may have simply decided to rename their product line. It is correct to say that if an artificial ID was used then even updates to the natural key would not require as many rows to be updated but if that is what you are implying then I would argue that artificial keys are definitely not the first rule of database design. They have their advantages but they also have many disadvantages which I won't go into here but a little googling would turn up quite a bit of controversy on whether or not to use artificial primary keys.
As for the original request, others have already nailed it, REPLACE is the way to go.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.