Optimize a SQL Server query with conditional and formatted string join - sql

I need to execute a query that will join two tables on fields named a.PatientAddress and b.ADDRESS, the issue is that b.ADDRESS needs to be standardized and formatted to match the standardized address found in a.PatientAddress. I don't have control over the incoming data format, so having the data scrubbed before it comes into my b table is not an option. Example:
a.PatientAddress may equal something like 1234 Someplace Cool Dr. Apt 1234 while the matching address in b.ADDRESS may equal something like 1234 Someplace Cool Dr. #1234 (in reality that is just one of many possibilities). The Apartment number (if existent in the address) is the area of fluctuation that needs formatting in order to join properly.
Some possible Apt variations I've seen in the data set are:
1234 Someplace Cool Dr. #1234
1234 Someplace Cool Dr. Apt 1234
1234 Someplace Cool Dr. Apt #1234
1234 Someplace Cool Dr. Apt # 1234
Now, for what I've already tried;
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN MARKETING_MAILING mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND (
-- IF APT IS NOT FOUND, THEN ADDRESS SHOULD DIRECTLY EQUAL ANOTHER ADDRESS
( mm.ADDRESS NOT LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) = UPPER(mm.ADDRESS)
)
OR
(
mm.ADDRESS LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) =
-- PATIENT ADDRESS SHOULD EQUAL THE FORMATTED ADDRESS OF THE MAIL RECIPIENT
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(mm.ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#',''),mm.ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#','')
)
)
The problem here is that the query takes 20+ minutes to execute, and sometimes doesn't even finish before the operation time expires. I've also tried splitting the two conditions up into UNION statements. I've also tried splitting the street address and apartment number to create a like statement that reads UPPER(vgi.PatientAddress) LIKE UPPER('%1234 Someplace Cool Dr.%1234%') and that doesn't seem to work either. I'm starting to run out of ideas and wanted to see what others could suggest.
Thanks in advance for any pointers or help!

The logic needed to scrub the data is beyond the scope of what we can do for you. You'll likely find that, ultimately, you need some other key for this query to ever work. However, assuming your existing logic is adequate to create a good match (even if slow), we might be able to help improve performance a bit.
One way you can improve things is to join on a projection of the address table that cleans the data. (That means join to a sub query). That projection might look like this:
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE UPPER(ADDRESS)
END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
This improves things because it avoids the "OR" condition in your JOIN; you simply match to the projected column. However, this will force the projection over every row in the table (hint: that was probably happening anyway), and so it's still not as fast as it could be. You can get an idea for whether this will help from how long it takes to run the projection by itself.
You can further improve on the projection method by adding the ADDRESS_CLEAN column above as a computed column to your Marketing_Mailing table. This will force the adjustment to happen at insert time, meaning the work is already done for your slow query. You can even index on the column. Of course, that is at the cost of slower inserts. You might also try a view (or materialized view) on the table. This will help Sql Server save some of the work it does computing that extra column across multiple queries. For best results, also think about what WHERE filters you can use at the time you are creating the projection, to avoid needing to every compute the extra column on those rows in the first place.
An additional note is that, for the default collation, you can skip using the UPPER() function. That is likely hurting your index use.
Put it all together like this:
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN
(
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE ADDRESS END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
) mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND vgi.PatientAddress = mm.ADDRESS_CLEAN
Another huge factor not yet covered is indexes. What indexes are on your VISIT_GENERAL_INFORMATION table? I'd especially like to see a single index that covers both AdmitDate and PatientAddress. Which order is determined by the cardinality of those fields, and how clean and how much data is in the Marketing_Mail table.
Finally, one request of my own: if this helps, I'd like to hear back on just how much it helped. If the query used to take 20 minutes, how long does it take now?

I agree with #TomTom that you would really benefit from "pre-standardizing" into either
a derived column that updates on the fly
or a view or just a temp table in your query process
that gives you a clean match.
And with that, I would use a third-party service or library, ideally, because they have spent a lot of time making it a reliable parse.
Either option works after receiving the data you can't control, so that is not a problem.
What you're doing is creating your own, internal copy that is standardized.
Of course, you're going to need to run the other side, "a", through the same standardization.

Related

Get once time a duplicate record (SQL)

SELECT DISTINCT A.LeaseID,
C.SerialNumber,
B.LeasedProjectNumber As 'ProjectNumber',
A.LeaseComment As 'LeaseContractComments'
FROM aLease A
LEFT OUTER JOIN aLeasedAsset B
ON a.LeaseID = B.LeaseID
LEFT OUTER JOIN aAsset C
ON B.LeasedProjectNumber = C.ProjectNumber AND B.PartID = C.aPartid
WHERE A.LeaseComment IS NOT NULL
I got this result from a query statement. But I don't want to get repeated the last column(Comments) for the 3 records in the second column.
I want for the values on the second column write once the repeated comment. Like a Group By
Alright, I'll take a stab at this. It's pretty unclear what exactly you're hoping for, but reading your comments, it sounds like you're looking to build a hierarchy of sorts in your table.
Something like this:
"Lease Terminated Jan 29, 2013 due to the event of..."
216 24914 87
216 724992 87
216 724993 87
"Other potential column"
217 2132 86
...
...
Unfortuantely, I don't believe that that's possible. SQL Server is pretty strict about returning a table, which is two-dimensional by definition. There's no good way to describe a hierarchy such as this in SQL. There is the hierarchyid type, but that's not really relevant here.
Given that, you only really have two options:
My preference 99% of the time, just accept the duplicates. Handle them in your procedural code later on, which probably does have support for these trees. Unless you're dealing with performance-critical situations, or if you're pulling back a lot of data (or really long comments), that should be totally fine.
If you're hoping to print this result directly to the user, or if network performance is a big issue, aggregate your columns into a single record for each comment. It's well-known that you can't have multiple values in the same column, for the very same reason as the above-listed result isn't possible. But what you could do, data and your own preferences permitting, is write an aggregate function to concatenate the grouped results into a single, comma-delimited column.
You'd likely then have to parse those commas out, though, so unless network traffic is your biggest concern, I'd really just do it procedural-side.
SELECT STUFF((SELECT DISTINCT ', ' + SerialNumber
FROM [vLeasedAsset]
WHERE A.LeaseID = LeaseID AND A.ProjectNumber = ProjectNumber
FOR XML PATH (''))
, 1, 1, '') AS SerialNumber, [ProjectNumber],
MAX(ContractComment) 'LeaseContractComment'
FROM [vLeasedAsset] A
WHERE ContractComment != ''
GROUP BY [ProjectNumber], LeaseID
Output:
SerialNumber
24914, 724993
23401, 720356
ProjectNumber
87
91

SQL - How to "flatten" several very similar rows into 1

Apologies if the title isn't clear - I just didn't know how to describe the issue and I really don't know SQL that well/at all.
I am working with a database used by our case management system. At places it has clearly been extended over time by the developers. I am working with Contact details (names, addresses, etc...) and they have added extra fields to deal with email addresses and to allow for home/work/mobile phone numbers etc...
The problem is that they haven't added a new field for each individual new field. They have instead added a couple of fields in 2 different tables - the first field includes the field name, the second then includes the actual data.
The first is called AddElTypeText in a table called AdditionalAddElTypes - The AddElTypeText field includes values like "Work Telephone 1", "Fax", "Home Email" etc... (There are a total of 10 different values and I can't see the developers expanding this number any time soon)
The second field is called AddElementText in a table called AdditionalAddressElements - the AddElementText then includes the actual data e.g. the phone number, email address.
For those of you who (unlike me) find it easier to look at the SQL code, it's:
SELECT
Address.AddressLine1
,AdditionalAddElTypes.AddElTypeText
,AdditionalAddressElements.AddElementText
FROM
Address
INNER JOIN AdditionalAddressElements
ON Address.AddressID = AdditionalAddressElements.AddressID
INNER JOIN AdditionalAddElTypes
ON AdditionalAddressElements.AddElTypeID = AdditionalAddElTypes.AddElTypeID
I can work with this, but if any contact has 2 or more "additional" elements, I get multiple rows, with most of the data being the same, but just the 2 columns of AddElTypeText and AddElementText being different.
So can anyone suggest anything to "flatten" a contact into a single row. I had in mind something like concatenating AddElTypeText and AddElementText into a single string field, ideally with a space in between AddElTypeText and AddElementText, and then a : or , separating the pairs of AddElTypeText and AddElementText.
However, I have very little idea how to achieve that, or whether an entirely different approach would be better. Any help very gratefully received!
Gary
As #twn08 said, this type of question has generally been asked before. It's generally a pain to do this kind of grouping concatenation in SQL Server, involving the use of FOR XML.
That being said, here's a SQLFiddle that (I believe) does something like what you wanted. And here's the actual query:
WITH Results AS
(
SELECT a.*,
t.AddElTypeText,
aa.AddElementText
FROM
Address a
INNER JOIN
AdditionalAddressElements aa
ON a.AddressID = aa.AddressID
INNER JOIN
AdditionalAddElTypes t
ON aa.AddElTypeID = t.AddElTypeID
)
SELECT
res.AddressID,
STUFF((
SELECT ', ' + AddElTypeText + ': ' + AddElementText
FROM Results
WHERE (AddressID = res.AddressID)
FOR XML PATH (''))
,1,2,'') AS AdditionalElements
FROM Results res
GROUP BY res.AddressID

Optimizing a strange MySQL Query

Hoping someone can help with this. I have a query that pulls data from a PHP application and turns it into a view for use in a Ruby on Rails application. The PHP app's table is an E-A-V style table, with the following business rules:
Given fields: First Name, Last Name, Email Address, Phone Number and Mobile Phone Carrier:
Each property has two custom fields defined: one being required, one being not required. Clients can use either one, and different clients use different ones based on their own rules (e.g. Client A may not care about First and Last Name, but client B might)
The RoR app must treat each "pair" of properties as only a single property.
Now, here is the query. The problem is it runs beautifully with around 11,000 records. However, the real database has over 40,000 and the query is extremely slow, taking roughly 125 seconds to run which is totally unacceptable from a business perspective. It's absolutely required that we pull this data, and we need to interface with the existing system.
The UserID part is to fake out a Rails-esque foreign key which relates to a Rails table. I'm a SQL Server guy, not a MySQL guy, so maybe someone can point out how to improve this query? They (the business) demand that it be sped up but I'm not sure how to since the various group_concat and ifnull calls are required due to the fact that I need every field for every client and then have to combine the data.
select `ls`.`subscriberid` AS `id`,left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) AS `user_id`,
ifnull(min((case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end)),_utf8'') AS `first_name`,
ifnull(min((case when (`s`.`fieldid` in (3,36)) then `s`.`data` else NULL end)),_utf8'') AS `last_name`,
ifnull(`ls`.`emailaddress`,_utf8'') AS `email_address`,
ifnull(group_concat((case when (`s`.`fieldid` = 81) then `s`.`data` when (`s`.`fieldid` = 154) then `s`.`data` else NULL end) separator ''),_utf8'') AS `mobile_phone`,
ifnull(group_concat((case when (`s`.`fieldid` = 100) then `s`.`data` else NULL end) separator ','),_utf8'') AS `sms_only`,
ifnull(group_concat((case when (`s`.`fieldid` = 34) then `s`.`data` else NULL end) separator ','),_utf8'') AS `mobile_carrier`
from ((`list_subscribers` `ls`
join `lists` `l` on((`ls`.`listid` = `l`.`listid`)))
left join `subscribers_data` `s` on((`ls`.`subscriberid` = `s`.`subscriberid`)))
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
group by `ls`.`subscriberid`,`l`.`name`,`ls`.`emailaddress`
EDIT
I removed the regexp and that sped the query up to about 20 seconds, instead of nearly 120 seconds. If I could remove the group by then it would be faster, but I cannot as removing this causes it to duplicate rows with blank data for each field, instead of aggregating them. For instance:
With group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John Doe jdoe#example.com 5551234567 0 Sprint
Without group by
id user_id first_name last_name email_address mobile_phone sms_only mobile_carrier
1 1 John jdoe#xample.com
1 1 Doe jdoe#example.com
1 1 jdoe#example.com
1 1 jdoe#example.com 5551234567
And so on. What we need is the first result.
EDIT #2
The query still seems to take a long time, but earlier today it was running in only about 20 seconds on the production database. Without changing a thing, the same query is now once again taking over 60 seconds. This is still unacceptable.. any other ideas on how to improve this?
That is, without a doubt, the second most hideous SQL query I have ever laid my eyes on :-)
My advice is to trade storage requirements for speed. This is a common trick used when you find your queries have a lot of per-row functions (ifnull, case and so forth). These per-row functions never scale very well as the table becomes larger.
Create new fields in the table which will hold the values you want to extract and then calculate those values on insert/update (with a trigger) rather than select. This doesn't technically break 3NF since the triggers guarantee data consistency between columns.
The vast majority of database tables are read far more often than they're written so this will amortise the cost of the calculation across many selects. In addition, just about every reported problem with databases is one of speed, not storage.
An example of what I mean. You can replace:
case when (`s`.`fieldid` in (2,35)) then `s`.`data` else NULL end
with:
`s`.`data_2_35`
in your query if your insert/update trigger simply sets the data_2_35 column to data or NULL depending on the value of fieldid. Then you index data_2_35 and, voila, instant speed improvement at the cost of a little storage.
This trick can be done to the five case clauses, the left/regexp bit and the "naked" ifnull function as well (the ifnull functions containing min and group_concat may be harder to do).
The problem is most likely the WHERE condition:
where (left(`l`.`name`,(locate(_utf8'_',`l`.`name`) - 1)) regexp _utf8'[[:digit:]]+')
This looks like complex string comparison, so no index can be used, which results in a full table scan, possibly for every row in the result set. I am not a MySQL expert, but if you can simplify this into more simple column comparisons it will probably run much faster.
The first thing that jumps out at me as the source of all the trouble:
The PHP app's table is an E-A-V style table...
Trying to convert data in EAV format into conventional relational format on the fly using SQL is bound to be awkward and inefficient. So don't try to smash it into a conventional column-per-attribute format. The following query returns multiple rows per subscriber, one row per EAV attribute:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE SUBSTRING_INDEX(l.name, _utf8'_', 1) REGEXP _utf8'[[:digit:]]+'
This eliminates the GROUP BY which is not optimized well in MySQL -- it usually incurs a temporary table which kills performance.
id user_id email_address fieldid data
1 1 jdoe#example.com 2 John
1 1 jdoe#example.com 3 Doe
1 1 jdoe#example.com 81 5551234567
But you'll have to sort out the EAV attributes in application code. That is, you can't seamlessly use ActiveRecord in this case. Sorry about that, but that's one of the disadvantages of using a non-relational design like EAV.
The next thing that I notice is the killer string manipulation (even after I've simplified it with SUBSTRING_INDEX()). When you're picking substrings out of a column, this says you me that you've overloaded one column with two distinct pieces of information. One is the name and the other is some kind of list-type attribute that you would use to filter the query. Store one piece of information in each column.
You should add a column for this attribute, and index it. Then the WHERE clause can utilize the index:
SELECT ls.subscriberid AS id,
SUBSTRING_INDEX(l.name, _utf8'_', 1) AS user_id,
COALESCE(ls.emailaddress, _utf8'') AS email_address,
s.fieldid, s.data
FROM list_subscribers ls JOIN lists l ON (ls.listid = l.listid)
LEFT JOIN subscribers_data s ON (ls.subscriberid = s.subscriberid
AND s.fieldid IN (2,3,34,35,36,81,100,154)
WHERE l.list_name_contains_digits = 1;
Also, you should always analyze an SQL query with EXPLAIN if it's important for them to have good performance. There's an analogous feature in MS SQL Server, so you should be accustomed to the concept, but the MySQL terminology may be different.
You'll have to read the documentation to learn how to interpret the EXPLAIN report in MySQL, there's too much info to describe here.
Re your additional info: Yes, I understand you can't do away with the EAV table structure. Can you create an additional table? Then you can load the EAV data into it:
CREATE TABLE subscriber_mirror (
subscriberid INT PRIMARY KEY,
first_name VARCHAR(100),
last_name VARCHAR(100),
first_name2 VARCHAR(100),
last_name2 VARCHAR(100),
mobile_phone VARCHAR(100),
sms_only VARCHAR(100),
mobile_carrier VARCHAR(100)
);
INSERT INTO subscriber_mirror (subscriberid)
SELECT DISTINCT subscriberid FROM list_subscribers;
UPDATE subscriber_data s JOIN subscriber_mirror m USING (subscriberid)
SET m.first_name = IF(s.fieldid = 2, s.data, m.first_name),
m.last_name = IF(s.fieldid = 3, s.data, m.last_name),
m.first_name2 = IF(s.fieldid = 35, s.data, m.first_name2),
m.last_name2 = IF(s.fieldid = 36, s.data, m.last_name2),
m.mobile_phone = IF(s.fieldid = 81, s.data, m.mobile_phone),
m.sms_only = IF(s.fieldid = 100, s.data, m.sms_only),
m.mobile_carrer = IF(s.fieldid = 34, s.data, m.mobile_carrier);
This will take a while, but you only need to do it when you get a new data update from the vendor. Subsequently you can query subscriber_mirror in a much more conventional SQL query:
SELECT ls.subscriberid AS id, l.name+0 AS user_id,
COALESCE(s.first_name, s.first_name2) AS first_name,
COALESCE(s.last_name, s.last_name2) AS last_name,
COALESCE(ls.email_address, '') AS email_address),
COALESCE(s.mobile_phone, '') AS mobile_phone,
COALESCE(s.sms_only, '') AS sms_only,
COALESCE(s.mobile_carrier, '') AS mobile_carrier
FROM lists l JOIN list_subscribers USING (listid)
JOIN subscriber_mirror s USING (subscriberid)
WHERE l.name+0 > 0
As for the userid that's embedded in the l.name column, if the digits are the leading characters in the column value, MySQL allows you to convert to an integer value much more easily:
An expression like '123_bill'+0 yields an integer value of 123. An expression like 'bill_123'+0 has no digits at the beginning, so it yields an integer value of 0.

SQL Group By

If I have a set of records
name amount Code
Dave 2 1234
Dave 3 1234
Daves 4 1234
I want this to group based on Code & Name, but the last Row has a typo in the name, so this wont group.
What would be the best way to group these as:
Dave/Daves 9 1234
As a general rule if the data is wrong you should fix the data.
However if you want to do the report anyway you could come up with another criteria to group on, for example LEFT(Name, 4) would perform a grouping on the first 4 characters of the name.
You may also want to consider the CASE statement as a method (CASE WHEN name = 'Daves' THEN 'Dave' ELSE name), but I really don't like this method, especially if you are proposing to use this for anything else then a one-off report.
If it's a workaround, try
SELECT cname, SUM(amount)
FROM (
SELECT CASE WHEN NAME = 'Daves' THEN 'Dave' ELSE name END AS cname, amount
FROM mytable
)
GROUP BY cname
This if course will handle only this exact case.
For MySQL:
select
group_concat(distinct name separator '/'),
sum(amount),
code
from
T
group by
code
For MSSQL 2005+ group_concat() can be implemented as .NET custom aggregate.
Fix the typo? Otherwise grouping on the name is going to create a new group.
Fixing your data should be your highest priority instead of trying to devise ways to "work around" it.
It should also be noted that if you have this single typo in your data, it is likely that you have (or will have at some point in the future) even more screwy data that will not cleanly fit into your code, which will force you to invent more and more "work arounds" to deal with it, when you should be focusing on the cleanliness of your data.
If the name field is suppose to be a key then the assumption has to be that Dave and Daves are two different items all together, and thus should be grouped differently. If however it is a typo, then as other have suggested, fix the data.
Grouping on a freeform entered text field if that is what this is, will always have issues. Data entry is never 100%.
To me it makes more sense to group on the code alone if that is the key field and leave name out of the grouping all together.

How to sort and display mixed lists of alphas and numbers as the users expect?

Our application has a CustomerNumber field. We have hundreds of different people using the system (each has their own login and their own list of CustomerNumbers). An individual user might have at most 100,000 customers. Many have less than 100.
Some people only put actual numbers into their customer number fields, while others use a mixture of things. The system allows 20 characters which can be A-Z, 0-9 or a dash, and stores these in a VARCHAR2(20). Anything lowercase is made uppercase before being stored.
Now, let's say we have a simple report that lists all the customers for a particular user, sorted by Customer Number. e.g.
SELECT CustomerNumber,CustomerName
FROM Customer
WHERE User = ?
ORDER BY CustomerNumber;
This is a naive solution as the people that only ever use numbers do not want to see a plain alphabetic sort (where "10" comes before "9").
I do not wish to ask the user any unnecessary questions about their data.
I'm using Oracle, but I think it would be interesting to see some solutions for other databases. Please include which database your answer works on.
What do you think the best way to implement this is?
Probably your best bet is to pre-calculate a separate column and use that for ordering and use the customer number for display. This would probably involve 0-padding any internal integers to a fixed length.
The other possibility is to do your sorting post-select on the returned results.
Jeff Atwood has put together a blog posting about how some people calculate human friendly sort orders.
In Oracle 10g:
SELECT cust_name
FROM t_customer c
ORDER BY
REGEXP_REPLACE(cust_name, '[0-9]', ''), TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+'))
This will sort by the first occurence of number, not regarding it's position, i. e.:
customer1 < customer2 < customer10
cust1omer ? customer1
cust8omer1 ? cust8omer2
, where a ? means that the order is undefined.
That suffices for most cases.
To force sort order on case 2, you may add a REGEXP_INSTR(cust_name, '[0-9]', n) to ORDER BY list n times, forcing order on the first appearance of n-th (2nd, 3rd etc.) group of digits.
To force sort order on case 3, you may add a TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+', n)) to ORDER BY list n times, forcing order of n-th. group of digits.
In practice, the query I wrote is enough.
You may create a function based index on these expressions, but you'll need to force it with a hint, and a one-pass SORT ORDER BY will be performed anyway, as the CBO doesn't trust function-base indexes enough to allow an ORDER BY on them.
You could have a numeric column [CustomerNumberInt] that is only used when the CustomerNumber is purely numeric (NULL otherwise[1]), then
ORDER BY CustomerNumberInt, CustomerNumber
[1] depending on how your SQL version handles NULLs in ORDER BY you might want to default it to zero (or infinity!)
I have a similar horrible situation and have developed a suitably horrible function to deal with it (SQLServer)
In my situation I have a table of "units" (this is a work-tracking system for students, so unit in this context represents a course they're doing). Units have a code, which for the most part is purely numeric, but for various reasons it was made a varchar and they decided to prefix some by up to 5 characters. So they expect 53,123,237,356 to sort normally, but also T53, T123, T237, T356
UnitCode is a nvarchar(30)
Here's the body of the function:
declare #sortkey nvarchar(30)
select #sortkey =
case
when #unitcode like '[^0-9][0-9]%' then left(#unitcode,1) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-1)
when #unitcode like '[^0-9][^0-9][0-9]%' then left(#unitcode,2) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-2)
when #unitcode like '[^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,3) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-3)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,4) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-4)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,5) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-5)
when #unitcode like '%[^0-9]%' then #unitcode
else left('000000000000000000000000000000',30-len(#unitcode)) + #unitcode
end
return #sortkey
I wanted to shoot myself in the face after writing that, however it works and seems not to kill the server when it runs.
I used this in SQL SERVER and working great: Here the solution is to pad the numeric values with a character in front so that all are of the same string length.
Here is an example using that approach:
select MyCol
from MyTable
order by
case IsNumeric(MyCol)
when 1 then Replicate('0', 100 - Len(MyCol)) + MyCol
else MyCol
end
The 100 should be replaced with the actual length of that column.