How to sort and display mixed lists of alphas and numbers as the users expect? - sql

Our application has a CustomerNumber field. We have hundreds of different people using the system (each has their own login and their own list of CustomerNumbers). An individual user might have at most 100,000 customers. Many have less than 100.
Some people only put actual numbers into their customer number fields, while others use a mixture of things. The system allows 20 characters which can be A-Z, 0-9 or a dash, and stores these in a VARCHAR2(20). Anything lowercase is made uppercase before being stored.
Now, let's say we have a simple report that lists all the customers for a particular user, sorted by Customer Number. e.g.
SELECT CustomerNumber,CustomerName
FROM Customer
WHERE User = ?
ORDER BY CustomerNumber;
This is a naive solution as the people that only ever use numbers do not want to see a plain alphabetic sort (where "10" comes before "9").
I do not wish to ask the user any unnecessary questions about their data.
I'm using Oracle, but I think it would be interesting to see some solutions for other databases. Please include which database your answer works on.
What do you think the best way to implement this is?

Probably your best bet is to pre-calculate a separate column and use that for ordering and use the customer number for display. This would probably involve 0-padding any internal integers to a fixed length.
The other possibility is to do your sorting post-select on the returned results.
Jeff Atwood has put together a blog posting about how some people calculate human friendly sort orders.

In Oracle 10g:
SELECT cust_name
FROM t_customer c
ORDER BY
REGEXP_REPLACE(cust_name, '[0-9]', ''), TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+'))
This will sort by the first occurence of number, not regarding it's position, i. e.:
customer1 < customer2 < customer10
cust1omer ? customer1
cust8omer1 ? cust8omer2
, where a ? means that the order is undefined.
That suffices for most cases.
To force sort order on case 2, you may add a REGEXP_INSTR(cust_name, '[0-9]', n) to ORDER BY list n times, forcing order on the first appearance of n-th (2nd, 3rd etc.) group of digits.
To force sort order on case 3, you may add a TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+', n)) to ORDER BY list n times, forcing order of n-th. group of digits.
In practice, the query I wrote is enough.
You may create a function based index on these expressions, but you'll need to force it with a hint, and a one-pass SORT ORDER BY will be performed anyway, as the CBO doesn't trust function-base indexes enough to allow an ORDER BY on them.

You could have a numeric column [CustomerNumberInt] that is only used when the CustomerNumber is purely numeric (NULL otherwise[1]), then
ORDER BY CustomerNumberInt, CustomerNumber
[1] depending on how your SQL version handles NULLs in ORDER BY you might want to default it to zero (or infinity!)

I have a similar horrible situation and have developed a suitably horrible function to deal with it (SQLServer)
In my situation I have a table of "units" (this is a work-tracking system for students, so unit in this context represents a course they're doing). Units have a code, which for the most part is purely numeric, but for various reasons it was made a varchar and they decided to prefix some by up to 5 characters. So they expect 53,123,237,356 to sort normally, but also T53, T123, T237, T356
UnitCode is a nvarchar(30)
Here's the body of the function:
declare #sortkey nvarchar(30)
select #sortkey =
case
when #unitcode like '[^0-9][0-9]%' then left(#unitcode,1) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-1)
when #unitcode like '[^0-9][^0-9][0-9]%' then left(#unitcode,2) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-2)
when #unitcode like '[^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,3) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-3)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,4) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-4)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,5) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-5)
when #unitcode like '%[^0-9]%' then #unitcode
else left('000000000000000000000000000000',30-len(#unitcode)) + #unitcode
end
return #sortkey
I wanted to shoot myself in the face after writing that, however it works and seems not to kill the server when it runs.

I used this in SQL SERVER and working great: Here the solution is to pad the numeric values with a character in front so that all are of the same string length.
Here is an example using that approach:
select MyCol
from MyTable
order by
case IsNumeric(MyCol)
when 1 then Replicate('0', 100 - Len(MyCol)) + MyCol
else MyCol
end
The 100 should be replaced with the actual length of that column.

Related

How to query column with letters on SQL?

I'm new to this.
I have a column: (chocolate_weight) On the table : (Chocolate) which has g at the end of every number, so 30x , 2x5g,10g etc.
I want to remove the letter at the end and then query it to show any that weigh greater than 35.
So far I have done
Select *
From Chocolate
Where chocolate_weight IN
(SELECT
REPLACE(chocolote_weight,'x','') From Chocolate) > 35
It is coming back with 0 , even though there are many that weigh more than 35.
Any help is appreciated
Thanks
If 'g' is always the suffix then your current query is along the right lines, but you don't need the IN you can do the replace in the where clause:
SELECT *
FROM Chocolate
WHERE CAST(REPLACE(chocolate_weight,'g','') AS DECIMAL(10, 2)) > 35;
N.B. This works in both the tagged DBMS SQL-Server and MySQL
This will fail (although only silently in MySQL) if you have anything that contains units other than grams though, so what I would strongly suggest is that you fix your design if it is not too late, store the weight as an numeric type and lose the 'g' completely if you only ever store in grams. If you use multiple different units then you may wish to standardise this so all are as grams, or alternatively store the two things in separate columns, one as a decimal/int for the numeric value and a separate column for the weight, e.g.
Weight
Unit
10
g
150
g
1000
lb
The issue you will have here though is that you will have start doing conversions in your queries to ensure you get all results. It is easier to do the conversion once when the data is saved and use a standard measure for all records.

Repeating operations vs multilevel queries

I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.

Optimize a SQL Server query with conditional and formatted string join

I need to execute a query that will join two tables on fields named a.PatientAddress and b.ADDRESS, the issue is that b.ADDRESS needs to be standardized and formatted to match the standardized address found in a.PatientAddress. I don't have control over the incoming data format, so having the data scrubbed before it comes into my b table is not an option. Example:
a.PatientAddress may equal something like 1234 Someplace Cool Dr. Apt 1234 while the matching address in b.ADDRESS may equal something like 1234 Someplace Cool Dr. #1234 (in reality that is just one of many possibilities). The Apartment number (if existent in the address) is the area of fluctuation that needs formatting in order to join properly.
Some possible Apt variations I've seen in the data set are:
1234 Someplace Cool Dr. #1234
1234 Someplace Cool Dr. Apt 1234
1234 Someplace Cool Dr. Apt #1234
1234 Someplace Cool Dr. Apt # 1234
Now, for what I've already tried;
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN MARKETING_MAILING mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND (
-- IF APT IS NOT FOUND, THEN ADDRESS SHOULD DIRECTLY EQUAL ANOTHER ADDRESS
( mm.ADDRESS NOT LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) = UPPER(mm.ADDRESS)
)
OR
(
mm.ADDRESS LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) =
-- PATIENT ADDRESS SHOULD EQUAL THE FORMATTED ADDRESS OF THE MAIL RECIPIENT
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(mm.ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#',''),mm.ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#','')
)
)
The problem here is that the query takes 20+ minutes to execute, and sometimes doesn't even finish before the operation time expires. I've also tried splitting the two conditions up into UNION statements. I've also tried splitting the street address and apartment number to create a like statement that reads UPPER(vgi.PatientAddress) LIKE UPPER('%1234 Someplace Cool Dr.%1234%') and that doesn't seem to work either. I'm starting to run out of ideas and wanted to see what others could suggest.
Thanks in advance for any pointers or help!
The logic needed to scrub the data is beyond the scope of what we can do for you. You'll likely find that, ultimately, you need some other key for this query to ever work. However, assuming your existing logic is adequate to create a good match (even if slow), we might be able to help improve performance a bit.
One way you can improve things is to join on a projection of the address table that cleans the data. (That means join to a sub query). That projection might look like this:
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE UPPER(ADDRESS)
END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
This improves things because it avoids the "OR" condition in your JOIN; you simply match to the projected column. However, this will force the projection over every row in the table (hint: that was probably happening anyway), and so it's still not as fast as it could be. You can get an idea for whether this will help from how long it takes to run the projection by itself.
You can further improve on the projection method by adding the ADDRESS_CLEAN column above as a computed column to your Marketing_Mailing table. This will force the adjustment to happen at insert time, meaning the work is already done for your slow query. You can even index on the column. Of course, that is at the cost of slower inserts. You might also try a view (or materialized view) on the table. This will help Sql Server save some of the work it does computing that extra column across multiple queries. For best results, also think about what WHERE filters you can use at the time you are creating the projection, to avoid needing to every compute the extra column on those rows in the first place.
An additional note is that, for the default collation, you can skip using the UPPER() function. That is likely hurting your index use.
Put it all together like this:
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN
(
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE ADDRESS END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
) mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND vgi.PatientAddress = mm.ADDRESS_CLEAN
Another huge factor not yet covered is indexes. What indexes are on your VISIT_GENERAL_INFORMATION table? I'd especially like to see a single index that covers both AdmitDate and PatientAddress. Which order is determined by the cardinality of those fields, and how clean and how much data is in the Marketing_Mail table.
Finally, one request of my own: if this helps, I'd like to hear back on just how much it helped. If the query used to take 20 minutes, how long does it take now?
I agree with #TomTom that you would really benefit from "pre-standardizing" into either
a derived column that updates on the fly
or a view or just a temp table in your query process
that gives you a clean match.
And with that, I would use a third-party service or library, ideally, because they have spent a lot of time making it a reliable parse.
Either option works after receiving the data you can't control, so that is not a problem.
What you're doing is creating your own, internal copy that is standardized.
Of course, you're going to need to run the other side, "a", through the same standardization.

Custom SQL sort by

Use:
The user searches for a partial postcode such as 'RG20' which should then be displayed in a specific order. The query uses the MATCH AGAINST method in boolean mode where an example of the postcode in the database would be 'RG20 7TT' so it is able to find it.
At the same time it also matches against a list of other postcodes which are in it's radius (which is a separate query).
I can't seem to find a way to order by a partial match, e.g.:
ORDER BY FIELD(postcode, 'RG20', 'RG14', 'RG18','RG17','RG28','OX12','OX11')
DESC, city DESC
Because it's not specifically looking for RG20 7TT, I don't think it can make a partial match.
I have tried SUBSTR (postcode, -4) and looked into left and right, but I haven't had any success using 'by field' and could not find another route...
Sorry this is a bit long winded, but I'm in a bit of a bind.
A UK postcode splits into 2 parts, the last section always being 3 characters and within my database there is a space between the two if that helps at all.
Although there is a DESC after the postcodes, I do need them to display in THAT particular order (RG20, RG14 then RG18 etc..) I'm unsure if specifying descending will remove the ordering or not
Order By Case
When postcode Like 'RG20%' Then 1
When postcode Like 'RG14%' Then 2
When postcode Like 'RG18%' Then 3
When postcode Like 'RG17%' Then 4
When postcode Like 'RG28%' Then 5
When postcode Like 'OX12%' Then 6
When postcode Like 'OX11%' Then 7
Else 99
End Asc
, City Desc
You're on the right track, trimming the field down to its first four characters:
ORDER BY FIELD(LEFT(postcode, 4), 'RG20', 'RG14', ...),
-- or SUBSTRING(postcode FROM 1 FOR 4)
-- or SUBSTR(postcode, 1, 4)
Here you don't want DESC.
(If your result set contains postcodes whose prefixes do not appear in your FIELD() ordering list, you'll have a bit more work to do, since those records will otherwise appear before any explicitly ordered records you specify. Before 'RG20' in the example above.)
If you want a completely custom sorting scheme, then I only see one way to do it...
Create a table to hold the values upon which to sort, and include a "sequence" or "sort_order" field. You can then join to this table and sort by the sequence field.
One note on the sequence field. It makes sense to create it as an int as... well, sequences are often ints :)
If there is any possibility of changing the sort order, you may want to consider making it alpha numeric... It is a lot easier to insert "5A" between "5 and "6" than it is to insert a number into a sequence of integers.
Another method I use is utilising the charindex function:
order by charindex(substr(postcode,4,1),"RG20RG14RG18...",1)
I think that's the syntax anyway, I'm just doing this in SAS at the moment so I've had to adapt from memory!
But essentially the sooner you hit your desired part of the string, the higher the rank.
If you're trying to rank on a large variety of postcodes then a case statement gets pretty hefty.

Sorting SQL by first two characters of fields

I'm trying to sort some data by sales person initials, and the sales rep field is 3 chars long, and is Firstname, Lastname and Account type. So, Bob Smith would be BS* and I just need to sort by the first two characters.
How can I pull all data for a certain rep, where the first two characters of the field equals BS?
In some databases you can actually do
select * from SalesRep order by substring(SalesRepID, 1, 2)
Othere require you to
select *, Substring(SalesRepID, 1, 2) as foo from SalesRep order by foo
And in still others, you can't do it at all (but will have to sort your output in program code after you get it from the database).
Addition: If you actually want just the data for one sales rep, do as the others suggest. Otherwise, either you want to sort by the thing or maybe group by the thing.
What about this
SELECT * FROM SalesTable WHERE SalesRepField LIKE 'BS_'
I hope that you never end up with two sales reps who happen to have the same initials.
Also, sorting and filtering are two completely different things. You talk about sorting in the question title and first paragraph, but your question is about filtering. Since you can just ORDER BY on the field and it will use the first two characters anyway, I'll give you an answer for the filtering part.
You don't mention your RDBMS, but this will work in any product:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE 'BS%'
If you're using a variable/parameter then:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE #my_param + '%'
You can also use:
LEFT(sales_rep, 2) = 'BS'
I would stay away from:
SUBSTRING(sales_rep, 1, 2) = 'BS'
Depending on your SQL engine, it might not be smart enough to realize that it can use an index on the last one.
You haven't said what DBMS you are using. The following would work in Oracle, and something like them in most other DBMSs
1) where sales_rep like 'BS%'
2) where substr(sales_rep,1,2) = 'BS'
SELECT * FROM SalesRep
WHERE SUBSTRING(SalesRepID, 1, 2) = 'BS'
You didn't say what database you were using, this works in MS SQL Server.