why soundex return irrelevant result - sql

I wonder why :
WHERE 1=1
AND LTRIM(RTRIM(lastName)) ='Schmdli'
OR (
SOUNDEX(lastName) = SOUNDEX('Schmdli')
)
Return me result like
lastName
Schöntal
Schindler-Külling
Schindler
Schmidlin
Schindler
Schmidli
Schmidli
Schindler
while I expect only:
Schmidli
Schmidli
Schmidlin
My first AND LTRIM(RTRIM(lastName)) ='Schmdli' is to match exact value then with soundex I expect better near Schmdli result here some result like
Schöntal
Schindler-Külling
Schindler
shouldn't appear.
Thanks

Trivial answer: because SOUNDEX is a simple algorithm with limited space (one letter and three digits), and all of your examples happen to translate to the same one, S534, only taking into account the letters S, C, M and D. Incidentally, Schöntal only takes into account S, C, N and T, producing the same output since M and N encode in the same way, as do D and T.

Related

Find duplicates in case-sensitive query in MS Access

I have a table containing Japanese text, in which I believe that there are some duplicate rows. I want to write a SELECT query that returns all duplicate rows. So I tried running the following query based on an answer from this site (I wasn't able to relocate the source):
SELECT [KeywordID], [Keyword]
FROM Keyword
WHERE [Keyword] IN (SELECT [Keyword]
FROM [Keyword] GROUP BY [Keyword] HAVING COUNT(*) > 1);
The problem is that Access' equality operator treats the two Japanese writing systems - hiragana and katakana - as the same thing, where they should be treated as distinct. Both writing systems have the same phonetic value, although the written characters used to represent the sound are different - e.g. あ (hiragana) and ア (katakana) both represent the sound 'a'.
When I run the above query, however, both of these characters will appear, as according to Access, they're the same character and therefore a duplicate. Essentially it's a case-insensitive search where I need a case-sensitive one.
I got around this issue when doing a simple SELECT to find a Keyword using StrComp to perform a binary comparison, because this method correctly treats hiragana and katakana as distinct. I don't know how I can adapt the query above to use StrComp, though, because it's not directly evaluating one string against another as in the linked question.
Basically what I'm asking is: how can I do a query that will return all duplicates in a table, case-sensitive?
You can use exists instead:
SELECT [KeywordID], [Keyword]
FROM Keyword as k
WHERE EXISTS (SELECT 1
FROM Keyword as k2
WHERE STRCOMP(k2.Keyword, k.KeyWord, 0) = 0 AND
k.KeywordID <> k2.KeywordID
);
Try with a self join:
SELECT k1.[KeywordID], k1.[Keyword], k2.[KeywordID], k2.[Keyword]
FROM Keyword AS k1 INNER JOIN Keyword AS k2
ON k1.[KeywordID] < k2.[KeywordID] AND STRCOMP(k1.[Keyword], k2.[Keyword], 0) = 0

How can I SELECT DISTINCT on the last, non-numerical part of a mixed alphanumeric field?

I have a data set that looks something like this:
A6177PE
A85506
A51SAIO
A7918F
A810004
A11483ON
A5579B
A89903
A104F
A9982
A8574
A8700F
And I need to find all the ENDings where they are non-numeric. In this example, that means PE, AIO, F, ON, B and F.
In pseudocode, I'm imagining I need something like
SELECT DISTINCT X FROM
(SELECT SUBSTR(COL,[SOME_CLEVER_LOGIC]) AS X FROM TABLE);
Any ideas? Can I solve this without learning regexp?
EDIT: To clarify, my data set is a lot larger than this example. Also, I'm only interested in the part of the string AFTER the numeric part. If the string is "A6177PE" I want "PE".
Disclaimer: I don't know Oracle SQL. But, I think something like this should work:
SELECT DISTINCT X FROM
(SELECT SUBSTR(COL,REGEXP_INSTR(COL, "[[:ALPHA:]]+$")) AS X FROM TABLE);
REGEXP_INSTR(COL, "[[:ALPHA:]]+$") should return the position of the first of the characters at the end of the field.
For readability, I'd recommend using the REGEXP_SUBSTR function (If there are no performance issues of course, as this is definitely slower than the accepted solution).
...also similar to REGEXP_INSTR, but instead of returning the position of the substring, it returns the substring itself
SELECT DISTINCT SUBSTR(MY_COLUMN,REGEXP_SUBSTR("[a-zA-Z]+$")) FROM MY_TABLE;
(:alpha: is supported also, as #Audun wrote )
Also useful: Oracle Regexp Support (beginning page)
For example
SELECT SUBSTR(col,INSTR(TRANSLATE(col,'A0123456789','A..........'),'.',-1)+1)
FROM table;

What does LEFT in SQL do when it is not paired with JOIN and why does it cause my query to time out?

I was given the following statement:
LEFT(f.field4, CASE WHEN PATINDEX('%[^0-9]%',f.field4) = 0 THEN LEN(f.field4) ELSE PATINDEX('%[^0-9]%',f.field4) - 1 END)=#DealNumber
and am having trouble contacting the person that wrote it. Could someone explain what that statement does, and if it is valid SQL? The goal of the statement is to compare the numeric character in f.field for to the DealNumber. DNumber and DealNumber are the same except for a wildcard at the end of DealNumber.
I am trying to use it in the context of the following statement:
SELECT d.Description, d.FileID, d.DateFiled, u.Contact AS UserFiledName, d.Pages, d.Notes
FROM Documents AS d
LEFT JOIN Files AS f ON d.FileID=f.FileID
LEFT JOIN Users AS u ON d.UserFiled=u.UserID
WHERE SUBSTRING(f.Field8, 2, 1) = #LocationIDString
AND f.field4=#DNumber OR LEFT(f.field4, CASE WHEN PATINDEX('%[^0-9]%',f.field4) = 0 THEN LEN(f.field4) ELSE PATINDEX('%[^0-9]%',f.field4) - 1 END)=#DealNumber"
but my code keeps timing out when I execute it.
It's the CASE clause which is slowing things down, not LEFT per se (although LEFT may prevent the use of indexes, which will have an effect).
The CASE determines what should be compared with #DealNumber, and I think it does the following...
If f.field4 does not start with a digit, use LEFT(f.field4, LEN(f.field4))=#DealNumber: that's equivalent to f.field4=#DealNumber.
If f.field4 does start with digits, use {those digits}=#DealNumber.
This sort of computation isn't very efficient.
I would attempt the following, which makes the large assumption that a mixed string can be cast as an integer — that is, that if you convert ABC to an integer you get zero, and if you convert 123ABC you get what can be converted, 123. I can't find any documentation which says whether that is possible or not.
AND f.field4=#DNumber
OR (f.field4=#DealNumber AND integer(f.field4)=0)
OR (integer(f.field4)=#DealNumber)
The first line is the same as your AND. The second line selects f.field4=#DealNumber only if f.field4 does not start with a number. The third line selects where the initial numeric portion of f.field4 is the same as #DealNumber.
As I say, there is an assumption here that integer() will work in this way. You may need to define a CAST function to do that conversion with strings. That's rather beyond me, although I would be confident that even such a function would be faster than a CASE as you currently have.
From the doc:
left(str text, n int)
Return first n characters in the string. When n is negative, return all but last |n| characters.

Is it possible to use LIKE and IN for a WHERE statment?

I have a list of place names and would like to match them to records in a sql database the problem is the properties have reference numbers after there name. eg. 'Ballymena P-4sdf5g'
Is it possible to use IN and LIKE to match records
WHERE dbo.[Places].[Name] IN LIKE('Ballymena%','Banger%')
No, but you can use OR instead:
WHERE (dbo.[Places].[Name] LIKE 'Ballymena%' OR
dbo.[Places].[Name] LIKE 'Banger%')
It's a common misconception that for the construct
b IN (x, y, z)
that (x, y, z) represents a set. It does not.
Rather, it is merely syntactic sugar for
(b = x OR b = y OR b = z)
SQL has but one data structure: the table. If you want to query search text values as a set then put them into a table. Then you can JOIN your search text table to your Places table using LIKE in the JOIN condition e.g.
WITH Places (Name)
AS
(
SELECT Name
FROM (
VALUES ('Ballymeade Country Club'),
('Ballymena Candles'),
('Bangers & Mash Cafe'),
('Bangebis')
) AS Places (Name)
),
SearchText (search_text)
AS
(
SELECT search_text
FROM (
VALUES ('Ballymena'),
('Banger')
) AS SearchText (search_text)
)
SELECT *
FROM Places AS P1
LEFT OUTER JOIN SearchText AS S1
ON P1.Name LIKE S1.search_text + '%';
well a simple solution would be using regular expression not sure how it's done in sql but probably something similiar to this
WHERE dbo.[Places].[Name] SIMILAR TO '(Banger|Ballymena)';
or
WHERE dbo.[Places].[Name] REGEXP_LIKE(dbo.[Places].[Name],'(Banger|Ballymena)');
one of them should atleast work
you could use OR
WHERE
dbo.[Places].[Name] LIKE 'Ballymena%'
OR dbo.[Places].[Name] LIKE 'Banger%'
or split the string at the space, if the places.name is always in the same format.
WHERE SUBSTRING(dbo.[Places].[Name], 1, CHARINDEX(dbo.[Places].[Name], ' '))
IN ('Ballymena', 'Banger')
This might decrease performance, because the database may be able to use indexes with like (if the wildcard is at the end you have even a better chance) but most probably not when using substring.

How do I order by on a varchar field that could contain numbers alphabetically?

I am sure that this must be quite a common problem so I would guess that Microsoft have already solved the problem. My Googling skills are just not up to scratch. I have a field that I want to order by, it is a varchar field, for example
Q
Num 10
Num 1
A
Num 9
Num 2
F
Now I would expect the result to be
A
F
Num 1
Num 2
Num 9
Num 10
Q
But it is not. It is as follows (Notice that Num 10 comes after Num 1 and not Num 9 as expected)
A
F
Num 1
Num 10
Num 2
Num 9
Q
Now I know the reason for this so you don't need to explain :) But I can't remember how to solve it or if there is a nice flag or command that I can use to get it right.
EDIT:
The examples above are just an example. The column could contain any value. Any combination of letters and digits. Is there a way to sort this humanly alphabetically instead of ASCII value alphabetically?
EDIT 2:
Thanks for the answers so far. I am talking about ANY arbitary data. If it were in a fixed position or preceded by something then it would easy and I wouldn't be asking. I am asking for a general solution to this problem with ANY arbitary data. Not patterns, no rules, no nothing.
This is an age old problem of Ascii Sort Order vs. Natural Sort Order
See Sorting for Humans : Natural Sort Order for further details.
You added
The column could contain any value. Any combination of letters and digits
So, where do you want "foo1bar" and "foo10bar" for example? Or "foo10bar11" and "foo10bar1"? Or "Foo Two" and "Foo Three"?
There is no sensible solution without sensible data. You have random data. Define "human readable".
"I am asking for a general solution to
this problem with ANY arbitary data.
Not patterns, no rules, no nothing."
The problem is, programming is all about finding patterns, deriving rules from the patterns and applying solutions based on those rules. So without those prerequisites your question is pretty tough.
Essentially what you have to do is tokenize your sort string into chunks of pure letters and chunks of pure digits, and apply a different sort order to each category. That is doable providing you have some kind of pattern e.g.
AAA999AA
A9AAAAA
A999A
but it would require a bespoke solution for each pattern. A general solution for any arbitrary arrangement of data is a big ask.
If the field always has the number at the end with possibly one word before it, and a space before it, you could use CHARINDEX/SUBSTRING to solve this.
Here is an example:
select *
from (
select 'Q' x
union
select 'Num 10'
union
select 'Num 1'
union
select 'A'
union
select 'Num 9'
union
select 'Num 2'
union
select 'F'
) a
order by
case
when CHARINDEX(' ', x) <> 0 then LEFT(x, CHARINDEX(' ', x) - 1)
else x
end,
cast(case
when CHARINDEX(' ', x) <> 0 then substring(x, CHARINDEX(' ', x) + 1, LEN(x) - CHARINDEX(' ', x) )
else ''
end as int)
The output from this is:
A
F
Num 1
Num 2
Num 9
Num 10
Q
Edit:
Since your data is not consistent enough to use a hard-coded approach, the solution calls for more drastic measures. I have experimented with T-SQL based functions that will give a form of natural sort, but found them to be far too slow to be usable. Instead, I wrote a CLR based function and it performs very well. The function returns a scalar value that you can sort on. You'll find the code and installation instructions at over here.