An esoteric pondering regarding the lack of compatibility between % and = and <> - sql

I am new to the world of programming but please humor me nonetheless.
I know that % works with LIKE and NOT LIKE. For example the following two queries work:
--QUERY 1
SELECT *
FROM TrumpFeccandid_Pacs
WHERE PACID NOT LIKE 'C%'
--QUERY 2
SELECT *
FROM TrumpFeccandid_Pacs
WHERE PACID LIKE 'C%'
However % does not work with = or <>. For example, the following two queries do not work:
--QUERY A
SELECT *
FROM TrumpFeccandid_Pacs
WHERE PACID <> 'C%'
--QUERY B
SELECT *
FROM TrumpFeccandid_Pacs
WHERE PACD = 'C%'
Why is this the case? Intuitively speaking I feel like not only should queries A and B work but Query A should be equivalent to Query 1 and Query B should be equivalent to Query 2.
These examples were using T-SQL from Sql Server 2016.

Image a relatively simple query like this one:
SELECT *
FROM A
JOIN B ON A.Name = B.Name
If = worked like LIKE, god help you if Name contains a percent or underscore!

Intuitively speaking I feel like
That is where you go awry!
LIKE is defined a certain way, as are = and <>. The people who designed the language presumably tried to make it accessible, to make it easy to understand and remember and use. What they did not do, because they could not do, is define it such that it meets everyone's expectations and hunches.
Why is LIKE different from =?
a like 'C%' is true if a starts with 'C'
a = 'C%' is true if a is exactly the 2 letter string 'C%'
But the real moral to this story IMO is that if you want to know how the language works, the best advice is RTFM. Especially when it doesn't work as expected.

SQL provides standard pattern matching like those used in Unix, grep, sed. These patters can be used only with operators "LIKE" and "NOT LIKE".....
LIKE/NOT LIKE are Boolean types i.e they returns TRUE/FALSE if the match_expression matches the specified pattern.
Following are various wild card used to match the patterns:
% = Any number of characters
_ = Any Single character
[] = Any single character within the specified range
[^] = Any single character not within the specified range
Documentation on patterns and like operators:
SQL server LIKE operator

Related

Why am I getting results that don't match the first WHERE clause?

WHERE [SOTR_CUST_CODE] = 'O004'
AND [SOTD_STRC_CODE] LIKE 'PC%'
OR [SOTD_STRC_CODE] LIKE 'PD%'
This returns records of customers that are not 'O004', and I'm not sure why. Is there also a better way to search for a string that could start with 2 different sets of characters without using the LIKE function twice?
Using SQL Server 2012
You need to use parentheses in your clause. Without parentheses, it means:
(A and B) OR C
Therefore you will get all records matching condition C regardless of condition A or B.
I know it already has very good answer (thanks to Laposhasu Acsa), but wanted to clarify for future readers:
Below code (without parentheses) is the same as (A and B) or C
WHERE [SOTR_CUST_CODE] = 'O004'
AND [SOTD_STRC_CODE] LIKE 'PC%'
OR [SOTD_STRC_CODE] LIKE 'PD%'
First solution is to use parentheses:
WHERE [SOTR_CUST_CODE] = 'O004'
AND ([SOTD_STRC_CODE] LIKE 'PC%'
OR [SOTD_STRC_CODE] LIKE 'PD%')
Second solution, which suits this particular case is:
WHERE [SOTR_CUST_CODE] = 'O004'
AND [SOTD_STRC_CODE] LIKE 'P[CD]%'

SQL FTS and comparison statements

Short story. I am working on a project where I need to communicate with SQLite database. And there I have several problems:
There is one FTS table with nodeId and nodeName columns. I need to select all nodeIds for which nodeNames contains some text pattern. For instance all node names with "Donald" inside. Something similar was discussed in this thread. The point is that I can't use CONTAINS keyword. Instead I use MATCH. And here is the question itself: how should this "Donald" string be "framed"? With '*' or with '%' character? Here is my query:
SELECT * FROM nodeFtsTable WHERE nodeName MATCH "Donald"
Is it OK to write multiple comparison in SELECT statement? I mean something like this:
SELECT * FROM distanceTable WHERE pointId = 1 OR pointId = 4 OR pointId = 203 AND distance<200
I hope that it does not sound very confusing. Thank you in advance!
Edit: Sorry, I missed the fact that you are using FTS4. It looks like you can just do this:
SELECT * FROM nodeFtsTable WHERE nodeName MATCH 'Donald'
Here is relevant documentation.
No wildcard characters are needed in order to match all entries in which Donald is a discrete word (e.g. the above will match Donald Duck). If you want to match Donald as part of a word (e.g. Donalds) then you need to use * in the appropriate place:
SELECT * FROM nodeFtsTable WHERE nodeName MATCH 'Donald*'
If your query wasn't working, it was probably because you used double quotes.
From the SQLite documentation:
The MATCH operator is a special syntax for the match()
application-defined function. The default match() function
implementation raises an exception and is not really useful for
anything. But extensions can override the match() function with more
helpful logic.
FTS4 is an extension that provides a match() function.
Yes, it is ok to use multiple conditions as in your second query. When you have a complex set of conditions, it is important to understand the order in which the conditions will be evaluated. AND is always evaluated before OR (they are analagous to mathematical multiplication and addition, respectively). In practice, I think it is always best to use parentheses for clarity when using a combination of AND and OR:
--This is the same as with no parentheses, but is clearer:
SELECT * FROM distanceTable WHERE
pointId = 1 OR
pointId = 4 OR
(pointId = 203 AND distance<200)
--This is something completely different:
SELECT * FROM distanceTable WHERE
(pointId = 1 OR pointId = 4 OR pointId = 203) AND
distance<200

Regular expressions inside SQL Server

I have stored values in my database that look like 5XXXXXX, where X can be any digit. In other words, I need to match incoming SQL query strings like 5349878.
Does anyone have an idea how to do it?
I have different cases like XXXX7XX for example, so it has to be generic. I don't care about representing the pattern in a different way inside the SQL Server.
I'm working with c# in .NET.
You can write queries like this in SQL Server:
--each [0-9] matches a single digit, this would match 5xx
SELECT * FROM YourTable WHERE SomeField LIKE '5[0-9][0-9]'
stored value in DB is: 5XXXXXX [where x can be any digit]
You don't mention data types - if numeric, you'll likely have to use CAST/CONVERT to change the data type to [n]varchar.
Use:
WHERE CHARINDEX(column, '5') = 1
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
CHARINDEX
ISNUMERIC
i have also different cases like XXXX7XX for example, so it has to be generic.
Use:
WHERE PATINDEX('%7%', column) = 5
AND CHARINDEX(column, '.') = 0 --to stop decimals if needed
AND ISNUMERIC(column) = 1
References:
PATINDEX
Regex Support
SQL Server 2000+ supports regex, but the catch is you have to create the UDF function in CLR before you have the ability. There are numerous articles providing example code if you google them. Once you have that in place, you can use:
5\d{6} for your first example
\d{4}7\d{2} for your second example
For more info on regular expressions, I highly recommend this website.
Try this
select * from mytable
where p1 not like '%[^0-9]%' and substring(p1,1,1)='5'
Of course, you'll need to adjust the substring value, but the rest should work...
In order to match a digit, you can use [0-9].
So you could use 5[0-9][0-9][0-9][0-9][0-9][0-9] and [0-9][0-9][0-9][0-9]7[0-9][0-9][0-9]. I do this a lot for zip codes.
SQL Wildcards are enough for this purpose. Follow this link: http://www.w3schools.com/SQL/sql_wildcards.asp
you need to use a query like this:
select * from mytable where msisdn like '%7%'
or
select * from mytable where msisdn like '56655%'

SQL produced by Entity Framework for string matching

Given this linq query against an EF data context:
var customers = data.Customers.Where(c => c.EmailDomain.StartsWith(term))
You’d expect it to produce SQL like this, right?
SELECT {cols} FROM Customers WHERE EmailDomain LIKE #term+’%’
Well, actually, it does something like this:
SELECT {cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Do you know why?
Also, replacing the Where selector to:
c => c.EmailDomain.Substring(0, term.Length) == term
it runs 10 times faster but still produces some pretty yucky SQL.
NOTE: Linq to SQL correctly translates StartsWith into Like {term}%, and nHibernate has a dedicated LikeExpression.
I don't know about MS SQL server but on SQL server compact LIKE 'foo%' is thousands time faster than CHARINDEX, if you have INDEX on seach column. And now I'm sitting and pulling my hair out how to force it use LIKE.
http://social.msdn.microsoft.com/Forums/en-US/adodotnetentityframework/thread/1b835b94-7259-4284-a2a6-3d5ebda76e4b
The reason is that CharIndex is a lot faster and cleaner for SQL to perform than LIKE. The reason is, that you can have some crazy "LIKE" clauses. Example:
SELECT * FROM Customer WHERE EmailDomain LIKE 'abc%de%sss%'
But, the "CHARINDEX" function (which is basically "IndexOf") ONLY handles finding the first instance of a set of characters... no wildcards are allowed.
So, there's your answer :)
EDIT: I just wanted to add that I encourage people to use CHARINDEX in their SQL queries for things that they didn't need "LIKE" for. It is important to note though that in SQL Server 2000... a "Text" field can use the LIKE method, but not CHARINDEX.
Performance seems to be about equal between LIKE and CHARINDEX, so that should not be the reason. See here or here for some discussion. Also the CAST is very weird because CHARINDEX returns an int.
charindex returns the location of the first term within the second term.
sql starts with 1 as the first location (0 = not found)
http://msdn.microsoft.com/en-us/library/ms186323.aspx
i don't know why it uses that syntax but that's how it works
I agree that it is no faster, I was retrieving tens of thousands of rows from our database with the letter i the name. I did find however that you need to use > rather than = ... so use
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) > 0)
rather than
{cols} FROM Customer WHERE ((CAST(CHARINDEX(#term, EmailDomain) AS int)) = 1)
Here are my two tests ....
select * from members where surname like '%i%' --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) > 0) --12 seconds
select * from sc4_persons where ((CAST(CHARINDEX('i', surname) AS int)) = 1) --too few results

What is the best way to select string fields based on character ranges?

I need to add the ability for users of my software to select records by character ranges.
How can I write a query that returns all widgets from a table whose name falls in the range Ba-Bi for example?
Currently I'm using greater than and less than operators, so the above example would become:
select * from widget
where name >= 'ba' and name < 'bj'
Notice how I have "incremented" the last character of the upper bound from i to j so that "bike" would not be left out.
Is there a generic way to find the next character after a given character based on the field's collation or would it be safer to create a second condition?
select * from widget
where name >= 'ba'
and (name < 'bi' or name like 'bi%')
My application needs to support localization. How sensitive is this kind of query to different character sets?
I also need to support both MSSQL and Oracle. What are my options for ensuring that character casing is ignored no matter what language appears in the data?
Let's skip directly to localization. Would you say "aa" >= "ba" ? Probably not, but that is where it sorts in Sweden. Also, you simply can't assume that you can ignore casing in any language. Casing is explicitly language-dependent, with the most common example being Turkish: uppercase i is İ. Lowercase I is ı.
Now, your SQL DB defines the result of <, == etc by a "collation order". This is definitely language specific. So, you should explicitly control this, for every query. A Turkish collation order will put those i's where they belong (in Turkish). You can't rely on the default collation.
As for the "increment part", don't bother. Stick to >= and <=.
For MSSQL see this thread: http://bytes.com/forum/thread483570.html .
For Oracle, it depends on your Oracle version, as Oracle 10 now supports regex(p) like queries: http://www.psoug.org/reference/regexp.html (search for regexp_like ) and see this article: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html
HTH
Frustratingly, the Oracle substring function is SUBSTR(), whilst it SQL-Server it's SUBSTRING().
You could write a simple wrapper around one or both of them so that they share the same function name + prototype.
Then you can just use
MY_SUBSTRING(name, 2) >= 'ba' AND MY_SUBSTRING(name, 2) <= 'bi'
or similar.
You could use this...
select * from widget
where name Like 'b[a-i]%'
This will match any row where the name starts with b, the second character is in the range a to i, and any other characters follow.
I think that I'd go with something simple like appending a high-sorting string to the end of the upper bound. Something like:
select * from widgetwhere name >= 'ba' and name <= 'bi'||'~'
I'm not sure that would survive EBCDIC conversion though
You could also do it like this:
select * from widget
where left(name, 2) between 'ba' and 'bi'
If your criteria length changes (as you seemed to indicate in a comment you left), the query would need to have the length as an input also:
declare #CriteriaLength int
set #CriteriaLength = 4
select * from widget
where left(name, #CriteriaLength) between 'baaa' and 'bike'