Strange behaviour with Fulltext search in SQL Server

Strange behaviour with Fulltext search in SQL Server - sql

I have MyTable with a Column Message NVARCHAR(MAX).
Record with ID 1 contains the Message '0123456789333444 Test'
When I run the following query
DECLARE #Keyword NVARCHAR(100)
SET #Keyword = '0123456789000001*'
SELECT *
FROM MyTable
WHERE CONTAINS(Message, #Keyword)
Record ID 1 is showing up in the results and in my opinion it should not because 0123456789333444 does not contains 0123456789000001.
Can someone explain why the records is showing up anyway?
EDIT
select * from sys.dm_fts_parser('"0123456789333444 Test"',1033,0,0)
returns the following:
group_id phrase_id occurrence special_term display_term expansion_type source_term
1 0 1 Exact Match 0123456789333444 0 0123456789333444 Test
1 0 1 Exact Match nn0123456789333444 0 0123456789333444 Test
1 0 2 Exact Match test 0 0123456789333444 Test

This is because the #Keyword is not wrapped in double quotes. Which forces zero, one, or more matches.
Specifies a match of words or phrases beginning with
the specified text. Enclose a prefix term in double quotation marks
("") and add an asterisk () before the ending quotation mark, so that
all text starting with the simple term specified before the asterisk
is matched. The clause should be specified this way: CONTAINS (column,
'"text"'). The asterisk matches zero, one, or more characters (of the
root word or words in the word or phrase). If the text and asterisk
are not delimited by double quotation marks, so the predicate reads
CONTAINS (column, 'text*'), full-text search considers the asterisk as
a character and searches for exact matches to text*. The full-text
engine will not find words with the asterisk (*) character because
word breakers typically ignore such characters.
When is a phrase, each word contained in the phrase is
considered to be a separate prefix. Therefore, a query specifying a
prefix term of "local wine*" matches any rows with the text of "local
winery", "locally wined and dined", and so on.
Have a look at the MSDN on the topic. MSDN

Have you tried to query the following view to see what's on the system stoplist?
select * from sys.fulltext_system_stopwords where language_id = 1033;

Found a solution that works. I've added language 1033 as an additional parameter.
SELECT * FROM MyTable WHERE CONTAINS(Message, #Keyword, langauge 1033)

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~

figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

SQL Server - Regex pattern match only alphanumeric characters

I have an nvarchar(50) column myCol with values like these 16-digit, alphanumeric values, starting with '0':
0b00d60b8d6cfb19, 0b00d60b8d6cfb05, 0b00d60b8d57a2b9
I am trying to delete rows with myCol values that don't match those 3 criteria.
By following this article, I was able to select the records starting with '0'. However, despite the [a-z0-9] part of the regex, it also keeps selecting myCol values containing special characters like 00-d#!b8-d6/f&#b. Below is my select query:
SELECT * from Table
WHERE myCol LIKE '[0][a-z0-9]%' AND LEN(myCol) = 16
How should the expression be changed to select only rows with myCol values that don't contain special characters?

If the value must only contain a-z and digits, and must start with a 0 you could use the following:
SELECT *
FROM (VALUES(N'0b00d60b8d6cfb19'),
(N'0b00d60b8d6cfb05'),
(N'0b00d60b8d57a2b9'),
(N'00-d#!b8-d6/f&#b'))V(myCol)
WHERE V.myCol LIKE '0%' --Checks starts with a 0
AND V.myCol NOT LIKE '%[^0-9A-z]%' --Checks only contains alphanumerical characters
AND LEN(V.myCol) = 16;
The second clause works as the LIKE will match any character that isn't an alphanumerical character. The NOT then (obviously) reverses that, meaning that the expression only resolves to TRUE when the value only contains alphanumerical characters.

Pattern matching in SQL Server is not awesome, and there is currently no real regex support.
The % in your pattern is what is including the special characters you show in your example. The [a-z0-9] is only matching a single character. If your character lengths are 16 and you're only interested in letters and numbers then you can include a pattern for each one:
SELECT *
FROM Table
WHERE myCol LIKE '[0][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9]';
Note: you don't need the AND LEN(myCol) = 16 with this.

T-SQL CONTAINS with numbers and dots (.)

Let's consider User.Note = 'Version:3.7.21.1'
SELECT * FROM [USER] WHERE CONTAINS(NOTE, '"3.7.2*"')
=> returns something
SELECT * FROM [USER] WHERE CONTAINS(NOTE, '"3.7*"')
=> returns nothing
If User.Note = 'Version:3.7.21'
SELECT * FROM [USER] WHERE CONTAINS(NOTE, '"3.7*"')
=> returns something
If User.Note = 'Version:3.72.21'
SELECT * FROM [USER] WHERE CONTAINS(NOTE, '"3.7*"')
=> returns nothing
I can't figure out how it works. It should always returns something when I search for "3.7*".
Do you know what's the logic behind this ?
PS: if I replace the numbers by letters, there's no problem.

I think your problem is being caused by the unpredictability of the word breaker interacting with the punctuation marks within the data. Full text search is based on the concept of strings of characters, not including spaces and punctuation. When the engine is building the index it sees the periods and breaks the word in weird ways.
As an example, I made a small table with the three values you provided...
VALUES (1,'3.7.21.1'),(2,'3.7.21'),(3,'3.72.21')
Now when I do your selects, I get results on all four... not the results I expect, though.
For me, this returns all three values
SELECT * FROM containstext WHERE CONTAINS(secondid, '"3.7.2*"')
and this returns only 3.7.21
SELECT * FROM containstext WHERE CONTAINS(secondid, '"3.7*"')
So let's run this and take a look at the contents of the full text index
SELECT * FROM sys.dm_fts_index_keywords(db_id('{databasename}'), object_id('{tablename}'))
For my results (yours are quite probably different) I've got the following display_term values
display_term document_count
21 3
3 3
3.7.21 1
7 2
72 1
So let's look at the first search criterion '"3.7.2*"'
If I shove that into sys.dm_fts_parser...
select * from sys.dm_fts_parser('"3.7.2*"', 1033, NULL, 0)
...it's showing me that it's breaking with matches on
3
7
2
But if I do...
select * from sys.dm_fts_parser('"3.7*"', 1033, NULL, 0)
I'm getting a single exact match on the term 3.7 and sys.dm_fts_index_keywords told me earlier that I only have one document/row that contains 3.7
You might also experience additional weirdness because numbers 0-9 are usually in the system stopwords and can be left out of an index because they're considered to be useless. This might be why it works when you change to letters.
Also, I know you've decided to replace LIKE, but Microsoft has suggested that you only use alphanumeric characters in your full text indexes and, if you need to use non-alphanumeric characters in search criteria, you should use LIKE. Perhaps changing the periods to some alphanumeric replacement that won't be used in normal values?

Contains will only work if the column is in a full text index. If it it is not indexed you will need to use like:
SELECT * FROM [USER] WHERE NOTE like '3.7%' --or '%3.7%
Are you wanting to use CONTAINS because you think it will be faster?(It generally is)
The Microsoft document lists all the ways you can format and use CONTAINS(11 examples)
Here is the Microsoft doc on CONTAINS

SQL : Confused with WildCard operators

what is difference between these two sql statements
1- select * from tblperson where name not like '[^AKG]%';
2- select * from tblperson where name like '[AKG]%';
showing same results: letter starting from a,k,g

like '[^AKG]% -- This gets you rows where the first character of name is not A,K or G. ^ matches any single character not in the specified set or a specified range of characters. There is one more negation not. So when you say name not like '[^AKG]%' you get rows where the first character of name is A,K or G.
name like '[AKG]% -- you get rows where the first character of name is A,K or G.
The wildcard character [] matches any character in a specified range or a set of characters. In your case it is a set of characters.
So both the conditions are equivalent.

You are using a double 'NOT'. The carrot '^' in your first character match is shorthand for 'not', so you are evaluating 'not like [not' AKG]% IE not like '[^AKG]%'.

1)In the first query you are using 'Not' and '^' basically it is Not twice so it cancels outs
therefore your query is 'Not Like [^AKG]' ==> 'Like [AKG]'

^ a.k.a caret or up arrow.
The purpose of this symbol is to provide a match for any characters not listed within the brackets [] , meaning that normally it wouldn't provide a result for anything that starts with AKG, but since you added the word NOT to the query , you are basically cancelling the operator, just as if you were doing in math :
(- 1) * (- 1)

Postgresql : Pattern matching of values starting with "IR"

If I have table contents that looks like this :
id | value
------------
1 |CT 6510
2 |IR 52
3 |IRAB
4 |IR AB
5 |IR52
I need to get only those rows with contents starting with "IR" and then a number, (the spaces ignored). It means I should get the values :
2 |IR 52
5 |IR52
because it starts with "IR" and the next non space character is an integer. unlike IRAB, that also starts with "IR" but "A" is the next character. I've only been able to query all starting with IR. But other IR's are also appearing.
select * from public.record where value ilike 'ir%'
How do I do this? Thanks.

You can use the operator ~, which performs a regular expression matching.
e.g:
SELECT * from public.record where value ~ '^IR ?\d';
Add a asterisk to perform a case insensitive matching.
SELECT * from public.record where value ~* '^ir ?\d';
The symbols mean:
^: begin of the string
?: the character before (here a white space) is optional
\d: all digits, equivalent to [0-9]
See for more info: Regular Expression Match Operators
See also this question, very informative: difference-between-like-and-in-postgres

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Strange behaviour with Fulltext search in SQL Server - sql

Have you tried to query the following view to see what's on the system stoplist? select * from sys.fulltext_system_stopwords where language_id = 1033;

Found a solution that works. I've added language 1033 as an additional parameter. SELECT * FROM MyTable WHERE CONTAINS(Message, #Keyword, langauge 1033)

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

SQL Server - Regex pattern match only alphanumeric characters

T-SQL CONTAINS with numbers and dots (.)

SQL : Confused with WildCard operators

Postgresql : Pattern matching of values starting with "IR"

Categories

Resources