T-SQL RegEx Matching "One or More" Operator - sql

In MS SQL, is there an operator that allows the matching of one or more character? (I'm curious about whether its implemented explicitly in T-SQL - other solutions are certainly possible, one of which I use in my question example below . . .)
I know in SQL, this could be explicitly implemented to varying degrees of success with the wildcard/like approach:
SELECT *
FROM table
-- finds letters aix and then anything following it
WHERE column LIKE 'aix_x%'
In Python, the '+' operator allows for this:
import re
str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 1 or more "x" characters:
# finds 'ai' + one or more letters x
x = re.findall("aix+", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")

Check if the string contains "ai" followed by 1 or more "x" characters:
finds 'ai' + one or more letters x
If this is what you want, then:
where str like '%aix%'
does what you want.
If you want an underscore, then an underscore is a wildcard in LIKE expressions. Probably the simplest method in SQL Server is to use a character class:
where str like '%ai[_]x%'
another solution is:
where str like '%ai$_x%' escape '$'

Related

What does ^5 (caret + number) mean?

Using ^5, one can get the first five elements of an array:
my #foo = 10..20;
say #foo[^5].join(',');
10,11,12,13,14
What is ^5 actually? Indexing syntax, a shortcut for lists, ... ?
The prefix ^ operator is the upto operator. It generates a Range from 0 upto N (exclusive)
See also prefix:<^>.
In the example it is used as a specification of an array slice, so it is equivalent to #foo[0,1,2,3,4].

Squeak Smalltak, Does +, -, *, / have more precedence over power?

I understand in Smalltalk numerical calculation, if without round brackets, everything starts being calculated from left to right. Nothing follows the rule of multiplication and division having more precedence over addition and subtraction.
Like the following codes
3 + 3 * 2
The print output is 12 while in mathematics we get 9
But when I started to try power calculation, like
91 raisedTo: 3 + 1.
I thought the answer should be 753572
What I actual get is 68574964
Why's that?
Is it because that +, -, *, / have more precedence over power ?
Smalltalk does not have operators with precedence. Instead, there are three different kinds of messages. Each kind has its own precedence.
They are:
unary messages that consist of a single identifier and do not have parameters as squared or asString in 3 squared or order asString;
binary messages that have a selector composed of !%&*+,/<=>?#\~- symbols and have one parameter as + and -> in 3 + 4 or key -> value;
keyword messages that have one or more parameters and a selector with colons before each parameter as raisedTo: and to:by:do: in 4 risedTo: 3 and 1 to: 10 by: 3 do: [ … ].
Unary messages have precedence over binary and both of them have precedence over keyword messages. In other words:
unary > binary > keyword
So for example
5 raisedTo: 7 - 2 squared = 125
Because first unary 2 squared is evaluated resulting in 4, then binary 7 - 4 is evaluated resulting in 3 and finally keyword 5 risedTo: 3 evaluates to 125.
Of course, parentheses have the highest precedence of everything.
To simplify the understanding of this concept don't think about numbers and math all the numbers are objects and all the operators are messages. The reason for this is that a + b * c does not mean that a, b, and c are numbers. They can be humans, cars, online store articles. And they can define their own + and * methods, but this does not mean that * (which is not a "multiplication", it's just a "star message") should happen before +.
Yes, +, -, *, / have more precedence than raisedTo:, and the interesting aspect of this is the reason why this happens.
In Smalltalk there are three types of messages: unary, binary and keyword. In our case, +, -, * and / are examples of binary messages, while raisedTo: is a keyword one. You can tell this because binary messages are made from characters that are not letters or numbers, unlike unary or keywords, which start with a letter or underscore and follow with numbers or letters or underscores. Also, you can tell when a selector is unary because they do not end with a colon. Thus, raisedTo: is a keyword message because it ends with colon (and is not made of non-letter or numeric symbols).
So, the expression 91 raisedTo: 3 + 1 includes two selectors, one binary (+) and one keyword (raisedTo:) and the precedence rule says:
first evaluate unary messages, then binary ones and finally those with keywords
This is why 3 + 1 gets evaluated first. Of course, you can always change the precedence using parenthesis. For example:
(91 raisedTo: 3) + 1
will evaluate first raisedTo: and then +. Note that you could write
91 raisedTo: (3 + 1)
too. But this is usually not done because Smalltalk precedence rules are so easy to remember that you don't need to emphasize them.
Commonly used binary selectors
# the Point creation message for x # y
>= greater or equal, etc.
-> the Association message for key -> value
==> production tranformation used by PetitParser
= equal
== identical (very same object)
~= not equal
~~ not identical
\\ remainder
// quotient
and a lot more. Of course, you are always entitled to create your own.

search any word inside a string in million rows

I have a set of 50k values say X. each value i want to compare with a set of 10k values say Y. if X is present any where in the string Y it matches.
So each value in X i want to check across each value in Y and assign X if it matches.
what would be the best method to complete this task. It is required for a data mining project.
I loaded the data into MS Access database.
then using a vba program
take each X . Update Y if it matches (Like '%X%') but it is a never ending process. The columns are indexed but no effect.
Is there any algorithm or steps to reduce it into step-by-step process and complete the mapping faster?
Please let me know if there is any other options available other than the answers given below. I ll explain the scenario bit more
Table1.Data
sentense1
sentense2
sentense3
sentense4
sentense5
sentense6
-
-
-
Sentense100k
Table2.Phrase (Means multiple words)
Phrase1
Phrase2
Phrase3
Phrase4
Phrase5
-
-
-
Phrase 100k
Want to check Phrase1 has any Match in Sentense1 to Sentense100k Exact Match of Phrase, anywhere Match of Phrase, Maximum Words in Phrase1 Match in Sentense etc.. and create a map based on best Match(ideally exact phrase available anywhere in the sentense)
Table3 Output
Data Best Possible Phrase Second Best Phrase(Optional)
Sentense1 Phrase1000 Phrase50k
Sentense2 Phrase10 Phrase70k
Please let me know any tool,logic to perform this. The logic what i tried in SQL
1.
Select A.Data,B.Phrase from Table1 A left join Table2 B on A.Data Like '%' + B.Phrase + '%'
2.
Check for any word in phrase available in sentense. So replaced all spaces with % like word1%word2%word3. then did query as
A.Data Like '%' + B.Phrase + '%' which is
A.Data Like '%word1%word2%word3%'
But it takes days to complete the task for this much data.
Any readily usable tools, indexing methods,queries would really help. The answers given below seems too technical for me to adapt. Please guide
You can build a suffix tree in linear time (you can look up suffix trees online), out of the concatenation of all strings in X and Y, with special unique symbols that end each string.
Then for each string Xi in X, you look it up in the suffix tree (linear time in length of Xi) and assign Xi to each string in Y that is somewhere in the subtree rooted at the end of Xi.
This is linear time in the number of strings in Y that Xi is assigned to.
Thus you get an optimal O(N + k) time algorithm, where:
N is the total length of all the strings in X and Y,
and k is the total number of matches between query strings in X and target strings in Y.

Solve three letter string into regular expression way

I need help to solve in regular expression way. The language of all strings defined over Σ = {X, Y, Z} with Y as the third letter and Z being the second last letter.
If you are allowed to use intersection (which does preserve rationality), I would state it simply as ΣΣYΣ* & Σ*ZΣ. If you feed this to Vcsn to normalize it, you get:
In [1]: import vcsn
In [2]: vcsn.B.expression('([XYZ]{2}Y[XYZ]*)&([XYZ]*Z[XYZ])').derived_term().expression()
Out[2]: (X+Y+Z)ZY+(X+Y+Z)(X+Y+Z)Y(X+Y+Z)*Z(X+Y+Z)
The call to derived_term is to build an automaton from the expression, and the last call to expression is to extract a rational expression from this automaton.
As given Σ = {X, Y, Z} , you need to construct the language of all strings defined over it with Y as third letter and Z being the second last letter.
"ΣΣYΣ*ZΣ | ΣZY" will be the required regular expression.
Σ* has all strings that are 0 or more concatenations of strings from Σ.
As you can see, here Y being the third element and Z is placed in second last position. And, Σ can be replaced with any of the X,Y or Z element.
i think regular expression should be like this....
(x+y+z)zy(x+y+z)^*

Contains at least a count of different character in a set

Assume that, I have a character set like this:
['a','b','c','x','y','z']
I want to build a regular expression which matches a certain number of these characters (for example 3).
Here are some examples of it:
ab - no match
xy - no match
abt - no match
aaa - no match
abc - match
yaz - match
yazx - match
ytaz - match
Can this be accomplished with a regular expression?
A simple solution would be a pattern like this:
(.*[abcxyz]){3}
This will match zero or more of any character, followed by one of a, b, c, x, y, or z, all of which must appear at least 3 times in the subject string.
To match only strings that contain different letters, you could use a negative lookahead ((?!…)) and a backreference (\N):
(.*([abcxyz])(?!.*\2)){3}
This will match zero or more of any character, followed by one of a, b, c, x, y, or z, as long as another instance of that character does not appear later in the string (i.e. it will match the last instance of that character in the string), all of which must appear at least 3 times in the subject string.
Of course, you can change the {3} to anything you like, but note that will not work if you need to specify a maximum number of times these characters can appear in your string, only the minimum.