Contains at least a count of different character in a set - sql

Assume that, I have a character set like this:
['a','b','c','x','y','z']
I want to build a regular expression which matches a certain number of these characters (for example 3).
Here are some examples of it:
ab - no match
xy - no match
abt - no match
aaa - no match
abc - match
yaz - match
yazx - match
ytaz - match
Can this be accomplished with a regular expression?

A simple solution would be a pattern like this:
(.*[abcxyz]){3}
This will match zero or more of any character, followed by one of a, b, c, x, y, or z, all of which must appear at least 3 times in the subject string.
To match only strings that contain different letters, you could use a negative lookahead ((?!…)) and a backreference (\N):
(.*([abcxyz])(?!.*\2)){3}
This will match zero or more of any character, followed by one of a, b, c, x, y, or z, as long as another instance of that character does not appear later in the string (i.e. it will match the last instance of that character in the string), all of which must appear at least 3 times in the subject string.
Of course, you can change the {3} to anything you like, but note that will not work if you need to specify a maximum number of times these characters can appear in your string, only the minimum.

Related

T-SQL RegEx Matching "One or More" Operator

In MS SQL, is there an operator that allows the matching of one or more character? (I'm curious about whether its implemented explicitly in T-SQL - other solutions are certainly possible, one of which I use in my question example below . . .)
I know in SQL, this could be explicitly implemented to varying degrees of success with the wildcard/like approach:
SELECT *
FROM table
-- finds letters aix and then anything following it
WHERE column LIKE 'aix_x%'
In Python, the '+' operator allows for this:
import re
str = "The rain in Spain falls mainly in the plain!"
#Check if the string contains "ai" followed by 1 or more "x" characters:
# finds 'ai' + one or more letters x
x = re.findall("aix+", str)
print(x)
if (x):
print("Yes, there is at least one match!")
else:
print("No match")
Check if the string contains "ai" followed by 1 or more "x" characters:
finds 'ai' + one or more letters x
If this is what you want, then:
where str like '%aix%'
does what you want.
If you want an underscore, then an underscore is a wildcard in LIKE expressions. Probably the simplest method in SQL Server is to use a character class:
where str like '%ai[_]x%'
another solution is:
where str like '%ai$_x%' escape '$'

Squeak Smalltak, Does +, -, *, / have more precedence over power?

I understand in Smalltalk numerical calculation, if without round brackets, everything starts being calculated from left to right. Nothing follows the rule of multiplication and division having more precedence over addition and subtraction.
Like the following codes
3 + 3 * 2
The print output is 12 while in mathematics we get 9
But when I started to try power calculation, like
91 raisedTo: 3 + 1.
I thought the answer should be 753572
What I actual get is 68574964
Why's that?
Is it because that +, -, *, / have more precedence over power ?
Smalltalk does not have operators with precedence. Instead, there are three different kinds of messages. Each kind has its own precedence.
They are:
unary messages that consist of a single identifier and do not have parameters as squared or asString in 3 squared or order asString;
binary messages that have a selector composed of !%&*+,/<=>?#\~- symbols and have one parameter as + and -> in 3 + 4 or key -> value;
keyword messages that have one or more parameters and a selector with colons before each parameter as raisedTo: and to:by:do: in 4 risedTo: 3 and 1 to: 10 by: 3 do: [ … ].
Unary messages have precedence over binary and both of them have precedence over keyword messages. In other words:
unary > binary > keyword
So for example
5 raisedTo: 7 - 2 squared = 125
Because first unary 2 squared is evaluated resulting in 4, then binary 7 - 4 is evaluated resulting in 3 and finally keyword 5 risedTo: 3 evaluates to 125.
Of course, parentheses have the highest precedence of everything.
To simplify the understanding of this concept don't think about numbers and math all the numbers are objects and all the operators are messages. The reason for this is that a + b * c does not mean that a, b, and c are numbers. They can be humans, cars, online store articles. And they can define their own + and * methods, but this does not mean that * (which is not a "multiplication", it's just a "star message") should happen before +.
Yes, +, -, *, / have more precedence than raisedTo:, and the interesting aspect of this is the reason why this happens.
In Smalltalk there are three types of messages: unary, binary and keyword. In our case, +, -, * and / are examples of binary messages, while raisedTo: is a keyword one. You can tell this because binary messages are made from characters that are not letters or numbers, unlike unary or keywords, which start with a letter or underscore and follow with numbers or letters or underscores. Also, you can tell when a selector is unary because they do not end with a colon. Thus, raisedTo: is a keyword message because it ends with colon (and is not made of non-letter or numeric symbols).
So, the expression 91 raisedTo: 3 + 1 includes two selectors, one binary (+) and one keyword (raisedTo:) and the precedence rule says:
first evaluate unary messages, then binary ones and finally those with keywords
This is why 3 + 1 gets evaluated first. Of course, you can always change the precedence using parenthesis. For example:
(91 raisedTo: 3) + 1
will evaluate first raisedTo: and then +. Note that you could write
91 raisedTo: (3 + 1)
too. But this is usually not done because Smalltalk precedence rules are so easy to remember that you don't need to emphasize them.
Commonly used binary selectors
# the Point creation message for x # y
>= greater or equal, etc.
-> the Association message for key -> value
==> production tranformation used by PetitParser
= equal
== identical (very same object)
~= not equal
~~ not identical
\\ remainder
// quotient
and a lot more. Of course, you are always entitled to create your own.

How can I prove this language is regular?

I'm trying to prove if this language:
L = { w={0,1}* | #0(w) % 3 = 0 } (number of 0's is divisble by 3)
is regular using the pumping lemma, but I can't find a way to do it. All other examples I got, have a simple form or let's say a more defined form such as w = axbycz etc.
I don't think you can use pumping lemma to prove that a language is regular. To prove a language is regular, you just need to give a regular expression or a DFA. In this case the regular expression is quite easy:
1*(01*01*01*)*
(proof: the regular expression clearly does not accept any string which has the number of 0's not divisible by 3, so we just need to prove that all possible strings which has the number of 0's divisible by 3 is accepted by this regular expression, which can be done by confirming that for strings that contain 3n 0's, the regular expression matches it since 1n001n101n201n3...01n3n-201n3n-101n3n has the same number of 0's and the nk's can be substituted so that it matches the string, and that this format is clearly accepted by the regular expression)
Pumping lemma cannot be used to prove that a language is regular because we cannot set the y as in Daniel Martin's answer. Here is a counter-example, in a similar format as his answer (please correct me if I'm doing something fundamentally different from his answer):
We prove that the language L = {w=0n1p | n ∈ N, n>0, p is prime} is regular using pumping lemma as follows: note that there is at least one occurrence of 0, so we take y as 0, and we have xykz = 0n+k-11p, which still satisfy the language definition. Therefore L is regular.
But this is false, since we know that a sequence with prime-numbered length is not regular. The problem here is we cannot just set y to any character.
Any string in this language with at least three characters in it has this property: either the string has a "1" in it, or there are three "0"s in a row.
If the string contains a 1, then you can split it as in the pumping lemma and set y equal to some 1 in the string. Then obviously the strings xyz, xyyz, xyyyz, etc. are all in the language because all those strings have the same number of zeros.
If the string does not contain a 1, it contains three 0s in a row. Setting y to those three 0s, it should be obvious that xyz, xyyz, xyyyz, etc. are all in the language because you're adding three 0 characters each time, so you always have a number of 0s divisible by 3.
#justhalf in the comments is perfectly correct; the pumping lemma can be used to prove that a regular language can be pumped or that a language that cannot be pumped is not regular, but you cannot use the pumping lemma to prove that a language is regular in the first place. Mea Culpa.
Instead, here's a proof that the given language is regular based on the Myhill-Nerode Theorem:
Consider the set of all strings of 0s and 1s. Divide these strings into three sets:
E0, all strings such that the number of 0s is a multiple of three,
E1, all strings such that the number of 0s is one more than a multiple of three,
E2, all strings such that the number of 0s is two more than a multiple of three.
Obviously, every string of 0s and 1s is in one of these three sets.
Furthermore, if x and z are both strings of 0s and 1s, then consider what it means if the concatenation xz is in L:
If x is in E0, then xz is in L if and only if z is in E0
If x is in E1, then xz is in L if and only if z is in E2
If x is in E2, then xz is in L if and only if z is in E1
Therefore, in the language of the theorem, there is no distinguishing extension for any two strings in the same one of our three Ei sets, and therefore there are at most three equivalence classes. A finite number of equivalence classes means the language is regular.
(in fact, there are exactly three equivalence classes, but that isn't needed)
A language is regular if and only if some nondeterministic finite automaton recognizes it.
Automaton is a finite state machine.
We have to build an automaton that regonizes L.
For each state, thinking like:
"Where am I?"
"Where can I go to, with some given entry?"
So, for L = { w={0,1}* | #0(w) % 3 = 0 }
The possibilites (states) are:
The remainder (rest of division) is 0, 1 or 2. Which means we need three states.
Let q0,q1 and q2 be the states that represent the remainderes 0,1 and 2, respectively.
q0 is the start and final state.
Now, for "0" entries, do the math #0(w)%3 and go to the aproppriated state.
Transion functions:
f(q0, 0) = q1
f(q1, 0) = q2
f(q2, 0) = q0
For "1" entries, it just loops wherever it is, 'cause it doesn't change the machine state.
f(qx, 1) = qx
The pumping lemma proves if some language is not regular.
Here is a good book for theory of computation: Introduction to the Theory of Computation 3rd Edition
by Michael Sipser.

search any word inside a string in million rows

I have a set of 50k values say X. each value i want to compare with a set of 10k values say Y. if X is present any where in the string Y it matches.
So each value in X i want to check across each value in Y and assign X if it matches.
what would be the best method to complete this task. It is required for a data mining project.
I loaded the data into MS Access database.
then using a vba program
take each X . Update Y if it matches (Like '%X%') but it is a never ending process. The columns are indexed but no effect.
Is there any algorithm or steps to reduce it into step-by-step process and complete the mapping faster?
Please let me know if there is any other options available other than the answers given below. I ll explain the scenario bit more
Table1.Data
sentense1
sentense2
sentense3
sentense4
sentense5
sentense6
-
-
-
Sentense100k
Table2.Phrase (Means multiple words)
Phrase1
Phrase2
Phrase3
Phrase4
Phrase5
-
-
-
Phrase 100k
Want to check Phrase1 has any Match in Sentense1 to Sentense100k Exact Match of Phrase, anywhere Match of Phrase, Maximum Words in Phrase1 Match in Sentense etc.. and create a map based on best Match(ideally exact phrase available anywhere in the sentense)
Table3 Output
Data Best Possible Phrase Second Best Phrase(Optional)
Sentense1 Phrase1000 Phrase50k
Sentense2 Phrase10 Phrase70k
Please let me know any tool,logic to perform this. The logic what i tried in SQL
1.
Select A.Data,B.Phrase from Table1 A left join Table2 B on A.Data Like '%' + B.Phrase + '%'
2.
Check for any word in phrase available in sentense. So replaced all spaces with % like word1%word2%word3. then did query as
A.Data Like '%' + B.Phrase + '%' which is
A.Data Like '%word1%word2%word3%'
But it takes days to complete the task for this much data.
Any readily usable tools, indexing methods,queries would really help. The answers given below seems too technical for me to adapt. Please guide
You can build a suffix tree in linear time (you can look up suffix trees online), out of the concatenation of all strings in X and Y, with special unique symbols that end each string.
Then for each string Xi in X, you look it up in the suffix tree (linear time in length of Xi) and assign Xi to each string in Y that is somewhere in the subtree rooted at the end of Xi.
This is linear time in the number of strings in Y that Xi is assigned to.
Thus you get an optimal O(N + k) time algorithm, where:
N is the total length of all the strings in X and Y,
and k is the total number of matches between query strings in X and target strings in Y.

Longest common sub sequence recurrence

The recurrence for lcs is:
L[i,j] = max(L[i-1,j], L[i,j-1]) if a[i] != a[j]
Can you tell me why it is i-1 or j-1? Why isn't L[i,j] = L[i-1,j-1] correct?
You are considering the case where a[i] != a[j], which means that the letters you are currently comparing of the two sequences A and B are different. Therefore, the length of the longest common subsequence is one of two things :
the longest common subsequence of the current substring of A minus its first character and B, i.e. L[i-1,j] ;
the longest common subsequence of A and the current substring of B minus its first character, i.e. L[i,j-1].
If L[i-1,j-1] were correct, it would mean that the current characters in both A and B do not count, they don’t get a "chance" to be part of the subsequence.
See for example this explanation (note that it works forward in the sequences instead of backward).