Snowflake - Check if 1st 3 Characters of string are letters - sql

Am trying to determine how one attempts to identify, in Snowflake SQL, if a product code begins with three letters.
Suggestions?
I did just try: LEFT(P0.PRODUCTCODE,3) NOT LIKE '[a-zA-Z]%' but it didn't work.
Thanks folks

You can use REGEXP_LIKE to return a boolean value indicating whether or not your string matched the pattern you're interested in.
In your case, something like REGEXP_LIKE(string_field_here, '[a-zA-Z]{3}.*')
Breaking down the regular expression pattern:
[a-zA-Z]: Only match letter characters, both upper and lowercase
{3}: Require three of those letters
.*: Allow any number of any characters after those three letters
Note: in many cases, you would need to specifically indicate the beginning/ending of the string in the pattern, but Snowflake's implementation handles that for you. From the docs:
The function implicitly anchors a pattern at both ends (i.e. ''
automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
To match any string starting with ABC, the pattern would be 'ABC.*'.
You can try running these examples:
SELECT REGEXP_LIKE('abc', '[a-zA-Z]{3}.*') AS _abc,
REGEXP_LIKE('123', '[a-zA-Z]{3}.*') AS _123,
REGEXP_LIKE('abc123', '[a-zA-Z]{3}.*') AS _abc123,
REGEXP_LIKE('123abc', '[a-zA-Z]{3}.*') AS _123abc

Related

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.
From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

get the pattern of unknown strings using sql?

I have database have thousand of unknow string they may be emails ,phonenum
BUT they are not for me mean they are not email or cell num for me they are only string for me but i want their common pattern so here is the string for example purposes
link to example click here
now what i want is this file out put if pattern matcehs 3 time here what i am doing is
DECLARE #strs2 nvarchar(255)
DECLARE #patternTable table(
id int ,
order by p.pat
but my example return this
485-2889
485-2889
) 485-2889
) 485-2889
.aol.com/aol/search?
.aol.com/aol/search?
gmail.com
gmail.com
but i want to add this for pattern
[a-zA-Z 0-9] [a-zA-Z 0-9] [a-zA-Z 0-9] - 485-2889
for gmail
[a-zA-Z 0-9] [a-zA-Z 0-9]# gmail.com
First of all, this is much more work than it might seem.
As far as I can say it's going to be method with heavy processing (and probably not something you want to do with a cursor in SQL (cursors are sort of bad in terms of efficiency).
You have to define a way for your code to identify a pattern. You will also have to work in priorities where a set of strings matches multiple patterns. For instance if you implement following pattern criteria (in your example):
BK-M18B-48
BK-M18B-52
BK-M82B-44
BK-M82S-38
BK-M82S-44
BK-R50B-58
BK-R50B-62
.....
should generate BK-[A-Z]-[0-9][0-9][A-Z]-[0-9][0-9]
Then next set can have multiple patterns as a result:
fedexcarepackage#outlook.com (example added for explanations)
fedexcarepackage#office.com
fedexcourierexpress#pisem.net
fedexcouriers#gmail.com ( another example added for explanations)
.....
Can generate :
fedexc%#%.% (as you said)
fedexc%#% (depending on processing)
fedexc[A-Z][A-Z]....%#%[A-Z]....[A-Z].[A-Z][A-Z][A-Z] (alphanumeris with '%' to compensate for length difference)
in addition to that if you take away fedexcarepackage#outlook.com from string list you get 1 additional pattern that you probably don't want to have:
fedexc%#%i%.% (because they have 'i' somewhere between the '#' and '.' (dot)
Anyway, that is something you will have to consider with your design.
I'll give you some basic logic you can work with:
Create a functions to identify each distinct pattern (1 pattern / function). For instnace, 1 function to check for static pieces of string (and attaching wildcards); Another to detect [A-Z],[0-9] patterns that match your conditions for this pattern to be valid; more if needed for different patterns.
Create a function to test a string with your pattern. So say you have 4 string, you find a pattern when comparing first 2 of them. Then you use this function to test if pattern applies to 3rd and 4th strings.
Create a function to test if 2 patterns are mutually exclusive. For instance 'PersonA#yahoo.%' and 'PersonA#%.net' patterns are not mutually exclusive, if they were both tested to be true. 'Person%#yahoo.com' and 'PersonB#yahoo.com' are mutually exclusive (both patterns cannot be true, so 1 is redundant.
Create a function to combine patterns that are NOT mutually exclusive (probably includes the use of function in 2nd and 3rd point). So 'PersonA#yahoo.%' and 'PersonA#%.net' can be combined into 'PersonA#%.%'
Once you have that setup, loop through each text line, and compare Current line to the next against each pattern criteria. Record any patterns you find (in a variable dedicated to that criteria, (don't mix them just yet).
Next comes the hardest part, safest way is to compare each pattern you find against each of the strings, to rule out the ones that don't apply to all strings. However, you could probably work out a way to combine patterns (in the same category) without cross checking
Finally, after you narrowed own your pattern list to 1 pattern per pattern type. Combine them into 1 or eliminate the ones
Keep in mind that in your pattern detection functions, you'll probably have to test each line multiple times and combine patterns. Some pseudo code to demonstrate:
Function CompareForStringMatches (String s1, String s2){ -- it should return a possible pattern found.
Array/List pattern;
int patternsFound=0;
For(i = 0, to length of shorter string){
For(x = 0, to length of shorter string){
if(longerString.contains(shorterString.substring(from i, to x)){
--record the pattern somewhere as:
pattern[patternsFound] = Replace(longerString, shorterString.Substring(from i, to x), '%') --pattern = longerString with substring replaced with '%' sign
patternsFound = patternsFound+1;
}
}
}
--After loops make another loop to check (partial) patterns against each other to eliminate patterns that are part of a larger pattern
--for instance Comparing 'random#asd.com' and 'sundom#asd.com' the patterns below should be found:
---compare'%andom#asd.com' and '%ndom#asd.com' and eliminate the first pattern, because both are valid, but second pattern includes the first one.
--You will have a lot of similar matches, but if you do this, you should end up with only a few patterns.
--after first cycle of checks do another one to combine patterns, where possible(for instance if you compare 'random#asd.com' and 'sundom#asd.net' you will end up with these 2 patterns'%ndom#asd.com' and 'Random#asd.%'.
--Since these patterns are true (because they were found during a comparison) you can combine them into '%ndom#asd.%'
--when you combine/eliminate all patterns, you should only have 1 left
return pattern[only pattern left];
}
PS: You can do things, much more efficiently, but if you have no idea where to start out, you probably need to do it the long way and work on improvements from first working prototypes.
Edit/Update
I suggest you make a wildcard detection method and then apply other patter checks you implement before it.
Wildcard detection for comparison of 2 strings (pseudo code), heavy processing version :
Compare 2 strings, check if every possible segment of shorter string is within longer:
for(int i = 0; i<shorterString.Length;i++){
for(int x = 0; i<shorterString.Length;i++){
if(longerString.contains(shorterString.substring(i,x))){ --from i to x
possiblePattern.Add(longerString.replace(shorterString.substring(i,x),'*')
--add to pattern list
}
}
--Next compare partal matches and eliminate ones that are a part of larger pattern
--So '*a#gmail.com' and '*na#yahoo.com' comparison should eliminate '*na#gmail.com', because if shorter pattern (with more symbols removed) is valid, then similar one with an extra symbol is part of it
--When that is done, combine remaining matches if there's more than 1 left.
--Remember, all patterns are valid if your first loop was correct, so '*#gmail.com' and 'personA#*.com' can be combined into '*#*.com
}
As for the alphanumeric detection. I would suggest you start by checking length of all strings. If they are the same, run the wildcard pattern detection method (for all of them). When done ONLY look for patern matches in wildcards.
So, You'll get a pattern like BK-*-* from wildcard detection run. On second iteration loop take 2 strings and only extract sub-strings that are represented by wildcard characters (use an array or an equivalent to store sub-strings, make sure not to combine both wildcards of a single string into 1 string).
So if you compare with pattern found above (BK-*-*) :
BK-M18B-48
BK-M18B-52
You should get following string sets to process after eliminating static characters:
Set 1:M18B and 48
Set 2:M18B and 52
Compare each character to opposite string in same position and check if characters match your category (like if String1[0].isaLetter AND String2[0].isaLetter). If they do add that 1 character to a pattern, if not either:
Add a wildcard character (will lead to pattern like BK-[A-Z]*[0-9][0-9]-[0-9][0-9]. If you do this combine adjacent wildcard characters to 1.
Pattern is false and you should abbort the ch'eck returning no patterns.
Use this basic logic to loop through strings, create (and store!!!!) patterns for each set of 2 strings. Loop through patterns, with wildcard detection (possibly a lighter version) to combine/eliminate paterns. So if you get patterns like '#yahoo.com' and '#gmail.com' from different sets of strings you should combine them into '#.com'
Keep in mind there's lots of room for optimization here.

IP Address/Hostname match regex

I need to match two ipaddress/hostname with a regular expression:
Like 20.20.20.20
should match with 20.20.20.20
should match with [http://20.20.20.20/abcd]
should not match with 20.20.20.200
should not match with [http://20.20.20.200/abcd]
should not match with [http://120.20.20.20/abcd]
should match with AB_20.20.20.20
should match with 20.20.20.20_AB
At present i am using something like this regular expression: "(.*[^(\w)]|^)20.20.20.20([^(\w)].*|$)"
But it is not working for the last two cases. As the "\w" is equal to [a-zA-Z0-9_]. Here I also want to eliminate the "_" underscore. I tried different combination but not able to succeed. Please help me with this regular expression.
(.*[_]|[^(\w)]|^)10.10.10.10([_]|[^(\w)].*|$)
I spent some more time on this.This regular expression seems to work.
I don't know which language you're using, but with Perl-like regular expressions you could use the following, shorter expression:
(?:\b|\D)20\.20\.20\.20(?:\b|\D)
This effectively says:
Match word boundary (\b, here: the start of the word) or a non-digit (\D).
Match IP address.
Match word boundary (\b, here: the end of the word) or a non-digit (\D).
Note 1: ?: causes the grouping (\b|\D) not to create a backreference, i.e. to store what it has found. You probably don't need the word boundaries/non-digits to be stored. If you actually need them stored, just remove the two ?:s.
Note 2: This might be nit-picking, but you need to escape the dots in the IP address part of the regular expression, otherwise you'd also match any other character at those positions. Using 20.20.20.20 instead of 20\.20\.20\.20, you might for example match a line carrying a timestamp when you're searching through a log file...
2012-07-18 20:20:20,20 INFO Application startup successful, IP=20.20.20.200
...even though you're looking for IP addresses and that particular one (20.20.20.200) explicitly shouldn't match, according to your question. Admittedly though, this example is quite an edge case.

SQL to return results for the following regex

I have the following regular expression:
WHERE A.srvc_call_id = '40750564' AND REGEXP_LIKE (A.SRVC_CALL_DN, '[^TEST]')
The row that contains 40750564 has "TEST CALL" in the column SRVC_CALL_DN and REGEXP_LIKE doesn't seem to be filtering it out. Whenever I run the query it returns the row when it shouldn't.
Is my regex pattern wrong? Or does SQL not accept [^whatever]?
The carat anchors the expression to the start of a string. By enclosing the letters T, E, S & T in square brackets you're searching, as barsju suggests for any of these characters, not for the string TEST.
You say that SRVC_CALL_DN contains the string 'TEST CALL', but you don't say where in the string. You also say that you're looking for where this string doesn't match. This implies that you want to use not regexp_like(...
Putting all this together I think you need:
AND NOT REGEXP_LIKE (A.SRVC_CALL_DN, '^TEST[[:space:]]CALL')
This excludes every match from your query where the string starts with 'TEST CALL'. However, if this string may be in any position in the column you need to remove the carat - ^.
This also assumes that the string is always in upper case. If it's in mixed case or lower, then you need to change it again. Something like the following:
AND NOT REGEXP_LIKE (upper(A.SRVC_CALL_DN), '^TEST[[:space:]]CALL')
By upper-casing SRV_CALL_DN you ensure that you're always going to match but ensure that your query may not use an index on this column. I wouldn't worry about this particular point as regular expressions queries can be fairly poor at using indexes anyway and it appears as though SRVC_CALL_ID is indexed.
Also if it may not include 'CALL' you will have to remove this. It is best when using regular expressions to make your match pattern as explicit as possible; so include 'CALL' if you can.
Try with '^TEST' or '^TEST.*'
Your regexp means any string not starting with any of the characters: T,E,S,T.
But your case is so simple, starts with TEST. Why not use a simple like:
LIKE 'TEST%'

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.