difference between pandas.Series.str.match and pandas.Series.str.contains - pandas

What's the difference between pandas.Series.str.contains and pandas.Series.str.match? Why is the case below?
s1 = pd.Series(['house and parrot'])
s1.str.contains(r"\bparrot\b", case=False)
I got True, but when i do
s1.str.match(r"\bparrot\b", case=False)
I got False. Why is the case?

The documentation for str.contains() states:
Test if pattern or regex is contained within a string of a Series or
Index.
The documentation for str.match() states:
Determine if each string matches a regular expression.
The difference in these two methods is that str.contains() uses: re.search, while str.match() uses re.match.
As per documentation of re.match()
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Return None if the string does not match the pattern; note that this
is different from a zero-length match.
So parrot does not match the first character of the string so your expression returns False. House does match the first character so it finds house and returns true.

Related

string operation in ruby

I have a below string array object which has 4 elements. I want to compare the elements of this list to an string and want to check if string is part of this list.
list = ["starter_v2", "professional_q", "custom_v", "basic_t9"]
str = "starter"
if list.include?str #should return true but returning false as it is checking the full string not the substring
Above if condition should return true, however it is returning false.
Can someone suggest me how to fix this, as I am new to ruby and want to compare strings.
For my usecase, in list object I will always have entries with "_" followed by an alphabetic character and I will compare this by string without "_"
Enumerable#include? checks if a given value matches any value in the given enumerable. A substring is not equivalent to a string that contains it, so this check fails.
Instead, you want to check if any string in the array matches your substring. Ruby has handy facilities for this: Enumerable#any? lets you iterate an enumerable, yielding each element to a block, and then will return true if any invocation of the block returns true.
So, you can use:
list.any? {|element| element.include?(str) }
What this will do is check each entry in list to see if str is included in it; once a match is found, it'll stop iterating and return true. If it goes through the entire list without finding a match, it'll return false.
You could also use use element.start_with? if you know that your search string should always match the first part of the string, or you could use a more complex condition which splits each element on underscore and compares the first part, or you could use a regex. The important part is that the block returns true when you want to indicate a match.

Snowflake - Check if 1st 3 Characters of string are letters

Am trying to determine how one attempts to identify, in Snowflake SQL, if a product code begins with three letters.
Suggestions?
I did just try: LEFT(P0.PRODUCTCODE,3) NOT LIKE '[a-zA-Z]%' but it didn't work.
Thanks folks
You can use REGEXP_LIKE to return a boolean value indicating whether or not your string matched the pattern you're interested in.
In your case, something like REGEXP_LIKE(string_field_here, '[a-zA-Z]{3}.*')
Breaking down the regular expression pattern:
[a-zA-Z]: Only match letter characters, both upper and lowercase
{3}: Require three of those letters
.*: Allow any number of any characters after those three letters
Note: in many cases, you would need to specifically indicate the beginning/ending of the string in the pattern, but Snowflake's implementation handles that for you. From the docs:
The function implicitly anchors a pattern at both ends (i.e. ''
automatically becomes '^$', and 'ABC' automatically becomes '^ABC$').
To match any string starting with ABC, the pattern would be 'ABC.*'.
You can try running these examples:
SELECT REGEXP_LIKE('abc', '[a-zA-Z]{3}.*') AS _abc,
REGEXP_LIKE('123', '[a-zA-Z]{3}.*') AS _123,
REGEXP_LIKE('abc123', '[a-zA-Z]{3}.*') AS _abc123,
REGEXP_LIKE('123abc', '[a-zA-Z]{3}.*') AS _123abc

What does this SQL query replacing JSON text mean?

I'm trying to understand a part of SQL query but I don't know what's it used for; can anyone help me?
I know it wants to replace something, but what is " ":"(.+)" ", and why the string like "store" can be used in substring()?
replace((
CASE
WHEN(char_length(substring(xxx_json::text FROM 'Name":"(.+)" , "store')) > 0)
THEN substring(xxx_json::text FROM 'Name":"(.+)" , "store')
ELSE substring(xxx_json::text FROM 'Name":"(.+)" , "employees')
END),'\u0016','''')
This appears to be a variant of substring that does regular-expression matching. The first argument, xxx_json::text, is the string to be searched. The second argument is the regular expression to match.
Note that the second argument consists of the entire SQL string literal 'Name":"(.+)" , "store' (in the first two cases). Everything in that string, except for the (.+), should literally match a portion of the string to be searched. The (.+) is regex syntax. A dot matches any character; a + means one or more occurrences; the parentheses define this as a capture group. In this context, the text that matches the capture group is what will be returned by substring.
So for instance if the contents of the string to be searched was a simple JSON expression like this: { "Name":"John Smith" , "store":"London" }, the regular expression would match and the substring would return 'John Smith'.
In short, this is a slightly hacky way of parsing JSON in SQL to extract the value of the Name element (or some element whose key ends with Name).
See section 9.7.3 in https://www.postgresql.org/docs/9.4/static/functions-matching.html for detailed documentation on this form of substring.

Extract all characters before a period with HiveQL regex?

I have a table that looks like:
bl.ah
foo.bar
bar.fight
And I'd like to use HiveQL's regexp_extract to return
bl
foo
bar
Given the docs data about regexp_extract:
regexp_extract(string subject, string pattern, int index)
Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc. The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.
So, if you have a table with a single column (let's call it description for our example) you should be able to use regexp_extract as follows to get the data before a period, if one exists, or the entire string in the absence of a period:
regexp_extract(description,'^([^\.]+)\.?',1)
The components of the regex are as follows:
^ start of string
([^\.]+) any non-period character one or more times, in a capture group
\.? a period either once or no times
Because the part of the string we're interested in will be in the first (and only) capture group, we refer to it by passing the index parameter a value of 1.

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.