How to use If statement and str.contains() to create a new dataframe column [duplicate] - pandas

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I'm looking for a string.contains or string.indexof method in Python.
I want to do:
if not somestring.contains("blah"):
continue

Use the in operator:
if "blah" not in somestring:
continue

If it's just a substring search you can use string.find("substring").
You do have to be a little careful with find, index, and in though, as they are substring searches. In other words, this:
s = "This be a string"
if s.find("is") == -1:
print("No 'is' here!")
else:
print("Found 'is' in the string.")
It would print Found 'is' in the string. Similarly, if "is" in s: would evaluate to True. This may or may not be what you want.

Does Python have a string contains substring method?
99% of use cases will be covered using the keyword, in, which returns True or False:
'substring' in any_string
For the use case of getting the index, use str.find (which returns -1 on failure, and has optional positional arguments):
start = 0
stop = len(any_string)
any_string.find('substring', start, stop)
or str.index (like find but raises ValueError on failure):
start = 100
end = 1000
any_string.index('substring', start, end)
Explanation
Use the in comparison operator because
the language intends its usage, and
other Python programmers will expect you to use it.
>>> 'foo' in '**foo**'
True
The opposite (complement), which the original question asked for, is not in:
>>> 'foo' not in '**foo**' # returns False
False
This is semantically the same as not 'foo' in '**foo**' but it's much more readable and explicitly provided for in the language as a readability improvement.
Avoid using __contains__
The "contains" method implements the behavior for in. This example,
str.__contains__('**foo**', 'foo')
returns True. You could also call this function from the instance of the superstring:
'**foo**'.__contains__('foo')
But don't. Methods that start with underscores are considered semantically non-public. The only reason to use this is when implementing or extending the in and not in functionality (e.g. if subclassing str):
class NoisyString(str):
def __contains__(self, other):
print(f'testing if "{other}" in "{self}"')
return super(NoisyString, self).__contains__(other)
ns = NoisyString('a string with a substring inside')
and now:
>>> 'substring' in ns
testing if "substring" in "a string with a substring inside"
True
Don't use find and index to test for "contains"
Don't use the following string methods to test for "contains":
>>> '**foo**'.index('foo')
2
>>> '**foo**'.find('foo')
2
>>> '**oo**'.find('foo')
-1
>>> '**oo**'.index('foo')
Traceback (most recent call last):
File "<pyshell#40>", line 1, in <module>
'**oo**'.index('foo')
ValueError: substring not found
Other languages may have no methods to directly test for substrings, and so you would have to use these types of methods, but with Python, it is much more efficient to use the in comparison operator.
Also, these are not drop-in replacements for in. You may have to handle the exception or -1 cases, and if they return 0 (because they found the substring at the beginning) the boolean interpretation is False instead of True.
If you really mean not any_string.startswith(substring) then say it.
Performance comparisons
We can compare various ways of accomplishing the same goal.
import timeit
def in_(s, other):
return other in s
def contains(s, other):
return s.__contains__(other)
def find(s, other):
return s.find(other) != -1
def index(s, other):
try:
s.index(other)
except ValueError:
return False
else:
return True
perf_dict = {
'in:True': min(timeit.repeat(lambda: in_('superstring', 'str'))),
'in:False': min(timeit.repeat(lambda: in_('superstring', 'not'))),
'__contains__:True': min(timeit.repeat(lambda: contains('superstring', 'str'))),
'__contains__:False': min(timeit.repeat(lambda: contains('superstring', 'not'))),
'find:True': min(timeit.repeat(lambda: find('superstring', 'str'))),
'find:False': min(timeit.repeat(lambda: find('superstring', 'not'))),
'index:True': min(timeit.repeat(lambda: index('superstring', 'str'))),
'index:False': min(timeit.repeat(lambda: index('superstring', 'not'))),
}
And now we see that using in is much faster than the others.
Less time to do an equivalent operation is better:
>>> perf_dict
{'in:True': 0.16450627865128808,
'in:False': 0.1609668098178645,
'__contains__:True': 0.24355481654697542,
'__contains__:False': 0.24382793854783813,
'find:True': 0.3067379407923454,
'find:False': 0.29860888058124146,
'index:True': 0.29647137792585454,
'index:False': 0.5502287584545229}
How can in be faster than __contains__ if in uses __contains__?
This is a fine follow-on question.
Let's disassemble functions with the methods of interest:
>>> from dis import dis
>>> dis(lambda: 'a' in 'b')
1 0 LOAD_CONST 1 ('a')
2 LOAD_CONST 2 ('b')
4 COMPARE_OP 6 (in)
6 RETURN_VALUE
>>> dis(lambda: 'b'.__contains__('a'))
1 0 LOAD_CONST 1 ('b')
2 LOAD_METHOD 0 (__contains__)
4 LOAD_CONST 2 ('a')
6 CALL_METHOD 1
8 RETURN_VALUE
so we see that the .__contains__ method has to be separately looked up and then called from the Python virtual machine - this should adequately explain the difference.

if needle in haystack: is the normal use, as #Michael says -- it relies on the in operator, more readable and faster than a method call.
If you truly need a method instead of an operator (e.g. to do some weird key= for a very peculiar sort...?), that would be 'haystack'.__contains__. But since your example is for use in an if, I guess you don't really mean what you say;-). It's not good form (nor readable, nor efficient) to use special methods directly -- they're meant to be used, instead, through the operators and builtins that delegate to them.

in Python strings and lists
Here are a few useful examples that speak for themselves concerning the in method:
>>> "foo" in "foobar"
True
>>> "foo" in "Foobar"
False
>>> "foo" in "Foobar".lower()
True
>>> "foo".capitalize() in "Foobar"
True
>>> "foo" in ["bar", "foo", "foobar"]
True
>>> "foo" in ["fo", "o", "foobar"]
False
>>> ["foo" in a for a in ["fo", "o", "foobar"]]
[False, False, True]
Caveat. Lists are iterables, and the in method acts on iterables, not just strings.
If you want to compare strings in a more fuzzy way to measure how "alike" they are, consider using the Levenshtein package
Here's an answer that shows how it works.

If you are happy with "blah" in somestring but want it to be a function/method call, you can probably do this
import operator
if not operator.contains(somestring, "blah"):
continue
All operators in Python can be more or less found in the operator module including in.

So apparently there is nothing similar for vector-wise comparison. An obvious Python way to do so would be:
names = ['bob', 'john', 'mike']
any(st in 'bob and john' for st in names)
>> True
any(st in 'mary and jane' for st in names)
>> False

You can use y.count().
It will return the integer value of the number of times a sub string appears in a string.
For example:
string.count("bah") >> 0
string.count("Hello") >> 1

Here is your answer:
if "insert_char_or_string_here" in "insert_string_to_search_here":
#DOSTUFF
For checking if it is false:
if not "insert_char_or_string_here" in "insert_string_to_search_here":
#DOSTUFF
OR:
if "insert_char_or_string_here" not in "insert_string_to_search_here":
#DOSTUFF

You can use regular expressions to get the occurrences:
>>> import re
>>> print(re.findall(r'( |t)', to_search_in)) # searches for t or space
['t', ' ', 't', ' ', ' ']

Related

Writing an `__array_ufunc__` for string dtypes

I'm implementing a class that mixes in NDArrayOperatorsMixin using the appraoch described here.
This works well for numbers, but doesn't work with string dtypes. For example,
x = MyNewArrayClass(np.array(["a", "b", "c"]))
x == "a"
raises the following UFuncTypeError:
numpy.core._exceptions.UFuncTypeError: ufunc 'equal' did not contain a loop with signature matching types (dtype('<U1'), dtype('<U1')) -> dtype('bool')
How can I modify the implementation suggested in the docs to support str dtypes?

Using Numpy.where() with a function on each element

I have a rather complicated function, say:
def func(elem):
// blah blah blah
return True
// blah blah blah
return False
I wish to use the numpy.where() function along the lines of
arr2 = np.where(func(arr1), arr1, 0)
But when I try this syntax and debug, I see that in func, that the entire array is passed rather than individual elements. The typical use cases I see in the documentation/examples only rely on simple comparators like arr < 5, but I need something quite a bit fancier that I don't want to try and write in a single line.
If this is possible, or if there is some vectorized substitute (emphasis on efficiency), any insights appreciated.
I figured out how to do it by using np.vectorize, followed by a list comprehension, not np.where. Maybe from this, one can find out a way to use numpy rather than of a list comprehension.
func_vec = np.vectorize(func)
[arr1 if cond else 0 for cond in func_vec(arr1)]
Anyways, by using func_vec(arr1) you get the True/False values per element.
Note: If you want a new array like arr1, replacing by 0 the elements that return False in your function, then this should work:
arr2 = np.where(func_vec(arr1), arr1, 0)
Edit:
Indeed, np.vectorize is not optimized to performance (by bad), is essentially a for loop under the hood. So I would recommend trying to write your function in a vectorized way rather than trying to vectorize it thereafter.
For example, try converting a function like this:
def func(elem):
if elem > 5:
return True
else:
return False
to something like this:
def func(elem):
return elem > 5
so that you can easily apply func(arr1) without error.
If you really have a function that returns just True or False, I'm pretty sure you can do it, regardless of its complexity. Anyways we're here to help you out!
It seems like you try to get the elements of arr1 you want using func function, but judging from the definition func works for a single element. You need a True/False array of the same shape as arr1 to do so.
If I get it right, a potential solution would be to modify func to operate on the whole array and not only on one element and return the True/False array of shape arr1.shape you need for np.where, since you want to do it in a single line that way.

Why does pandas sometimes use the argument 'dropna' and sometimes 'skipna' in its methods?

Please consider this either a question or a suggestion. Pandas sometimes uses the argument 'dropna' in its methods and sometimes 'skipna' as in the following two cases:
history_trans['category_2'].value_counts(dropna = False)
history_trans['category_2'].median(skipna = False)
It may be better to use 'skipna' in both the cases unless there are some special reasons for using two different names for the arguments.
Base on the doc
median :skipna : boolean, default True
Exclude NA/null values when computing the result.
value_counts: dropna : boolean, default True
Don’t include counts of NaN.
So the value_count will dropna at the end when print the output(default) , but the median will return the value by not consider nan during the calculation, so basically these two are aim two different target.

Accessing the last element in Perl6

Could someone explain why this accesses the last element in Perl 6
#array[*-1]
and why we need the asterisk *?
Isn't it more logical to do something like this:
#array[-1]
The user documentation explains that *-1 is just a code object, which could also be written as
-> $n { $n - 1 }
When passed to [ ], it will be invoked with the array size as argument to compute the index.
So instead of just being able to start counting backwards from the end of the array, you could use it to eg count forwards from its center via
#array[* div 2] #=> middlemost element
#array[* div 2 + 1] #=> next element after the middlemost one
According to the design documents, the reason for outlawing negative indices (which could have been accepted even with the above generalization in place) is this:
The Perl 6 semantics avoids indexing discontinuities (a source of subtle runtime errors), and provides ordinal access in both directions at both ends of the array.
If you don't like the whatever-star, you can also do:
my $last-elem = #array.tail;
or even
my ($second-last, $last) = #array.tail(2);
Edit: Of course, there's also a head method:
my ($first, $second) = #array.head(2);
The other two answers are excellent. My only reason for answering was to add a little more explanation about the Whatever Star * array indexing syntax.
The equivalent of Perl 6's #array[*-1] syntax in Perl 5 would be $array[ scalar #array - 1]. In Perl 5, in scalar context an array returns the number of items it contains, so scalar #array gives you the length of the array. Subtracting one from this gives you the last index of the array.
Since in Perl 6 indices can be restricted to never be negative, if they are negative then they are definitely out of range. But in Perl 5, a negative index may or may not be "out of range". If it is out of range, then it only gives you an undefined value which isn't easy to distinguish from simply having an undefined value in an element.
For example, the Perl 5 code:
use v5.10;
use strict;
use warnings;
my #array = ('a', undef, 'c');
say $array[-1]; # 'c'
say $array[-2]; # undefined
say $array[-3]; # 'a'
say $array[-4]; # out of range
say "======= FINISHED =======";
results in two nearly identical warnings, but still finishes running:
c
Use of uninitialized value $array[-2] in say at array.pl line 7.
a
Use of uninitialized value in say at array.pl line 9.
======= FINISHED =======
But the Perl 6 code
use v6;
my #array = 'a', Any, 'c';
put #array[*-1]; # evaluated as #array[2] or 'c'
put #array[*-2]; # evaluated as #array[1] or Any (i.e. undefined)
put #array[*-3]; # evaluated as #array[0] or 'a'
put #array[*-4]; # evaluated as #array[-1], which is a syntax error
put "======= FINISHED =======";
will likewise warn about the undefined value being used, but it fails upon the use of an index that comes out less than 0:
c
Use of uninitialized value #array of type Any in string context.
Methods .^name, .perl, .gist, or .say can be used to stringify it to something meaningful.
in block <unit> at array.p6 line 5
a
Effective index out of range. Is: -1, should be in 0..Inf
in block <unit> at array.p6 line 7
Actually thrown at:
in block <unit> at array.p6 line 7
Thus your Perl 6 code can more robust by not allowing negative indices, but you can still index from the end using the Whatever Star syntax.
last word of advice
If you just need the last few elements of an array, I'd recommend using the tail method mentioned in mscha's answer. #array.tail(3) is much more self-explanatory than #array[*-3 .. *-1].

Not keyword vs = False when checking for false boolean condition

When I'm using an If statement and I want to check if a boolean is false should I use the "Not" keyword or just = false, like so
If (Not myboolean) then
vs
If (myboolean = False) then
Which is better practice and more readable?
Definitely, use "Not". And for the alternately, use "If (myboolean)" instead of "If (myboolean = true)"
The works best if you give the boolean a readable name:
if (node.HasChildren)
Since there's no functional difference between either style, this is one of those things that just comes down to personal preference.
If you're working on a codebase where a standard has already been set, then stick to that.
Use True and False to set variables, not to test them. This improves readability as described in the other answers, but it also improves portability, particularly when best practices aren't followed.
Some languages allow you to interchange bool and integer types. Consider the contrived example:
int differentInts(int i, int j)
{
return i-j; // Returns non-zero (true) if ints are different.
}
. . .
if (differentInts(4, 8) == TRUE)
printf("Four and Eight are different!\n");
else
printf("Four and Eight are equal!\n");
Horrible style, but I've seen worse sneak into production. On other people's watches, of course. :-)
Additionally to the consensus, when there is both a true case and a false case please use
if (condition)
// true case
else
// false case
rather than
if (not condition)
// false case
else
// true case
(But then I am never sure if python's x is not None is the true-case or the false case.)
Definitely use "Not", consider reading it aloud.
If you read aloud:
If X is false Then Do Y
Do Y
Versus
If Not X Then Do Y
I think you'll find the "Not" route is more natural. Especially if you pick good variable names or functions.
Code Complete has some good rules on variable names. http://cc2e.com/Page.aspx?hid=225 (login is probably required)
! condition
In C and pre-STL C++, "!condition" means condition evaluates to a false truth value, whereas "condition == FALSE" meant that the value of condition had to equal what the system designed as FALSE. Since different implementations defined it in different ways, it was deemed better practice to use !condition.
UPDATE: As pointed out in the comment -- FALSE is always 0, it's TRUE that can be dangerous.
Something else: Omit the parentheses, they’re redundant in VB and as such, constitute syntactic garbage.
Also, I'm slightly bothered by how many people argue by giving technical examples in other languages that simply do not apply in VB. In VB, the only reasons to use If Not x instead of If x = False is readability and logic. Not that you’d need other reasons.
Completely different reasons apply in C(++), true. Even more true due to the existence of frameworks that really handle this differently. But misleading in the context of VB!
It does not make any difference as long you are dealing with VB only, however if you happen to use C functions such as the Win32 API, definitely do not use "NOT" just "==False" when testing for false, but when testing for true do not use "==True" instead use "if(function())".
The reason for that is the difference between C and VB in how boolean is defined.
In C true == 1 while in VB true == -1 (therefore you should not compare the output of a C function to true, as you are trying to compare -1 to 1)
Not in Vb is a bitwise NOT (equal to C's ~ operator not the ! operator), and thus it negates each bit, and as a result negating 1 (true in C) will result in a non zero value which is true, NOT only works on VB true which is -1 (which in bit format is all one's according to the two's complement rule [111111111]) and negating all bits [0000000000] equals zero.
For a better understanding see my answer on Is there a VB.net equivalent for C#'s ! operator?
Made a difference with these lines in vb 2010/12
With the top line, Option Strict had to be turned off.
If InStr(strLine, "=") = False Then _
If Not CBool(InStr(strLine, "=")) Then
Thanks for answering the question for me. (I'm learning)