Is there a nullable boolean type I can use in a Pandas dataframe? - pandas

In a program I am working on I have to explicitly set the type of a column that contains boolean data. Sometimes all of the values in this column are None. Unless I provide explicit type information Pandas will infer the wrong type information for that column.
Is there a pandas-compatible type that represents a nullable-bool? I want to do something like this, but preserve the Nones:
s = pandas.Series([True, False, None]).astype(bool)
print([v for v in s])
gives:
[True, False, False]
Python's built-in bool class cannot have a Null value. It can only be True or False. And in this case, because bool(None)==False the final Null is lost.
But what if I want to preserve my nulls? Is there a type I can give the column which allows for True, False and None?
I have solved a similar issue with numeric columns: For these I can use the Numpy Int64 which is a pandas-compatible nullable integer type:
s = pandas.Series([1, 2, None, numpy.NaN]).astype("Int64")
print([v for v in s])
gives:
[1, 2, <NA>, <NA>]
Which is exactly right behaviour for nullable integers, I just need a type I can use for my Nullable bools.

boolean dtype should work:
>>> pd.Series([True, False, None])
0 True
1 False
2 None
dtype: object
>>> pd.Series([True, False, None]).astype("boolean")
0 True
1 False
2 <NA>
dtype: boolean

Related

Pandas: Fast way to get cols/rows containing na

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Why asarray and list react diffently?

I'm trying to understand why :
w=[0.1,0.2,0.3,0.5,0]
print(w[w!=0])
outputs : 0.2,
while
w=[0.1,0.2,0.3,0.5,0]
w=np.asarray(w)
print(w[w!=0])
outputs : [0.1 0.2 0.3 0.5], which seems more logical
So : why lists do return the second element ?
A list and an ndarray implement comparison differently. In particular:
a list returns a single bool value of True or False when compared to something else. Clearly a list w is not the value 0.2 so w != 0.2 returns True
an ndarray implements comparison by returning an ndarray of booleans, representing each array element’s comparison. Thus, w != 0.2 returns [True False True True]
Thus
for a list, w[w!=0.2] is w[True] and this is treated as meaning w[1]
for an ndarray it is w[ ndarray([True False True True]) ] which then leverages numpy’s array indexing to return only those elements where the Boolean is True

A contain masking operation

Suppose such an array
In [8]: pd.Series(['testing', 'the', 'masking'])
Out[8]:
0 testing
1 the
2 masking
dtype: object
Masking is handy
In [10]: arr == 'testing'
Out[10]:
0 True
1 False
2 False
dtype: bool
If check if 't' in the individual strings, nested iterations should be applied
In [11]: [ u for u in arr if 't' in u]
Out[11]: ['testing', 'the']
Is it possible to get it done with
arr contains 't'
It is possible
s[s.str.contains('t')]

How to interpret the result of pandas dataframe boolean index

The above operation seems a little trivia, however, I am a little lost as to the output of the operation. Below is a piece of code to illustrate my point.
# sample data for understanding concept of boolean indexing:
d_f = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9], 'y':[10,12,13,14,15,16,1,2,3,5]})
# changing index of dataframe:
d_f = d_f.set_index([list('abcdefghig')])
# a list of data:
myL = np.array(range(1,11))
# loop through data to boolean slicing and indexing:
for r in myL:
DF2 = d_f['x'].values == r
The result of the above code is:
array([False,
False,
False,
False,
False,
False,
False,
False,
False,
False],
dtype=bool
But all the values in myL are in d_f['x'].values except 0. It, therefore, appears that the program was doing an 'index for index' matching of the elements in the myL and d_f['x'].values. Is this a typical behavior of pandas library? If so, can some please explain the rationale behind this for me. Thank you in advance.
As #coldspeed states, you are overwriting DF2 with d_f['x'] == 10 which is a boolean series of all False.
What I think you are trying to do is this instead:
d_f['x'].isin(myL)
Output:
a False
b True
c True
d True
e True
f True
g True
h True
i True
g True
Name: x, dtype: bool

How To Compare The Date From Two Numpy.datetime64

What is the proper method to compare the date portion of two numpy.datetime64's?
A: 2011-01-10 Type: <type 'numpy.datetime64'>
B: 2011-01-10T09:00:00.000000-0700 Type: <type 'numpy.datetime64'>
The above example would return false by comparing (A == B)
You'll want to strip your datetime64 of time information before comparison by specifying the 'datetime64[D]' data type, like this:
>>> a = numpy.datetime64('2011-01-10')
>>> b = numpy.datetime64('2011-01-10T09:00:00.000000-0700')
>>> a == b
False
>>> a.astype('datetime64[D]') == b.astype('datetime64[D]')
True
I couldn't get numpy to create an array of datetime64[D] values from the string you gave for b above, by the way. I got this error:
>>> b = numpy.array(['2011-01-10T09:00:00.000000-0700'], dtype='datetime64[D]')
TypeError: Cannot parse "2011-01-10T09:00:00.000000-0700" as unit 'D' using casting rule 'same_kind'