Some column in dataframe df, df.column, is stored as datatype int64.
The values are all 1s or 0s.
Is there a way to replace these values with boolean values?
df['column_name'] = df['column_name'].astype('bool')
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_integers(0,1,size=5),
columns=['foo'])
print(df)
# foo
# 0 0
# 1 1
# 2 0
# 3 1
# 4 1
df['foo'] = df['foo'].astype('bool')
print(df)
yields
foo
0 False
1 True
2 False
3 True
4 True
Given a list of column_names, you could convert multiple columns to bool dtype using:
df[column_names] = df[column_names].astype(bool)
If you don't have a list of column names, but wish to convert, say, all numeric columns, then you could use
column_names = df.select_dtypes(include=[np.number]).columns
df[column_names] = df[column_names].astype(bool)
Reference: Stack Overflow unutbu (Jan 9 at 13:25), BrenBarn (Sep 18 2017)
I had numerical columns like age and ID which I did not want to convert to Boolean. So after identifying the numerical columns like unutbu showed us, I filtered out the columns which had a maximum more than 1.
# code as per unutbu
column_names = df.select_dtypes(include=[np.number]).columns
# re-extracting the columns of numerical type (using awesome np.number1 :)) then getting the max of those and storing them in a temporary variable m.
m=df[df.select_dtypes(include=[np.number]).columns].max().reset_index(name='max')
# I then did a filter like BrenBarn showed in another post to extract the rows which had the max == 1 and stored it in a temporary variable n.
n=m.loc[m['max']==1, 'max']
# I then extracted the indexes of the rows from n and stored them in temporary variable p.
# These indexes are the same as the indexes from my original dataframe 'df'.
p=column_names[n.index]
# I then used the final piece of the code from unutbu calling the indexes of the rows which had the max == 1 as stored in my variable p.
# If I used column_names directly instead of p, all my numerical columns would turn into Booleans.
df[p] = df[p].astype(bool)
There are various ways to achieve that, below one will see various options:
Using pandas.Series.map
Using pandas.Series.astype
Using pandas.Series.replace
Using pandas.Series.apply
Using numpy.where
As OP didn't specify the dataframe, in this answer I will be using the following dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 0, 0, 1, 0], 'col2': [0, 0, 1, 0, 1], 'col3': [1, 1, 1, 0, 1], 'col4': [0, 0, 0, 0, 1]})
[Out]:
col1 col2 col3 col4
0 1 0 1 0
1 0 0 1 0
2 0 1 1 0
3 1 0 0 0
4 0 1 1 1
We will consider that one wants to change to boolean only the values in col1. If one wants to transform the whole dataframe, see one of the notes below.
In the section Time Comparison one will measure the times of execution of each option.
Option 1
Using pandas.Series.map as follows
df['col1'] = df['col1'].map({1: True, 0: False})
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 2
Using pandas.Series.astype as follows
df['col1'] = df['col1'].astype(bool)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 3
Using pandas.Series.replace, with one of the following options
# Option 3.1
df['col1'] = df['col1'].replace({1: True, 0: False})
# or
# Option 3.2
df['col1'] = df['col1'].replace([1, 0], [True, False])
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 4
Using pandas.Series.apply and a custom lambda function as follows
df['col1'] = df['col1'].apply(lambda x: True if x == 1 else False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Option 5
Using numpy.where as follows
import numpy as np
df['col1'] = np.where(df['col1'] == 1, True, False)
[Out]:
col1 col2 col3 col4
0 True 0 1 0
1 False 0 1 0
2 False 1 1 0
3 True 0 0 0
4 False 1 1 1
Time Comparison
For this specific case one has used time.perf_counter() to measure the time of execution.
method time
0 Option 1 0.00000120000913739204
1 Option 2 0.00000220000219997019
2 Option 3.1 0.00000179999915417284
3 Option 3.2 0.00000200000067707151
4 Option 4 0.00000400000135414302
5 Option 5 0.00000210000143852085
Notes:
There are strong opinions on using .apply(), so one might want to read this.
There are additional ways to measure the time of execution. For additional ways, read this: How do I get time of a Python program's execution?
To convert the whole dataframe, one can do, for example, the following
df = df.astype(bool)
[Out]:
col1 col2 col3 col4
0 True False True False
1 False False True False
2 False True True False
3 True False False False
4 False True True True
Related
I have a list of 18 different dataframes. The only thing these dataframes have in common is that each contains a variable that ends with "_spec". The computations I would like to perform on each dataframe in the list are as follows:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above); and
store the results of #2 in a separate list of 18 dataframes
I can get the output that I would like for each individual dataframe with the following:
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) # count (negative) no. of numeric vars in df
lvmo_spec = df[df.sum(numeric_only=True,axis=1)==lvmo_numlength].filter(regex='_spec') # does ^ = sum of numeric vars?
lvmo_spec.to_list()
but I don't want to copy and paste this 18(+) times...
I am new to writing functions and loops, but I know these can be utilized to perform the procedure I desire; yet I don't know how to execute it. The below code shows the abomination I have created, which can't even make it off the ground. Any suggestions?
# make list of dataframes
name_list = [lvmo, trx_nonrx, pd, odose_drg, fx, cpn_use, dem_hcc, dem_ori, drg_man, drg_cou, nlx_gvn, nlx_ob, opd_rsn, opd_od, psy_yn, sti_prep_tkn, tx_why, tx_curtx]
# create variable that satisfies condition 1
def numlen(name):
return name + "_numlen"
# create variable that satisfies condition 2
def spec(name):
return name + "_spec"
# loop it all together
for name in name_list:
numlen(name) = -len(name.select_dtypes('number').columns.tolist())
spec(name) = name[name.sum(numeric_only=True,axis=1)]==numlen(name).filter(regex='spec')
You can achieve what I believe your question is asking as follows, given input df_list which is a list of dataframes:
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
Explanation:
for each input dataframe, create a new dataframe as follows: for rows where the sum of the values in numeric columns is <=0 and is equal in magnitude to the number of numeric columns, select only those columns with a label ending in '_spec'
use a list comprehension to compile the above new dataframes into a list
Note that this can also be expressed using a standard for loop instead of a list comprehension as follows:
res_list = []
for df in df_list:
res_list.append( df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') )
Sample code (using 7 input dataframe objects instead of 18:
import pandas as pd
df_list = [pd.DataFrame({'b':['a','b','c','d']} | {f'col{i+1}{"_spec" if not i%3 else ""}':[-1,0,0]+([0 if i!=n-1 else -n]) for i in range(n)}) for n in range(7)]
for df in df_list: print(df)
res_list = [df[df.sum(numeric_only=True,axis=1) == -len(df.select_dtypes('number').columns.tolist())].filter(regex='_spec') for df in df_list]
for df in res_list: print(df)
Input:
b
0 a
1 b
2 c
3 d
b col1_spec
0 a -1
1 b 0
2 c 0
3 d -1
b col1_spec col2
0 a -1 -1
1 b 0 0
2 c 0 0
3 d 0 -2
b col1_spec col2 col3
0 a -1 -1 -1
1 b 0 0 0
2 c 0 0 0
3 d 0 0 -3
b col1_spec col2 col3 col4_spec
0 a -1 -1 -1 -1
1 b 0 0 0 0
2 c 0 0 0 0
3 d 0 0 0 -4
b col1_spec col2 col3 col4_spec col5
0 a -1 -1 -1 -1 -1
1 b 0 0 0 0 0
2 c 0 0 0 0 0
3 d 0 0 0 0 -5
b col1_spec col2 col3 col4_spec col5 col6
0 a -1 -1 -1 -1 -1 -1
1 b 0 0 0 0 0 0
2 c 0 0 0 0 0 0
3 d 0 0 0 0 0 -6
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
col1_spec
0 -1
3 -1
col1_spec
0 -1
3 0
col1_spec
0 -1
3 0
col1_spec col4_spec
0 -1 -1
3 0 -4
col1_spec col4_spec
0 -1 -1
3 0 0
col1_spec col4_spec
0 -1 -1
3 0 0
Also, a couple of comments about the original question:
lvmo_spec.to_list() doesn't work because to_list() is not defined. There is a method named tolist(), but it will only work for a Series (not a DataFrame).
lvmo_numlength = -len(df.select_dtypes('number').columns.tolist()) gives a negative result. I have assumed this is your intention, and that you want the sum of each row's numeric values to have a negative value, but this is slightly at odds with your description which states:
return the number of columns in each dataframe that are numeric;
filter the dataframe to include only the "_spec" column if the sum of the numeric columns is equal to #1 (above);
I have a pandas dataframe in the below format
id name value_1 value_2
1 def 1 0
2 abc 0 1
I would need to sort the above dataframe based on id, name, value_1 & value_2. Following that, for every group of [id,name,value_1,value_2], get the first row and set df['result'] = 1. For the other rows in that group, set df['result'] = 0.
I do the sorting and get the first row using the below code:
df = df.sort_values(["id","name","value_1","value_2"], ascending=True)
first_row_per_group = df.groupby(["id","name","value_1","value_2"]).agg('first')
After getting the first row, I set first_row_per_group ['result'] = 1. But I am not sure how to set the other rows (non-first) rows to 0.
Any suggestions would be appreciated.
duplicated would be faster than groupby:
df = df.sort_values(['id', 'name', 'value_1', 'value_2'])
df['result'] = (~df['id'].duplicated()).astype(int)
use df.groupby(...).cumcount() to get a counter of rows within the group which you can then manipulate.
In [51]: df
Out[51]:
a b c
0 def 1 0
1 abc 0 1
2 def 1 0
3 abc 0 1
In [52]: df2 = df.sort_values(['a','b','c'])
In [53]: df2['result'] = df2.groupby(['a', 'b', 'c']).cumcount()
In [54]: df2['result'] = np.where(df2['result'] == 0, 1, 0)
In [55]: df2
Out[55]:
a b c result
1 abc 0 1 1
3 abc 0 1 0
0 def 1 0 1
2 def 1 0 0
In [1]: import pandas as pd
...: a=pd.DataFrame([1,2,'a'])
In [2]: a.isin([1,'a'])
Out[2]:
0
0 True
1 False
2 True
In [3]: a.isin(pd.DataFrame([1,'a']))
Out[3]:
0
0 True
1 False
2 False
why isin cant find 'a' in a DataFrame but can in a list?.
PS: Using pandas 1.0.5
In [4]: pd.version
Out[4]: '1.0.5'
The pd.DataFrame.isin docs spell out this behavior pretty clearly, emphasis my own.
If values is a DataFrame, then both the index and column labels must match.
So looking at your two DataFrames side by side:
a isin pd.DataFrame([1,'a'])
0 0
0 1 0 1 True <- 1 == 1 for col label (0) and index label (0)
1 2 1 a False <- 2 != 'a' for col label (0) and index label (1)
2 a False <- Nothing to align
I currently have the following dataframe
A B C
0 x 1 1
1 x 0 1
2 x 0 1
3 y 0 1
4 y 0 0
5 z 1 0
6 z 0 0
And i want
A B C
0 x 1 1
1 y 0 1
2 z 1 0
Basically summatize to show that in each grouped class if that variable exists or not?
How about sorting the data as per order of higher to lower indicator value and then picking the first value for each group, In case any group lacks 1s on any row then we can use a filter condition by checking sum on each row should be greater than equal to 1.
import pandas as pd
df = pd.DataFrame({'x': ['x', 'x', 'x', 'y', 'y', 'z', 'z'], 'A': [1,0,0,0,0,1,0], 'B': [1,1,1,1,0,0,0]})
newdf = df.sort_values(['x', 'A', 'B'],ascending=[True, False, False]).groupby(['x']).first().reset_index()
newdf.loc[newdf.sum(axis=1) > 0,:]
Output:
# x A B
# 0 x 1 1
# 1 y 0 1
# 2 z 1 0
If your definition of existence is any value more than 0, you can do this:
df.groupby('A', as_index=False).any()
which gives you a boolean dataframe indicating the presence of variable B or C:
A B C
0 x True True
1 y False True
2 z True False
What about taking the max?
df.groupby('A').max()
Here is a pattern that can be adapted more generally to any value -- i.e. not just checking for 1s:
df.groupby('A').agg(lambda x: any(x == 1))
(Replace 1 with a different value if needed.)
This will actually produce a result with True/False values. If you need the result to be 1s and 0s:
df.groupby('A').agg(lambda x: 1 if any(x == 1) else 0)
It seems dataframe.le doesn't operate column wise fashion.
df = DataFrame(randn(8,12))
series=Series(rand(8))
df.le(series)
I would expect for each column in df it will compare to series (so total 12 columns comparison with series, so 12 column*8 row comparison involved). But it appears for each element in df it will compare against every elements in series so this will involves 12(columns)*8(rows) * 8(elements in series) comparison. How can I achieve column by column comparison?
Second question is once I am done with column wise comparison I want to be able to count for each row how many 'true' are there, I am currently doing astype(int32) to turn bool into int then do sum, does this sound reasonable?
Let me give an example about the first question to show what I meant (using a simpler example since show 8*12 is tough):
>>>from pandas import *
>>>from numpy.random import *
>>>df = DataFrame(randn(2,5))
>>>t = DataFrame(randn(2,1))
>>>df
0 1 2 3 4
0 -0.090283 1.656517 -0.183132 0.904454 0.157861
1 1.667520 -1.242351 0.379831 0.672118 -0.290858
>>>t
0
0 1.291535
1 0.151702
>>>df.le(t)
0 1 2 3 4
0 True False False False False
1 False False False False False
What I expect df's column 1 should be:
1
False
True
Because 1.656517 < 1.291535 is False and -1.242351 < 0.151702 is True, this is column wise comparison. However the print out is False False.
I'm not sure I understand the first part of your question, but as to the second part, you can count the Trues in a boolean DataFrame using sum:
In [11]: df.le(s).sum(axis=0)
Out[11]:
0 4
1 3
2 7
3 3
4 6
5 6
6 7
7 6
8 0
9 0
10 0
11 0
dtype: int64
.
Essentially le is testing for each column:
In [21]: df[0] < s
Out[21]:
0 False
1 True
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
Which for each index is testing:
In [22]: df[0].loc[0] < s.loc[0]
Out[22]: False