Converting true/false to 0/1 boolean in a mixed dataframe - pandas

I have a dataframe with mixed data types. I want to create a function that goes through all the columns and converts any columns containing True/False to int32 type 0/1. I tried a lambda function below, where d is my dataframe:
f = lambda x: 1 if x==True else 0
d.applymap(f)
This doesn't work, it converts all my non boolean columns to 0/1 as well. Is there a good way to go through the dataframe and leave everything everything untouched except the boolean columns and convert them to 0's and 1's? Any help is appreciated!

Let's modify your lambda to use an isinstance check:
df.applymap(lambda x: int(x) if isinstance(x, bool) else x)
Only values of type bool will be converted to int, everything else remains the same.
As a better solution, if the column types are scalar (and not "mixed" as I originally assumed given your question), you can instead use
u = df.select_dtypes(bool)
df[u.columns] = u.astype(int)

You can select the columns using loc and change data type.
df = pd.DataFrame({'col1':np.random.randn(2), 'col2':[True, False], 'col3':[False, True]})
df.loc[:, df.dtypes == bool] = df.astype(int)
col1 col2 col3
0 0.999358 1 0
1 0.795179 0 1

If you have a dataframe df, try:
df_1=df.applymap(lambda x: int(x) if type(x)==bool else x)

Related

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

Dataframe Column is not Read as List in Lambda Function

I have a dataframe which contains list value, let us call it df1:
Text
-------
["good", "job", "we", "are", "so", "proud"]
["it", "was", "his", "honor", "as", "well", "as", "guilty"]
And also another dataframe, df2:
Word Value
-------------
good 7.47
proud 8.03
honor 7.66
guilty 2.63
I want to create apply plus lambda function to create df1['score'] where the values are derived from sum-aggregating words per list in df1 which are found in df2's words. Currently, this is my code:
def score(list_word):
sum = count = mean = sd = 0
for word in list_word:
if word in df2['Word']:
sum = sum + df2.loc[df2['Word'] == word, 'Value'].iloc[0]
count = count + 1
if count != 0:
return sum/count
else:
return 0
df['score'] = df.apply(lambda x: score(x['words']), axis=1)
This is what I envision:
Score
-------
7.75 #average of good (7.47) and proud (8.03)
5.145 #average of honor (7.66) and guilty (2.63)
However, it seems x['words'] did not pass as list object, and I do not know how to modify the score function to meet the object type. I try to convert it by tolist() method, but no avail. Any help appreciated.
Giving the first df1, and df2 with explode and map , Notice explode is after pandas 0.25
#import ast
#df1.Text=df1.Text.apply(ast.literal_eval)
#If the list is string type , we need bring the format list back with fast
s=df1.Text.explode().map(dict(zip(df2.Word,df2.Value))).mean(level=0)
0 7.750
1 5.145
Name: Text, dtype: float64
Update
df1.Text.explode().to_frame('Word').reset_index().merge(df2,how='left').groupby('index').mean()
Value
index
0 7.750
1 5.145

Pandas isin boolean operator giving error

I am running into an error while using the 'isin' Boolean operator:
def rowcheck(row):
return row['CUST_NAME'].isin(['John','Alan'])
My dataframe has column CUST_NAME. So I use:
df['CUSTNAME_CHK'] = df.apply (lambda row: rowcheck(row),axis=1)
I get:
'str' object has no attribute 'isin'
What did I do wrong?
You are doing it inside a function passed to apply, such that row['CUST_NAME'] holds the value for a specific cell (and it is a string). Strings which have no isin method. This method belongs to pd.Series, and not strings.
If you really want to use apply, use np.isin in this case
def rowcheck(row):
return pd.np.isin(row['CUST_NAME'], ['John','Alan'])
As #juanpa.arrivilaga noticed, isin won't be efficient in this case, so its advised to use the operator in directly
return row['CUST_NAME'] in ['John','Alan']
Notice that you probably don't need apply. You can just use pd.Series.isindirectly. For example,
df = pd.DataFrame({'col1': ['abc', 'dfe']})
col1
0 abc
1 dfe
Such that you can do
df.col1.isin(['abc', 'xyz'])
0 True
1 False
Name: col1, dtype: bool

Is there an elegant way to select all rows in a pandas column that have data type float?

An elegant function like
df[~pandas.isnull(df.loc[:,0])]
can check a pandas DataFrame column and return the entire DataFrame but with all NaN value rows from the selected column removed.
I am wondering if there is a similar function which can check and return a df column conditional on its dtype without using any loops.
I've looked at
.select_dtypes(include=[np.float])
but this only returns columns that have entirely float64 values, not every row in a column that is a float.
First lets set up a DataFrame with two columns. Only column b has a float. We'll try and find this row:
df = pandas.DataFrame({
'a': ['qw', 'er'],
'b' : ['ty', 1.98]
})
When printed this looks like:
a b
0 qw ty
1 er 1.98
Then create a map to select the rows using apply()
def check_if_float(row):
return isinstance(row['b'], float)
map = df.apply(check_if_float, axis=1)
This will give a boolean map of all the rows that have a float in column b:
0 False
1 True
You can then use this map to select the rows you want
filtered_rows = df[map]
Which leaves you only the rows that contain a float in column b:
a b
1 er 1.98

What's the Pandas way to write `if()` conditional between two `timeseries` columns?

My naive approach to Pandas Series needs some pointers. I have one Pandas DataFrame with two joined tables. The left table had timestamp with title Time1 and the right had Time2; My new DataFrame has both.
At this step I'm comparing the two datetime columns using helper functions g() and f():
df['date_error'] = g(df['Time1'], df['Time2'])
The working helper function g() compares two datetime values:
def g(newer,older):
value = newer > older
return value
This gives me a column (True,False) values. When I use the conditional in the helper function f(), I get an error because newer and older are Pandas Series:
def f(newer,older):
if newer > older:
delta = (newer - older)
else :
# arbitrairly large value to maintain col dtype
delta = datetime.timedelta(minutes=1000)
return delta
Ok. Fine. I know I'm not unpacking the Pandas Series correctly, because I can get this to work with the following monstrosity:
def f(newer,older):
delta = []
for (k,v),(k2,v2) in zip(newer.iteritems(), older.iteritems()):
if v > v2 :
delta.append(v - v2)
else :
# arbitrairly large value to maintain col dtype
delta.append(datetime.timedelta(minutes=1000))
return pd.Series(delta)
What's the Pandas way a conditional between two DataFrame columns?
Usually where is the pandas equivalent of if:
df = pd.DataFrame([['1/1/01 11:00', '1/1/01 12:00'],
['1/1/01 14:00', '1/1/01 13:00']],
columns = ['Time1', 'Time2']
).apply(pd.to_datetime)
(df.Time1 - df.Time2).where(df.Time1 > df.Time2)
0 NaT
1 01:00:00
dtype: timedelta64[ns]
If you don't want nulls in this column you could call fillna(1000) afterwards, however note that this datatype supports a null value NaT (not a time).