Consolidating columns by the number before the decimal point in the column name - pandas

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?

use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()

Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

Related

From a set of columns with true/false values, say which column has a True value

I have a df with with several columns which have only True/False values.
I want to create another column whose value will tell me which column has a True value.
HEre's an example:
index
bol_1
bol_2
bol_3
criteria
1
True
False
False
bol_1
2
False
True
False
bol_2
3
True
True
False
[bol_1, bol_2]
My objective is to know which rows have True values(at least 1), and which columns are responsible for those True values. I want to be able to some basic statistics on this new column, e.g. for how many rows is bol_1 the unique column to have a True values.
Use DataFrame.select_dtypes for boolean columns, convert columns names to array and in list comprehension filter Trues values:
df1 = df.select_dtypes(bool)
cols = df1.columns.to_numpy()
df['criteria'] = [list(cols[x]) for x in df1.to_numpy()]
print (df)
bol_1 bol_2 bol_3 criteria
1 True False False [bol_1]
2 False True False [bol_2]
3 True True False [bol_1, bol_2]
If performance is not important use DataFrame.apply:
df['criteria'] = df1.apply(lambda x: cols[x], axis=1)
A possible solution:
df.assign(criteria=df.apply(lambda x: list(
df.columns[1:][x[1:] == True]), axis=1))
Output:
index bol_1 bol_2 bol_3 criteria
0 1 True False False [bol_1]
1 2 False True False [bol_2]
2 3 True True False [bol_1, bol_2]

How to check if a row does not exist in another column?

import pandas as pd
import numpy as np
from numpy.random import randint
dict_1 = {'Col1':[1,1,1,1,2,4,5,6,7],'Col2':[3,3,3,3,2,4,5,6,7]}
df = pd.DataFrame(dict_1)
filt = df.apply(lambda x: x['Col2'] not in df['Col1'],axis = 1)
print(filt)
That's is what I tried the expected output is:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
The given result is
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
It is only giving false no matter what I do, and I am not sure how to fix that.
IIUC, here's one way:
filt = ~df.Col2.isin(df.Col1.unique())
OUTPUT:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
In general, using df.COLUMN notation has the drawback you mention in that it is not obvious how to reference them.
~df["Col2"].isin(df["Col1"].unique())
Remember that when using square brackets instead of .dot notation, single square brackets returns a Series, while double-square brackets return a DataFrame.
isinstance(df["Col2"], pandas.Series)
OUTPUT:
True
Versus
isinstance(df[["Col2"]], pandas.DataFrame)
OUTPUT:
True

Pandas True False Matching

For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False

Find the min/max of rows with overlapping column values, create new column to represent the full range of both

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)

Drawing bar charts from boolean fields:

I have three boolean fields, where their count is shown below:
I want to draw a bar chart that have
Offline_RetentionByTime with 37528
Offline_RetentionByCount with 29640
Offline_RetentionByCapacity with 3362
How to achieve that?
I think you can use apply value_counts for creating new df1 and then DataFrame.plot.bar:
df = pd.DataFrame({'Offline_RetentionByTime':[True,False,True, False],
'Offline_RetentionByCount':[True,False,False,True],
'Offline_RetentionByCapacity':[True,True,True, False]})
print (df)
Offline_RetentionByCapacity Offline_RetentionByCount Offline_RetentionByTime
0 True True True
1 True False False
2 True False True
3 False True False
df1 = df.apply(pd.value_counts)
print (df1)
Offline_RetentionByCapacity Offline_RetentionByCount \
True 3 2
False 1 2
Offline_RetentionByTime
True 2
False 2
df1.plot.bar()
If need plot only True values select by loc:
df1.loc[True].plot.bar()