When the start column is true, start counting.
When the end column is true, stop counting.
Input:
import pandas as pd
df=pd.DataFrame()
df['start']=[False,True,False,False,False,True,False,False,False]
df['end']= [False,False,False,True,False,False,False,True,False]
Expected Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0
You can use cumsum to compute the groups, groupby.cummax to identify the values after a start (and later mask with where) and groupby.cumcount to increment a counter:
# make groups between start/end
group = (df['start']|df['end']).cumsum()
# identify values after a start and before an end
mask = df['start'].groupby(group).cummax()
# compute a cumcount and mask with the above "mask"
df['expected'] = df.groupby(group).cumcount().add(1).where(mask, 0)
Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0
Related
import pandas as pd
import numpy as np
from numpy.random import randint
dict_1 = {'Col1':[1,1,1,1,2,4,5,6,7],'Col2':[3,3,3,3,2,4,5,6,7]}
df = pd.DataFrame(dict_1)
filt = df.apply(lambda x: x['Col2'] not in df['Col1'],axis = 1)
print(filt)
That's is what I tried the expected output is:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
The given result is
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
It is only giving false no matter what I do, and I am not sure how to fix that.
IIUC, here's one way:
filt = ~df.Col2.isin(df.Col1.unique())
OUTPUT:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
In general, using df.COLUMN notation has the drawback you mention in that it is not obvious how to reference them.
~df["Col2"].isin(df["Col1"].unique())
Remember that when using square brackets instead of .dot notation, single square brackets returns a Series, while double-square brackets return a DataFrame.
isinstance(df["Col2"], pandas.Series)
OUTPUT:
True
Versus
isinstance(df[["Col2"]], pandas.DataFrame)
OUTPUT:
True
For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False
I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
I have the following datasets of boolean columns
date hr energy
0 5-Feb-18 False False
1 29-Jan-18 False False
2 6-Dec-17 True False
3 16-Nov-17 False False
4 14-Nov-17 True True
5 25-Oct-17 False False
6 24-Oct-17 False False
7 5-Oct-17 False False
8 3-Oct-17 False False
9 26-Sep-17 False False
10 13-Sep-17 True False
11 7-Sep-17 False False
12 31-Aug-17 False False
I want to multiply each boolean column by 1 to turn it into a dummy
I tried:
df = df.iloc[:, 1:]
for col in df:
col = col*1
but the columns remain boolean, why?
Just using
df.iloc[:,1:]=df.iloc[:,1:].astype(int)
df
Out[477]:
date hr energy
0 5-Feb-18 0 0
1 29-Jan-18 0 0
2 6-Dec-17 1 0
3 16-Nov-17 0 0
4 14-Nov-17 1 1
5 25-Oct-17 0 0
6 24-Oct-17 0 0
7 5-Oct-17 0 0
8 3-Oct-17 0 0
9 26-Sep-17 0 0
10 13-Sep-17 1 0
11 7-Sep-17 0 0
12 31-Aug-17 0 0
For future cases other than True or False, If you want to convert categorical into numerical you could always use the replace function.
df.iloc[:,1:]=df.iloc[:,1:].replace({True:1,False:0})
df = [bigdataframe[['Action', 'Adventure','Animation',
'Childrens', 'Comedy', 'Crime','Documentary',
'Drama', 'Fantasy', 'FilmNoir', 'Horror',
'Musical',
'Mystery', 'Romance','SciFi', 'Thriller', 'War',
'Western']].sum(axis=1) > 1]
df
Out[8]:
[0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
10 False
11 True
12 True
13 True
14 True
15 False
16 True
17 False
18 True
19 False
20 False
21 True
22 True
23 True
24 False
25 True
26 True
27 True
28 True
29 True
99970 True
99971 True
99972 False
99973 True
99974 True
99975 True
99976 True
99977 True
99978 False
99979 False
99980 True
99981 False
99982 True
99983 False
99984 True
99985 True
99986 True
99987 True
99988 False
99989 True
99990 True
99991 True
99992 False
99993 True
99994 True
99995 True
99996 True
99997 True
99998 True
99999 False
Length: 100000, dtype: bool]
I have tried:
len(df[df==True])
Masking
They are in a list so shouldn't I just be able to count them? Or do I need to assign them numerical values, 1 for true and 0 for false and then use the count or sum function to find how many are true?
Demo:
In [386]: df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
In [387]: df
Out[387]:
A B C
0 0.228687 0.647431 0.526471
1 0.795122 0.915011 0.950481
2 0.386244 0.705412 0.420596
3 0.343213 0.928993 0.192527
4 0.201023 0.209281 0.304799
In [388]: df[['A','B','C']].sum(axis=1).gt(1.5)
Out[388]:
0 False
1 True
2 True
3 False
4 False
dtype: bool
In [389]: df[['A','B','C']].sum(axis=1).gt(1.5).sum()
Out[389]: 2
to count number of true in a list
sum(unlist(your.list.object))