How to count trues/falses from a list? - pandas

df = [bigdataframe[['Action', 'Adventure','Animation',
'Childrens', 'Comedy', 'Crime','Documentary',
'Drama', 'Fantasy', 'FilmNoir', 'Horror',
'Musical',
'Mystery', 'Romance','SciFi', 'Thriller', 'War',
'Western']].sum(axis=1) > 1]
df
Out[8]:
[0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
10 False
11 True
12 True
13 True
14 True
15 False
16 True
17 False
18 True
19 False
20 False
21 True
22 True
23 True
24 False
25 True
26 True
27 True
28 True
29 True
99970 True
99971 True
99972 False
99973 True
99974 True
99975 True
99976 True
99977 True
99978 False
99979 False
99980 True
99981 False
99982 True
99983 False
99984 True
99985 True
99986 True
99987 True
99988 False
99989 True
99990 True
99991 True
99992 False
99993 True
99994 True
99995 True
99996 True
99997 True
99998 True
99999 False
Length: 100000, dtype: bool]
I have tried:
len(df[df==True])
Masking
They are in a list so shouldn't I just be able to count them? Or do I need to assign them numerical values, 1 for true and 0 for false and then use the count or sum function to find how many are true?

Demo:
In [386]: df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
In [387]: df
Out[387]:
A B C
0 0.228687 0.647431 0.526471
1 0.795122 0.915011 0.950481
2 0.386244 0.705412 0.420596
3 0.343213 0.928993 0.192527
4 0.201023 0.209281 0.304799
In [388]: df[['A','B','C']].sum(axis=1).gt(1.5)
Out[388]:
0 False
1 True
2 True
3 False
4 False
dtype: bool
In [389]: df[['A','B','C']].sum(axis=1).gt(1.5).sum()
Out[389]: 2

to count number of true in a list
sum(unlist(your.list.object))

Related

How to make count the amount between 2 conditions?

When the start column is true, start counting.
When the end column is true, stop counting.
Input:
import pandas as pd
df=pd.DataFrame()
df['start']=[False,True,False,False,False,True,False,False,False]
df['end']= [False,False,False,True,False,False,False,True,False]
Expected Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0
You can use cumsum to compute the groups, groupby.cummax to identify the values after a start (and later mask with where) and groupby.cumcount to increment a counter:
# make groups between start/end
group = (df['start']|df['end']).cumsum()
# identify values after a start and before an end
mask = df['start'].groupby(group).cummax()
# compute a cumcount and mask with the above "mask"
df['expected'] = df.groupby(group).cumcount().add(1).where(mask, 0)
Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0

Consolidating columns by the number before the decimal point in the column name

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?
use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()
Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

Pandas True False Matching

For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False

How to vectorize in Pandas when values depend on prior values

I'd like to use Pandas to implement a function that keeps a running balance, but I'm not sure it can be vectorized for speed.
In short, the problem I'm trying to solve is to keep track consumption, generation, and the "bank" of over-generation.
"consumption" means how much is used in a given time period.
"generation" is how much is generated.
When generation is greater than consumption then the homeowner can "bank" the extra generation, to be applied in subsequent time periods. they can apply it if their consumption exceeds their generation for a later month.
This will be for many entities, hence the "id" field. The time sequence is defined by "order"
Very basic example:
Month 1 generates 13 consumes 8 -> therefore banks 5
month 2 generates 8 consumes 10 -> therefore uses 2 from the the bank, and still has 3 left over
Month 3 generates 7 consumes 20 -> exhausts remaining 3 from bank, and has no bank left over.
Code
import numpy as np
import pandas as pd
id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 100, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,91,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,8,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)),
columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])
def bank(df):
# deposit all excess when generation exceeds consumption
deposit = (df['Generate'] > df['Consume']) * (df['Generate'] - df['Consume'])
df['end_bal'] = 0
# beginning balance = prior period ending balance
df = df.sort_values(by=['id', 'Order'])
df['begin_bal'] = df['end_bal'].shift(periods=1)
df.loc[df['Order']==1, 'begin_bal'] = 0 # set first month beginning balance of each customer to 0
# calculate withdrawal
df['Withdraw'] = 0
ok_to_withdraw = df['Consume'] > df['Generate']
df.loc[ok_to_withdraw,'Withdraw'] = np.minimum(df.loc[ok_to_withdraw, 'begin_bal'],
df.loc[ok_to_withdraw, 'Consume'] -
df.loc[ok_to_withdraw, 'Generate'] -
deposit[ok_to_withdraw])
# ending balance = beginning balance + deposit - withdraw
df['end_bal'] = df['begin_bal'] + deposit - df['Withdraw']
return df
df = bank(df)
df.head()
id Order Consume Generate end_bal begin_bal Withdraw
0 1 1 10 20 10.0 0.0 0.0
1 1 2 17 16 0.0 0.0 0.0
2 1 3 20 17 0.0 0.0 0.0
3 1 4 11 21 10.0 0.0 0.0
4 1 5 17 9 0.0 0.0 0.0
df_solution.head()
id Order Consume Generate begin_bal end_bal Withdraw
0 1 1 10 20 0 10 0
1 1 2 17 16 10 9 1
2 1 3 20 17 9 6 3
3 1 4 11 21 6 16 0
4 1 5 17 9 16 8 9
I tried to implement with various iterations of cumsum and shift . . . but the fact remains that value of each row seems like it needs to be recalculated based on the prior row, and I'm not sure this is possible to vectorize.
Code to generate some test datasets:
def generate_testdata():
random.seed(42*42)
np.random.seed(42*42)
numids = 10
numorders = 12
id = []
order = []
for i in range(numids):
id = id + [i]*numorders
order = order + list(range(1,numorders+1))
consume = np.random.uniform(low = 10, high = 40, size = numids*numorders)
generate = np.random.uniform(low = 10, high = 40, size = numids*numorders)
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
return df
Here is a numpy-ish approach, mostly because I'm not that familiar with pandas:
The idea is to first compute the free cumsum and then to subtract the cumulative minimum if it is negative.
import numpy as np
import pandas as pd
id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 8, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,9,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)),
columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])
def f(df):
# find block bondaries
ids = df["id"].values
bnds, = np.where(np.diff(ids, prepend=ids[0]-1, append=ids[-1]+1))
# find raw balance change
delta = (df["Generate"] - df["Consume"]).values
# find offset, so cumulative min does not interfere across ids
safe_total = (np.minimum(delta.min(), 0)-1) * np.diff(bnds[:-1])
# must apply offset just before group switch, so it aligns the first
# begin_bal, not end_bal, of the next group
# also keep a copy of original values at switches
delta_orig = delta[bnds[1:-1]-1]
delta[bnds[1:-1]-1] += safe_total - np.add.reduceat(delta, bnds[:-2])
# form free cumsum
acc = delta.cumsum()
# correct
acc -= np.minimum(0, np.minimum.accumulate(acc))
# write solution back to df
shft = np.empty_like(acc)
shft[1:] = acc[:-1]
shft[0] = 0
# reinstate last end_bal of each group
acc[bnds[1:-1]-1] = np.maximum(0, shft[bnds[1:-1]-1] + delta_orig)
df["begin_bal"] = shft
df["end_bal"] = acc
df["Withdraw"] = np.maximum(0, df["begin_bal"] - df["end_bal"])
Test:
f(df)
df == df_solution
Prints:
id Order Consume Generate begin_bal end_bal Withdraw
0 True True True True True True True
1 True True True True True True True
2 True True True True True True True
3 True True True True True True True
4 True True True True True True False
5 True True True True True True True
6 True True True True True True True
7 True True True True True True True
8 True True True True True True True
9 True True True True True True True
10 True True True True True True True
11 True True True True True True True
12 True True True True True True True
13 True True True True True True True
14 True True True True True True True
15 True True True True True True True
16 True True True True True True True
17 True True True True True True True
18 True True True True True True True
19 True True True True True True True
20 True True True True True True True
21 True True True True True True True
22 True True True True True True True
23 True True True True True True True
24 True True True True True True True
25 True True True True True True True
There is one False but that appears to be a typo in the expected output provided.
Using #PaulPanzer's logic here is a pandas version.
def CalcEB(x):
delta = x['Generate'] - x['Consume']
return delta.cumsum() - delta.cumsum().cummin().clip(-np.inf,0)
df['end_bal'] = df.groupby('id', as_index=False).apply(CalcEB).values
df['begin_bal'] = df.groupby('id')['end_bal'].shift().fillna(0)
df['Withdraw'] = (df['begin_bal'] - df['end_bal']).clip(0,np.inf)
df_pandas = df.copy()
#Note the typo mentioned by Paul Panzer
df_pandas.reindex(df_solution.columns, axis=1) == df_solution
Output (check dataframes)
id Order Consume Generate begin_bal end_bal Withdraw
0 True True True True True True True
1 True True True True True True True
2 True True True True True True True
3 True True True True True True True
4 True True True True True True False
5 True True True True True True True
6 True True True True True True True
7 True True True True True True True
8 True True True True True True True
9 True True True True True True True
10 True True True True True True True
11 True True True True True True True
12 True True True True True True True
13 True True True True True True True
14 True True True True True True True
15 True True True True True True True
16 True True True True True True True
17 True True True True True True True
18 True True True True True True True
19 True True True True True True True
20 True True True True True True True
21 True True True True True True True
22 True True True True True True True
23 True True True True True True True
24 True True True True True True True
25 True True True True True True True
I am not sure I understood your question fully, but I am going to give a go at answering.
I will re-phrase what I understood...
1. Source data
There is source data, which is a DataFrame with four columns:
id - ID number of an entity
order - indicates the sequence of periods
consume - how much was consumed during the period
generate - how much was generated during the period
2. Calculations
For each id, we want to calculate:
diff which is the difference between generate and consume for each period
opening balance which is the closing balance from the previous order
closing balance which is the cumulative sum of the diff
3. Code
I will try to solve this with groupby, cumsum and shift.
# Make sure the df is sorted
df = df.sort_values(['id','order'])
df['diff'] = df['generate'] - df['consume']
df['closing_balance'] = df.groupby('id')['diff'].cumsum()
# Opening balance equals the closing balance from the previous period
df['opening_balance'] = df.groupby('id')['closing_balance'].shift(1)
I definitely misunderstood something, feel free to correct me and I will try to come up with a better answer.
In particular, I wasn't sure how to handle the closing_balance going into negative numbers. Should it show negative balance? Should it nullify the "debts"?

Converting boolean to zero-or-one, for all elements in an array

I have the following datasets of boolean columns
date hr energy
0 5-Feb-18 False False
1 29-Jan-18 False False
2 6-Dec-17 True False
3 16-Nov-17 False False
4 14-Nov-17 True True
5 25-Oct-17 False False
6 24-Oct-17 False False
7 5-Oct-17 False False
8 3-Oct-17 False False
9 26-Sep-17 False False
10 13-Sep-17 True False
11 7-Sep-17 False False
12 31-Aug-17 False False
I want to multiply each boolean column by 1 to turn it into a dummy
I tried:
df = df.iloc[:, 1:]
for col in df:
col = col*1
but the columns remain boolean, why?
Just using
df.iloc[:,1:]=df.iloc[:,1:].astype(int)
df
Out[477]:
date hr energy
0 5-Feb-18 0 0
1 29-Jan-18 0 0
2 6-Dec-17 1 0
3 16-Nov-17 0 0
4 14-Nov-17 1 1
5 25-Oct-17 0 0
6 24-Oct-17 0 0
7 5-Oct-17 0 0
8 3-Oct-17 0 0
9 26-Sep-17 0 0
10 13-Sep-17 1 0
11 7-Sep-17 0 0
12 31-Aug-17 0 0
For future cases other than True or False, If you want to convert categorical into numerical you could always use the replace function.
df.iloc[:,1:]=df.iloc[:,1:].replace({True:1,False:0})