Working with a multiindex dataframe, to get summation results over a boolean column, based on a condition from another column - pandas

We have a multiindex dataframe that looks like:
date condition_1 condition_2
item1 0 2021-06-10 06:30:00+00:00 True False
1 2021-06-10 07:00:00+00:00 False True
2 2021-06-10 07:30:00+00:00 True True
item2 3 2021-06-10 06:30:00+00:00 True False
4 2021-06-10 07:00:00+00:00 True True
5 2021-06-10 07:30:00+00:00 True True
item3 6 2021-06-10 06:30:00+00:00 True True
7 2021-06-10 07:00:00+00:00 False True
8 2021-06-10 07:30:00+00:00 True True
The value of date repeats between items (because the df is a result of a default concat on a dictionary of dataframes).
The logic we basically want to vectorize is "for every date where condition_1 is true for all items: sum the occurrences where condition_2 is true in a new results column for all of them".
The result would basically look like this based on the above example (comments on how it's derived: next to the results column):
date condition_1 condition_2 result
item1 0 2021-06-10 06:30:00+00:00 True False 1 [because condition_1 is True for all items and condition_2 is True once]
1 2021-06-10 07:00:00+00:00 False True 0 [condition_1 is not True for all items so condition_2 is irrelevant]
2 2021-06-10 07:30:00+00:00 True True 3 [both conditions are True for all 3 items]
item2 3 2021-06-10 06:30:00+00:00 True False 1 [a repeat for the same reasons]
4 2021-06-10 07:00:00+00:00 True True 0 [a repeat for the same reasons]
5 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]
item3 6 2021-06-10 06:30:00+00:00 True True 1 [a repeat for the same reasons]
7 2021-06-10 07:00:00+00:00 False True 0 [a repeat for the same reasons]
8 2021-06-10 07:30:00+00:00 True True 3 [a repeat for the same reasons]

Here is what I came up with.
def cond_sum(s):
return s.cond1.all() * s.cond2.sum()
df.reset_index(level=0, inplace=True)
df['result'] = df.groupby('date').apply(cond_sum)
df.set_index('item', append=True)
Then if you want the original index, you can add it back.
df.set_index('item', append=True).swaplevel()
Note, you mentioned vectorized, so you could swap that out for:
dfg = df.groupby(level=0).agg({'cond1': 'all', 'cond2': 'sum'})
df['result'] = dfg.cond1 * dfg.cond2

Related

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

Replace a string value with NaN in pandas data frame - Python

Do I have to replace the value? with NaN so you can invoke the .isnull () method. I have found several solutions but some errors are always returned. Suppose:
data = pd.DataFrame([[1,?,5],[?,?,4],[?,32.1,1]])
and if I try:
pd.data.replace('?', np.nan)
I have:
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
but data.isnull() returns:
0 1 2
0 False False False
1 False False False
2 False False False
Why?
I think you forget assign back:
data = pd.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
data = data.replace('?', np.nan)
#alternative
#data.replace('?', np.nan, inplace=True)
print (data)
0 1 2
0 1.0 NaN 5
1 NaN NaN 4
2 NaN 32.1 1
print (data.isnull())
0 1 2
0 False True False
1 True True False
2 True False False
# a dataframe with string values
dat = pd.DataFrame({'a':[1,'FG', 2, 4], 'b':[2, 5, 'NA', 7]})
Removing non numerical elements from the dataframe:
"Method 1 - with regex"
dat2 = dat.replace(r'^([A-Za-z]|[0-9]|_)+$', np.NaN, regex=True)
dat2
"Method 2 - with pd.to_numeric"
dat3 = pd.DataFrame()
for col in dat.columns:
dat3[col] = pd.to_numeric(dat[col], errors='coerce')
dat3
? is a not null. So you will expect to get a False under the isnull test
>>> data = pandas.DataFrame([[1,'?',5],['?','?',4],['?',32.1,1]])
>>> data
0 1 2
0 False False False
1 False False False
2 False False False
After you replace ? with NaN the test will look much different
>>> data = data.replace('?', np.nan)
>>> data
0 1 2
0 False True False
1 True True False
2 True False False
I believe when you are doing pd.data.replace('?', np.nan) this action is not done in place, so you must try -
data = data.replace('?', np.nan)

Converting boolean to zero-or-one, for all elements in an array

I have the following datasets of boolean columns
date hr energy
0 5-Feb-18 False False
1 29-Jan-18 False False
2 6-Dec-17 True False
3 16-Nov-17 False False
4 14-Nov-17 True True
5 25-Oct-17 False False
6 24-Oct-17 False False
7 5-Oct-17 False False
8 3-Oct-17 False False
9 26-Sep-17 False False
10 13-Sep-17 True False
11 7-Sep-17 False False
12 31-Aug-17 False False
I want to multiply each boolean column by 1 to turn it into a dummy
I tried:
df = df.iloc[:, 1:]
for col in df:
col = col*1
but the columns remain boolean, why?
Just using
df.iloc[:,1:]=df.iloc[:,1:].astype(int)
df
Out[477]:
date hr energy
0 5-Feb-18 0 0
1 29-Jan-18 0 0
2 6-Dec-17 1 0
3 16-Nov-17 0 0
4 14-Nov-17 1 1
5 25-Oct-17 0 0
6 24-Oct-17 0 0
7 5-Oct-17 0 0
8 3-Oct-17 0 0
9 26-Sep-17 0 0
10 13-Sep-17 1 0
11 7-Sep-17 0 0
12 31-Aug-17 0 0
For future cases other than True or False, If you want to convert categorical into numerical you could always use the replace function.
df.iloc[:,1:]=df.iloc[:,1:].replace({True:1,False:0})

check if at least one value exists in pandas dataframe index

How can I check if at least one of the index of df2 is in df1?
df1
Val
StartDate
2015-03-31 NaN
2015-04-03 NaN
2015-04-05 8.08
2015-04-06 23.48
df2
Val
StartDate
2015-03-31 True
2015-04-01 True
2015-04-02 True
2015-04-03 True
2015-04-04 True
2015-04-05 True
2015-04-06 True
df2.index in df1.index
returns False
Use Index.isin with Index.any for check at least one True:
a = df1.index.isin(df2.index).any()
print (a)
True
Detail:
print (df1.index.isin(df2.index))
[ True True True True]
[i for i in df1.index if i in df2.index]

returning same initial input when filtering dataframes' values

I have the following dataframe I have obtained from the read_html pandas' property.
A        1.48        2.64    1.02         2.46   2.73
B       658.4        14.33    7.41        15.35   8.59
C        3.76         2.07    4.61         2.26   2.05
D   513854.86         5.70    0.00         5.35  30.16
I would like to remove the rows that are over 150 so I did adf1= df[df > 150], however it returns the same table.
Then I thought to include in the decimals in the routeroute = pd.read_html(https//route , decimal='.') and continues returning the same initial dataframe with no filters.
This would be my desired output:
A        1.48        2.64    1.02         2.46   2.73
C        3.76         2.07    4.61         2.26   2.05
Need:
print (df)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
1 B 658.40 14.33 7.41 15.35 8.59
2 C 3.76 2.07 4.61 2.26 2.05
3 D 513854.86 5.70 0.00 5.35 30.16
df1 = df[~(df.iloc[:, 1:] > 150).any(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Or:
df1 = df[(df.iloc[:, 1:] <= 150).all(1)]
print (df1)
0 1 2 3 4 5
0 A 1.48 2.64 1.02 2.46 2.73
2 C 3.76 2.07 4.61 2.26 2.05
Explanation:
First select all columns without first by iloc:
print (df.iloc[:, 1:])
1 2 3 4 5
0 1.48 2.64 1.02 2.46 2.73
1 658.40 14.33 7.41 15.35 8.59
2 3.76 2.07 4.61 2.26 2.05
3 513854.86 5.70 0.00 5.35 30.16
Then compare - get boolean DataFrame:
print (df.iloc[:, 1:] > 150)
1 2 3 4 5
0 False False False False False
1 True False False False False
2 False False False False False
3 True False False False False
print (df.iloc[:, 1:] <= 150)
1 2 3 4 5
0 True True True True True
1 False True True True True
2 True True True True True
3 False True True True True
Then use all for check if all values in row has Trues
or any for check if at least one value is True:
print ((df.iloc[:, 1:] > 150).any(1))
0 False
1 True
2 False
3 True
dtype: bool
print ((df.iloc[:, 1:] <= 150).all(1))
0 True
1 False
2 True
3 False
dtype: bool
Last first Series invert with ~ and filter by boolean indexing.