groupby and count on multiple columns of dataframe - pandas

I have a df
df = pd.DataFrame([
[1, 1, 'A', 10],
[4, 1 ,'A', 6],
[7, 2 ,'A', 3],
[2, 2 ,'A', 4],
[6, 2 ,'B', 9],
[5, 2 ,'B', 7],
[5, 1 ,'B', 12],
[5, 1 ,'B', 4],
[5, 2 ,'C', 9],
[5, 1 ,'C', 3],
[5, 1 ,'C', 4],
[5, 2 ,'C', 7]
],
index=['A', 'A', 'A','A','A','A','A','A','A','A','A','A'],
columns=['A', 'B', 'C', 'D'])
I can count the number of non zero values for column D grouped by column A using:
df['countTrans'] = df['D'].ne(0).groupby(df['A']).transform('sum')
where the output is:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 7.0
A 5 1 B 12 7.0
A 5 1 B 4 7.0
A 5 2 C 9 7.0
A 5 1 C 3 7.0
A 5 1 C 4 7.0
A 5 2 C 7 7.0
however I would like to also group by not only by column A but also column B.
I have tried variants of:
df['countTrans'] = df['D'].ne(0).groupby(df['A'], df['B']).transform('sum')
df['countTrans'] = df['D'].ne(0).groupby(df['A','B']).transform('sum')
without success
my desired output would look like:
df:
A B C D countTrans
A 1 1 A 10 1.0
A 4 1 A 0 0.0
A 7 2 A 3 1.0
A 2 2 A 4 1.0
A 6 2 B 9 1.0
A 5 2 B 7 3.0
A 5 1 B 12 4.0
A 5 1 B 4 4.0
A 5 2 C 9 3.0
A 5 1 C 3 4.0
A 5 1 C 4 4.0
A 5 2 C 7 3.0

Possible solution is pass Series to list:
df['countTrans'] = df['D'].ne(0).groupby([df['A'], df['B']]).transform('sum')
print (df)
A B C D countTrans
A 1 1 A 10 1
A 4 1 A 6 1
A 7 2 A 3 1
A 2 2 A 4 1
A 6 2 B 9 1
A 5 2 B 7 3
A 5 1 B 12 4
A 5 1 B 4 4
A 5 2 C 9 3
A 5 1 C 3 4
A 5 1 C 4 4
A 5 2 C 7 3
Or create helper column by DataFrame.assign (more 'clean' in my opinion):
df['countTrans'] = df.assign(E = df['D'].ne(0)).groupby(['A','B'])['E'].transform('sum')
#similar solution with overwrite D
#df['countTrans'] = df.assign(D = df['D'].ne(0)).groupby(['A','B'])['D'].transform('sum')

Related

Pandas return True if condition True in any of previous n rows

example df:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9],[1, 2, 3], [4, 5, 6], [7, 8, 9],[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 1 2 3
4 4 5 6
5 7 8 9
6 1 2 3
7 4 5 6
8 7 8 9
Goal is to get a new column, 'd', that returns True when a certain condition is true anywhere within a rolling window of size n.
For example, desired column 'd' for condition "column c == 2 within rolling window of 2":
a b c d
0 1 2 3 nan
1 4 5 6 True
2 7 8 9 False
3 1 2 3 True
4 4 5 6 True
5 7 8 9 False
6 1 2 3 True
7 4 5 6 True
8 7 8 9 False
I hope my question is understood thank you for taking your time
To be clear, I am trying to return True if ANY of the rows in the rolling window return True
I imagine you meant column b:
s = df2['b'].eq(2).rolling(2).max()
df2['d'] = s.astype(bool).mask(s.isna())
NB. you need to use max as rolling only works with numeric data.
Output:
a b c d
0 1 2 3 NaN
1 4 5 6 True
2 7 8 9 False
3 1 2 3 True
4 4 5 6 True
5 7 8 9 False
6 1 2 3 True
7 4 5 6 True
8 7 8 9 False

Concatenate all combinations of sub-level columns in a pandas DataFrame

Given the following DataFrame:
cols = pd.MultiIndex.from_product([['A', 'B'], ['a', 'b']])
example = pd.DataFrame([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]], columns=cols)
example
A B
a b a b
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I would like to end up with the following one:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
I used this code:
concatenated = pd.DataFrame([])
for A_sub_col in ('a', 'b'):
for B_sub_col in ('a', 'b'):
new_frame = example[[['A', A_sub_col], ['B', B_sub_col]]]
new_frame.columns = ['A', 'B']
concatenated = pd.concat([concatenated, new_frame])
However, I strongly suspect that there is a more straight-forward, idiomatic way to do that with Pandas. How would one go about it?
Here's an option using list comprehension:
pd.concat([
example[[('A', i), ('B', j)]].droplevel(level=1, axis=1)
for i in example['A'].columns
for j in example['B'].columns
]).reset_index(drop=True)
Output:
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11
Here is one way. Not sure how more pythonic this is. It is definitely less readable :-) but on the other hand does not use explicit loops:
(example
.apply(lambda c: [list(c)])
.stack(level=1)
.apply(lambda c:[list(c)])
.explode('A')
.explode('B')
.apply(pd.Series.explode)
.reset_index(drop = True)
)
to understand what's going on it would be helpful to do this one step at a time, but the end result is
A B
0 0 2
1 4 6
2 8 10
3 0 3
4 4 7
5 8 11
6 1 2
7 5 6
8 9 10
9 1 3
10 5 7
11 9 11

Pairwise minima of elements of a pandas Series

Input:
numbers = pandas.Series([3,5,8,1], index=["A","B","C","D"])
A 3
B 5
C 8
D 1
Expected output (pandas DataFrame):
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Current (working) solution:
pairwise_mins = pandas.DataFrame(index=numbers.index)
def calculate_mins(series, index):
to_return = numpy.minimum(series, series[index])
return to_return
for col in numbers.index:
pairwise_mins[col] = calculate_mins(numbers, col)
I suspect there must be a better, shorter, vectorized solution. Who could help me with it?
Use the outer ufunc here that numpy provides, combined with numpy.minimum
n = numbers.to_numpy()
np.minimum.outer(n, n)
array([[3, 3, 3, 1],
[3, 5, 5, 1],
[3, 5, 8, 1],
[1, 1, 1, 1]], dtype=int64)
This can be done by broadcasting:
pd.DataFrame(np.where(numbers.values[:,None] < numbers.values,
numbers[:,None],
numbers),
index=numbers.index,
columns=numbers.index)
Output:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1
Use np.broadcast_to and np.clip
a = numbers.values
pd.DataFrame(np.broadcast_to(a, (a.size,a.size)).T.clip(max=a),
columns=numbers.index,
index=numbers.index)
Out[409]:
A B C D
A 3 3 3 1
B 3 5 5 1
C 3 5 8 1
D 1 1 1 1

pandas convert lists in multiple columns within DataFrame to separate columns

I am trying to convert a list within multiple columns of a pandas DataFrame into separate columns.
Say, I have a dataframe like this:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [1, 2, 3] [4, 5, 6]
2 [1, 2, 3] [4, 5, 6]
And would like to convert it to something like this:
0 1 2 0 1 2
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
I have managed to do this in a loop. However, I would like to do this in fewer lines.
My code snippet so far is as follows:
import pandas as pd
df = pd.DataFrame([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])
output1 = df[0].apply(pd.Series)
output2 = df[1].apply(pd.Series)
output = pd.concat([output1, output2], axis=1)
If you don't care about the column names you could do:
>>> df.apply(np.hstack, axis=1).apply(pd.Series)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
Using sum
pd.DataFrame(df.sum(1).tolist())
0 1 2 3 4 5
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6

Merging dataframes in pandas

I am new to pandas and I am facing the following problem:
I have 2 data frames:
df1 :
x y
1 3 4
2 nan
3 6
4 nan
5 9 2
6 1 4 9
df2:
x y
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
The size of the two is same.
I want to merge the two dataframes such that all the resulting dataframe i get is the following:
result :
x y
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 6 7
So in the result, priority is given to df2. If there is a value in df2, it is put first and the remaining values are put from df1 (they have the same position as in df1). There should be no repeated values in the result (i.e if a value is in position 1 in df1 and position 3 in df2, then that value should come only in position 1 in the result and not repeat)
Any kind of help will be appreciated.
Thanks!
IIUC
Setup
df1 = pd.DataFrame(dict(x=range(1, 7),
y=[[3, 4], None, [6], None, [9, 2], [1, 4, 9]]))
df2 = pd.DataFrame(dict(x=range(1, 7), y=[[2, 3, 6, 1, 5], [4, 1, 8, 7, 5],
[6, 3, 1, 4, 5], [2, 1, 3, 5, 4],
[9, 2, 3, 8, 7], [1, 4, 5, 3, 7]]))
print df1
print
print df2
x y
0 1 [3, 4]
1 2 None
2 3 [6]
3 4 None
4 5 [9, 2]
5 6 [1, 4, 9]
x y
0 1 [2, 3, 6, 1, 5]
1 2 [4, 1, 8, 7, 5]
2 3 [6, 3, 1, 4, 5]
3 4 [2, 1, 3, 5, 4]
4 5 [9, 2, 3, 8, 7]
5 6 [1, 4, 5, 3, 7]
convert to something more usable:
df1_ = df1.set_index('x').y.apply(pd.Series)
df2_ = df2.set_index('x').y.apply(pd.Series)
print df1_
print
print df2_
0 1 2
x
1 3.0 4.0 NaN
2 NaN NaN NaN
3 6.0 NaN NaN
4 NaN NaN NaN
5 9.0 2.0 NaN
6 1.0 4.0 9.0
0 1 2 3 4
x
1 2 3 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 5 3 7
Combine with priority given to df1 (I think you meant df1 as that what was consistent with my interpretation of your question and the expected output you provided) then reducing to eliminate duplicates:
print df1_.combine_first(df2_).apply(lambda x: x.unique(), axis=1)
0 1 2 3 4
x
1 3 4 6 1 5
2 4 1 8 7 5
3 6 3 1 4 5
4 2 1 3 5 4
5 9 2 3 8 7
6 1 4 9 3 7