Pandas: Grouping by values when a column is a list - pandas

I have a DataFrame like this one:
df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]], 'value':[4,5,6]})
type | value
-------------
1,3 | 4
1,2,3| 5
2,3 | 6
I would like to group by the different values in the 'type' column so for example the sum of value would be:
type | sum
------------
1 | 9
2 | 11
3 | 15
Thanks for your help!

You need first reshape Dataframe by column type by DataFrame constructor, stack and reset_index. Then cast column type to int and last groupby with aggregating sum:
df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']) \
.stack() \
.reset_index(name='type')
df1.type = df1.type.astype(int)
print (df1)
value level_1 type
0 4 0 1
1 4 1 3
2 5 0 1
3 5 1 2
4 5 2 3
5 6 0 2
6 6 1 3
print (df1.groupby('type', as_index=False)['value'].sum())
type value
0 1 9
1 2 11
2 3 15
Another solution with join:
df1 = pd.DataFrame(df['type'].values.tolist()) \
.stack() \
.reset_index(level=1, drop=True) \
.rename('type') \
.astype(int)
print (df1)
0 1
0 3
1 1
1 2
1 3
2 2
2 3
Name: type, dtype: int32
df2 = df[['value']].join(df1)
print (df2)
value type
0 4 1
0 4 3
1 5 1
1 5 2
1 5 3
2 6 2
2 6 3
print (df2.groupby('type', as_index=False)['value'].sum())
type value
0 1 9
1 2 11
2 3 15
Version with Series where select first level of index by get_level_values, convert to Series by to_series and aggregate sum. Last reset_index and rename column index to type:
df1 = pd.DataFrame(df['type'].values.tolist(), index = df['value']).stack().astype(int)
print (df1)
value
4 0 1
1 3
5 0 1
1 2
2 3
6 0 2
1 3
dtype: int32
print (df1.index.get_level_values(0)
.to_series()
.groupby(df1.values)
.sum()
.reset_index()
.rename(columns={'index':'type'}))
type value
0 1 9
1 2 11
2 3 15
Edit by comment - it is a bit modified second solution with DataFrame.pop:
df = pd.DataFrame({'type':[[1,3],[1,2,3],[2,3]],
'value1':[4,5,6],
'value2':[1,2,3],
'value3':[4,6,1]})
print (df)
type value1 value2 value3
0 [1, 3] 4 1 4
1 [1, 2, 3] 5 2 6
2 [2, 3] 6 3 1
df1 = pd.DataFrame(df.pop('type').values.tolist()) \
.stack() \
.reset_index(level=1, drop=True) \
.rename('type') \
.astype(int)
print (df1)
0 1
0 3
1 1
1 2
1 3
2 2
2 3
Name: type, dtype: int32
print (df.join(df1).groupby('type', as_index=False).sum())
type value1 value2 value3
0 1 9 3 10
1 2 11 5 7
2 3 15 6 11

Related

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

Pandas count occurrence within column on condition being satisfied

I am trying to do count by grouping. see below input and output.
input:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['col3'] = [1,1,1,1,1,1,1]
output:
col4
0 2
1 2
2 2
3 2
4 1
5 1
6 1
Tried playing around with groupby and count, by doing:
s = df.groupby(['col1','col2'])['col3'].sum()
and the output I got was
a 4 2
5 2
b 6 1
7 1
8 1
how do I add it just as a column on the main df.
Thanks vm!
Use transform len or size:
df['count'] = df.groupby(['col1','col2'])['col3'].transform(len)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
df['count'] = df.groupby(['col1','col2'])['col3'].transform('size')
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
But column col3 is not necessary, you can use col1 or col2:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['count'] = df.groupby(['col1','col2'])['col1'].transform(len)
df['count1'] = df.groupby(['col1','col2'])['col2'].transform(len)
print (df)
col1 col2 count count1
0 a 4 2 2
1 a 4 2 2
2 a 5 2 2
3 a 5 2 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
try this,
df['count'] = df.groupby(['col1','col2'])['col3'].transform(sum)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1