Pandas how to group rows by a dictionary of {row : group} - pandas

I have a dataframe n rows:
1 2 3
3 4 1
5 3 2
9 8 2
7 2 6
0 0 0
4 4 4
8 4 1
...
and a dictionary of keys , so that row is a key and the value is the group:
d = {0 : 0 , 1: 0, 2 : 0, 3 : 1, 4 : 1, 5: 2, 6: 2}
I want to group by the keys and then apply mean on the groups.
So I will get:
3 3 2 #This is the mean of rows 0,1,2 from the original df, as d[0]=d[1]=d[2]=0
8 5 4
2 2 2
8 4 1
What is the best way to do so?

Simply use the dictionary in the groupby it will replace the index value by the dictionary value matching on the key:
df.groupby(d).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
If you also want to get the missing keys, use dropna=False in groupby. Those keys will be listed in the 'NaN' group:
df.groupby(d, dropna=False).mean()
output:
a b c
0.0 3.0 3.0 2.0
1.0 8.0 5.0 4.0
2.0 2.0 2.0 2.0
NaN 8.0 4.0 1.0
And for a range index instead of the dictionary keys:
df.groupby(d, dropna=False, as_index=False).mean()
output:
a b c
0 3.0 3.0 2.0
1 8.0 5.0 4.0
2 2.0 2.0 2.0
3 8.0 4.0 1.0
used input:
a b c
0 1 2 3
1 3 4 1
2 5 3 2
3 9 8 2
4 7 2 6
5 0 0 0
6 4 4 4
7 8 4 1

Related

mode returns Exception: Must produce aggregated value

for this dataframe
values ii
0 3.0 4
1 0.0 1
2 3.0 8
3 2.0 5
4 2.0 1
5 3.0 5
6 2.0 4
7 1.0 8
8 0.0 5
9 1.0 1
This line returns "Must ptoduce aggregated values
bii2=df.groupby(['ii'])['values'].agg(pd.Series.mode)
While this line works
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x)[0])
Could you explain why is that?
Problem is mode return sometimes 2 or more values, check solution with GroupBy.apply:
bii2=df.groupby(['ii'])['values'].apply(pd.Series.mode)
print (bii2)
ii
1 0 0.0
1 1.0
2 2.0
4 0 2.0
1 3.0
5 0 0.0
1 2.0
2 3.0
8 0 1.0
1 3.0
Name: values, dtype: float64
And pandas agg need scalar in output, so return error. So if select first value it working nice
bii3=df.groupby('ii')['values'].agg(lambda x: pd.Series.mode(x).iat[0])
print (bii3)
ii
1 0.0
4 2.0
5 0.0
8 1.0
Name: values, dtype: float64

Insert a list to row in pandas?

The data may look like this
0 1 2
0 0 0 0
1 1 5 6
2 2 7 4
list=[1.2,1.3,1.4]
How may I insert to it to a dataframe?
0 1 2
0 1.1 1.2 1.3
1 1 5 6
2 2 7 4
I only use a loop to do it.
Is there any function do it?
Use .loc to select the first row, then insert your list:
mylist = [1.2,1.3,1.4]
df.loc[0] = mylist
Output
0 1 2
0 1.2 1.3 1.4
1 1.0 5.0 6.0
2 2.0 7.0 4.0

Get data from multiindexed dataframe with given a list of index

I want to get information from from a column called 'index' in a certain pandas dataframe with multiple index. However, the index shall be listed. Please see the following example.
index ID
0 0 1.0 17226815
1 0 2.0 17226807
2 0 3.0 17226816
3 0 4.0 17226808
4 0 5.0 17231739
5 0 6.0 17231739
6 0 1.0 17226815
1 2.0 17226807
7 0 1.0 17226815
1 3.0 17226816
filtered_list = [3, 5, 7]
with the following line I can get the filtered data.
print(df.loc[df.index.isin(filtered_list, level=0)]['index'])
out:
3 0 4.0
5 0 6.0
7 0 1.0
1 3.0
what I want to get is a list consisting of the 'index' value. It will be then as additional information next to the filtered index. It is shown as follow:
0 3 4
1 5 6
2 7 (1, 3)
how can I get this list?
thank you in advance.
If I understand correctly,
df.loc[filtered_list,'index'].groupby(level=0).apply(tuple).reset_index()
Output:
0 index
0 3 (4.0,)
1 5 (6.0,)
2 7 (1.0, 3.0)
Going further:
df.loc[filtered_list,'index']\
.groupby(level=0)\
.apply(lambda x: tuple(x)[0] if len(x.index)==1 else tuple(x))\
.reset_index()
OUtput:
0 index
0 3 4
1 5 6
2 7 (1.0, 3.0)

generate lines with all unique values of given column for each group

df = pd.DataFrame({'timePoint': [1,1,1,1,2,2,2,2,3,3,3,3],
'item': [1,2,3,4,3,4,5,6,1,3,7,2],
'value': [2,4,7,6,5,9,3,2,4,3,1,5]})
>>> df
item timePoint value
0 1 1 2
1 2 1 4
2 3 1 7
3 4 1 6
4 3 2 5
5 4 2 9
6 5 2 3
7 6 2 2
8 1 3 4
9 3 3 3
10 7 3 1
11 2 3 5
In this df, not every item appears at every timePoint. I want to have all unique items at every timePoint, and these newly inserted items should either have:
(i) a NaN value if they have not appeared at a previous timePoint, or
(ii) if they have, they get their most recent value.
The desired output should look like the following (lines with hashtag are those inserted).
>>> dfx
item timePoint value
0 1 1 2.0
3 1 2 2.0 #
8 1 3 4.0
1 2 1 4.0
4 2 2 4.0 #
11 2 3 5.0
2 3 1 7.0
4 3 2 5.0
9 3 3 3.0
3 4 1 6.0
5 4 2 9.0
6 4 3 9.0 #
0 5 1 NaN #
6 5 2 3.0
7 5 3 3.0 #
1 6 1 NaN #
7 6 2 2.0
8 6 3 2.0 #
2 7 1 NaN #
5 7 2 NaN #
10 7 3 1.0
For example, item 1 gets a 4.0 at timePoint 2 because that's what it had a timePoint 1 whereas item 6 gets a NaN at timePoint 1 because there is no preceding value.
Now, I know that if I manage to insert all lines of every unique item missing in each timePoint group, i.e. reach this point:
>>> dfx
item timePoint value
0 1 1 2.0
1 2 1 4.0
2 3 1 7.0
3 4 1 6.0
4 3 2 5.0
5 4 2 9.0
6 5 2 3.0
7 6 2 2.0
8 1 3 4.0
9 3 3 3.0
10 7 3 1.0
11 2 3 5.0
0 5 1 NaN
1 6 1 NaN
2 7 1 NaN
3 1 2 NaN
4 2 2 NaN
5 7 2 NaN
6 4 3 NaN
7 5 3 NaN
8 6 3 NaN
Then I can do:
dfx.sort_values(by = ['item', 'timePoint'],
inplace = True,
ascending = [True, True])
dfx['value'] = dfx.groupby('item')['value'].fillna(method='ffill')
which will return the desired output.
But how do I add as lines all df.item.unique() items that are missing to each timePoint group?
Also, if you have a more efficient solution from scratch to suggest, then by all means please be my guest.
Using pd.MultiIndex.from_product, levels, reindex
d = df.set_index(['item', 'timePoint'])
d.reindex(
pd.MultiIndex.from_product(d.index.levels, names=d.index.names)
).groupby(level='item').ffill().reset_index()
item timePoint value
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0
I think stack with unstack will achieve the format , then we using groupby ffillto fill the nan value forward
s=df.set_index(['item','timePoint']).value.unstack().stack(dropna=False)
s.groupby(level=0).ffill().reset_index()
Out[508]:
item timePoint 0
0 1 1 2.0
1 1 2 2.0
2 1 3 4.0
3 2 1 4.0
4 2 2 4.0
5 2 3 5.0
6 3 1 7.0
7 3 2 5.0
8 3 3 3.0
9 4 1 6.0
10 4 2 9.0
11 4 3 9.0
12 5 1 NaN
13 5 2 3.0
14 5 3 3.0
15 6 1 NaN
16 6 2 2.0
17 6 3 2.0
18 7 1 NaN
19 7 2 NaN
20 7 3 1.0

create new column using a shift within a groupby values

I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0