Pandas groupby, sum and populate original dataframe - pandas

Here is my original df
import pandas as pd
df_1 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5]})
color count
blue 1
blue 3
yellow 4
yellow 5
I would like to group by color column and sum count column and then populate original dataframe with results. So final result should look like:
df_2 = pd.DataFrame({'color': ['blue', 'blue', 'yellow', 'yellow'], 'count': [1,3,4,5],
'total_per_color': [4,4,9,9]})
color count total_per_color
blue 1 4
blue 3 4
yellow 4 9
yellow 5 9
I can do it with groupby and sum and then merge using pandas, but I wonder if there is quicker way to do it? In SQL one can achieve it with partition, in R I can use dplyr and mutate. Is there something similar in pandas?

Using transform with groupby
df_1['total_per_color']=df_1.groupby('color')['count'].transform('sum')
df_1
Out[886]:
color count total_per_color
0 blue 1 4
1 blue 3 4
2 yellow 4 9
3 yellow 5 9

Related

how to count the occurences of a value

How to count the number of occurences for a histogram using dataframes
d = {'color': ["blue", "green", "yellow", "red, blue", "green, yellow", "yellow, red, blue"],}
df = pd.DataFrame(data=d)
How do you go from
color
blue
green
yellow
red, blue
green, yellow
yellow, red, blue
to
color
occurance
blue
3
green
2
yellow
3
Let's try split by regex ,s\* for comma with zero or more whitespaces, then explode into rows and value_counts to get the count of values:
s = (
df['color'].str.split(r',\s*')
.explode()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
Or can split and expand then stack:
s = (
df['color'].str.split(r',\s*', expand=True)
.stack()
.value_counts()
.rename_axis('color')
.reset_index(name='occurance')
)
s:
color occurance
0 blue 3
1 yellow 3
2 green 2
3 red 2
Here is another way using .str.get_dummies()
df['color'].str.get_dummies(sep=', ').sum()

Scatter plot derived from two pandas dataframes with multiple columns in plotly [express]

I want to create a scatter plot that drives its x values from one dataframe and y values from another dataframe having multiple columns.
x_df :
red blue
0 1 2
1 2 3
2 3 4
y_df:
red blue
0 1 2
1 2 3
2 3 4
I want to plot a scatter plot like
I would like to have two red and blue traces such that x values should come from x_df and y values are derived from y_df.
at some layer you need to do data integration. IMHO better to be done at data layer i.e. pandas
have modified your sample data so two traces do not overlap
used join() assuming that index of data frames is the join key
could have further structured dataframe, however I generated multiple traces using plotly express modifying as required to ensure colors and legends are created
have not considered axis labels...
x_df = pd.read_csv(io.StringIO(""" red blue
0 1 2
1 2 3
2 3 4"""), sep="\s+")
y_df = pd.read_csv(io.StringIO(""" red blue
0 1.1 2.2
1 2.1 3.2
2 3.1 4.2"""), sep="\s+")
df = x_df.join(y_df, lsuffix="_x", rsuffix="_y")
px.scatter(df, x="red_x", y="red_y").update_traces(
marker={"color": "red"}, name="red", showlegend=True
).add_traces(
px.scatter(df, x="blue_x", y="blue_y")
.update_traces(marker={"color": "blue"}, name="blue", showlegend=True)
.data
)

Re-define dataframe index with map function

I have a dataframe like this. I wanted to know how can I apply map function to its index and rename it into a easier format.
df = pd.DataFrame({'d': [1, 2, 3, 4]}, index=['apple_017', 'orange_054', 'orange_061', 'orange_053'])
df
d
apple_017 1
orange_054 2
orange_061 3
orange_053 4
There are only two labels in the indeces of the dataframe, so it's either apple or orange in this case and this is how I tried:
data.index = data.index.map(i = "apple" if "apple" in i else "orange")
(Apparently it's not how it works)
Desired output:
d
apple 1
orange 2
orange 3
orange 4
Appreciate anyone's help and suggestion!
Try via split():
df.index=df.index.str.split('_').str[0]
OR
via map():
df.index=df.index.map(lambda x:'apple' if 'apple' in x else 'orange')
output of df:
d
apple 1
orange 2
orange 3
orange 4

Reset the categories of categorical index in Pandas

I have a dataframe with a column being categorical.
I remove all the rows having one the categories.
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
df.color = df.color.astype('category')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
Remove Brown from dataframe and category.
df = df.query('color != "Brown"')
df.color = df.color.cat.remove_categories('Brown')
df.color.head()
Output:
0 Blue
1 Green
2 Blue
3 Green
7 Red
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]
How can I make sure the resulting dataframe has only those categories that exist and does not keep the deleted categories in its index?
There's (now?) a pandas function doing exactly that: remove_unused_categories
This function only has one parameter inplace, which is deprecated since pandas 1.2.0. Hence, the following example (based on Scott's answer) does not use inplace:
>>> df = pd.DataFrame({'color':np.random.choice(['Blue','Green','Brown','Red'], 50)})
... df.color = df.color.astype('category')
... df.color.head()
0 Green
1 Brown
2 Blue
3 Red
4 Brown
Name: color, dtype: category
Categories (4, object): [Blue, Brown, Green, Red]
>>> df = df[df.color != "Brown"]
... df.color = df.color.cat.remove_unused_categories()
... df.color.head()
0 Green
2 Blue
3 Red
5 Red
6 Green
Name: color, dtype: category
Categories (3, object): [Blue, Green, Red]

Pandas: expanding_apply with groupby for unique counts of string type

I have the dataframe:
import pandas as pd
id = [0,0,0,0,1,1,1,1]
color = ['red','blue','red','black','blue','red','black','black']
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
and would like to create a column of the running count of the unique colors grouped by id so that the final dataframe looks like this:
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
I tried this simple way:
def len_unique(x):
return(len(np.unique(x)))
test['expanding_unique_count'] = test.groupby('id')['color'].apply(lambda x: pd.expanding_apply(x, len_unique))
And got ValueError: could not convert string to float: black
If I change the colors to integers:
color = [1,2,1,3,2,1,3,3]
test = pd.DataFrame(zip(id, color), columns = ['id', 'color'])
Then running the same code above produces the desired result. Is there a way for this to work while maintaining the string type for the column color?
It looks like expanding_apply and rolling_apply mainly work on numeric values. Maybe try creating a numeric column to code the color string as numeric values (this can be done by make color column categorical), and then expanding_apply.
# processing
# ===================================
# create numeric label
test['numeric_label'] = pd.Categorical(test['color']).codes
# output: array([2, 1, 2, 0, 1, 2, 0, 0], dtype=int8)
# your expanding function
test['expanding_unique_count'] = test.groupby('id')['numeric_label'].apply(lambda x: pd.expanding_apply(x, len_unique))
# drop the auxiliary column
test.drop('numeric_label', axis=1)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3
Edit:
def func(group):
return pd.Series(1, index=group.groupby('color').head(1).index).reindex(group.index).fillna(0).cumsum()
test['expanding_unique_count'] = test.groupby('id', group_keys=False).apply(func)
print(test)
id color expanding_unique_count
0 0 red 1
1 0 blue 2
2 0 red 2
3 0 black 3
4 1 blue 1
5 1 red 2
6 1 black 3
7 1 black 3