How to group by multiple columns on a pandas series

How to group by multiple columns on a pandas series - series

The pandas.Series groupby method makes it possible to group by another series, for example:
data = {'gender': ['Male', 'Male', 'Female', 'Male'], 'age': [20, 21, 20, 20]}
df = pd.DataFrame(data)
grade = pd.Series([5, 6, 7, 4])
grade.groupby(df['age']).mean()
However, this approach does not work for a groupby using two columns:
grade.groupby(df[['age','gender']])
ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
In the example, it is easy to add the column to the dataframe and get the desired result as follows:
df['grade'] = grade
y = df.groupby(['gender','age']).mean()
y.to_dict()
{'grade': {('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}}
But that can get quite ugly in real life situations. Is there any way to do this groupby on multiple columns directly on the series?

Since I don't know of any direct way to solve the problem, I've made a function that creates a temporary table and performs the groupby on it.
def pd_groupby(series,group_obj):
df = pd.DataFrame(group_obj).copy()
groupby_columns = list(df.columns)
df[series.name] = series
return df.groupby(groupby_columns)[series.name]
Here, group_obj can be a pandas Series or a Pandas DataFrame. Starting from the sample code, the desired result can be achieved by:
y = pd_groupby(grade,df[['gender','age']]).mean()
y.to_dict()
{('Female', 20): 7.0, ('Male', 20): 4.5, ('Male', 21): 6.0}

Related

using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable

Follow-up from my last question:
pyjanitor pivot_longer multiple sets of columns with common grouping variable and id column
In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).
Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.
The dataset
df = pd.DataFrame({
'id': [1, 1, 1],
'ratingfirst': [1, 2, 3],
'ratingsecond': [2.8, 2.9, 2.2],
'ratingthird': [3.4, 3.8, 2.9],
'firstestimate': [1.2, 2.4, 2.8],
'secondestimate': [2.4, 3, 2.4],
'thirdestimate':[3.4, 3.8, 2.9],
'firsttype': ['red', 'green', 'blue'],
'secondtype': ['red', 'green', 'yellow'],
'thirdtype': ['red', 'red', 'blue'],
})
Desired output
The header of my desired output is the following:
id
category
rating
estimate
type
1
first
1.0
1.2
'red'

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:
def fix_col_header(s, d):
for word, word_replace in d.items():
s = s.replace(word, word_replace)
if s.startswith("_"):
s = s[len(word_replace):] + s[:len(word_replace)]
return s
d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]
This will give the columns:
id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third
Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:
import janitor
(df
.pivot_longer(
column_names="*_*",
names_to = (".value", "category"),
names_sep="_")
)

Fastest way to locate rows of a dataframe from two lists and concatenate them?

Apologies if this has already been asked, I haven't found anything specific enough although this does seem like a general question. Anyways, I have two lists of values which correspond to values in a dataframe, and I need to pull those rows which contain those values and make them into another dataframe. The code I have works, but it seems quite slow (14 seconds per 250 items). Is there a smart way to speed it up?
row_list = []
for i, x in enumerate(datetime_list):
row_list.append(df.loc[(df["datetimes"] == x) & (df.loc["b"] == b_list[i])])
data = pd.concat(row_list)
Edit: Sorry for the vagueness #anky, here's an example dataframe
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'datetimes' : [datetime(2020, 6, 14, 2), datetime(2020, 6, 14, 3), datetime(2020, 6, 14, 4)],
'b' : [0, 1, 2],
'c' : [500, 600, 700]})

IIUC, try this
dfi = df.set_index(['datetime', 'b'])
data = dfi.loc[list(enumerate(datetime_list)), :].reset_index()
Without test data in question it is hard to verify if this correct.

i need to return a value from a dataframe cell as a variable not a series

i have the following issue:
when i use .loc funtion it returns a series not a single value with no index.
As i need to do some math operation with the selected cells. the function that i am using is:
import pandas as pd
data = [[82,1], [30, 2], [3.7, 3]]
df = pd.DataFrame(data, columns = ['Ah-Step', 'State'])
df['Ah-Step'].loc[df['State']==2]+ df['Ah-Step'].loc[df['State']==3]

.values[0] will do what OP wants.
Assuming one wants to obtain the value 30, the following will do the work
df.loc[df['State'] == 2, 'Ah-Step'].values[0]
print(df)
[Out]: 30.0
So, in OP's specific case, the operation 30+3.7 could be done as follows
df.loc[df['State'] == 2, 'Ah-Step'].values[0] + df['Ah-Step'].loc[df['State']==3].values[0]
[Out]: 33.7

How to use reindex to fill in missing timesteps?

I'm trying to test the example given in the docs that fills in missing timesteps
date_index = pd.date_range('1/1/2010', periods=6, freq='D')
df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index)
date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
#show how many rows are in the fragmented dataframe
print(df2.shape)
df2.reindex(date_index2)
#show how many rows after reindexing
print(df2.shape)
But running this code shows that no rows were added. What am i missing here?

reindex does not work inplace by default. You can do
print(df2.shape)
# assign back
df2 = df2.reindex(date_index2)
print(df2.shape)
Output:
(6, 1)
(10, 1)

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.

A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.

This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.

sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to group by multiple columns on a pandas series - series

Related

using regex in pivot_longer to unpivot multiple sets of columns with common grouping variable

Fastest way to locate rows of a dataframe from two lists and concatenate them?

i need to return a value from a dataframe cell as a variable not a series

How to use reindex to fill in missing timesteps?

Seaborn groupby pandas Series

Categories

Resources