Python 3.9 - Jupyter - Pandas - DataFrame : Missing group by column after aggregating - pandas

I have a data frame, I .groupby() and .agg() after which the data aggregates successfully. However, the label column, in this case, Year is no longer a key that can be referenced for plotting the data. That said, the Year label is visible when printing the data-frame.
The code that aggregates:
df_grp_by_year = df_grp_by_year.groupby(
df_grp_by_year['Year']).agg({
'Avg OHCL': my_stats.get_arithmetic_mean,
'Low': min,
'High': max,
'Med': my_stats.get_median, # statistics.median
'Var': my_stats.get_variance, # statistics.pvariance
'Std': my_stats.get_standard_deviation # statistics.pstdev
})
The printed output:
The code that fails and associated error:
plot.plot(
df_grp_by_year['Year'],
df_grp_by_year['Avg OHCL'])
KeyError: 'Year'
What is the work-around for this?

After the groupby and agg the column Year has become the index of the dataframe instead of a column. Therefore it cannot be called as a column hence the KeyError. For the plot you can refer to the index or just leave it out because the index will be plotted by default.
plot.plot(df_grp_by_year.index, df_grp_by_year['Avg OHCL'])
or
plot.plot(df_grp_by_year['Avg OHCL'])

Related

Groupby does return previous df without changing it

df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

Find the average of a column based on another Pandas?

I'm working on a jupyter notebook, and I would like to get the average 'pcnt_change' based on 'day_of_week'. How do I do this?
A simple groupby call would do the trick here.
If df is the pandas dataframe:
df.groupby('day_of_week').mean()
would return a dataframe with average of all numeric columns in the dataframe with day_of_week as index. If you want only certain column(s) to be returned, select only the needed columns on the groupby call (for e.g.,
df[['open_price', 'high_price', 'day_of_week']].groupby('day_of_week').mean()

Plotly Bar Chart Based on Pandas dataframe grouped by year

I have a pandas dataframe that I've tried to group by year on 'Close Date' and then plot 'ARR (USD)' on the y-axis against the year on the x-axis.
All seems fine after grouping:
sumyr = brandarr.groupby(brandarr['Close Date'].dt.year,as_index=True).sum()
ARR (USD)
Close Date
2017 17121174.33
2018 15383130.32
But when I try to plot:
trace = [go.Bar(
x=sumyr['Close Date'],
y=sumyr['ARR (USD)']
)]
I get the error: KeyError: 'Close Date'
I'm sure it's something stupid, I'm a newbie, but I've been messing with it for an hour and well, here I am. Thanks!
In your groupby function you have used as_index=True so Close Date is now an index. If you want to have access to an index, use pandas .loc or .iloc.
To have access to the index values directly, use:
sumyr.index.tolist()
Check here: Pandas - how to get the data frame index as an array

pandas DatatFrame.corr returns a one by one DF

I can't get what is possibly wrong in the way I use df.corr() function.
For a DF with 2 columns it returns only 1*1 resulting DF.
In:
merged_df[['Citable per Capita','Citations']].corr()
Out:
one by one resulting DF
What can be the problem here? I expected to see as many rows and columns as many columns were there in the original DF
I found the problem - it was the wrong dtype of the first column values.
To change type of all the columns, use:
df=df.apply(lambda x: pd.to_numeric(x, errors='ignore'))
Note that apply creates a copy of df. That is why reassignment is necessary here