Scatter plot of Multiindex GroupBy() - pandas

I'm trying to make a scatter plot of a GroupBy() with Multiindex (http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-with-multiindex). That is, I want to plot one of the labels on the x-axis, another label on the y-axis, and the mean() as the size of each point.
df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean() returns:
Sigma_ang Epsilon_K
3.4 30 0.647000
40 0.602071
50 0.619786
3.6 30 0.646538
40 0.591833
50 0.607769
3.8 30 0.616833
40 0.590714
50 0.578364
Name: RMSD, dtype: float64
And I'd like to to plot something like: plt.scatter(x=Sigma, y=Epsilon, s=RMSD)
What's the best way to do this? I'm having trouble getting the proper Sigma and Epsilon values for each RMSD value.

+1 to Vaishali Garg. Based on his comment, the following works:
df_mean = df['RMSD'].groupby([df['Sigma'],df['Epsilon']]).mean().reset_index()
plt.scatter(df_mean['Sigma'], df_mean['Epsilon'], s=100.*df_mean['RMSD'])

Related

Annotate text in facetgrid of sns relplot in python

Using the following data frame (utilities):
Security_Name Rating Duracion Spread
0 COLBUN 3.95 10/11/27 BBB 6.135749 132
1 ENELGX 4 1/4 04/15/24 BBB+ 3.197206 124
2 PROMIG 3 3/4 10/16/29 BBB- 7.628048 243
3 IENOVA 4 3/4 01/15/51 BBB 15.911632 364
4 KALLPA 4 7/8 05/24/26 BBB- 4.792474 241
5 TGPERU 4 1/4 04/30/28 BBB+ 4.935607 130
dataframe
I am trying to create a sns relplot which should annotate the scatter plot points in respective facetgrid. However the out put i get looks something like this(without the annotations)
relplot
I can't see any annotation in any plot
I have tried the following code:
sns.relplot(x="Duracion", y="Spread", col="Rating", data=utilities)
I really don't know where to start to bring the annotations for this replot using facetrgid. The annotation should be the values of the column Security_Name
please advise the modifications. thanks in advance.
Using FacetGrid and a custom annotation function, you can get the desired result. Note that there is a good chance the annotation will overlap given the example dataframe provided:
def annotate_points(x,y,t, **kwargs):
ax = plt.gca()
data = kwargs.pop('data')
for i,row in data.iterrows():
ax.annotate(row[t], xy=(row[x],row[y]))
g = sns.FacetGrid(col="Rating", data=df)
g.map(sns.scatterplot, "Duracion", "Spread")
g.map_dataframe(annotate_points, "Duracion", "Spread", 'Security_Name')

Pandas, approximating a bar plot for large dataframes

I have a dataframe (around 10k rows) of the following form:
id | voted
123 1.0
12 0.0
215 1.0
362 0.0
...
And I want to bar plot this and look at where the values are mostly 0.0 and where they are mostly 1.0. (the order of indices in the first column is essential, as the dataframe is sorted).
I tried doing a bar plot, but even if I restrict myself to a small subset of the dataframe, the plot is still not readable:
Is there a way to approximate areas that are mostly 1.0 with a single thicker bar, such as we do for histograms, when we set the bins to a higher and lower number?
As you are searching for an interval approximation for the density of the votes, maybe you can add a moving average to it, :
df['ma'] = df['voted'].rolling(5).mean()
With this you would have always an average, then you could plot it over the indexes as a line graph, if the value is close to 1 then you know that you have a group of id's which votes with 1.0.

How to plot the number of unique values in each column in pandas dataframe as bar plot?

I want to plot the count of unique values per column for specific columns of my dataframe.
So if my dataframe has four columns 'col_a', 'col_b' , 'col_c' and 'col_d', and two ('col_a', 'col_b') of them are categorical features, I want to have a bar plot having 'col_a' and 'col_b' in the x-axis, and the count of unique values in 'col_a' and number of unique values in 'col_b' in the y-axis.
PS: I don't want to plot the count of each unique value in a specific column.
Actually, how to bar plot this with python?
properties_no_na.nunique()
Which returns:
neighborhood 51
block 6805
lot 1105
zip_code 41
residential_units 210
commercial_units 48
total_units 215
land_sqft_thousands 6192
gross_sqft_thousands 8469
year_built 170
tax_class_at_sale 4
building_class_at_sale 156
sale_price_millions 14135
sale_date 4440
sale_month 12
sale_year 15
dtype: int64
How would that be possible? If possible with Seaborn?
nunique() returns Pandas.Series. Convert it to Pandas.DataFrame with reset_index() and call seaborn.
nu = properties_no_na.nunique().reset_index()
nu.columns = ['feature','nunique']
ax = sns.barplot(x='feature', y='nunique', data=nu)
sns.displot(x=df.column_name1,col=df.column_name2,kde=True)
note: sns is the alias of python seaborn library.
x axis always column_name1 and y axis column_name2. And this code will give you number of displots depends on unique values in the column column_name2

Getting an error while plotting sum and average in pandas

total_income_language = pd.DataFrame(df.groupby('language')['gross'].sum())
average_income_language = pd.DataFrame(df.groupby('language')['gross'].mean())
plt.bar(total_income_language.index, total_income_language["gross"],
label="Total Income of Language")
plt.bar(average_income_language.index, average_income_language["gross"],
label="Average Income of Language")
plt.xlabel("Language")
plt.ylabel("Log Dollar Values(Gross)")
I want to plot the sum and average for each and every languages. I'm not sure if my code does what I wanted. And I'm getting an error while trying to plot this. I'm not sure where did I messed up on the coding. I need some assistance.
Here's the error message
You can use groupby with aggregation by agg, rename columns by dict and plot by DataFrame.plot.bar.
Last set labels by ax.set.
df = pd.DataFrame({'language':['en','de','en','de','sk','sk'],
'gross':[10,20,30,40,50,60]})
print (df)
gross language
0 10 en
1 20 de
2 30 en
3 40 de
4 50 sk
5 60 sk
d = {'mean':'Average Income of Language','sum':'Total Income of Language'}
df1 = df.groupby('language')['gross'].agg(['sum','mean']).rename(columns=d)
print (df1)
Total Income of Language Average Income of Language
language
de 60 30
en 40 20
sk 110 55
ax = df1.plot.bar()
ax.set(xlabel='Language', ylabel='Log Dollar Values(Gross)')
If want rotate labels of axis x:
ax = df1.plot.bar(rot=0)
ax.set(xlabel='Language', ylabel='Log Dollar Values(Gross)')
Instead of:
df.groupby('language')['gross'].sum()
Try this:
df.groupby('language').sum()
And similarly with mean(). That should get your code closer to running.
Calling the groupby() method of a DataFrame yields a groupby object upon which you then need to call an aggregation function, like sum, mean, or agg. The groupby documentation is really great: https://pandas.pydata.org/pandas-docs/stable/groupby.html
Also, you might be able to achieve your desired output in two lines:
df.groupby('language').sum().plot(kind='bar')
df.groupby('language').mean().plot(kind='bar')

Pandas series stacked bar chart normalized

I have a pandas series with a multiindex like this:
my_series.head(5)
datetime_publication my_category
2015-03-31 xxx 24
yyy 2
zzz 1
qqq 1
aaa 2
dtype: int64
I am generating a horizontal bar chart using the plot method from pandas with all those stacked categorical values divided by datetime (according to the index hierarchy) like this:
my_series.unstack(level=1).plot.barh(
stacked=True,
figsize=(16,6),
colormap='Paired',
xlim=(0,10300),
rot=45
)
plt.legend(
bbox_to_anchor=(0., 1.02, 1., .102),
loc=3,
ncol=5,
mode="expand",
borderaxespad=0.
)
However I am not able to find a way to normalize all those values in the series broken down by datetime_publication,my_category. I would like to have all the horizontal bars of the same length, but right now the legth depends on the absolute values in the series.
Is there a built-in functionality from pandas to normalize the slices of the series or some quick function to apply at the series that keeps track of the total taken from the multiindex combinatin of the levels?