Barplot per each ax in matplotlib - pandas

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code

The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")

If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

Related

How make scatterplot in pandas readable

I've been playing with Titanic dataset and working through some visualisations in Pandas using this tutorial. https://www.kdnuggets.com/2023/02/5-pandas-plotting-functions-might-know.html
I have a visual of scatterplot having used this code.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
I was confused by bootstrap plot result so went on to scatterplot.
pd.plotting.scatter_matrix(df, figsize=(10,10), )
plt.show()
I can sort of interpret it but I'd like to put the various variables at top and bottom of every column. Is that doable?
You can use:
fig, ax = plt.subplots(4, 3, figsize=(20, 15))
sns.scatterplot(x = 'bedrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 0])
sns.scatterplot(x = 'bathrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 1])

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

How to set x axis according to the numbers in the DATAFRAME

i am using Matplotlib to show graph of some information that i get from the users,
i want to show it as:axis x will be by the ID of the users and axis y will be by the Winning time that whey have..
I dont understand how can i put the x axis index as the ID of my users.
my code:
import matplotlib.pyplot as plt
import matplotlib,pylab as pylab
import pandas as pd
import numpy as np
#df = pd.read_csv('Players.csv')
df = pd.read_json('Players.json')
# df.groupby('ID').sum()['Win']
axisx = df.groupby('ID').sum()['Win'].keys()
axisy = df.groupby('ID').sum()['Win'].values
fig = pylab.gcf()
# fig.canvas.set_window_title('4 In A Row Statistic')
# img = plt.imread("Oi.jpeg")
# plt.imshow(img)
fig, ax = plt.subplots()
ax.set_xticklabels(axisx.to_list())
plt.title('Game Statistic',fontsize=20,color='r')
plt.xlabel('ID Players',color='r')
plt.ylabel('Wins',color='r')
x = np.arange(len(axisx))
rects = ax.bar(x, axisy, width=0.1)
plt.show()
use plt.xticks(array_of_id). xticks can set the current tick locations and labels of the x-axis.

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by?
I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"
There are two Groups classes: "yes" and "no"
Using:
df.hist()
I get the hist for each of the 4 columns.
Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").
I tried this withouth success:
df.hist(by = "group")
Using Seaborn
If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')
bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
This is not the most flexible workaround but will work for your question specifically.
def sephist(col):
yes = df[df['group'] == 'yes'][col]
no = df[df['group'] == 'no'][col]
return yes, no
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
You could make this more generic by:
adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)
Because the first argument to matplotlib.pyplot.hist can take
either a single array or a sequency of arrays which are not required
to be of the same length
...an alternattive would be:
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.
import numpy as np
import matplotlib.pyplot as plt
def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
def _sephist(df, col, by):
unique_vals = df[by].unique()
df_by = dict()
for uv in unique_vals:
df_by[uv] = df[df[by] == uv][col]
return df_by
subplt_c = 4
subplt_r = 5
fig = plt.figure()
for num, col in enumerate(columns):
if num + 1 > subplt_c * subplt_r:
continue
plt.subplot(subplt_c, subplt_r, num+1)
bins = np.linspace(df[col].min(), df[col].max(), nbins)
for lbl, sepcol in _sephist(df, col, by).items():
plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
plt.legend(loc='upper right', title=by)
plt.title(col)
plt.tight_layout()
return fig
TLDR oneliner;
It won't create the subplots but will create 4 different plots;
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']
Full working example below
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

pandas dataframe bar plot put space between bars

So I want my image look like this
But now my image look like this
How do I reduce the space between bars without making the bar width into 1?
Here is my code:
plot=repeat.loc['mean'].plot(kind='bar',rot=0,alpha=1,cmap='Reds',
yerr=repeat.loc['std'],error_kw=dict(elinewitdh=0.02,ecolor='grey'),
align='center',width=0.2,grid=None)
plt.ylabel('')
plt.grid(False)
plt.title(cell,ha='center')
plt.xticks([])
plt.yticks([])
plt.ylim(0,120)
plt.tight_layout()`
make the plot from scratch if the toplevel functions from pandas or seaborn do not give you the desired result! :)
import seaborn.apionly as sns
import scipy as sp
import matplotlib.pyplot as plt
# some fake data
data = sp.randn(10,10) + 1
data = data[sp.argsort(sp.average(data,axis=1))[::-1],:]
avg = sp.average(data,axis=1)
std = sp.std(data,axis=1)
# a practical helper from seaborn to quickly generate the colors
colors = sns.color_palette('Reds',n_colors = data.shape[0])
fig, ax = plt.subplots()
pos = range(10)
ax.bar(pos,avg,width=1)
for col,patch in zip(colors,ax.patches):
patch.set_facecolor(col)
patch.set_edgecolor('k')
for i,p in enumerate(pos):
ax.plot([p,p],[avg[i],avg[i]+std[i]],color='k',lw=2, zorder=-1)