How make scatterplot in pandas readable - pandas

I've been playing with Titanic dataset and working through some visualisations in Pandas using this tutorial. https://www.kdnuggets.com/2023/02/5-pandas-plotting-functions-might-know.html
I have a visual of scatterplot having used this code.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('train.csv')
I was confused by bootstrap plot result so went on to scatterplot.
pd.plotting.scatter_matrix(df, figsize=(10,10), )
plt.show()
I can sort of interpret it but I'd like to put the various variables at top and bottom of every column. Is that doable?

You can use:
fig, ax = plt.subplots(4, 3, figsize=(20, 15))
sns.scatterplot(x = 'bedrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 0])
sns.scatterplot(x = 'bathrooms', y = 'price', data = dataset, whis=1.5, ax=ax[0, 1])

Related

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

Show multiple columns values on labels with squarify.plot

I have a dataframe that I'd like to plot a tree map with squarify. I'd like to show the country_name and counts on the chart by editing the labels parameter but it seems only taking one value.
Example data
import squarify
import pandas as pd
from matplotlib import pyplot as plt
d = {'country_name':['USA', 'UK', 'Germany'], 'counts':[100, 200, 300]}
dd = pd.DataFrame(data=d)
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
norm = matplotlib.colors.Normalize(vmin=min(dd.counts), vmax=max(dd.counts))
colors = [matplotlib.cm.Blues(norm(value)) for value in dd.counts]
squarify.plot(label=dd.country_name, sizes=dd.counts, alpha=.7, color=colors)
plt.axis('off')
plt.show()
Expected output will have both counts and country_name on the chart.
You can create a list of labels by looping simultaneously through both columns and composing combined strings. For example:
import squarify
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib
d = {'country_name': ['USA', 'UK', 'Germany'], 'counts': [100, 200, 300]}
dd = pd.DataFrame(data=d)
labels = [f'{country}\n{count}' for country, count in zip(dd.country_name, dd.counts)]
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
norm = matplotlib.colors.Normalize(vmin=min(dd.counts), vmax=max(dd.counts))
colors = [matplotlib.cm.Blues(norm(value)) for value in dd.counts]
squarify.plot(label=labels, sizes=dd.counts, alpha=.7, color=colors)
plt.axis('off')
plt.show()

Seaborn boxplot custom lables aside box

I have the code segment given below, and it generates the provided boxplot. I would like to know how to add custom labels aside each box, so that the boxplot is even more digestible to the readers of my result. The expected diagram is also provided. I reckon there should be an easy way to get this done in Seaborn/Matplotlib.
What I exactly want is to add the following labels to each box (on left hand side as in shown in the example provided)
The code use to generate boxplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as MaxNLocator
from matplotlib import rcParams
from matplotlib.ticker import ScalarFormatter, FuncFormatter,FormatStrFormatter, EngFormatter#, mticker
%matplotlib inline
import seaborn as sns
range_stats = pd.read_csv(f'{snappy_data_dir}range_searcg_snappy_stats.csv')
data_stats_rs_txt = range_stats[range_stats['category'] == "t"]
data_stats_rs_seq = range_stats[range_stats['category'] == "s"]
fig, ax =plt.subplots(1,2)
rcParams['figure.figsize'] =8, 6
flierprops = dict(marker='x')
labels1 = ['R1', 'R2', 'R3', 'R4', 'R5']
sns.boxplot(x='Interval',y='Total',data=data_stats_rs_txt,palette='rainbow', ax=ax[0])
sns.boxplot(x='Interval',y='Total',data=data_stats_rs_seq,palette='rainbow', ax=ax[1])
ax[0].set(xlabel='Interval (s)', ylabel='query execution time (s)', title='Text format', ylim=(0, 290))
ax[1].set(xlabel='Interval (s)', ylabel='', title='Proposed format',ylim=(0, 290), yticklabels=[])
plt.savefig("range-query-corrected.svg")
plt.savefig('snappy_compressed_rangesearch.pdf')
Resulted figure:
Expected figure with labels
This might help you, although it is not a fully correct way and is not a complete solution.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
tips = sns.load_dataset('tips')
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.set_context('poster',font_scale=0.5)
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow', ax=axes[0], zorder=0)
axes[0].text(0, 45, r"$B1$", fontsize=20, color="blue")
axes[0].text(0.9, 45, r"$B2$", fontsize=20, color="blue")
axes[0].text(2.2, 45, r"$B3$", fontsize=20, color="blue")
axes[0].text(3.1, 45, r"$B4$", fontsize=20, color="blue");
sns.boxplot(x="day", y="tip", data=tips,palette='rainbow', ax=axes[1], zorder=10)
iris = sns.load_dataset("iris")
x_var = 'species'
y_var = 'sepal_width'
x_order = ['setosa', 'versicolor', 'virginica']
labels = ['R1','R2','R3']
max_vals = iris.groupby(x_var).max()[y_var].reindex(x_order)
ax = sns.boxplot(x=x_var, y=y_var, data=iris)
for x,y,l in zip(range(len(x_order)), max_vals, labels):
ax.annotate(l, xy=[x,y], xytext=[0,5], textcoords='offset pixels', ha='center', va='bottom')

Pandas histogram df.hist() group by

How to plot a histogram with pandas DataFrame.hist() using group by?
I have a data frame with 5 columns: "A", "B", "C", "D" and "Group"
There are two Groups classes: "yes" and "no"
Using:
df.hist()
I get the hist for each of the 4 columns.
Now I would like to get the same 4 graphs but with blue bars (group="yes") and red bars (group = "no").
I tried this withouth success:
df.hist(by = "group")
Using Seaborn
If you are open to use Seaborn, a plot with multiple subplots and multiple variables within each subplot can easily be made using seaborn.FacetGrid.
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
df2 = pd.melt(df, id_vars='group', value_vars=list("ABCD"), value_name='value')
bins=np.linspace(df2.value.min(), df2.value.max(), 10)
g = sns.FacetGrid(df2, col="variable", hue="group", palette="Set1", col_wrap=2)
g.map(plt.hist, 'value', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
This is not the most flexible workaround but will work for your question specifically.
def sephist(col):
yes = df[df['group'] == 'yes'][col]
no = df[df['group'] == 'no'][col]
return yes, no
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist(sephist(alpha)[0], bins=25, alpha=0.5, label='yes', color='b')
plt.hist(sephist(alpha)[1], bins=25, alpha=0.5, label='no', color='r')
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
You could make this more generic by:
adding a df and by parameter to sephist: def sephist(df, by, col)
making the subplots loop more flexible: for num, alpha in enumerate(df.columns)
Because the first argument to matplotlib.pyplot.hist can take
either a single array or a sequency of arrays which are not required
to be of the same length
...an alternattive would be:
for num, alpha in enumerate('abcd'):
plt.subplot(2, 2, num)
plt.hist((sephist(alpha)[0], sephist(alpha)[1]), bins=25, alpha=0.5, label=['yes', 'no'], color=['r', 'b'])
plt.legend(loc='upper right')
plt.title(alpha)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
I generalized one of the other comment's solutions. Hope it helps someone out there. I added a line to ensure binning (number and range) is preserved for each column, regardless of group. The code should work for both "binary" and "categorical" groupings, i.e. "by" can specify a column wherein there are N number of unique groups. Plotting also stops if the number of columns to plot exceeds the subplot space.
import numpy as np
import matplotlib.pyplot as plt
def composite_histplot(df, columns, by, nbins=25, alpha=0.5):
def _sephist(df, col, by):
unique_vals = df[by].unique()
df_by = dict()
for uv in unique_vals:
df_by[uv] = df[df[by] == uv][col]
return df_by
subplt_c = 4
subplt_r = 5
fig = plt.figure()
for num, col in enumerate(columns):
if num + 1 > subplt_c * subplt_r:
continue
plt.subplot(subplt_c, subplt_r, num+1)
bins = np.linspace(df[col].min(), df[col].max(), nbins)
for lbl, sepcol in _sephist(df, col, by).items():
plt.hist(sepcol, bins=bins, alpha=alpha, label=lbl)
plt.legend(loc='upper right', title=by)
plt.title(col)
plt.tight_layout()
return fig
TLDR oneliner;
It won't create the subplots but will create 4 different plots;
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']
Full working example below
import numpy as np; np.random.seed(1)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(300,4), columns=list("ABCD"))
df["group"] = np.random.choice(["yes", "no"], p=[0.32,0.68],size=300)
[df.groupby('group')[i].plot(kind='hist',title=i)[0] and plt.legend() and plt.show() for i in 'ABCD']

Arrange two plots horizontally

As an exercise, I'm reproducing a plot from The Economist with matplotlib
So far, I can generate a random data and produce two plots independently. I'm struggling now with putting them next to each other horizontally.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
df1 = pd.DataFrame({"broadcast": np.random.randint(110, 150,size=8),
"cable": np.random.randint(100, 250, size=8),
"streaming" : np.random.randint(10, 50, size=8)},
index=pd.Series(np.arange(2009,2017),name='year'))
df1.plot.bar(stacked=True)
df2 = pd.DataFrame({'usage': np.sort(np.random.randint(1,50,size=7)),
'avg_hour': np.sort(np.random.randint(0,3, size=7) + np.random.ranf(size=7))},
index=pd.Series(np.arange(2009,2016),name='year'))
plt.figure()
fig, ax1 = plt.subplots()
ax1.plot(df2['avg_hour'])
ax2 = ax1.twinx()
ax2.bar(left=range(2009,2016),height=df2['usage'])
plt.show()
You should try using subplots. First you create a figure by plt.figure(). Then add one subplot(121) where 1 is number of rows, 2 is number of columns and last 1 is your first plot. Then you plot the first dataframe, note that you should use the created axis ax1. Then add the second subplot(122) and repeat for the second dataframe. I changed your axis ax2 to ax3 since now you have three axis on one figure. The code below produces what I believe you are looking for. You can then work on aesthetics of each plot separately.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure()
df1 = pd.DataFrame({"broadcast": np.random.randint(110, 150,size=8),
"cable": np.random.randint(100, 250, size=8),
"streaming" : np.random.randint(10, 50, size=8)},
index=pd.Series(np.arange(2009,2017),name='year'))
ax1 = fig.add_subplot(121)
df1.plot.bar(stacked=True,ax=ax1)
df2 = pd.DataFrame({'usage': np.sort(np.random.randint(1,50,size=7)),
'avg_hour': np.sort(np.random.randint(0,3, size=7) + np.random.ranf(size=7))},
index=pd.Series(np.arange(2009,2016),name='year'))
ax2 = fig.add_subplot(122)
ax2.plot(df2['avg_hour'])
ax3 = ax2.twinx()
ax3.bar(left=range(2009,2016),height=df2['usage'])
plt.show()