Overlaying boxplots on the relative bin of a histogram

Overlaying boxplots on the relative bin of a histogram - pandas

Taking the dataset 'tip' as an example
total_bill
tip
smoker
day
time
size
16.99
1.01
No
Sun
Dinner
2
10.34
1.66
No
Sun
Dinner
3
21.01
3.50
No
Sun
Dinner
3
23.68
3.31
No
Sun
Dinner
2
24.59
3.61
No
Sun
Dinner
4
what I'm trying to do is represent the distribution of the variable 'total_bill' and relate each of its bins to the distribution of the variable 'tip' linked to it. In this example, this graph is meant to answer the question: "What is the distribution of tips left by customers as a function of the bill they paid?"
I have more or less achieved the graph I wanted to obtain (but there is a problem. At the end I explain what it is).
And the procedure I adopted is this:
Dividing 'total_bill' into bins.
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
tips.head()
total_bill
tip
smoker
day
time
size
bins_total_bill
16.99
1.01
No
Sun
Dinner
2
(12.618, 17.392]
10.34
1.66
No
Sun
Dinner
3
(7.844, 12.618]
21.01
3.50
No
Sun
Dinner
3
(17.392, 22.166]
23.68
3.31
No
Sun
Dinner
2
(22.166, 26.94]
24.59
3.61
No
Sun
Dinner
4
(22.166, 26.94]
Creation of a pd.Series with:
Index: pd.interval of total_cost bins
Values: n° of occurrences
s = tips['bins_total_bill'].value_counts(sort=False)
s
(3.022, 7.844] 7
(7.844, 12.618] 42
(12.618, 17.392] 68
(17.392, 22.166] 51
(22.166, 26.94] 31
(26.94, 31.714] 19
(31.714, 36.488] 12
(36.488, 41.262] 7
(41.262, 46.036] 3
(46.036, 50.81] 4
Name: bins_total_bill, dtype: int64
Combine barplot and poxplot together
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
sns.barplot(ax=ax1, x = s.index, y = s.values)
sns.boxplot(ax=ax2, x='bins_total_bill', y='tip', data=tips)
sns.stripplot(ax=ax2, x='bins_total_bill', y='tip', data=tips, size=5, color="yellow", edgecolor='red', linewidth=0.3)
#Title and axis labels
ax1.tick_params(axis='x', rotation=90)
ax1.set_ylabel('Number of bills')
ax2.set_ylabel('Tips [$]')
ax1.set_xlabel("Mid value of total_bill bins [$]")
ax1.set_title("Tips ~ Total_bill distribution")
#Reference lines average(tip) + add yticks + Legend
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip))
ax2.legend(loc='best')
#Set labels axis x
ax1.set_xticklabels(list(map(lambda s: round(s.mid,2), s.index)))
It has to be said that this graph has a problem! As the x-axis is categorical, I cannot, for example, add a vertical line at the mean value of 'total_bill'.
How can I fix this to get the correct result?
I also wonder if there is a correct and more streamlined approach than the one I have adopted.

I thought of this method, which is more compact than the previous one (it can probably be done better) and overcomes the problem of scaling on the x-axis.
I split 'total_bill' into bins and add the column to Df
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
Group column 'tip' by previously created bins
obj_gby_tips = tips.groupby('bins_total_bill')['tip']
gby_tip = dict(list(obj_gby_tips))
Create dictionary with:
keys: midpoint of each bins interval
values: gby tips for each interval
mid_total_bill_bins = list(map(lambda bins: bins.mid, list(gby_tip.keys())))
gby_tips = gby_tip.values()
tip_gby_total_bill_bins = dict(zip(mid_total_bill_bins, gby_tips))
Create chart by passing to each rectangle of the boxplot the
centroid of each respective bins
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
bp_values = list(tip_gby_total_bill_bins.values())
bp_pos = list(tip_gby_total_bill_bins.keys())
l1 = sns.histplot(tips.total_bill, bins=10, ax=ax1)
l2 = ax2.boxplot(bp_values, positions=bp_pos, manage_ticks=False, patch_artist=True, widths=2)
#Average tips as hline
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip)) #add value of avg(tip) to y-axis
#Average total_bill as vline
avg_total_bill=np.mean(tips.total_bill)
ax1.axvline(x=avg_total_bill, color='orange', linestyle="--", label="avg tot_bill")
then the result.

Related

Plotting time series box and whisker plot with missing date values for origin destination pairs

I have the following data set:
df.head(7)
Origin Dest Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
In short, this table represents the number of items (Quantity) that were sent from a given origin to a given destination on a given date. The table contains 1 month of data. This table was read with:
shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])
Note that this is a sparse table: if Quantity=0 for a particular (Origin,Dest,Date) pair then this row is not included in the table. As per example, on 2021-09-10 no items were sent from Atlanta to LA this row is not included in the data.
I would like to visualize this data using time series box and whisker plots. The x-axis of my graph should show the day, and Quantity should be on the y-axis. A boxplot should represent the various percentiles aggregated over all (origin-destination) pairs.
Similarly, would it be possible to create a graph which, instead of every day, only shows Monday-Sunday on the x-axis (and hence shows the results per day of the week)?
To generate the rows with missing data I used the following code:
table = pd.pivot_table(data=shipments, index='Date', columns=['Origin','Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06','2021-10-10')
table = table.reindex(idx,fill_value=0)

You could transpose the table dataframe, and use that as input for a sns.boxplot. And you could create a similar table for the day of the week. Note that with many zeros, the boxplot might look a bit strange.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first create some test data, somewhat similar to the given data
N = 1000
cities = ['Atlanta', 'LA', 'Chicago', 'Seattle', 'Newark']
shipments = pd.DataFrame({'Origin': np.random.choice(cities, N),
'Dest': np.random.choice(cities, N),
'Date': np.random.choice(pd.date_range('2021-09-06', '2021-10-10'), N),
'Quantity': (np.random.uniform(1, 4, N) ** 3).astype(int)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5), gridspec_kw={'width_ratios': [3, 1]})
# create boxplots for each day
table_month = pd.pivot_table(data=shipments, index='Date', columns=['Origin', 'Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06', '2021-10-10')
table_month = table_month.reindex(idx, fill_value=0)
sns.boxplot(data=table_month.T, ax=ax1)
labels = [day.strftime('%d\n%b %Y') if i == 0 or day.day == 1 else day.strftime('%d')
for i, day in enumerate(table_month.index)]
ax1.set_xticklabels(labels)
# create boxplots for each day of the week
table_dow = pd.pivot_table(data=shipments, index=shipments['Date'].dt.dayofweek, columns=['Origin', 'Dest'],
values='Quantity', fill_value=0)
table_dow = table_dow.reindex(range(7), fill_value=0)
sns.boxplot(data=table_dow.T, ax=ax2)
labels = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
ax2.set_xticklabels(labels)
ax2.set_xlabel('') # remove superfluous x label
fig.tight_layout()
plt.show()

Shift Matplotlib Axes to Match Overlaid Plots

I have a plot that is an overlay of a boxplot and a scatter plot (red data points). Everything is fine except that the red data points are not lining up on the x axis with the boxplot x axis. I think this is because two different plot methods are used on the same axis?? I have noticed that the second method plot of "scatter" effectively shifts the boxplots to the right on the x axis. Plotting the boxplot without the scatter plot does not shift the x axis values as shown below. The method i'm using below should work but it is not. Here is my code:
sitenames = df3.plant_name.unique().tolist()
months = ['JANUARY','FEBRUARY','MARCH','APRIL','MAY','JUNE','JULY','AUGUST','SEPTEMBER','OCTOBER','NOVEMBER','DECEMBER']
from datetime import datetime
monthn = datetime.now().month
newList = list()
for i in range(monthn-1):
newList.append(months[(monthn-1+i)%12])
print(newList)
for i, month in enumerate(newList,1):
#plt.figure()
fig, ax = plt.subplots()
ax = df3[df3['month']==i].boxplot(by='plant_name',column='Var')
df3c[df3c['month']==i].plot(kind='scatter', x='plant_name', y='Var',ax=ax,color='r',label='CY')
plt.xticks(rotation=90, ha='right')
plt.suptitle('1991-2020 ERA5 WIND PRODUCTION',y=1)
plt.title(months[i-1])
plt.xlabel('SITE')
plt.ylabel('VARIABILITY')
plt.legend()
plt.show()
Here is the plot from "February" that shows the mis-aligned x axis:
Here are partial rows for df3, df3c:
df3.head(3)
Out[223]:
plant_name month year power_kwh power_kwh_rhs Var
0 BII NEE STIPA 1 1991 11905.826075 14673.281223 -18.9
1 BII NEE STIPA 1 1992 14273.927688 14673.281223 -2.7
2 BII NEE STIPA 1 1993 12559.828360 14673.281223 -14.4
df3c.head(3)
Out[224]:
plant_name month year power_kwh power_kwh_rhs Var
0 BII NEE STIPA 1 2021.0 14863.643952 14673.281223 1.3
1 BII NEE STIPA 2 2021.0 9663.393155 12388.328084 -22.0
2 DOS ARBOLITOS 1 2021.0 36819.502285 36882.205762 -0.2
I have found a similar problem but can't see how to insert this solution to my code: Shift matplotlib axes to match eachother

waterfalls graph for two data frame in pandas

I have two data frames which have totally the same rows and columns.
df1 shows the percentage % of Mode :
pt car bike walk
Equilibrium 28.80 36.82 3.55 30.83
No information 28.80 36.82 3.55 30.83
start 28.82 36.83 3.55 30.80
Equilibrium2 28.51 36.95 3.56 30.98
and df2 is Travel Time(minutes)
pt car bike walk
Equilibrium 384651.50 216673.23 24136.57 88602.10
No information 397068.27 216640.15 24133.03 88565.93
start 386008.27 216664.17 24136.57 88521.93
Equilibrium2 383788.73 215751.85 26638.87 89602.90
I want to have such a following diagram that x-axis shows the percentage of Mode (df1) and the Y-axis shows the Travel Time (df2) for each column name such as:

Let's try plotting the bars with bottoms:
colors = [f'C{i}' for i in range(4)]
fig, axes = plt.subplots(2,2, figsize=(8,8))
for idx, ax in zip(df1.index, axes.ravel()):
s1, s2 = df1.loc[idx].cumsum(),df2.loc[idx].cumsum()
ax.bar(s1.shift(fill_value=0),df2.loc[idx],width=df1.loc[idx],
bottom=s2.shift(fill_value=0), align='edge', color=colors)
ax.set_title(idx)
fig.tight_layout()
Output:

Subplot multiindex data by level

This is my multiindex data.
Month Hour Hi
1 9 84.39
10 380.41
11 539.06
12 588.70
13 570.62
14 507.42
15 340.42
16 88.91
2 8 69.31
9 285.13
10 474.95
11 564.42
12 600.11
13 614.36
14 539.79
15 443.93
16 251.57
17 70.51
I want to make subplot where each subplot represent the Month. x axis is hour, y axis is Hi of the respective month.
This gives a beautiful approach as follow:
levels = df.index.levels[0]
fig, axes = plt.subplots(len(levels), figsize=(3, 25))
for level, ax in zip(levels, axes):
df.loc[level].plot(ax=ax, title=str(level))
plt.tight_layout()
I want to make 1x2 subplot instead of vertically arranged as above. Later, with larger data, I want to make 3x4 subplot or even larger dimension.
How to do it?

You can do it in pandas
df.Hi.unstack(0).fillna(0).plot(kind='line',subplots=True, layout=(1,2))

Pass the rows and columns arguments to plt.subplots
levels = df.index.levels[0]
# Number of rows v
fig, axes = plt.subplots(1, len(levels), figsize=(6, 3))
for level, ax in zip(levels, axes):
df.loc[level].plot(ax=ax, title=str(level))
plt.tight_layout()

Plot multiple lines in a line graph using matplotlib

I am trying to plot a line graph with several lines in it, one for each group.
X axis would be the hour and y axis would be the count.
Since there are 3 groups in the dataframe, i will have 3 lines in a single line graph.
This is the code I have used but not sure where I am going wrong.
Group Hour Count
G1 1 40
G2 1 300
G1 2 400
G2 2 80
G3 2 1211
Code used:
fig, ax = plt.subplots()
labels = []
for key, grp in df1.groupby(['Group']):
ax = grp.plot(ax=ax, kind='line', x='x', y='y', c=key)
labels.append(key)
lines, _ = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
plt.show()

You can use df.pivot and save yourself some lines
df.pivot('Hour', 'Group', 'Count').plot(kind='line', marker='o')
G3 is plotted as a point because there is only one point (2 hrs, 1211 count) associated with it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Overlaying boxplots on the relative bin of a histogram - pandas

Related

Plotting time series box and whisker plot with missing date values for origin destination pairs

Shift Matplotlib Axes to Match Overlaid Plots

waterfalls graph for two data frame in pandas

Subplot multiindex data by level

Plot multiple lines in a line graph using matplotlib

Categories

Resources