How to plot plotBox and a line plot with different axes - pandas

I have a dataset that can be crafted in this way:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
date_range = pd.date_range(start='2021-11-20', end='2022-01-09').to_list()
df_left = pd.DataFrame(columns=['Date','Values'])
for d in date_range*3:
if (np.random.randint(0,2) == 0):
df_left = df_left.append({'Date': d, 'Values': np.random.randint(1,11)}, ignore_index=True)
df_left["year-week"] = df_left["Date"].dt.strftime("%Y-%U")
df_right = pd.DataFrame(
{
"Date": date_range,
"Values": np.random.randint(0, 50 , len(date_range)),
}
)
df_right_counted = df_right.resample('W', on='Date')['Values'].sum().to_frame().reset_index()
df_right_counted["year-week"] = df_right_counted["Date"].dt.strftime("%Y-%U")
pd_right_counted:
Date Values year-week
0 2021-12-05 135 2021-49
1 2021-12-12 219 2021-50
2 2021-12-19 136 2021-51
3 2021-12-26 158 2021-52
4 2022-01-02 123 2022-01
5 2022-01-09 222 2022-02
And pd_left:
Date Values year-week
0 2021-12-01 10 2021-48
1 2021-12-05 1 2021-49
2 2021-12-07 5 2021-49
...
13 2022-01-07 7 2022-01
14 2022-01-08 9 2022-01
15 2022-01-09 6 2022-02
And I'd like to create this graph in matplotlib.
Where a boxplot is plotted with df_left and it uses the y-axis on the left and a normal line plot is plotted with df_right_counted and uses the y-axis on the right.
This is my attempt (+ the Fix from the comment of Javier) so far but I am completely stuck with:
making both of the graphs starting from the same week ( I'd like to start from 2021-49 )
Plot another x-axis on the right and Let the line plot use it
This is my attempt so far:
fig, ax = plt.subplots(nrows=1, ncols=1, dpi=100)
fig.tight_layout()
fig.set_tight_layout(True)
fig.set_facecolor('white')
ax2=ax.twinx()
df_left.boxplot(figsize=(31, 8), column='Values', by='year-week', ax=ax)
df_right_counted.plot(figsize=(31, 8), x='year-week', y='Values', ax=ax2)
plt.show()
Could you give me some guidance? I am still learning using matplotlib

One of the problems is that resample('W', on='Date') and .dt.strftime("%Y-%U") seem to lead to different numbers in both dataframes. Another problem is that boxplot internally labels the boxes starting with 1.
Some possible workarounds:
oblige boxplot to number starting from one
create the counts via first extracting the year-week and then use group_by; that way the week numbers should be consistent
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
date_range = pd.date_range(start='2021-11-20', end='2022-01-09').to_list()
df_left = pd.DataFrame(columns=['Date', 'Values'])
for d in date_range * 3:
if (np.random.randint(0, 2) == 0):
df_left = df_left.append({'Date': d, 'Values': np.random.randint(1, 11)}, ignore_index=True)
df_left["year-week"] = df_left["Date"].dt.strftime("%Y-%U")
df_right = pd.DataFrame({"Date": date_range,
"Values": np.random.randint(0, 50, len(date_range))})
df_right["year-week"] = df_right["Date"].dt.strftime("%Y-%U")
df_right_counted = df_right.groupby('year-week')['Values'].sum().to_frame().reset_index()
fig, ax = plt.subplots(nrows=1, ncols=1, dpi=100)
fig.tight_layout()
fig.set_tight_layout(True)
fig.set_facecolor('white')
ax2 = ax.twinx()
df_left.boxplot(figsize=(31, 8), column='Values', by='year-week', ax=ax,
positions=np.arange(len(df_left['year-week'].unique())))
df_right_counted.plot(figsize=(31, 8), x='year-week', y='Values', ax=ax2)
plt.show()

Related

Changing the tick frequency on the x axis for each subplot on a FacetGrid plot in Seaborn

Problem:
I am trying to create a FacetGrid plot in Seaborn, where I have a yearWeek column as the x-axis and a conversionRate column as the y-axis. However, I want to only display every second yearWeek on the x-axis. How can I achieve this?
 
My current attempt:
 
!python --version
print(f'Seaborn version: {sns.__version__}')
 
data = {'yearWeek': ['2022-W1','2022-W2','2022-W3','2022-W4','2022-W5','2022-W6','2022-W7','2022-W8','2022-W9','2022-W10','2022-W11','2022-W12']*3,
        'country': ['US','US','US','US','US','US','US','US','US','US','US','US'] + ['India','India','India','India','India','India','India','India','India','India','India','India'] + ['Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia'],
        'conversionRate': [np.random.rand() for i in range(12*3)]
       }
 
df = pd.DataFrame(data)
 
g = sns.FacetGrid(df, col="country", aspect=1.5)
g.map_dataframe(sns.lineplot, x='yearWeek', y='conversionRate')
for ax in g.axes.flat:
    ax.set_xticks(ax.get_xticks()[::2])
    plt.setp(ax.get_xticklabels(), rotation=45)
Instead of slicing your dataframe you can just define the tick distance with matplotlib.ticker. This is very useful for all kinds of plots where you don't want to have auto-ticks.
See your modified code:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
data = {'yearWeek': ['2022-W1','2022-W2','2022-W3','2022-W4','2022-W5','2022-W6','2022-W7','2022-W8','2022-W9','2022-W10','2022-W11','2022-W12']*3,
'country': ['US','US','US','US','US','US','US','US','US','US','US','US'] + ['India','India','India','India','India','India','India','India','India','India','India','India'] + ['Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia'],
'conversionRate': [np.random.rand() for i in range(12*3)]
}
df = pd.DataFrame(data)
g = sns.FacetGrid(df, col="country", aspect=1.5)
g.map_dataframe(sns.lineplot, x='yearWeek', y='conversionRate')
for ax in g.axes.flat:
xtick_spacing = 2
ax.xaxis.set_major_locator(ticker.MultipleLocator(xtick_spacing))
# ax.set_xticks(ax.get_xticks()[::2])
plt.setp(ax.get_xticklabels(), rotation=45)
Result:
If I understand correctly, the question is most related to pandas.
You just need to slice df appropriately to get every second yearWeek: df = df[1::2].
Additionally, you can use reset_index with drop=True argument to "reset the index to the default integer index" of the DataFrame df post-slicing.
Full code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# generate reproducible `conversionRate`
rng = np.random.default_rng(12)
data = {'yearWeek': ['2022-W1','2022-W2','2022-W3','2022-W4','2022-W5','2022-W6','2022-W7','2022-W8','2022-W9','2022-W10','2022-W11','2022-W12']*3,
'country': ['US','US','US','US','US','US','US','US','US','US','US','US'] + ['India','India','India','India','India','India','India','India','India','India','India','India'] + ['Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia','Australia'],
'conversionRate': [rng.random() for i in range(12*3)]
}
df = pd.DataFrame(data)
df = df[1::2].reset_index(drop=True)
print(df)
g = sns.FacetGrid(df, col="country", aspect=1.5)
g.map_dataframe(sns.lineplot, x='yearWeek', y='conversionRate')
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
plt.show()
with df as:
yearWeek country conversionRate
0 2022-W2 US 0.946753
1 2022-W4 US 0.179291
2 2022-W6 US 0.230541
3 2022-W8 US 0.115079
4 2022-W10 US 0.858130
5 2022-W12 US 0.541466
6 2022-W2 India 0.257955
7 2022-W4 India 0.453616
8 2022-W6 India 0.927517
9 2022-W8 India 0.187890
10 2022-W10 India 0.946619
11 2022-W12 India 0.880250
12 2022-W2 Australia 0.936696
13 2022-W4 Australia 0.871556
14 2022-W6 Australia 0.219390
15 2022-W8 Australia 0.661634
16 2022-W10 Australia 0.201345
17 2022-W12 Australia 0.763625
Finally, if you don't want to change the original df, please check Returning view vs. copy as a precautionary measure.

How to draw pandas dataframe using Matplotlib hist with multiple y axes

I have a dataframe as below:
frame_id time_stamp pixels step
0 50 06:34:10 0.000000 0
1 100 06:38:20 0.000000 0
2 150 06:42:30 3.770903 1
3 200 06:46:40 3.312285 1
4 250 06:50:50 3.077356 0
5 300 06:55:00 2.862603 0
I want to draw two y-axes in one plot. One is for pixels. The other is for step. x-axis is time_stamp. I want the plot for step like the green line like this:
Here's an example that could help. Change d1 and d2 as per your variables and the respective labels as well.
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed=0)
d1 = rng.normal(loc=20, scale=5, size=200)
d2 = rng.normal(loc=30, scale=5, size=500)
fig, ax1 = plt.subplots(figsize=(9,5))
#create twin axes
ax2 = ax1.twinx()
ax1.hist([d1], bins=15, histtype='barstacked', linewidth=2,
alpha=0.7)
ax2.hist([d2], bins=15, histtype='step', linewidth=2,
alpha=0.7)
ax1.set_xlabel('Interval')
ax1.set_ylabel('d1 freq')
ax2.set_ylabel('d2 freq')
plt.show()
Getting the bar labels is not easy with the two types of histograms in the same plot using matplotlib.
bar_labels
Instead of histograms you could use bar plots to get the desired output. I have also added in a function to help get the bar labels.
import matplotlib.pyplot as plt
import numpy as np
time = ['06:34:10','06:38:20','06:42:30','06:46:40','06:50:50','06:55:00']
step = [0,0,1,1,0,0]
pixl = [0.00,0.00,3.77,3.31,3.077,2.862]
#function to add labels
def addlabels(x,y):
for i in range(len(x)):
plt.text(i, y[i], y[i], ha = 'center')
fig, ax1 = plt.subplots(figsize=(9,5))
#generate twin axes
ax2 = ax1.twinx()
ax1.step(time,step, 'k',where="mid",linewidth=1)
ax2.bar(time,pixl,linewidth=1)
addlabels(time,step)
addlabels(time,pixl)
ax1.set_xlabel('Time')
ax1.set_ylabel('Step')
ax2.set_ylabel('Pixels')
plt.show()
bar_plot

How to format in Matplotlib the x axis ticks to the only hours and minutes and define regular intervals?

I would like to have in the x axis the following ticks numbers and intervals:
[6:00; 8:00; 10:00: 12:00; 14:00, 16:00, 18:00]. The point at '12:00' should also be in the center of the figure, now it is shifted to the right
I tried to convert the column 'time' to a datetime format, but I get an error:
TypeError: <class 'datetime.time'> is not convertible to datetime
My code looks like this:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df3 = pd.read_excel(r'results.xlsx')
# failed attempt
df3["time"] = pd.to_datetime(df3["time"]).dt.date
# plotting
color = 'black'
ax1 = sns.lineplot(x = 'time', y = 'parabolic', data = df3, color = color,
label='Light intensity')
ax2 = ax1.twinx()
ax2 = sns.scatterplot(x = 'time', y = 'Gas-exchange_p', hue= 'Sampling',
marker='v', s=200, data = df3, label= 'Measurement time points')
plt.legend()
plt.show()
My dataframe in excel looks like this:
time y
12:00:00 AM 0
12:01:00 AM 0
12:02:00 AM 0
...
2:00:40 PM 416
...

Plotting annual mean and standard deviation in different colors for each year

I have data for several years. I have calculated mean and standard deviation for each year. Now I want to plot each row with mean as a scatter plot and fill plot between the standard deviations that is mean plus minus standard deviation in different colors for different years.
After using df_wc.set_index('Date').resample('Y')["Ratio(a/w)"].mean() it returns only the last date of the year (as shown below in the data set) but I want the fill plot for standard deviation to spread for the entire year.
Sample Data set:
Date | Mean | Std_dv
1858-12-31 1.284273 0.403052
1859-12-31 1.235267 0.373283
1860-12-31 1.093308 0.183646
1861-12-31 1.403693 0.400722
That's a very good question that you have asked, and it did not have an easy answer. But if I had understood the problem correctly, you need a fill plot with different colours for each year. The upper bound and lower bound of the plot will be between mean + std and mean - std?
So, I formed a custom time series and this is how I have plotted the values with the upper bound and lower bounds:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection,PatchCollection
from matplotlib.colors import ListedColormap, BoundaryNorm
import pandas as pd
ts = range(10)
num_classes = len(ts)
df = pd.DataFrame(data={'TOTAL': np.random.rand(len(ts)), 'Label': list(range(0, num_classes))}, index=ts)
df['UB'] = df['TOTAL'] + 2
df['LB'] = df['TOTAL'] - 2
print(df)
colors = ['r', 'g', 'b', 'y', 'purple', 'orange', 'k', 'pink', 'grey', 'violet']
cmap = ListedColormap(colors)
norm = BoundaryNorm(range(num_classes+1), cmap.N)
points = np.array([df.index, df['TOTAL']]).T.reshape(-1, 1, 2)
pointsUB = np.array([df.index, df['UB']]).T.reshape(-1, 1, 2)
pointsLB = np.array([df.index, df['LB']]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
segmentsUB = np.concatenate([pointsUB[:-1], pointsUB[1:]], axis=1)
segmentsLB = np.concatenate([pointsLB[:-1], pointsLB[1:]], axis=1)
lc = LineCollection(segments, cmap=cmap, norm=norm, linestyles='dashed')
lc.set_array(df['Label'])
lcUB = LineCollection(segmentsUB, cmap=cmap, norm=norm, linestyles='solid')
lcUB.set_array(df['Label'])
lcLB = LineCollection(segmentsLB, cmap=cmap, norm=norm, linestyles='solid')
lcLB.set_array(df['Label'])
fig1 = plt.figure()
plt.gca().add_collection(lc)
plt.gca().add_collection(lcUB)
plt.gca().add_collection(lcLB)
for i in range(len(colors)):
plt.fill_between( df.index,df['UB'],df['LB'], where= ((df.index >= i) & (df.index <= i+1)), alpha = 0.1,color=colors[i])
plt.xlim(df.index.min(), df.index.max())
plt.ylim(-3.1, 3.1)
plt.show()
And the result dataframe obtained looks like this:
TOTAL Label UB LB
0 0.681455 0 2.681455 -1.318545
1 0.987058 1 2.987058 -1.012942
2 0.212432 2 2.212432 -1.787568
3 0.252284 3 2.252284 -1.747716
4 0.886021 4 2.886021 -1.113979
5 0.369499 5 2.369499 -1.630501
6 0.765192 6 2.765192 -1.234808
7 0.747923 7 2.747923 -1.252077
8 0.543212 8 2.543212 -1.456788
9 0.793860 9 2.793860 -1.206140
And the plot looks like this:
Let me know if this helps! :)

ax.twinx label appears twice

I have been trying to make a chart based on an excel, using Matplotlib and Seaborn. Code is from the internet, adapted to what I want.
The issue is that the legend appears 2 times.
Do you have any recommendations?
Report screenshot: enter image description here
Excel table is:
Month Value (tsd eur) Total MAE
0 Mar 2020 14.0 1714.0
1 Apr 2020 22.5 1736.5
2 Jun 2020 198.0 1934.5
3 Jan 2021 45.0 1979.5
4 Feb 2021 60.0 2039.5
5 Jan 2022 67.0 2106.5
6 Feb 2022 230.0 2336.5
7 Mar 2022 500.0 2836.5
Code is:
import pandas as pd
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
mae=pd.read_excel('Book1.xlsx')
mae['Month'] = mae['Month'].apply(lambda x: pd.Timestamp(x).strftime('%b %Y'))
a=mae['Value (tsd eur)']
b=mae['Total MAE']
#Create combo chart
fig, ax1 = plt.subplots(figsize=(20,12))
color = 'tab:green'
#bar plot creation
ax1.set_title('MAE Investments', fontsize=25)
ax1.set_xlabel('Month', fontsize=23)
ax1.set_ylabel('Investments (tsd. eur)', fontsize=23)
ax1 = sns.barplot(x='Month', y='Value (tsd eur)', data = mae, palette='Blues',label="Value (tsd eur)")
ax1.tick_params(axis='y',labelsize=20)
ax1.tick_params(axis='x', which='major', labelsize=20, labelrotation=40)
#specify we want to share the same x-axis
ax2 = ax1.twinx()
color = 'tab:red'
#line plot creation
ax2.set_ylabel('Total MAE Value', fontsize=16)
ax2 = sns.lineplot(x='Month', y='Total MAE', data = mae, sort=False, color='blue',label="Total MAE")
ax2.tick_params(axis='y', color=color,labelsize=20)
h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()
ax1.legend(h1+h2, l1+l2, loc=2, prop={'size': 24})
for i,j in b.items():
ax2.annotate(str(j), xy=(i, j+30))
for i,j in a.items():
ax1.annotate(str(j), xy=(i, j+2))
#show plot
print(mae)
plt.show()
Update: found the answer here:
Secondary axis with twinx(): how to add to legend?
code used:
lines, labels =ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax2.legend(lines + lines2, labels + labels2, title="Legend", loc=2, prop={'size': 24})
insteaf of:
for i,j in b.items():
ax2.annotate(str(j), xy=(i, j+30))
for i,j in a.items():
ax1.annotate(str(j), xy=(i, j+2))