matplotlib plot from dataframe but shift dates in x labels - pandas

I have this dataframe:
dates;A;B;C
2018-01-31;1;2;5
2018-02-28;1;4;3
2018-03-31;1;5;5
2018-04-30;1;6;3
2018-05-31;1;6;7
2018-06-30;1;7;3
2018-07-31;1;9;9
2018-08-31;1;2;3
2018-09-30;1;2;10
2018-10-31;1;4;3
2018-11-30;1;7;11
2018-12-31;1;2;3
I read it:
dfr = pd.read_csv('test.dat', sep=';', header = 0, index_col=0, parse_dates=True)
and then I try to plot it:
width = 5
dfr.index = pd.to_datetime(dfr.index)
x = date2num(dfr.index)
axs.bar(x-0.5*width,dfr.iloc[:,1], width=width)
axs.bar(x+0.5*width,dfr.iloc[:,2], width=width)
axs.xaxis_date()
months = dates.MonthLocator()
axs.xaxis.set_major_formatter(dates.DateFormatter(r'\textbf{%B}'))
months_f = dates.DateFormatter('%B')
axs.xaxis.set_major_locator(months)
plt.setp( axs.xaxis.get_majorticklabels(), rotation=90)
here the modules imported:
import matplotlib.pyplot as plt
from matplotlib.dates import date2num
import datetime
import pandas as pd
import matplotlib.dates as dates
and here the result:
I do not get why x label starts with 'Feb'.
I would like to have something like 'Jan,Feb,Mar...' as x labels in the x axis.
Thanks in advance

The heights of the bar charts you made do not correspond to the labelled month, i.e. the values for Feb are actually those of Jan. Therefore, the problem is in the way you labelled the axis rather than having an incorrect plot order.
I'm not so familiar with the packages you used, so I proposed a different way of making your plot:
dfr['dates'] = pd.to_datetime(dfr['dates'])
### group by months
month_vals = dfr.groupby(dfr['dates'].map(lambda x: x.month))
month_vals = sorted(month_vals, key=lambda m: m[0])
fig, axs = plt.subplots()
spacing = 0.15
### Create the list of months and the corresponding dataframes
months, df_months = zip(*month_vals)
### In your case, each month has exactly one entry, but in case there are more, sum over all of them
axs.bar([m-spacing for m in months], [df_m.loc[:,'B'].sum() for df_m in df_months], width=0.3)
axs.bar([m+spacing for m in months], [df_m.loc[:,'C'].sum() for df_m in df_months], width=0.3)
axs.set_xticks(months)
### 1900 and 1 are dummy values; we are just initializing a datetime instance here
axs.set_xticklabels([datetime.date(1900, m, 1).strftime('%b') for m in months])
Output:

Related

Barplot per each ax in matplotlib

I have the following dataset, ratings in stars for two fictitious places:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
Since the rating is a category (is not a continuous data) I convert it to a category:
df['rating_cat'] = pd.Categorical(df['rating'])
What I want is to create a bar plot per each fictitious place ('A or B'), and the count per each rating. This is the intended plot:
I guess using a for per each value in id could work, but I have some trouble to decide the size:
fig, ax = plt.subplots(1,2,figsize=(6,6))
axs = ax.flatten()
cats = df['rating_cat'].cat.categories.tolist()
ids_uniques = df.id.unique()
for i in range(len(ids_uniques)):
ax[i].bar(df[df['id']==ids_uniques[i]], df['rating'].size())
But it returns me an error TypeError: 'int' object is not callable
Perhaps it's something complicated what I am doing, please, could you guide me with this code
The pure matplotlib way:
from math import ceil
# Prepare the data for plotting
df_plot = df.groupby(["id", "rating"]).size()
unique_ids = df_plot.index.get_level_values("id").unique()
# Calculate the grid spec. This will be a n x 2 grid
# to fit one chart by id
ncols = 2
nrows = ceil(len(unique_ids) / ncols)
fig = plt.figure(figsize=(6,6))
for i, id_ in enumerate(unique_ids):
# In a figure grid spanning nrows x ncols, plot into the
# axes at position i + 1
ax = fig.add_subplot(nrows, ncols, i+1)
df_plot.xs(id_).plot(axes=ax, kind="bar")
You can simplify things a lot with Seaborn:
import seaborn as sns
sns.catplot(data=df, x="rating", col="id", col_wrap=2, kind="count")
If you're ok with installing a new library, seaborn has a very helpful countplot. Seaborn uses matplotlib under the hood and makes certain plots easier.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame({'id':['A','A','A','A','A','A','A','B','B','B','B','B','B'],
'rating':[1,2,4,5,5,5,3,1,3,3,3,5,2]})
sns.countplot(
data = df,
x = 'rating',
hue = 'id',
)
plt.show()
plt.close()

Plotting annual mean and standard deviation in different colors for each year

I have data for several years. I have calculated mean and standard deviation for each year. Now I want to plot each row with mean as a scatter plot and fill plot between the standard deviations that is mean plus minus standard deviation in different colors for different years.
After using df_wc.set_index('Date').resample('Y')["Ratio(a/w)"].mean() it returns only the last date of the year (as shown below in the data set) but I want the fill plot for standard deviation to spread for the entire year.
Sample Data set:
Date | Mean | Std_dv
1858-12-31 1.284273 0.403052
1859-12-31 1.235267 0.373283
1860-12-31 1.093308 0.183646
1861-12-31 1.403693 0.400722
That's a very good question that you have asked, and it did not have an easy answer. But if I had understood the problem correctly, you need a fill plot with different colours for each year. The upper bound and lower bound of the plot will be between mean + std and mean - std?
So, I formed a custom time series and this is how I have plotted the values with the upper bound and lower bounds:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection,PatchCollection
from matplotlib.colors import ListedColormap, BoundaryNorm
import pandas as pd
ts = range(10)
num_classes = len(ts)
df = pd.DataFrame(data={'TOTAL': np.random.rand(len(ts)), 'Label': list(range(0, num_classes))}, index=ts)
df['UB'] = df['TOTAL'] + 2
df['LB'] = df['TOTAL'] - 2
print(df)
colors = ['r', 'g', 'b', 'y', 'purple', 'orange', 'k', 'pink', 'grey', 'violet']
cmap = ListedColormap(colors)
norm = BoundaryNorm(range(num_classes+1), cmap.N)
points = np.array([df.index, df['TOTAL']]).T.reshape(-1, 1, 2)
pointsUB = np.array([df.index, df['UB']]).T.reshape(-1, 1, 2)
pointsLB = np.array([df.index, df['LB']]).T.reshape(-1, 1, 2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
segmentsUB = np.concatenate([pointsUB[:-1], pointsUB[1:]], axis=1)
segmentsLB = np.concatenate([pointsLB[:-1], pointsLB[1:]], axis=1)
lc = LineCollection(segments, cmap=cmap, norm=norm, linestyles='dashed')
lc.set_array(df['Label'])
lcUB = LineCollection(segmentsUB, cmap=cmap, norm=norm, linestyles='solid')
lcUB.set_array(df['Label'])
lcLB = LineCollection(segmentsLB, cmap=cmap, norm=norm, linestyles='solid')
lcLB.set_array(df['Label'])
fig1 = plt.figure()
plt.gca().add_collection(lc)
plt.gca().add_collection(lcUB)
plt.gca().add_collection(lcLB)
for i in range(len(colors)):
plt.fill_between( df.index,df['UB'],df['LB'], where= ((df.index >= i) & (df.index <= i+1)), alpha = 0.1,color=colors[i])
plt.xlim(df.index.min(), df.index.max())
plt.ylim(-3.1, 3.1)
plt.show()
And the result dataframe obtained looks like this:
TOTAL Label UB LB
0 0.681455 0 2.681455 -1.318545
1 0.987058 1 2.987058 -1.012942
2 0.212432 2 2.212432 -1.787568
3 0.252284 3 2.252284 -1.747716
4 0.886021 4 2.886021 -1.113979
5 0.369499 5 2.369499 -1.630501
6 0.765192 6 2.765192 -1.234808
7 0.747923 7 2.747923 -1.252077
8 0.543212 8 2.543212 -1.456788
9 0.793860 9 2.793860 -1.206140
And the plot looks like this:
Let me know if this helps! :)

How to set x axis according to the numbers in the DATAFRAME

i am using Matplotlib to show graph of some information that i get from the users,
i want to show it as:axis x will be by the ID of the users and axis y will be by the Winning time that whey have..
I dont understand how can i put the x axis index as the ID of my users.
my code:
import matplotlib.pyplot as plt
import matplotlib,pylab as pylab
import pandas as pd
import numpy as np
#df = pd.read_csv('Players.csv')
df = pd.read_json('Players.json')
# df.groupby('ID').sum()['Win']
axisx = df.groupby('ID').sum()['Win'].keys()
axisy = df.groupby('ID').sum()['Win'].values
fig = pylab.gcf()
# fig.canvas.set_window_title('4 In A Row Statistic')
# img = plt.imread("Oi.jpeg")
# plt.imshow(img)
fig, ax = plt.subplots()
ax.set_xticklabels(axisx.to_list())
plt.title('Game Statistic',fontsize=20,color='r')
plt.xlabel('ID Players',color='r')
plt.ylabel('Wins',color='r')
x = np.arange(len(axisx))
rects = ax.bar(x, axisy, width=0.1)
plt.show()
use plt.xticks(array_of_id). xticks can set the current tick locations and labels of the x-axis.

Only plotting observed dates in matplotlib, skipping range of dates

I have a simple dataframe I am plotting in matplotlib. However, the plot is showing the range of the dates, rather than just the two observed data points.
How can I only plot the two data points and not the range of the dates?
df structure:
Date Number
2018-01-01 12:00:00 1
2018-02-01 12:00:00 2
Output of the matplotlib code:
Here is what I expected (this was done using a string and not a date on the x-axis data):
df code:
import pandas as pd
df = pd.DataFrame([['2018-01-01 12:00:00', 1], ['2018-02-01 12:00:00',2]], columns=['Date', 'Number'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'],inplace=True)
Plot code:
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(
figsize=(4,5),
dpi=72
)
width = 0.75
#starts the bar chart creation
ax1.bar(df.index, df['Number'],
width,
align='center',
color=('#666666', '#333333'),
edgecolor='#FF0000',
linewidth=2
)
ax1.set_ylim(0,3)
ax1.set_ylabel('Score')
fig.autofmt_xdate()
#Title
plt.title('Scores by group and gender')
plt.tight_layout()
plt.show()
Try adding something like:
import matplotlib.dates as mdates
myFmt = mdates.DateFormatter('%y-%m-%d')
ax1.xaxis.set_major_formatter(myFmt)
plt.xticks(df.index)
I think the dates are transformed to large integers at the time of the plot. So width = 0.75 is very small, try something bigger (like width = 20:
Matplotlib bar plots are numeric in nature. If you want a categorical bar plot instead, you may use pandas bar plots.
df.plot.bar()
You may then want to beautify the labels a bit
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame([['2018-01-01 12:00:00', 1], ['2018-02-01 12:00:00',2]], columns=['Date', 'Number'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Date'],inplace=True)
ax = df.plot.bar()
ax.tick_params(axis="x", rotation=0)
ax.set_xticklabels([t.get_text().split()[0] for t in ax.get_xticklabels()])
plt.show()

Pandas boxplot side by side for different DataFrame

Even though there are nice examples online about plotting side by side boxplots. With the way my data is set in two different pandas DataFrames and allready having sum subplots I have not been able to manage getting my boxplots next to each other in stead of overlapping.
my code is as follows:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
mpl.use('agg')
fig, axarr = plt.subplots(3,sharex=True,sharey=True,figsize=(9,6))
month = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']
percentiles = [90,95,98]
nr = 0
for p in percentiles:
future_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
present_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
Future = future_data.as_matrix()
Present = present_data.as_matrix()
pp = axarr[nr].boxplot(Present,patch_artist=True, showfliers=False)
fp = axarr[nr].boxplot(Future, patch_artist=True, showfliers=False)
nr += 1
The results looks as follows:
Overlapping Boxplots
Could you help me out in how to makes sure the boxes are next to each other so I can compare them without being bothered by the overlap?
Thank you!
EDIT: I have reduced the code somewhat so it can run like this.
You need to position your bars manually, i.e. providing the positions as array to the position argument of boxplot. Here it makes sense to shift one by -0.2 and the other by +0.2 to their integer position. You can then adjust the width of them to sum up to something smaller than the difference in positions.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, axarr = plt.subplots(3,sharex=True,sharey=True,figsize=(9,6))
month = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']
percentiles = [90,95,98]
nr = 0
for p in percentiles:
future_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
present_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
Future = future_data.as_matrix()
Present = present_data.as_matrix()
pp = axarr[nr].boxplot(Present,patch_artist=True, showfliers=False,
positions=np.arange(Present.shape[1])-.2, widths=0.4)
fp = axarr[nr].boxplot(Future, patch_artist=True, showfliers=False,
positions=np.arange(Present.shape[1])+.2, widths=0.4)
nr += 1
axarr[-1].set_xticks(np.arange(len(month)))
axarr[-1].set_xticklabels(month)
axarr[-1].set_xlim(-0.5,len(month)-.5)
plt.show()