Inconsistent internal representation of dates in matplotlib/pandas - pandas

import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-02'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (736066.7, 736469.3)
Now, if we change the last date.
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (184.8, 189.2)
The first example seems consistent with the matplotlib docs:
Matplotlib represents dates using floating point numbers specifying the number of days since 0001-01-01 UTC, plus 1
Why does the second example return something seemingly completely different? I'm using pandas version 0.22.0 and matplotlib version 2.2.2.

In the second example, if you look at the plots, rather than giving dates matplotlib is giving quarter values:
The dates in this case are exactly six months and therefore two quarters apart, which is presumably why you're seeing this behavior. While I can't find it in the docs, the numbers given by xlim in this case are consistent with being the number of quarters since the Unix Epoch (Jan. 1, 1970).

Pandas uses different units to represents dates and times on the axes, depending on the range of dates/times in use. This means that different locators are in use.
In the first case,
print(ax.xaxis.get_major_locator())
# Out: pandas.plotting._converter.PandasAutoDateLocator
in the second case
print(ax.xaxis.get_major_locator())
# pandas.plotting._converter.TimeSeries_DateLocator
You may force pandas to always use the PandasAutoDateLocator using the x_compat argument,
df.plot(x_compat=True)
This would ensure to always get the same datetime definition, consistent with the matplotlib.dates convention.
The drawback is that this removes the nice quarterly ticking
and replaces it with the standard ticking
On the other hand it would then allow to use the very customizable matplotlib.dates tickers and formatters. For example to get quarterly ticks/labels
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot(x_compat=True)
# Quarterly ticks
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
# Formatting:
def func(x,pos):
q = (mdates.num2date(x).month-1)//3+1
tx = "Q{}".format(q)
if q == 1:
tx += "\n{}".format(mdates.num2date(x).year)
return tx
ax.xaxis.set_major_formatter(mticker.FuncFormatter(func))
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")
plt.show()

Related

Converting dict to dataframe of Solution point values & plotting

I am trying to plot some results obtained after optimisation using Gurobi.
I have converted the dictionary to python dataframe.
it is 96*1
But now how do I use this dataframe to plot as 1st row-value, 2nd row-value, I am attaching the snapshot of the same.
Please anyone can help me in this?
x={}
for t in time1:
x[t]= [price_energy[t-1]*EnergyResource[174,t].X]
df = pd.DataFrame.from_dict(x, orient='index')
df
You can try pandas.DataFrame(data=x.values()) to properly create a pandas DataFrame while using row numbers as indices.
In the example below, I have generated a (pseudo) random dictionary with 10 values, and stored it as a data frame using pandas.DataFrame giving a name to the only column as xyz. To understand how indexing works, please see Indexing and selecting data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a dictionary 'x'
rng = np.random.default_rng(121)
x = dict(zip(np.arange(10), rng.random((1, 10))[0]))
# Create a dataframe from 'x'
df = pd.DataFrame(x.values(), index=x.keys(), columns=["xyz"])
print(df)
print(df.index)
# Plot the dataframe
plt.plot(df.index, df.xyz)
plt.show()
This prints df as:
xyz
0 0.632816
1 0.297902
2 0.824260
3 0.580722
4 0.593562
5 0.793063
6 0.444513
7 0.386832
8 0.214222
9 0.029993
and gives df.index as:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
and also plots the figure:

Matplotlib boxplot select method to calculate the quartile values

Using boxplot from matplotlib.pyplot the quartile values are calculated by including the median. Can this be changed to NOT include the median?
For example, consider the ordered data set
2, 3, 4, 5, 6, 7, 8
If the median is NOT included, then Q1=3 and Q3=7. However, boxplot includes the median value, i.e. 5, and generates the figure below
Is it possible to change this behavior, and NOT include the median in the calculation of the quartiles? This should correspond to Method 1 as described on on the Wikipedia page Quartile. The code to generate the figure is listed below
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
data = [2, 3, 4, 5, 6, 7, 8]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
bp = ax.boxplot(data, '',
vert=False,
positions=[0.4],
widths=[0.3])
ax.set_xlim([0,9])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The question is motivated by the fact that several educational resources around the internet do not calculate the quartiles as the default settings used by matplotlib's boxplot. For example, in the online course, "Statistics and probability" from Khan Academy, the quartiles are calculated as described in Method 1 on the Wikipedia page Quartiles, while boxplot employs Method 2.
Consider an example from Khan Academy's course "Statistics and probability" section "Comparing range and interquartile range (IQR)" . The daily high temperatures are recorded in Paradise, MI. for 7 days and found to be 16, 24, 26, 26,26, 27, and 28 degree Celsius. Describe the data with a boxplot and calculate IQR.
The result of using the default settings in boxplot and that presented by Prof. Khan are very different, see figure below.
The IQR found by matplotlib is 1.5, and that calculated by Prof. Khan is 3.
As pointed out in the comments by #JohanC, boxplot can not directly be configured to follow Method 1, but requires a customized function. Therefore, neglecting the calculation of outliers, I updated the code to calculate the quartiles according to Method 1, and thus be comparable with the Khan Academy course. The code is listed below, not very pythonic, suggestions are welcome.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
from matplotlib.ticker import MultipleLocator
def median(x):
"""
x - input a list of numbers
Returns the midpoint number, for example
in a list with oddnumbers
[1,2, 3, 4,5] returns 3
for a list with even numbers the algebraic mean is returned, e.g
[1,2,3,4] returns 2.5
"""
if len(x)&1:
# Odd number of elements in list, e.g. x = [1,2,3] returns 2
index_middle = int((len(x)-1)/2)
median = x[index_middle]
else:
# Even number of elements in list, e.g. x = [-1,2] returns 0.5
index_lower = int(len(x)/2-1)
index_upper = int(len(x)/2)
median = (x[index_lower]+x[index_upper])/2
return median
def method_1_quartiles(x):
"""
x - list of numbers
"""
x.sort()
N = len(x)
if N&1:
# Odd number of elements
index_middle = int((N-1)/2)
lower = x[0:index_middle] # Up to but not including
upper = x[index_middle+1:N+1]
Q1= median(lower)
Q2 = x[index_middle]
Q3 = median(upper)
else:
# Even number of elements
index_lower = int(N/2)
lower = x[0:index_lower]
upper = x[index_lower:N]
Q1= median(lower)
Q2 = (x[index_lower-1]+x[index_lower])/2
Q3 = median(upper)
return Q1,Q2,Q3
data = [16,24,26, 26, 26,27,28]
fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
stats = cbook.boxplot_stats(data,)[0]
Q1_default = stats['q1']
Q3_default = stats['q3']
stats['whislo']=min(data)
stats['whishi']=max(data)
IQR_default = Q3_default - Q1_default
Q1, Q2, Q3 = method_1_quartiles(data)
IQR = Q3-Q1
stats['q1'] = Q1
stats['q3'] = Q3
print(f"IQR: {IQR}")
ax.bxp([stats],vert=False,manage_ticks=False,widths=[0.3],positions=[0.4],showfliers=False)
ax.set_xlim([15,30])
ax.set_ylim([0,1])
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks([])
ax.grid(which='major',axis='x',lw=0.1)
plt.show()
The graph generated is

How to change a seaborn histogram plot to work for hours of the day?

I have a pandas dataframe with lots of time intervals of varying start times and lengths. I am interested in the distribution of start times over 24hours. I therefore have another column entitled Hour with just that in. I have plotted a histogram using seaborn to look at the distribution but obviously the x axis starts at 0 and runs to 24. I wonder if there is a way to change so it runs from 8 to 8 and loops over at 23 to 0 so it provides a better visualisation of my data from a time perspective. Thanks in advance.
sns.distplot(df2['Hour'], bins = 24, kde = False).set(xlim=(0,23))
If you want to have a custom order of x-values on your bar plot, I'd suggest using matplotlib directly and plot your histogram simply as a bar plot with width=1 to get rid of padding between bars.
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
# prepare sample data
dates = pd.date_range(
start=datetime(2020, 1, 1),
end=datetime(2020, 1, 7),
freq="H")
random_dates = np.random.choice(dates, 1000)
df = pd.DataFrame(data={"date":random_dates})
df["hour"] = df["date"].dt.hour
# set your preferred order of hours
hour_order = [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5,6,7]
# calculate frequencies of each hour and sort them
plot_df = (
df["hour"]
.value_counts()
.rename_axis("hour", axis=0)
.reset_index(name="freq")
.set_index("hour")
.loc[hour_order]
.reset_index())
# day / night colour split
day_mask = ((8 <= plot_df["hour"]) & (plot_df["hour"] <= 20))
plot_df["color"] = np.where(day_mask, "skyblue", "midnightblue")
# actual plotting - note that you have to cast hours as strings
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.bar(
x=plot_df["hour"].astype(str),
height=plot_df["freq"],
color=plot_df["color"], width=1)
ax.set_xlabel('Hour')
ax.set_ylabel('Frequency')
plt.show()

Matplotlib Bar Graph Yaxis not being set to 0 [duplicate]

My DataFrame's structure
trx.columns
Index(['dest', 'orig', 'timestamp', 'transcode', 'amount'], dtype='object')
I'm trying to plot transcode (transaction code) against amount to see the how much money is spent per transaction. I made sure to convert transcode to a categorical type as seen below.
trx['transcode']
...
Name: transcode, Length: 21893, dtype: category
Categories (3, int64): [1, 17, 99]
The result I get from doing plt.scatter(trx['transcode'], trx['amount']) is
Scatter plot
While the above plot is not entirely wrong, I would like the X axis to contain just the three possible values of transcode [1, 17, 99] instead of the entire [1, 100] range.
Thanks!
In matplotlib 2.1 you can plot categorical variables by using strings. I.e. if you provide the column for the x values as string, it will recognize them as categories.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
plt.scatter(df["x"].astype(str), df["y"])
plt.margins(x=0.5)
plt.show()
In order to optain the same in matplotlib <=2.0 one would plot against some index instead.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({"x" : np.random.choice([1,17,99], size=100),
"y" : np.random.rand(100)*100})
u, inv = np.unique(df["x"], return_inverse=True)
plt.scatter(inv, df["y"])
plt.xticks(range(len(u)),u)
plt.margins(x=0.5)
plt.show()
The same plot can be obtained using seaborn's stripplot:
sns.stripplot(x="x", y="y", data=df)
And a potentially nicer representation can be done via seaborn's swarmplot:
sns.swarmplot(x="x", y="y", data=df)

Seaborn groupby pandas Series

I want to visualize my data into box plots that are grouped by another variable shown here in my terrible drawing:
So what I do is to use a pandas series variable to tell pandas that I have grouped variables so this is what I do:
import pandas as pd
import seaborn as sns
#example data for reproduciblity
a = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
])
#converting second column to Series
a.ix[:,1] = pd.Series(a.ix[:,1])
#Plotting by seaborn
sns.boxplot(a, groupby=a.ix[:,1])
And this is what I get:
However, what I would have expected to get was to have two boxplots each describing only the first column, grouped by their corresponding column in the second column (the column converted to Series), while the above plot shows each column separately which is not what I want.
A column in a Dataframe is already a Series, so your conversion is not necessary. Furthermore, if you only want to use the first column for both boxplots, you should only pass that to Seaborn.
So:
#example data for reproduciblity
df = pd.DataFrame(
[
[2, 1],
[4, 2],
[5, 1],
[10, 2],
[9, 2],
[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn
sns.boxplot(df.a, groupby=df.b)
I changed your example a little bit, giving columns a label makes it a bit more clear in my opinion.
edit:
If you want to plot all columns separately you (i think) basically want all combinations of the values in your groupby column and any other column. So if you Dataframe looks like this:
a b grouper
0 2 5 1
1 4 9 2
2 5 3 1
3 10 6 2
4 9 7 2
5 3 11 1
And you want boxplots for columns a and b while grouped by the column grouper. You should flatten the columns and change the groupby column to contain values like a1, a2, b1 etc.
Here is a crude way which i think should work, given the Dataframe shown above:
dfpiv = df.pivot(index=df.index, columns='grouper')
cols_flat = [dfpiv.columns.levels[0][i] + str(dfpiv.columns.levels[1][j]) for i, j in zip(dfpiv.columns.labels[0], dfpiv.columns.labels[1])]
dfpiv.columns = cols_flat
dfpiv = dfpiv.stack(0)
sns.boxplot(dfpiv, groupby=dfpiv.index.get_level_values(1))
Perhaps there are more fancy ways of restructuring the Dataframe. Especially the flattening of the hierarchy after pivoting is hard to read, i dont like it.
This is a new answer for an old question because in seaborn and pandas are some changes through version updates. Because of this changes the answer of Rutger is not working anymore.
The most important changes are from seaborn==v0.5.x to seaborn==v0.6.0. I quote the log:
Changes to boxplot() and violinplot() will probably be the most disruptive. Both functions maintain backwards-compatibility in terms of the kind of data they can accept, but the syntax has changed to be more similar to other seaborn functions. These functions are now invoked with x and/or y parameters that are either vectors of data or names of variables in a long-form DataFrame passed to the new data parameter.
Let's now go through the examples:
# preamble
import pandas as pd # version 1.1.4
import seaborn as sns # version 0.11.0
sns.set_theme()
Example 1: Simple Boxplot
df = pd.DataFrame([[2, 1] ,[4, 2],[5, 1],
[10, 2],[9, 2],[3, 1]
], columns=['a', 'b'])
#Plotting by seaborn with x and y as parameter
sns.boxplot(x='b', y='a', data=df)
Example 2: Boxplot with grouper
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
# usinge pandas melt
df_long = pd.melt(df, "grouper", var_name='a', value_name='b')
# join two columns together
df_long['a'] = df_long['a'].astype(str) + df_long['grouper'].astype(str)
sns.boxplot(x='a', y='b', data=df_long)
Example 3: rearanging the DataFrame to pass is directly to seaborn
def df_rename_by_group(data:pd.DataFrame, col:str)->pd.DataFrame:
'''This function takes a DataFrame, groups by one column and returns
a new DataFrame where the old columnnames are extended by the group item.
'''
grouper = df.groupby(col)
max_length_of_group = max([len(values) for item, values in grouper.indices.items()])
_df = pd.DataFrame(index=range(max_length_of_group))
for i in grouper.groups.keys():
helper = grouper.get_group(i).drop(col, axis=1).add_suffix(str(i))
helper.reset_index(drop=True, inplace=True)
_df = _df.join(helper)
return _df
df = pd.DataFrame([[2, 5, 1], [4, 9, 2],[5, 3, 1],
[10, 6, 2],[9, 7, 2],[3, 11, 1]
], columns=['a', 'b', 'grouper'])
df_new = df_rename_by_group(data=df, col='grouper')
sns.boxplot(data=df_new)
I really hope this answer helps to avoid some confusion.
sns.boxplot() doesnot take groupby.
Probably you are gonna see
TypeError: boxplot() got an unexpected keyword argument 'groupby'.
The best idea to group data and use in boxplot passing the data as groupby dataframe value.
import seaborn as sns
grouDataFrame = nameDataFrame(['A'])['B'].agg(sum).reset_index()
sns.boxplot(y='B', x='A', data=grouDataFrame)
Here B column data contains numeric value and grouped is done on the basis of A. All the grouped value with their respective column are added and boxplot diagram is plotted. Hope this helps.