Incorrect plot from a pandas Series with datetimes indeces - matplotlib

When I use matplotlib to plot a pandas Series containing three float values with indices that are datetimes, I get a incorrect plot with a vertical line in the middle. It looks like this:
I've been struggling with this for hours. I'm finally able to reproduce it with these three data points in the following Python code:
import pandas as pd
import matplotlib.pyplot as plt
data = """\
2013-04-16 08:50:00.080120 / 56.70999
2013-04-16 08:53:34.165183 / 56.59997
2013-04-16 08:59:09.676249 / 55.70001\
"""
fmt = "%Y-%m-%d %H:%M:%S.%f"
val = [float(a.split(' / ')[1]) for a in data.split('\n')]
indx = [pd.datetime.strptime(a.split(' / ')[0], fmt) for a in data.split('\n')]
s = pd.Series(val, index=indx)
s.plot()
plt.show()
If I zoom in on the line I can see it's placed, seemingly, at the correct date (April 16), but at exactly midnight, instead of at the times specified by the data (and echoed by printing s).

Which version of matplotlib and pandas are you using?
With pandas v0.10.1.dev-f73128e and matplotlib v1.2.1 I get the correct figure whether I do the plot in Ipython in interactive mode or from a python script. (btw I use python 2.7)

Related

Seaborn hexbin plot with marginal distributions for datetime64[ns] and category variables

I'm uploading a spreadsheet from excel to a dataframe. In this table, I am only interested in two columns. The first column is the date and time in the format %Y-%m-%d %H-%M-%S. The second column is a categorical variable, namely the type of violation (for example, being late).
There are few types of violations in total. About 6-7 types.
Using the command df.info () you can make sure that the dataframe for the available columns has the datatime64[ns] type for the date and time column and the category type for the column with the types of violations.
I would like to use hexbin plot with marginal distributions from the seaborn library (https://seaborn.pydata.org/examples/hexbin_marginals.html ). However, the simple code available at the link above is not so simple for a variable with categories and time.
import seaborn as sns
sns.set_theme(style="ticks")
sns.jointplot(x=df['incident'], y=['date-time'], kind="hex", color="#4CB391")
The compiler reports TypeError: The x variable is categorical, but one of ['numeric', 'datetime'] is required
I understand that either a numeric variable or a date-time variable is needed for the ordinate axis. Conversion does not solve the problem.
This error can be reproduced using
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
ndf = pd.DataFrame({'date-time': ['2021-11-15 00:10:00','2021-11-15 00:20:00'], 'incident': ['a','b']})
print(ndf)
sns.set_theme(style="ticks")
sns.jointplot(data=ndf, x='incident', y='date-time', color="#4CB391", hue=ndf['incident'] )
plt.show()
Question. How to get a plot looks like seabron style
Based on the example cited in the question as the desired graph style, change the data on the x-axis to date/time data, convert it to a date format that can be handled by matplotlib, and place it as a tick on the x-axis. The ticks are placed by converting the time series back to its original format. Since the date and time overlap, the angle of the text is changed. Also, please judge for yourself the points that #JohanC pointed out.
import numpy as np
import seaborn as sns
sns.set_theme(style="ticks")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
rs = np.random.RandomState(11)
x = rs.gamma(2, size=1000)
date_rnge = pd.date_range('2021-11-15', '2021-11-16', freq='1min')
y = -.5 * x + rs.normal(size=1000)
g = sns.jointplot(x=mdates.date2num(date_rng[:1000]), y=y, kind="hex", color="#4CB391")
g.ax_joint.set_xticklabels([mdates.num2date(d).strftime('%Y-%m-%d %H:%M:%S') for d in mdates.date2num(date_rng[:1000])])
for tick in g.ax_joint.get_xticklabels():
tick.set_rotation(45)

how to draw scatter with zeppelin that contains 3 types of points

I'm working with zeppelin notebook using spark interpreter, I want to get a scatter but I want to get these points to have 3 different colors.
I integrated matplotlib in zeppelin because is simple to plot different pandas dataframe in the same figure.
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
# get object from ResourcePool
MemArb=sqlContext.table("MemArb")
VoisArb=sqlContext.table("VoisArb")
SousTest=sqlContext.table("SousTest")
MemArb=MemArb.toPandas()
VoisArb=VoisArb.toPandas()
SousTest=SousTest.toPandas()
x_client = SousTest["derx"].
astype('float').iloc[0]
y_client = SousTest["dtrx"].
astype('float').iloc[0]
x_memeArbre = MemArb['valx'].astype('float')
y_memeArbre = MemArb['valOx'].astype('float')
x_voisinArbre = VoisArb['vax'].astype('float')
y_voisinArbre = VoisArb['valOx'].astype('float')
y_voisinArbre.count()
figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(x_client, y_client, s=90, color='b')
plt.scatter(x_memeArbre,y_memeArbre,s=10, color='r')
plt.scatter(x_voisinArbre, y_voisinArbre, s=10, color='b')
plt.title('Nuage de points avec Matplotlib')
plt.xlabel('ONx')
plt.ylabel('OLx')
plt.show()
Is there a solution to get the same result using Zeppelin without matplotlib
You can use createOrReplaceTempView function on a dataframe and then write SQL queries to get the data. Current Zeppelin (0.8.0) has a scatter plot among offered built-in visualizations. Just make sure that each data point has a corresponding column indicating color.

Formatting Seaborn Factorplot y-labels to percentages [duplicate]

I have an existing plot that was created with pandas like this:
df['myvar'].plot(kind='bar')
The y axis is format as float and I want to change the y axis to percentages. All of the solutions I found use ax.xyz syntax and I can only place code below the line above that creates the plot (I cannot add ax=ax to the line above.)
How can I format the y axis as percentages without changing the line above?
Here is the solution I found but requires that I redefine the plot:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as mtick
data = [8,12,15,17,18,18.5]
perc = np.linspace(0,100,len(data))
fig = plt.figure(1, (7,4))
ax = fig.add_subplot(1,1,1)
ax.plot(perc, data)
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
xticks = mtick.FormatStrFormatter(fmt)
ax.xaxis.set_major_formatter(xticks)
plt.show()
Link to the above solution: Pyplot: using percentage on x axis
This is a few months late, but I have created PR#6251 with matplotlib to add a new PercentFormatter class. With this class you just need one line to reformat your axis (two if you count the import of matplotlib.ticker):
import ...
import matplotlib.ticker as mtick
ax = df['myvar'].plot(kind='bar')
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
PercentFormatter() accepts three arguments, xmax, decimals, symbol. xmax allows you to set the value that corresponds to 100% on the axis. This is nice if you have data from 0.0 to 1.0 and you want to display it from 0% to 100%. Just do PercentFormatter(1.0).
The other two parameters allow you to set the number of digits after the decimal point and the symbol. They default to None and '%', respectively. decimals=None will automatically set the number of decimal points based on how much of the axes you are showing.
Update
PercentFormatter was introduced into Matplotlib proper in version 2.1.0.
pandas dataframe plot will return the ax for you, And then you can start to manipulate the axes whatever you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,5))
# you get ax from here
ax = df.plot()
type(ax) # matplotlib.axes._subplots.AxesSubplot
# manipulate
vals = ax.get_yticks()
ax.set_yticklabels(['{:,.2%}'.format(x) for x in vals])
Jianxun's solution did the job for me but broke the y value indicator at the bottom left of the window.
I ended up using FuncFormatterinstead (and also stripped the uneccessary trailing zeroes as suggested here):
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame(np.random.randn(100,5))
ax = df.plot()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
Generally speaking I'd recommend using FuncFormatter for label formatting: it's reliable, and versatile.
For those who are looking for the quick one-liner:
plt.gca().set_yticklabels([f'{x:.0%}' for x in plt.gca().get_yticks()])
this assumes
import: from matplotlib import pyplot as plt
Python >=3.6 for f-String formatting. For older versions, replace f'{x:.0%}' with '{:.0%}'.format(x)
I'm late to the game but I just realize this: ax can be replaced with plt.gca() for those who are not using axes and just subplots.
Echoing #Mad Physicist answer, using the package PercentFormatter it would be:
import matplotlib.ticker as mtick
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))
#if you already have ticks in the 0 to 1 range. Otherwise see their answer
I propose an alternative method using seaborn
Working code:
import pandas as pd
import seaborn as sns
data=np.random.rand(10,2)*100
df = pd.DataFrame(data, columns=['A', 'B'])
ax= sns.lineplot(data=df, markers= True)
ax.set(xlabel='xlabel', ylabel='ylabel', title='title')
#changing ylables ticks
y_value=['{:,.2f}'.format(x) + '%' for x in ax.get_yticks()]
ax.set_yticklabels(y_value)
You can do this in one line without importing anything:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{}%'.format))
If you want integer percentages, you can do:
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))
You can use either ax.yaxis or plt.gca().yaxis. FuncFormatter is still part of matplotlib.ticker, but you can also do plt.FuncFormatter as a shortcut.
Based on the answer of #erwanp, you can use the formatted string literals of Python 3,
x = '2'
percentage = f'{x}%' # 2%
inside the FuncFormatter() and combined with a lambda expression.
All wrapped:
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: f'{y}%'))
Another one line solution if the yticks are between 0 and 1:
plt.yticks(plt.yticks()[0], ['{:,.0%}'.format(x) for x in plt.yticks()[0]])
add a line of code
ax.yaxis.set_major_formatter(ticker.PercentFormatter())

Unusual axes range for Pandas dataframe plot

I'm trying to do simple plotting using the built-in pandas.DataFrame.plot function (I want to avoid the full-blown pyplt figure/axes object setup approach). However, I get strange results when I don't specify the x-axes range parameter (xlim). You would think that plot would pick defaults based on the max and min of the data.
(I'm using inline plotting in a Jupyter notebook with Python 2.7)
Setup code
import numpy as np
import pandas as pd
%matplotlib inline
n = 366
x = 10.0*np.random.randn(n)
time_series = pd.date_range("2016-01-01", "2016-12-31")
df = pd.DataFrame(data=x, index=time_series, columns=['X'])
df['y'] = 100 + 2.5*x + 5.0*np.random.randn(n)
df.describe()
Here is what the data looks like:
Desired code:
df.plot('X', 'y', style='.')
Here is what I get from above code:
Here is what I expect to get:
df.plot('X', 'y', xlim=(-35, 35), style='.')
Each time I run the code I get an equally-odd choice of default axis range.

How to plot data from two columns of Excel Spreadhseet using madplotlib and numpy to make a line graph/plot

Link to the Spreadsheet: https://docs.google.com/spreadsheets/d/1c2hItirdrnvz2emJ4peJHaWrQlzahoHeVqetgHHAXvI/edit?usp=sharing
I am new to Python and am very keen to learn this since I like statistics and computer programming. Any help would be appreciated!
I used matploylib and numpy, but don't know how to graph this spreadsheet as a line graph.
If the data are in a common csv (comma separable values) format, they can easily be read into python. (Here I downloaded the file from the link in the question via File/Download as/comma separated values.
Using pandas and matplotlib
You can then read in data in pandas using pandas.read_csv(). This creates a DataFrame. Usually pandas automatically understands that the first row is the column names. You can then access the columns from the Dataframe via their names.
Plotting can easily performed with the DataFrame.plot(x,y) method, where x and y can be simply the column names to plot.
import pandas as pd
import matplotlib.pyplot as plt
# reading in the dataframe from the question text
df = pd.read_csv("data/1880-2016 Temperature Data Celc.csv")
# make Date a true Datetime
df["Year"] = pd.to_datetime(df["Year"], format="%Y")
# plot dataframe
ax = df.plot("Year", "Temperature in C")
ax.figure.autofmt_xdate()
plt.show()
In case one wants a scatterplot, use
df.plot( x="Year", y="Temperature in C", marker="o", linestyle="")
Using numpy and matplotlib
The same can be done with numpy. Reading in the data works with numpy.loadtxt where one has to provide a little bit more information about the data. E.g. expluding the first row and using comma as separator. The unpacked columns can be plotted with pyplot pyplot.plot(year, temp).
import numpy as np
import matplotlib.pyplot as plt
# reading in the data
year, temp = np.loadtxt("data/1880-2016 Temperature Data Celc.csv",
skiprows=1, unpack=True, delimiter=",")
#plotting with pyplot
plt.plot(year, temp, label="Temperature in C")
plt.xlabel("Year")
plt.ylabel("Temperature in C")
plt.legend()
plt.gcf().autofmt_xdate()
plt.show()
The result looks roughly the same as in the pandas case (because pandas simply uses matplotlib internally).
In case one wants a scatterplot, there are two options:
plt.plot(year, temp, marker="o", ls="", label="Temperature in C")
or
plt.scatter(year, temp, label="Temperature in C")