Unusual axes range for Pandas dataframe plot - pandas

I'm trying to do simple plotting using the built-in pandas.DataFrame.plot function (I want to avoid the full-blown pyplt figure/axes object setup approach). However, I get strange results when I don't specify the x-axes range parameter (xlim). You would think that plot would pick defaults based on the max and min of the data.
(I'm using inline plotting in a Jupyter notebook with Python 2.7)
Setup code
import numpy as np
import pandas as pd
%matplotlib inline
n = 366
x = 10.0*np.random.randn(n)
time_series = pd.date_range("2016-01-01", "2016-12-31")
df = pd.DataFrame(data=x, index=time_series, columns=['X'])
df['y'] = 100 + 2.5*x + 5.0*np.random.randn(n)
df.describe()
Here is what the data looks like:
Desired code:
df.plot('X', 'y', style='.')
Here is what I get from above code:
Here is what I expect to get:
df.plot('X', 'y', xlim=(-35, 35), style='.')
Each time I run the code I get an equally-odd choice of default axis range.

Related

Sns barplot does not sort sliced values

I want to plot from pd df using sns barplot. Everything works fine :
code associated :
result = df.groupby(['Code departement']).size().sort_values(ascending=False)
x=result.index
y=result.values
plot=sns.barplot(x, y)
plot.set(xlabel='Code departement', ylabel='Nombre de transactions')
sns.barplot(x, y, data=df).set_title('title')
But as you can see in PLOT 1, there are too many bars so I just want the 10 highest, and when I slice x and y :
x=result[:10].index
y=result[:10].values
plot=sns.barplot(x, y)
It prints bars unordered like this :
I checked by printing x and y (sliced) and they are right ordered, Idk what I am missing thank you for your help
You didn't state the version you are using, but probably it isn't the latest. Seaborn as well as matplotlib receive quite some improvements with each new version.
With seaborn 0.11.1 you'd get a warning, as x and y is preferred to be passed via keywords, i.e. sns.barplot(x=x, y=y). The warning tries to avoid confusion with the data= keyword. Apart from that, the numeric x-values would appear sorted numerically.
The order can be controlled via the order= keyword. In this case, sns.barplot(x=x, y=y, order=x). To only have the 10 highest, you can pass sns.barplot(x=x, y=y, order=x[:10]).
Also note that you are creating the bar plot twice (just to change the title?), which can be very confusing. As sns.barplot returns the ax (the subplot onto which the plot has been drawn), the usual approach is ax = sns.barplot(...) and then ax.set_title(...). (The name ax is preferred, to easier understand how matplotlib and seaborn example code can be employed in new code.)
The following example code has been tested with seaborn 0.11.1:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
print(sns.__version__)
df = pd.DataFrame({'Code departement': np.random.randint(1, 51, 1000)})
result = df.groupby(['Code departement']).size().sort_values(ascending=False)
x = result.index
y = result.values
ax = sns.barplot(x, y, order=x[:10])
ax.set(xlabel='Code departement', ylabel='Nombre de transactions')
ax.set_title('title')
plt.show()

matplotlib - seaborn - the numbers on the correlation plots are not readable

The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.

What is the difference between doing a regression with a dataframe and ndarray?

I would like to know why would I need to convert my dataframe to ndarray when doing a regression, since I get the same result for intercept and coef when I do not convert it?
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn import linear_model
%matplotlib inline
# import data and create dataframe
!wget -O FuelConsumption.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/FuelConsumptionCo2.csv
df = pd.read_csv("FuelConsumption.csv")
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
# Split train/ test data
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
# Modeling
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
**# if I use the dataframe, train[['ENGINESIZE']] for 'x', and train[['CO2EMISSIONS']] for 'y'
below, I get the same result**
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Thank you very much!
So df is the loaded dataframe, cdf is another frame with selected columns, and train is selected rows.
train[['ENGINESIZE']] is a 1 column dataframe (I believe train['ENGINESIZE'] would be a pandas Series).
I believe the preferred syntax for getting an array from the dataframe is:
train[['ENGINESIZE']].values # or
train[['ENGINESIZE']].to_numpy()
though
np.asanyarray(train[['ENGINESIZE']])
is supposed to do the same thing.
Digging down through the regr.fit code I see that it calls sklearn.utils.check_X_y which in turn calls sklearn.tils.check_array. That takes care of converting the inputs to numpy arrays, with some awareness of pandas dataframe peculiarities (such as multiple dtypes).
So it appears that if fit accepts your dataframes, you don't need to convert them ahead of time. But if you can get a nice array from the dataframe, there's no harm in do that either. Either way the fit is done with arrays, derived from the dataframe.

How to plot Series with selective ticks?

I have a Series that I would like to plot as a bar chart: pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts()
Since I have many bars I only want to display some (equidistant) ticks.
However, unless I actively work against it, pyplot will print the wrong labels. E.g. if I leave out set_xticklabels in the code below I get
where every element from the index is taken and just displayed with the specified distance.
This code does what I want:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
mi,ma = min(s.index), max(s.index)
s = s.reindex(range(mi,ma+1,1), fill_value=0)
distance = 10
a = s.plot(kind='bar')
condition = lambda t: int(t[1].get_text()) % 10 == 0
ticks_,labels_=zip(*filter(condition, zip(a.get_xticks(), a.get_xticklabels())))
a.set_xticks(ticks_)
a.set_xticklabels(labels_)
plt.show()
But I still feel like I'm being unnecessarily clever here. Am I missing a function? Is this the best way of doing that?
Consider not using a pandas bar plot in case you intend to plot numeric values; that is because pandas bar plots are categorical in nature.
If instead using a matplotlib bar plot, which is numeric in nature, there is no need to tinker with any ticks at all.
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
plt.bar(s.index, s)
I think you overcomplicated it. You can simply use the following. You just need to find the relationship between the ticks and the ticklabels.
a = s.plot(kind='bar')
xticks = np.arange(0, max(s)*10+1, 10)
plt.xticks(xticks + abs(mi), xticks)

Incorrect plot from a pandas Series with datetimes indeces

When I use matplotlib to plot a pandas Series containing three float values with indices that are datetimes, I get a incorrect plot with a vertical line in the middle. It looks like this:
I've been struggling with this for hours. I'm finally able to reproduce it with these three data points in the following Python code:
import pandas as pd
import matplotlib.pyplot as plt
data = """\
2013-04-16 08:50:00.080120 / 56.70999
2013-04-16 08:53:34.165183 / 56.59997
2013-04-16 08:59:09.676249 / 55.70001\
"""
fmt = "%Y-%m-%d %H:%M:%S.%f"
val = [float(a.split(' / ')[1]) for a in data.split('\n')]
indx = [pd.datetime.strptime(a.split(' / ')[0], fmt) for a in data.split('\n')]
s = pd.Series(val, index=indx)
s.plot()
plt.show()
If I zoom in on the line I can see it's placed, seemingly, at the correct date (April 16), but at exactly midnight, instead of at the times specified by the data (and echoed by printing s).
Which version of matplotlib and pandas are you using?
With pandas v0.10.1.dev-f73128e and matplotlib v1.2.1 I get the correct figure whether I do the plot in Ipython in interactive mode or from a python script. (btw I use python 2.7)