How Can I Add A Regression Line to pandas.plot(kind='bar)? - pandas

I'd like to add a regression line for each flavor below. How can I do that? Do I need to use subplots? Is it possible using pandas.plot or do I need to use the full matplotlib?
import pandas as pd
# initialize list of lists
data = [[1,157.842730083188,202.290991182781,244.849416438322],
[2,234.516775578511,190.104435611797,202.157088214941],
[3,198.279130213755,193.075780258345,194.112394276613],
[4,156.285653517235,198.382900113055,185.380696178104],
[5,190.653607667334,208.807038546447,202.662790911701],
[6,192.027054343382,168.768097007287,179.315293388299],
[7,144.927513854729,166.183469310198,157.338388768229],
[8,194.096584739985,177.710332802887,188.006211652239],
[9,131.613923150861,112.503607632448,128.947939049068],
[10,139.545538050778,129.935716833166,139.334073132085]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['DensityDecileRank', 'Flavor1', 'Flavor2', 'Flavor3'])
df.plot(x='DensityDecileRank',
kind='bar',
stacked=False)

If you don't mind to use numpy to explicitly calculate the regression values,
the following code snippet based on this can be used as a quick solution:
ax = df.plot(x='DensityDecileRank', kind='bar', stacked=False)
rank, flavors = df.columns[0], df.columns[1:]
for flavor in flavors:
reg_func = np.poly1d(np.polyfit(df[rank], df[flavor], 1))
ax.plot(reg_func(df[rank]))
plt.show()
The code above derives the function reg_func for each flavor, which can be used for calculating the regression values based on the rank values.
The regression lines are plotted in the order of the flavor columns to match the colors. Further formatting can be added to ax.plot.

Related

Iterating and ploting five columns per iteration pandas

I am trying to plot five columns per iteration, but current code is ploting everithing five times. How to explain to it to plot five columns per iteration without repeting them?
n=4
for tag_1,tag_2,tag_3,tag_4,tag_5 in zip(df.columns[n:], df.columns[n+1:], df.columns[n+2:], df.columns[n+3:], df.columns[n+4:]):
fig,ax=plt.subplots(ncols=5, tight_layout=True, sharey=True, figsize=(20,3))
sns.scatterplot(df, x=tag_1, y='variable', ax=ax[0])
sns.scatterplot(df, x=tag_2, y='variable', ax=ax[1])
sns.scatterplot(df, x=tag_3, y='variable', ax=ax[2])
sns.scatterplot(df, x=tag_4, y='variable', ax=ax[3])
sns.scatterplot(df, x=tag_5, y='variable', ax=ax[4])
plt.show()
You are using list slicing in the wrong way. When you use df.columns[n:], you are getting all the column names from the one with index n to the last one. The same is valid for n+1, n+2, n+3 and n+4. This causes the repetition that you are referring to. In addition to that, the fact that the plot is shown five times is due to the behavior of the zip function: when used on iterables with different sizes, the iterable returned by zip has the size of the smaller one (in this case df.columns[n+4:]).
You can achieve what you want by adapting your code as follows:
# Imports to create sample data
import string
import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Create some sample data and a sample dataframe
data = { string.ascii_lowercase[i]: [random.randint(0, 100) for _ in range(100)] for i in range(15) }
df = pd.DataFrame(data)
# Iterate in groups of five indexes
for start in range(0, len(df.columns), 5):
# Get the next five columns. Pay attention to the case in which the number of columns is not a multiple of 5
cols = [df.columns[idx] for idx in range(start, min(start+5, len(df.columns)))]
# Adapt your plot and take into account that the last group can be smaller than 5
fig,ax=plt.subplots(ncols=len(cols), tight_layout=True, sharey=True, figsize=(20,3))
for idx in range(len(cols)):
#sns.scatterplot(df, x=cols[idx], y='variable', ax=ax[idx])
sns.scatterplot(df, x=cols[idx], y=df[cols[idx]], ax=ax[idx]) # In the example the values of the column are plotted
plt.show()
In this case, the code performs the following steps:
Iterate over groups of at most five indexes ([0->4], [5->10]...)
Recover the columns that are positioned in the previously recovered indexes. The last group of columns may be smaller than 5 (e.g., 18 columns, the last is composed of the ones with the following indexes: 15, 16, 17
Create the plot taking into account the previous corner case of less than 5 columns
With Seaborn's object interface, available from v0.12, we might do like this:
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import seaborn.objects as so
sns.set_theme()
First, let's create a sample dataset, just like trolloldem's answer.
random.seed(0) # To produce the same random values across multiple runs
columns = list("abcdefghij")
sample_size = 20
df_orig = pd.DataFrame(
{c: random.randint(100, size=sample_size) for c in columns},
index=pd.Series(range(sample_size), name="variable")
)
Then transform the data frame into a long-form for easier processing.
df = (df_orig
.melt(value_vars=columns, var_name="tag", ignore_index=False)
.reset_index()
)
Then finally render the figures, 5 figures per row.
(
so.Plot(df, x="value", y="variable") # Or you might do x="variable", y="value" instead
.facet(col="tag", wrap=5)
.add(so.Dot())
)

Draw bar-charts with value_counts() for multiple columns in a Pandas DataFrame

I'm trying to draw bar-charts with counts of unique values for all columns in a Pandas DataFrame. Kind of what df.hist() does for numerical columns, but I have categorical columns.
I'd prefer to use the object-oriented approach, because if feels more natural and explicit to me.
I'd like to have multiple Axes (subplots) within a single Figure, in a grid fashion (again like what df.hist() does).
My solution below does exactly what I want, but it feels cumbersome. I doubt whether I really need the direct dependency on Matplotlib (and all the code for creating the Figure, removing the unused Axes etc.). I see that pandas.Series.plot has parameters subplots and layout which seem to point to what I want, but maybe I'm totally off here. I tried looping over the columns in my DataFrame and apply these parameters, but I cannot figure it out.
Does anyone know a more compact way to do what I'm trying to achieve?
# Defining the grid-dimensions of the Axes in the Matplotlib Figure
nr_of_plots = len(ames_train_categorical.columns)
nr_of_plots_per_row = 4
nr_of_rows = math.ceil(nr_of_plots / nr_of_plots_per_row)
# Defining the Matplotlib Figure and Axes
figure, axes = plt.subplots(nrows=nr_of_rows, ncols=nr_of_plots_per_row, figsize=(25, 50))
figure.subplots_adjust(hspace=0.5)
# Plotting on the Axes
i, j = 0, 0
for column_name in ames_train_categorical:
if ames_train_categorical[column_name].nunique() <= 30:
axes[i][j].set_title(column_name)
ames_train_categorical[column_name].value_counts().plot(kind='bar', ax=axes[i][j])
j += 1
if j % nr_of_plots_per_row == 0:
i += 1
j = 0
# Cleaning up unused Axes
# plt.subplots creates a square grid of Axes. On the last row, not all Axes will always be used. Unused Axes are removed here.
axes_flattened = axes.flatten()
for ax in axes_flattened:
if not ax.has_data():
ax.remove()
Edit: alternative idea
Using the pyplot/state-machine WoW, you could do it like this with very limited lines of code. But this also has the downside that every graph gets it's own figure, you they're not nicely arranged in a grid.
for column_name in ames_train_categorical:
ames_train_categorical[column_name].value_counts().plot(kind='bar')
plt.show()
Desired output
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"MS Zoning": ["RL", "FV", "RL", "RH", "RL", "RL"],
"Street": ["Pave", "Pave", "Pave", "Grvl", "Pave", "Pave"],
"Alley": ["Grvl", "Grvl", "Grvl", "Grvl", "Pave", "Pave"],
"Utilities": ["AllPub", "NoSewr", "AllPub", "AllPub", "NoSewr", "AllPub"],
"Land Slope": ["Gtl", "Mod", "Sev", "Mod", "Sev", "Sev"],
}
)
Here is a bit more idiomatic way to do it:
import math
from matplotlib import pyplot as plt
size = math.ceil(df.shape[1]** (1/2))
fig = plt.figure()
for i, col in enumerate(df.columns):
fig.add_subplot(size, size, i + 1)
df[col].value_counts().plot(kind="bar", ax=plt.gca(), title=col, rot=0)
fig.tight_layout()

Linear 1D interpolation on multiple datasets using loops

I'm interested in performing Linear interpolation using the scipy.interpolate library. The dataset looks somewhat like this:
DATAFRAME for interpolation between X, Y for different RUNs
I'd like to use this interpolated function to find the missing Y from this dataset:
DATAFRAME to use the interpolation function
The number of runs given here is just 3, but I'm running on a dataset that will run into 1000s of runs. Hence appreciate if you could advise how to use the iterative functions for the interpolation ?
from scipy.interpolate import interp1d
for RUNNumber in range(TotalRuns)
InterpolatedFunction[RUNNumber]=interp1d(X, Y)
As I understand it, you want a separate interpolation function defined for each run. Then you want to apply these functions to a second dataframe. I defined a dataframe df with columns ['X', 'Y', 'RUN'], and a second dataframe, new_df with columns ['X', 'Y_interpolation', 'RUN'].
interpolating_functions = dict()
for run_number in range(1, max_runs):
run_data = df[df['RUN']==run_number][['X', 'Y']]
interpolating_functions[run_number] = interp1d(run_data['X'], run_data['Y'])
Now that we have interpolating functions for each run, we can use them to fill in the 'Y_interpolation' column in a new dataframe. This can be done using the apply function, which takes a function and applies it to each row in a dataframe. So let's define an interpolate function that will take a row of this new df and use the X value and the run number to calculate an interpolated Y value.
def interpolate(row):
int_func = interpolating_functions[row['RUN']]
interp_y = int_func._call_linear([row['X'])[0] #the _call_linear method
#expects and returns an array
return interp_y[0]
Now we just use apply and our defined interpolate function.
new_df['Y_interpolation'] = new_df.apply(interpolate,axis=1)
I'm using pandas version 0.20.3, and this gives me a new_df that looks like this:

Combining Pandas Subplots into a Single Figure

I'm having trouble understanding Pandas subplots - and how to create axes so that all subplots are shown (not over-written by subsequent plot).
For each "Site", I want to make a time-series plot of all columns in the dataframe.
The "Sites" here are 'shark' and 'unicorn', both with 2 variables. The output should be be 4 plotted lines - the time-indexed plot for Var 1 and Var2 at each site.
Make Time-Indexed Data with Nans:
df = pd.DataFrame({
# some ways to create random data
'Var1':pd.np.random.randn(100),
'Var2':pd.np.random.randn(100),
'Site':pd.np.random.choice( ['unicorn','shark'], 100),
# a date range and set of random dates
'Date':pd.date_range('1/1/2011', periods=100, freq='D'),
# 'f':pd.np.random.choice( pd.date_range('1/1/2011', periods=365,
# freq='D'), 100, replace=False)
})
df.set_index('Date', inplace=True)
df['Var2']=df.Var2.cumsum()
df.loc['2011-01-31' :'2011-04-01', 'Var1']=pd.np.nan
Make a figure with a sub-plot for each site:
fig, ax = plt.subplots(len(df.Site.unique()), 1)
counter=0
for site in df.Site.unique():
print(site)
sitedat=df[df.Site==site]
sitedat.plot(subplots=True, ax=ax[counter], sharex=True)
ax[0].title=site #Set title of the plot to the name of the site
counter=counter+1
plt.show()
However, this is not working as written. The second sub-plot ends up overwriting the first. In my actual use case, I have 14 variable number of sites in each dataframe, as well as a variable number of 'Var1, 2, ...'. Thus, I need a solution that does not require creating each axis (ax0, ax1, ...) by hand.
As a bonus, I would love a title of each 'site' above that set of plots.
The current code over-writes the first 'Site' plot with the second. What I missing with the axes here?!
When you are using DataFrame.plot(..., subplot=True) you need to provide the correct number of axes that will be used for each column (and with the right geometry, if using layout=). In your example, you have 2 columns, so plot() needs two axes, but you are only passing one in ax=, therefore pandas has no choice but to delete all the axes and create the appropriate number of axes itself.
Therefore, you need to pass an array of axes of length corresponding to the number of columns you have in your dataframe.
# the grouper function is from itertools' cookbook
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
fig, axs = plt.subplots(len(df.Site.unique())*(len(df.columns)-1),1, sharex=True)
for (site,sitedat),axList in zip(df.groupby('Site'),grouper(axs,len(df.columns)-1)):
sitedat.plot(subplots=True, ax=axList)
axList[0].set_title(site)
plt.tight_layout()

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()