Exponential smoothening giving NaN values - dataframe

I have written a function to find the values for alpha, beta and gamma for ExponentialSmoothening. When I run this code without the gamma values, it works fine but when i give values for gamma, it gives me an error.
The function is outputting NaN values but for only a specific combination of the three values.
Code:
import numpy as np
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# In[2]:
def auto_hwm(timeseries, val_split_date, alpha=[None], beta=[None], gamma=[None], phi=[None],
trend=None, seasonal=None, periods=None, verbose=False):
'''The auto_hwm (short for auto holt winters model) function to search for the best possible parameter
combination for the Exponential Smoothing model i.e. smoothing level, smoothing slope,
smoothing seasonal and damping slope based on mean absolute error.
****Paramters****
timeseries: array-like
Time-Series
val_split_date: str
The datetime to split the time-series for validation
alpha: list of floats (optional)
The list of alpha values for the simple exponential smoothing parameter
beta: list of floats (optional)
The list of beta values for the Holt’s trend method parameter
gamma: list of floats (optional)
The list of gamma values for the holt winters seasonal method parameter
phi: list of floats (optional)
The list of phi values for the damped method parameter
trend: {“add”, “mul”, “additive”, “multiplicative”, None} (optional)
Type of trend component.
seasonal: {“add”, “mul”, “additive”, “multiplicative”, None} (optional)
Type of seasonal component.
periods: int (optional)
The number of periods in a complete seasonal cycle
****Returns****
best_params: dict
The values of alpha, beta, gamma and phi for which the
validation data (val_split_date) gives the least mean absolute error
'''
best_params = []
actual = timeseries[val_split_date:]
print('Evaluating Exponential Smoothing model for', len(alpha) * len(beta) * len(gamma) * len(phi), 'fits\n')
for a in alpha:
for b in beta:
for g in gamma:
for p in phi:
if(verbose == True):
print('Checking for', {'alpha': a, 'beta': b, 'gamma': g, 'phi': p})
model = ExponentialSmoothing(timeseries, trend=trend, seasonal=seasonal, seasonal_periods=periods)
model.fit(smoothing_level=a, smoothing_slope=b, smoothing_seasonal=g, damping_slope=p)
f_cast = model.predict(model.params, start=actual.index[0])
score = np.where(np.float64(mean_absolute_error(actual, f_cast)/actual).mean()>0,np.float64(mean_absolute_error(actual, f_cast)/actual).mean(),0)
best_params.append({'alpha': a, 'beta': b, 'gamma': g, 'phi': p, 'mae': score})
return min(best_params, key=lambda x: x['mae'])
auto_hwm(ts, val_split_date = '2018-10-01', alpha = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], beta = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], gamma = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], trend='mul', seasonal='mul', periods=12, verbose=True)
It is giving NaN values for a specific combination of three values. There are no missing values in the series

Related

Two Pandas dataframes, how to interpolate row-wise using scipy

How can I use scipy interpolate on two dataframes, interpolating row-rise?
For example, if I have:
dfx = pd.DataFrame({"a": [0.1, 0.2, 0.5, 0.6], "b": [3.2, 4.1, 1.1, 2.8]})
dfy = pd.DataFrame({"a": [0.8, 0.2, 1.1, 0.1], "b": [0.5, 1.3, 1.3, 2.8]})
display(dfx)
display(dfy)
And say I want to interpolate for y(x=0.5), how can I get the results into an array that I can put in a new dataframe?
Expected result is: [0.761290323 0.284615385 1.1 -0.022727273]
For example, for first row, you can see the expected value is 0.761290323:
x = [0.1, 3.2] # from dfx, row 0
y = [0.8,0.5] # from dfy, row 0
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
f = scipy.interpolate.interp1d(x,y)
out = f(0.5)
print(out)
I tried the following but received ValueError: x and y arrays must be equal in length along interpolation axis.
f = scipy.interpolate.interp1d(dfx, dfy)
out = np.exp(f(0.5))
print(out)
Since you are looking for linear interpolation, you can do:
def interpolate(val, dfx, dfy):
t = (dfx['b'] - val) / (dfx['b'] - dfx['a'])
return dfy['a'] * t + dfy['b'] * (1-t)
interpolate(0.5, dfx, dfy)
Output:
0 0.885714
1 0.284615
2 1.100000
3 -0.022727
dtype: float64

Invalid RGBA argument: masked_array(data=[1.0, 0.5651961183210134, 0.0, 1.0], mask=False, when using Matplotlib

I am trying to plot a 4d graph using x,y,z labels with the fourth dimension being color. However, when trying to run this code, I run into this:
Invalid RGBA argument: masked_array(data=[1.0, 0.5651961183210134,
0.0, 1.0],
mask=False,
only whenever I try to change the z variable. The only time I don't get an error is when I set the z variable to this: np.random.standard_normal(100)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = rainfall
y = airport_train_adult_pax
z = airport_total_pax
c = exchange_rate
img = ax.scatter(x, y, z, c=c, cmap=plt.hot())
fig.colorbar(img)
plt.show()
Just for some background of my data, the rainfall ranges from 0 to 200 and includes one decimal point, airport train and airport total ranges 2000000-3000000 range with no decimals, and exchange_rate ranges between 0 to 1 with two decimals.
I guess you have NaNs in your data.
Try to insert this code before ax.scatter(...):
df = pd.DataFrame({'x': x, 'y': y, 'z': z, 'c': c}).dropna()
x,y,z,c = [df[c] for c in df]

Matplotlib: plt.text with user-defined circle radii

Dear stackoverflow users,
I want to plot some data labels with its coordinates in a x,y-plot. Around the labels I want to put a circle with a user-defined radius as I want to symbolize the magnitude of the data property by the radius of the circle.
An example dataset could look like the following:
point1 = ["label1", 0.5, 0.25, 1e0] # equals [label, x, y, radius]
point2 = ["label2", 0.5, 0.75, 1e1] # equals [label, x, y, radius]
I want to use a code silimar to the following one:
import matplotlib.pyplot as plt
plt.text(point1[1], point1[2], point1[0], bbox = dict(boxstyle="circle")) # here I want to alter the radius by passing point1[3]
plt.text(point2[1], point2[2], point2[0], bbox = dict(boxstyle="circle")) # here I want to alter the radius by passing point2[3]
plt.show()
Is this possible somehow or is the plt.add_patch variant the only possible way?
Regards
In principle, you can use the boxes' pad parameter to define the circle size. However this is then relative to the label. I.e. a small label would have a smaller circle around it for the same value of pad than a larger label. Also the units of pad are fontsize (i.e. if you have a fontsize of 10pt, a padding of 1 would correspond to 10pt).
import numpy as np
import matplotlib.pyplot as plt
points = [["A", 0.2, 0.25, 0], # zero radius
["long label", 0.4, 0.25, 0], # zero radius
["label1", 0.6, 0.25, 1]] # one radius
for point in points:
plt.text(point[1], point[2], point[0], ha="center", va="center",
bbox = dict(boxstyle=f"circle,pad={point[3]}", fc="lightgrey"))
plt.show()
I don't know in how far this is desired.
I guess usually you would rather create a scatterplot at the same positions as the text
import numpy as np
import matplotlib.pyplot as plt
points = [["A", 0.2, 0.25, 100], # 5 pt radius
["long label", 0.4, 0.25, 100], # 5 pt radius
["label1", 0.6, 0.25, 1600]] # 20 pt radius
data = np.array([l[1:] for l in points])
plt.scatter(data[:,0], data[:,1], s=data[:,2], facecolor="gold")
for point in points:
plt.text(point[1], point[2], point[0], ha="center", va="center")
plt.show()

Subtract two columns of lists in pandas

I have a dataframe with two columns of 1D lists of the same size, and I would like to form a third column with the difference of these vectors. Conceptually:
df['dV'] = df['v1'] - df['v2']
So that if df['v1'] looks like:
0 [0.2, 0.1, 0.0]
1 [0.5, -0.4, 0.0]
...
and df['v2'] looks like:
0 [0.1, 0.6, 0.0]
1 [0.5, 0.4, 0.0]
...
then the desired result df['dV'] would be:
0 [0.1, -0.5, 0.0]
1 [0.0, -0.8, 0.0]
...
I have tried the following:
df['dV'] = df['v1'] - df['v2']
which results in an "operands could not be broadcast.." error. Next, I tried:
vecsub = lambda x, y: np.subtract(x, y)
df['dV'] = list(map(vecsub, df['v1'], df['v2']))
This produces a result, but the types are different:
type(df['dV'])
is numpy.ndarray
while
type(df['v1'])
is list.
How might I simply get the results in dV as lists? Applying numpy's tolist around my lambda outputs <built-in method tolist of numpy.ndarray object> for every value in the dataframe.
If you want to change ndarray to list just do list(df['dV'])
Broadcasting errors happen usually when arrays have different size. Are you sure their shapes are equal? You can use .shape to get that information. You can read more about broadcasting here.
Applying numpy's tolist around my lambda outputs <built-in method tolist of numpy.ndarray object> for every value in the dataframe.
Thats because you did: someArray.tolist, instead of someArray.tolist(), so you are actually printing function, not calling it and then printing it's result.

Logarithmic multi-sequenz plot with equal bar widths

I have something like
import matplotlib.pyplot as plt
import numpy as np
a=[0.05, 0.1, 0.2, 1, 2, 3]
plt.hist((a*2, a*3), bins=[0, 0.1, 1, 10])
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
which gives me the following plot:
As one can see, the bar width is not equal. In the linear part (from 0 to 0.1), everything is find, but after this, the bar width is still in linear scale, while the axis is in logarithmic scale, giving me uneven widths for bars and spaces in between (the tick is not in the middle of the bars).
Is there any way to correct this?
Inspired by https://stackoverflow.com/a/30555229/635387 I came up with the following solution:
import matplotlib.pyplot as plt
import numpy as np
d=[0.05, 0.1, 0.2, 1, 2, 3]
def LogHistPlot(data, bins):
totalWidth=0.8
colors=("b", "r", "g")
for i, d in enumerate(data):
heights = np.histogram(d, bins)[0]
width=1/len(data)*totalWidth
left=np.array(range(len(heights))) + i*width
plt.bar(left, heights, width, color=colors[i], label=i)
plt.xticks(range(len(bins)), bins)
plt.legend(loc='best')
LogHistPlot((d*2, d*3, d*4), [0, 0.1, 1, 10])
plt.show()
Which produces this plot:
The basic idea is to drop the plt.hist function, compute the histogram by numpy and plot it with plt.bar. Than, you can easily use a linear x-axis, which makes the bar width calculation trivial. Lastly, the ticks are replaced by the bin edges, resulting in the logarithmic scale. And you don't even have to deal with the symlog linear/logarithmic botchery anymore.
You could use histtype='stepfilled' if you are okay with a plot where the data sets are plotted one behind the other. Of course, you'll need to carefully choose colors with alpha values, so that all your data can still be seen...
a = [0.05, 0.1, 0.2, 1, 2, 3] * 2
b = [0.05, 0.05, 0.05, 0.15, 0.15, 2]
colors = [(0.2, 0.2, 0.9, 0.5), (0.9, 0.2, 0.2, 0.5)] # RGBA tuples
plt.hist((a, b), bins=[0, 0.1, 1, 10], histtype='stepfilled', color=colors)
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
I've changed your data slightly for a better illustration. This gives me:
For some reason the overlap color seems to be going wrong (matplotlib 1.3.1 with Python 3.4.0; Is this a bug?), but it's one possible solution/alternative to your problem.
Okay, I found out the real problem: when you create the histogram with those bin-edge settings, the histogram creates bars which have equal size, and equal outside-spacing on the non-log scale.
To demonstrate, here's a zoomed-in version of the plot in the question, but in non-log scale:
Notice how the first two bars are centered around (0 + 0.1) / 2 = 0.05, with a gap of 0.1 / 10 = 0.01 at the edges, while the next two bars are centered around (0.1 + 1.0) / 2 = 0.55, with a gap of 1.1 / 10 = 0.11 at either edge.
When converting things to log scale, bar widths and edge widths all go for a huge toss. This is compounded further by the fact that you have a linear scale from 0 to 0.1, after which things become log-scale.
I know no way of fixing this, other than to do everything manually. I've used the geometric means of the bin-edges in order to compute what the bar edges and bar widths should be. Note that this piece of code will work only for two datasets. If you have more datasets, you'll need to have some function that fills in the bin-edges with a geometric series appropriately.
import numpy as np
import matplotlib.pyplot as plt
def geometric_means(a):
"""Return pairwise geometric means of adjacent elements."""
return np.sqrt(a[1:] * a[:-1])
a = [0.05, 0.1, 0.2, 1, 2, 3] * 2
b = [0.05, 0.1, 0.2, 1, 2, 3] * 3
# Find frequencies
bins = np.array([0, 0.1, 1, 10])
a_hist = np.histogram(a, bins=bins)[0]
b_hist = np.histogram(b, bins=bins)[0]
# Find log-scale mid-points for bar-edges
mid_vals = np.hstack((np.array([0.05,]), geometric_means(bins[1:])))
# Compute bar left-edges, and bar widths
a_x = np.empty(mid_vals.size * 2)
a_x = bins[:-1]
a_widths = mid_vals - bins[:-1]
b_x = np.empty(mid_vals.size * 2)
b_x = mid_vals
b_widths = bins[1:] - mid_vals
plt.bar(a_x, a_hist, width=a_widths, color='b')
plt.bar(b_x, b_hist, width=b_widths, color='g')
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
And the final result:
Sorry, but the neat gaps between the bars get killed. Again, this can be fixed by doing the appropriate geometric interpolation, so that everything is linear on log-scale.
Just in case someone stumbles upon this problem:
This solution looks much more like the way it should be
plotting a histogram on a Log scale with Matplotlib