Subtract two columns of lists in pandas - pandas

I have a dataframe with two columns of 1D lists of the same size, and I would like to form a third column with the difference of these vectors. Conceptually:
df['dV'] = df['v1'] - df['v2']
So that if df['v1'] looks like:
0 [0.2, 0.1, 0.0]
1 [0.5, -0.4, 0.0]
...
and df['v2'] looks like:
0 [0.1, 0.6, 0.0]
1 [0.5, 0.4, 0.0]
...
then the desired result df['dV'] would be:
0 [0.1, -0.5, 0.0]
1 [0.0, -0.8, 0.0]
...
I have tried the following:
df['dV'] = df['v1'] - df['v2']
which results in an "operands could not be broadcast.." error. Next, I tried:
vecsub = lambda x, y: np.subtract(x, y)
df['dV'] = list(map(vecsub, df['v1'], df['v2']))
This produces a result, but the types are different:
type(df['dV'])
is numpy.ndarray
while
type(df['v1'])
is list.
How might I simply get the results in dV as lists? Applying numpy's tolist around my lambda outputs <built-in method tolist of numpy.ndarray object> for every value in the dataframe.

If you want to change ndarray to list just do list(df['dV'])
Broadcasting errors happen usually when arrays have different size. Are you sure their shapes are equal? You can use .shape to get that information. You can read more about broadcasting here.
Applying numpy's tolist around my lambda outputs <built-in method tolist of numpy.ndarray object> for every value in the dataframe.
Thats because you did: someArray.tolist, instead of someArray.tolist(), so you are actually printing function, not calling it and then printing it's result.

Related

Exponential smoothening giving NaN values

I have written a function to find the values for alpha, beta and gamma for ExponentialSmoothening. When I run this code without the gamma values, it works fine but when i give values for gamma, it gives me an error.
The function is outputting NaN values but for only a specific combination of the three values.
Code:
import numpy as np
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# In[2]:
def auto_hwm(timeseries, val_split_date, alpha=[None], beta=[None], gamma=[None], phi=[None],
trend=None, seasonal=None, periods=None, verbose=False):
'''The auto_hwm (short for auto holt winters model) function to search for the best possible parameter
combination for the Exponential Smoothing model i.e. smoothing level, smoothing slope,
smoothing seasonal and damping slope based on mean absolute error.
****Paramters****
timeseries: array-like
Time-Series
val_split_date: str
The datetime to split the time-series for validation
alpha: list of floats (optional)
The list of alpha values for the simple exponential smoothing parameter
beta: list of floats (optional)
The list of beta values for the Holt’s trend method parameter
gamma: list of floats (optional)
The list of gamma values for the holt winters seasonal method parameter
phi: list of floats (optional)
The list of phi values for the damped method parameter
trend: {“add”, “mul”, “additive”, “multiplicative”, None} (optional)
Type of trend component.
seasonal: {“add”, “mul”, “additive”, “multiplicative”, None} (optional)
Type of seasonal component.
periods: int (optional)
The number of periods in a complete seasonal cycle
****Returns****
best_params: dict
The values of alpha, beta, gamma and phi for which the
validation data (val_split_date) gives the least mean absolute error
'''
best_params = []
actual = timeseries[val_split_date:]
print('Evaluating Exponential Smoothing model for', len(alpha) * len(beta) * len(gamma) * len(phi), 'fits\n')
for a in alpha:
for b in beta:
for g in gamma:
for p in phi:
if(verbose == True):
print('Checking for', {'alpha': a, 'beta': b, 'gamma': g, 'phi': p})
model = ExponentialSmoothing(timeseries, trend=trend, seasonal=seasonal, seasonal_periods=periods)
model.fit(smoothing_level=a, smoothing_slope=b, smoothing_seasonal=g, damping_slope=p)
f_cast = model.predict(model.params, start=actual.index[0])
score = np.where(np.float64(mean_absolute_error(actual, f_cast)/actual).mean()>0,np.float64(mean_absolute_error(actual, f_cast)/actual).mean(),0)
best_params.append({'alpha': a, 'beta': b, 'gamma': g, 'phi': p, 'mae': score})
return min(best_params, key=lambda x: x['mae'])
auto_hwm(ts, val_split_date = '2018-10-01', alpha = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], beta = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], gamma = [0.1, 0.2, 0.3,0.4, 0.5, 0.6, 0.8, 0.9], trend='mul', seasonal='mul', periods=12, verbose=True)
It is giving NaN values for a specific combination of three values. There are no missing values in the series

Two Pandas dataframes, how to interpolate row-wise using scipy

How can I use scipy interpolate on two dataframes, interpolating row-rise?
For example, if I have:
dfx = pd.DataFrame({"a": [0.1, 0.2, 0.5, 0.6], "b": [3.2, 4.1, 1.1, 2.8]})
dfy = pd.DataFrame({"a": [0.8, 0.2, 1.1, 0.1], "b": [0.5, 1.3, 1.3, 2.8]})
display(dfx)
display(dfy)
And say I want to interpolate for y(x=0.5), how can I get the results into an array that I can put in a new dataframe?
Expected result is: [0.761290323 0.284615385 1.1 -0.022727273]
For example, for first row, you can see the expected value is 0.761290323:
x = [0.1, 3.2] # from dfx, row 0
y = [0.8,0.5] # from dfy, row 0
fig, ax = plt.subplots(1,1)
ax.plot(x,y)
f = scipy.interpolate.interp1d(x,y)
out = f(0.5)
print(out)
I tried the following but received ValueError: x and y arrays must be equal in length along interpolation axis.
f = scipy.interpolate.interp1d(dfx, dfy)
out = np.exp(f(0.5))
print(out)
Since you are looking for linear interpolation, you can do:
def interpolate(val, dfx, dfy):
t = (dfx['b'] - val) / (dfx['b'] - dfx['a'])
return dfy['a'] * t + dfy['b'] * (1-t)
interpolate(0.5, dfx, dfy)
Output:
0 0.885714
1 0.284615
2 1.100000
3 -0.022727
dtype: float64

scipy's shgo optimizer fails to minimize variance

In order to get familiar with global optimization methods and in particular with the shgo optimizer from scipy.optimize v1.3.0 I have tried to minimize the variance var(x) of a vector x = [x1,...,xN] with 0 <= xi <= 1 under the constraint that x has a given average value:
import numpy as np
from scipy.optimize import shgo
# Constraint
avg = 0.5 # Given average value of x
cons = {'type': 'eq', 'fun': lambda x: np.mean(x)-avg}
# Minimize the variance of x under the given constraint
res = shgo(lambda x: np.var(x), bounds=6*[(0, 1)], constraints=cons)
The shgo method fails on this problem:
>>> res
fun: 0.0
message: 'Failed to find a feasible minimiser point. Lowest sampling point = 0.0'
nfev: 65
nit: 2
nlfev: 0
nlhev: 0
nljev: 0
success: False
x: array([0., 0., 0., 0., 0., 0.])
The correct solution would be the uniform distribution x = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5] and it can be easily found by using the local optimizer minimize from scipy.optimize:
from scipy.optimize import minimize
from numpy.random import random
x0 = random(6) # Random start vector
res2 = minimize(lambda x: np.var(x), x0, bounds=6*[(0, 1)], constraints=cons)
The minimize method yields the correct result for arbitrary start vectors:
>>> res2.success
True
>>> res2.x
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
My question is: Why shgo fails on this relatively simple task? Did I made a mistake or is shgo simply not usable for this problem? Any help would be greatly appreciated.

How to perform subtraction on a single element of a tensor

I have a tensor that consists of 4 floats, called label.
How do I with a 50% chance execute x[0] = 1 - x[0]?
Right now I have:
label = tf.constant([0.35, 0.5, 0.17, 0.14]) # just an example
uniform_random = tf.random_uniform([], 0, 1.0)
# Create a tensor with [1.0, 0.0, 0.0, 0.0] if uniform_random > 50%
# else it's only zeroes
inv = tf.pack([tf.round(uniform_random), 0.0, 0.0, 0.0])
label = tf.sub(inv, label)
label = tf.abs(label) # need abs because it inverted the other elements
# output will be either [0.35, 0.5, 0.17, 0.14] or [0.65, 0.5, 0.17, 0.14]
which works, but looks extremely ugly. Isn't there a smarter/simpler way of doing this?
Related question: How do I apply a certain op (e.g. sqrt) just to two elements? I'm guessing I have to remove these two elements, perform the op and then concat them back to the original vector?
tf.select and tf.cond come in handy for situations where you have to perform computations conditionally on elements of a tensor. For your example, the following would work :
label = tf.constant([0.35, 0.5, 0.17, 0.14])
inv = tf.pack([1.0, 0.0, 0.0, 0.0])
mask = tf.pack([1.0, -1.0, -1.0, -1.0])
output = tf.cond(tf.random_uniform([], 0, 1.0) > 0.5,
lambda: label,
lambda: (inv - label) * mask)
with tf.Session(''):
print(output.eval())

Logarithmic multi-sequenz plot with equal bar widths

I have something like
import matplotlib.pyplot as plt
import numpy as np
a=[0.05, 0.1, 0.2, 1, 2, 3]
plt.hist((a*2, a*3), bins=[0, 0.1, 1, 10])
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
which gives me the following plot:
As one can see, the bar width is not equal. In the linear part (from 0 to 0.1), everything is find, but after this, the bar width is still in linear scale, while the axis is in logarithmic scale, giving me uneven widths for bars and spaces in between (the tick is not in the middle of the bars).
Is there any way to correct this?
Inspired by https://stackoverflow.com/a/30555229/635387 I came up with the following solution:
import matplotlib.pyplot as plt
import numpy as np
d=[0.05, 0.1, 0.2, 1, 2, 3]
def LogHistPlot(data, bins):
totalWidth=0.8
colors=("b", "r", "g")
for i, d in enumerate(data):
heights = np.histogram(d, bins)[0]
width=1/len(data)*totalWidth
left=np.array(range(len(heights))) + i*width
plt.bar(left, heights, width, color=colors[i], label=i)
plt.xticks(range(len(bins)), bins)
plt.legend(loc='best')
LogHistPlot((d*2, d*3, d*4), [0, 0.1, 1, 10])
plt.show()
Which produces this plot:
The basic idea is to drop the plt.hist function, compute the histogram by numpy and plot it with plt.bar. Than, you can easily use a linear x-axis, which makes the bar width calculation trivial. Lastly, the ticks are replaced by the bin edges, resulting in the logarithmic scale. And you don't even have to deal with the symlog linear/logarithmic botchery anymore.
You could use histtype='stepfilled' if you are okay with a plot where the data sets are plotted one behind the other. Of course, you'll need to carefully choose colors with alpha values, so that all your data can still be seen...
a = [0.05, 0.1, 0.2, 1, 2, 3] * 2
b = [0.05, 0.05, 0.05, 0.15, 0.15, 2]
colors = [(0.2, 0.2, 0.9, 0.5), (0.9, 0.2, 0.2, 0.5)] # RGBA tuples
plt.hist((a, b), bins=[0, 0.1, 1, 10], histtype='stepfilled', color=colors)
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
I've changed your data slightly for a better illustration. This gives me:
For some reason the overlap color seems to be going wrong (matplotlib 1.3.1 with Python 3.4.0; Is this a bug?), but it's one possible solution/alternative to your problem.
Okay, I found out the real problem: when you create the histogram with those bin-edge settings, the histogram creates bars which have equal size, and equal outside-spacing on the non-log scale.
To demonstrate, here's a zoomed-in version of the plot in the question, but in non-log scale:
Notice how the first two bars are centered around (0 + 0.1) / 2 = 0.05, with a gap of 0.1 / 10 = 0.01 at the edges, while the next two bars are centered around (0.1 + 1.0) / 2 = 0.55, with a gap of 1.1 / 10 = 0.11 at either edge.
When converting things to log scale, bar widths and edge widths all go for a huge toss. This is compounded further by the fact that you have a linear scale from 0 to 0.1, after which things become log-scale.
I know no way of fixing this, other than to do everything manually. I've used the geometric means of the bin-edges in order to compute what the bar edges and bar widths should be. Note that this piece of code will work only for two datasets. If you have more datasets, you'll need to have some function that fills in the bin-edges with a geometric series appropriately.
import numpy as np
import matplotlib.pyplot as plt
def geometric_means(a):
"""Return pairwise geometric means of adjacent elements."""
return np.sqrt(a[1:] * a[:-1])
a = [0.05, 0.1, 0.2, 1, 2, 3] * 2
b = [0.05, 0.1, 0.2, 1, 2, 3] * 3
# Find frequencies
bins = np.array([0, 0.1, 1, 10])
a_hist = np.histogram(a, bins=bins)[0]
b_hist = np.histogram(b, bins=bins)[0]
# Find log-scale mid-points for bar-edges
mid_vals = np.hstack((np.array([0.05,]), geometric_means(bins[1:])))
# Compute bar left-edges, and bar widths
a_x = np.empty(mid_vals.size * 2)
a_x = bins[:-1]
a_widths = mid_vals - bins[:-1]
b_x = np.empty(mid_vals.size * 2)
b_x = mid_vals
b_widths = bins[1:] - mid_vals
plt.bar(a_x, a_hist, width=a_widths, color='b')
plt.bar(b_x, b_hist, width=b_widths, color='g')
plt.gca().set_xscale("symlog", linthreshx=0.1)
plt.show()
And the final result:
Sorry, but the neat gaps between the bars get killed. Again, this can be fixed by doing the appropriate geometric interpolation, so that everything is linear on log-scale.
Just in case someone stumbles upon this problem:
This solution looks much more like the way it should be
plotting a histogram on a Log scale with Matplotlib