Cross-validation score - data-science

can a Mean Square Error be negative in Cross-Validation ? if not then, what below code is trying to say
-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')

Mean square error can not be negative, it returns a value greater than zero and values closer to zero are better.
The code scoring='neg_mean_squared_error' asking to return negative scoring which means that values closer to zero are worse, then you multiply the result by -1

Related

How do I test binary floating point math functions

I implemented the powf(float x, float y) math function. This function is a binary floating point operation. I need to test it for correctness,but the test can't iterate over all floating point. what should I do.
Consider 2 questions:
How do I test binary floating point math functions?
Break FP values into groups:
Largest set: Normal values including +/- values near 1.0 and near the extremes as well as randomly selected ones.
Subnormals
Zeros: +0.0, -0.0
NANs
Use at least combinations of 100,000s+ sample values from the first set (including +/-min, +/-1.0, +/-max), 1,000s from the second set (including +/-min, +/-max) and -0.0, +0.0, -NAN, +NAN.
Additional tests for the function's edge cases.
How do I test powf()?
How: Test powf() against pow() for result correctness.
Values to test against: powf() has many concerns.
*pow(x,y) functions are notoriously difficult to code well. The error in little sub-calculations errors propagate to large final errors.
*pow() includes expected integral results with integral value arguments. E.g. pow(2, y) is expected to be exact for all in range results. pow(10, y) is expected to be within 0.5 unit in the last place for all y in range.
*pow() includes expected integer results with negative x.
There is little need to test every x, y combination. Consider every x < 0, y non-whole number value leads to a NAN.
z = powf(x,y) readily underflows to 0.0. Testing of x, y, values near a result of z == 0 needs some attention.
z = powf(x,y) readily overflows to ∞. Testing of x, y, values near a result of z == FMT_MAX needs more attention as a slight error result in FLT_MAX vs. INF. Since overflow is so rampant with powf(x,y), this reduces the numbers of combinations needed as it is the edge that is important and larger values need light testing.

scipy-optimize-minimize does not perform the optimization - CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL

I am trying to minimize a function defined as follows:
utility(decision) = decision * (risk - cost)
where variables take the following form:
decision = binary array
risk = array of floats
cost = constant
I know the solution will take the form of:
decision = 1 if (risk >= threshold)
decision = 0 otherwise
Therefore, in order to minimize this function I can assume that I transform the function utility to depend only on this threshold. My direct translation to scipy is the following:
def utility(threshold,risk,cost):
selection_list = [float(risk[i]) >= threshold for i in range(len(risk))]
v = np.array(risk.astype(float)) - cost
total_utility = np.dot(v, selection_list)
return -1.0*total_utility
result = minimize(fun=utility, x0=0.2, args=(r,c),bounds=[(0,1)], options={"disp":True} )
This gives me the following result:
fun: array([-17750.44298655]) hess_inv: <1x1 LbfgsInvHessProduct with
dtype=float64>
jac: array([0.])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 2
nit: 0 status: 0 success: True
x: array([0.2])
However, I know the result is wrong because in this case it must be equal to cost. On top of that, no matter what x0 I use, it always returns it as the result. Looking at the results I observe that jacobian=0 and does not compute 1 iteration correctly.
Looking more thoroughly into the function. I plot it and observe that it is not convex on the limits of the bounds but we can clearly see the minimum at 0.1. However, no matter how much I adjust the bounds to be in the convex part only, the result is still the same.
What could I do to minimize this function?
The error message tells you that the gradient was at some point too small and thus numerically the same as zero. This is likely due to the thresholding that you do when you calculate your selection_list. There you say float(risk[i]) >= threshold, which has derivative 0 almost everywhere. Hence, almost every starting value will give you the warning you receive.
A solution could be to apply some smoothing to the thresholding operation. So instead of float(risk[i]) >= threshold, you would use a continuous function:
def g(x):
return 1./(1+np.exp(-x))
With this function, you can express the thresholding operation as
g((risk[i] - threshold)/a), which a parameter a. The larger a, the closer is this modified error function to what you are doing so far. At something like a=20 or so, you would probably have pretty much the same that you have at the moment. You would therefore derive a sequence of solutions, where you start with a=1 and then take that solution as a starting value for the same problem with a=2, take that solution as a starting value for the problem with a=4, and so on. At some point, you will notice that changing a does no longer change the solution and you're done.

how to get the coefficients out of the polynomal expression?

At the input I get a polynomial as a string,
I want to get its coefficients in variables, but i have no idea gow do this.
example:7x^4+3x^3-6x^2+x-8.Maximum degree is not known, coefficients are integers.
I will be very grateful for any help.
Split by plus and minus (e.g. with re.split), preserving the signs in the results. Then for each substring, split by "x" to get the leading coefficient (+1 and -1 are special cases), and take note of missing powers of x (i.e. coefficient 0).

pd.describe returns non-zero mean?

I'm normalizing my data as such:
df = (df-df.mean())/df.std()
then procceed with
stats = df.describe()
Why do I get non-zero means, but std=1?
Your mean is very close to zero, as your std is very close to 1. Python saves calculated values with a finite precision, which will result in your answers being precise up to a certain degree of precision.
Quoting the documentation:
the stored value is an approximation to the original decimal
fraction

any elegant binary search tree algorithm to work this out?

I was confronted with this problem, and would like to have a quick algorithm to work it out.
Given n Points in the 2D plane(none of them has an x value or y value equal to another), find out the number of all pairs of points which form a line with positive slope.(say (0,0) and (1,1), with a positive slope of 45 degrees )
Since the n is big( say 60000), so I need an elegant algorithm to keep it within 1 second.
I know it is easy to do it with O(n^2), but it is simply to slow, which takes about 30 seconds. Is it possible to have an binary search tree to do it with nlogn complexity?
I appreciate anyone who would like to enlighten me on this.
It seems like you should be able to do it mathematically..
Every set of 2 points there are (using permuations, I believe) will either be positive or negative and if they are random points, this means there will be average 50% positive and 50% negative.. So it would be the amount of pairs / 2.