How to quantify the relevant value change - quantify

I have two lists of values:
accuracy: 0.6, 0.6, 0.6
hit: 1, 1, 1
Now after some processing, these two lists of values become:
accuracy: 0.59, 0.58, 0.55
hit: 0.5, 0.3, 0.1
so the drop can be calculated as:
accuracy_drop: 0.01, 0.02, 0.05
hit_drop: 0.5, 0.8, 0.9
In my case, the third case with accuracy_drop = 0.05 and hit_drop = 0.9 is the optimal and I want to quantify this 'optimal' case by doing some calculation using the values I have. If calculating the hit_drop/accuracy_drop, the first case gives the highest value; if calculating the hit_drop-accuracy_drop, the third case gives the highest value but I am not sure whether this is a reasonable quantification.
I tried to search for some quantification methods of relevant value change but found nothing. Any idea?

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Pandas qcut returning out and bins does not appear consistent on the left boundary

If you do simply this:
out, bins = pd.qcut(range(10), 4, retbins=True)
out is:
[(-0.001, 2.25], (-0.001, 2.25], (-0.001, 2.25], (2.25, 4.5], (2.25, 4.5], (4.5, 6.75], (4.5, 6.75], (6.75, 9.0], (6.75, 9.0], (6.75, 9.0]]
Categories (4, interval[float64]): [(-0.001, 2.25] < (2.25, 4.5] < (4.5, 6.75] < (6.75, 9.0]]
bins is:
array([0. , 2.25, 4.5 , 6.75, 9. ])
Note that all bounds from 'out' and 'bins' are consistent except the '0' (left boundary). It seemed to be -0.001 in the category interval but 0.0 in the bins array.
This causes a problem for me since I serialized bins as is, and re-apply it back on new data. The context is machine learning, where you apply the exact same binning and categorize/embedding during inference on new data. Because of the difference in the left boundaries, the resulting categories don't match up and I have a bug.
Anyone knows why the categorical interval is -0.001 but 0.0 in the 'bins' while all other intervals seemed ok.
That looks like a rounding to me .
Your code with a slight mod :
pd.qcut(np.arange(0.1,1,0.01), 4, retbins=True)
Here are the results :
([(0.099, 0.322], (0.099, 0.322], (0.099, 0.322], (0.099, 0.322], (0.099, 0.322], ..., (0.767, 0.99], (0.767, 0.99], (0.767, 0.99], (0.767, 0.99], (0.767, 0.99]]
Length: 90
Categories (4, interval[float64]): [(0.099, 0.322] < (0.322, 0.545] < (0.545, 0.767] < (0.767, 0.99]]
Bins are
array([0.1 , 0.3225, 0.545 , 0.7675, 0.99 ]))
Note that the 0.099 of the left boundary has been rounded off to 0.1
I did the following, it feels like a hack. This works for me for now.
As #root pointed out, the -0.001 is there because the qcut uses left open interval. I.e. (0, 1] is anything bigger than 0 and equal or less than one but exclude 0 itself, while (-0.001, 1] will include 0. We can control the precision by doing this:
out, bins = pd.qcut(range(10), 4, precision=4, retbins=True)
which gives (-0.0001, 2.25] as first interval. I found this precision spec is important just in case its magnitude is small if you leave it (default). It can get 'lost' in type conversion downstream and will cause problem. I.e. you don't want it ever to be -1e-7 (which can happen depending on your data).
Then you fix bins[0] to have that -0.001:
bins[0] = out.categories[0].left
Such that now, if you perform this:
pd.cut(range(10), bins)
it will generate categorical values that match the original qcut using only the bins. This solves my problem. It feels like a hack. Would love to hear a more robust solution though.

how to do a simple matrix transpose in openrefine

I have data that I just need to perform a transpose on, seems simple but i can't make heads or tails of the transpose function.
data looks like this
name, requirement_1, requirement_2, requirement_3
label, 1.1, 1.2, 1.3
threshold, 10, 20, 30
objective, 100, 200, 300
floor, 0, .5, .5
I need:
name, label, threshold, objective, floor
requirement_1, 1.1, 10, 100, 0
requirement_2, 1.2, 20, 200, 0.5
requirement_3, 1.3, 30, 300, 0.5
in power query this is simply clicking the transpose button.
Thanks
This is a bit more complicated in OpenRefine, since you have to perform two operations: transpose cells across columns into rows, then columnize by key value.

How to write lists in different directions in pandas in python

I have three lists as follows.
list_names_1 = ["Salad", "Bread"]
list_names_2 = ["Oil", "Fat", "Salt"]
list_values = [[0.2, 0.1, 0.8], [0.2, 0.9, 0.8]]
Now I want to write the aforementioned three lists to a csv file as follows.
NAMES, Oil, Fat, Salt
Salad, 0.2, 0.1, 0.8
Bread, 0.2, 0.9, 0.8
That is, I want the list names_1 in vertical direction, list_names_2 in horizontal direction and list_values as the values of the two lists.
Is it possible to do this in pandas?
Use DataFrame constructor with to_csv:
df = pd.DataFrame(list_values, columns=list_names_2, index=list_names_1)
df.index.name = 'NAMES'
print (df)
NAMES Oil Fat Salt
Salad 0.2 0.1 0.8
Bread 0.2 0.9 0.8
df.to_csv('file')
NAMES,Oil,Fat,Salt
Salad,0.2,0.1,0.8
Bread,0.2,0.9,0.8
Use pd.DataFrame(data=.., columns=..., index=...) to construct the dataframe.
And, use index_label in to_csv to get the name as NAMES set in output.
In [2167]: print (pd.DataFrame(data=list_values, columns=list_names_2, index=list_names_1)
.to_csv(index_label='NAMES'))
NAMES,Oil,Fat,Salt
Salad,0.2,0.1,0.8
Bread,0.2,0.9,0.8
(pd.DataFrame(data=list_values, columns=list_names_2, index=list_names_1)
.to_csv('name.csv' index_label='NAMES'))

Linear regression slope error in numpy

I use numpy.polyfit to get a linear regression: coeffs = np.polyfit(x, y, 1).
What is the best way to calculate the error of the fit's slope using numpy?
As already mentioned by #ebarr in the comments, you can use np.polyfit to return the residuals by using the keyword argument full=True.
Example:
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z, residuals, rank, singular_values, rcond = np.polyfit(x, y, 3, full=True)
residuals then is the sum of least squares.
Alternatively, you can use the keyword argument cov=True to get the covariance matrix.
Example:
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z, cov = np.polyfit(x, y, 3, cov=True)
Then, the diagonal elements of cov are the variances of the coefficients in z, i.e. np.sqrt(np.diag(cov)) gives you the standard deviations of the coefficients. You can use the standard deviations to estimate the probability that the absolute error exceeds a certain value, e.g. by inserting the standard deviations in the uncertainty propagation calculation. If you use e.g. 3*standard deviations in the uncertainty propagation, you calculate the error which will not be exceeded in 99.7% of the cases.
One last hint: you have to choose whether you choose full=True or cov=True. cov=True only works when full=False (default) or vice versa.