Strange result when subtracting two tensors

Strange result when subtracting two tensors - tensorflow

I try to subtract two tensors and then convert every negative value to zero using relu function, but i cannot do that because when i subtract two tensors, tensorflow for some reason add 256 to every negative value !!
img = mpimg.imread('/home/moumenshobaky/tensorflow_files/virtualenv/archive/training/Dessert/82
7.jpg')
img2 = tf.math.floordiv(img,64)*64
img3 = img2-img
# showing an example of the Flatten class and operation
from tensorflow.keras.layers import Flatten
flatten = Flatten(dtype='float32')
print(flatten(img2))
print(img3)
now the result is
tf.Tensor(
[[ 0 0 0 ... 64 64 0]
[ 0 0 0 ... 64 0 0]
[64 64 0 ... 64 0 0]
...
[64 64 64 ... 64 64 64]
[64 64 64 ... 64 64 64]
[64 64 64 ... 64 64 64]], shape=(384, 1536), dtype=uint8)
tf.Tensor(
[[198 197 213 ... 229 252 202]
[194 193 207 ... 235 193 207]
[250 253 198 ... 238 193 207]
...
[227 217 207 ... 218 230 242]
[226 216 206 ... 217 230 239]
[225 215 203 ... 214 227 235]], shape=(384, 1536), dtype=uint8)

I could make it work using tf.keras.utils.img_to_array to convert the image into a numpy array to avoid any unknown behaviour.
I used this image from i presume the same dataset.
img = mpimg.imread('C:/Users/as/Downloads/15.jpg')
img2 = tf.math.floordiv(img,64)*64
# convert to arrays
img = tf.keras.utils.img_to_array(img)
img2 = tf.keras.utils.img_to_array(img2)
img3 = img2-img
# showing an example of the Flatten class and operation
from tensorflow.keras.layers import Flatten
flatten = Flatten(dtype='float32')
print(flatten(img2))
print(img3)
output:
tf.Tensor(
[[192. 128. 64. ... 0. 0. 0.]
[192. 128. 64. ... 0. 0. 0.]
[192. 128. 128. ... 0. 0. 0.]
...
[ 64. 0. 0. ... 64. 0. 0.]
[ 64. 0. 0. ... 64. 0. 0.]
[ 64. 0. 0. ... 64. 0. 0.]], shape=(512, 1536), dtype=float32)
[[[-17. -43. -58.]
[-23. -51. -3.]
[-31. -61. -16.]
...
[-20. -7. -14.]
[-20. -7. -14.]
[-20. -7. -14.]]
[[-19. -45. -60.]
[-25. -53. -5.]
[-33. -63. -18.]
...
[-20. -7. -14.]
[-20. -7. -14.]
[-21. -8. -15.]]
[[-21. -49. -1.]
show more (open the raw output data in a text editor) ...
[-32. -31. -11.]
...
[-30. -60. -15.]
[-30. -60. -15.]
[-30. -60. -15.]]]
The problem occured because the image dtype is uint which is unsigned, so it doesn't allow negative values (check a similar issue here).
so I found out you could also solve the problem with tf.cast(img, dtype=tf.float32).

Related

How to split the dataset into inputs and labels in tensorflow?

consider the code below. I want to split the tensorflow.python.data.ops.dataset_ops.BatchDataset into inputs and labels according to the function below. Yet I get the error 'BatchDataset' object is not subscriptable. Can anyone help me with that?
import tensorflow as tf
input_slice=3
labels_slice=2
def split_window(features):
inputs = features[:, input_slice, :]
labels = features[:, labels_slice, :]
#####create a batch dataset
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5)
#####split the dataset into input and labels
dataset=split_window(dataset)
The dataset without the split window looks like this:
tf.Tensor([1 2 3 4 5], shape=(5,), dtype=int64)
tf.Tensor([ 6 7 8 9 10], shape=(5,), dtype=int64)
tf.Tensor([11 12 13 14 15], shape=(5,), dtype=int64)
tf.Tensor([16 17 18 19 20], shape=(5,), dtype=int64)
tf.Tensor([21 22 23 24 25], shape=(5,), dtype=int64)
But what I meant was to display the inputs and labels like this:
Inputs:
[1 2 3 ]
[ 6 7 8 ]
[11 12 13 ]
[16 17 18 ]
[21 22 23 ]
Labels:
[4 5]
[9 10]
[14 15]
[19 20]
[24 25]

You can try this:
import tensorflow as tf
input_slice=3
labels_slice=2
def split_window(x):
features = tf.slice(x,[0], [input_slice])
labels = tf.slice(x,[input_slice], [labels_slice])
return features, labels
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5).map(split_window)
for i, j in dataset:
print(i.numpy(),end="->")
print(j.numpy())
[1 2 3]->[4 5]
[6 7 8]->[ 9 10]
[11 12 13]->[14 15]
[16 17 18]->[19 20]
[21 22 23]->[24 25]

You can't apply a Python function directly to a tf.data.Dataset. You need to use the .map() method. Also, your function is returning nothing.
import tensorflow as tf
input_slice = 3
labels_slice = 2
def split_window(features):
inputs = tf.gather_nd(features, [input_slice])
labels = tf.gather_nd(features, [labels_slice])
return inputs, labels
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5).map(split_window)
for x, y in dataset:
print(x.numpy(), y.numpy())
4 3
9 8
14 13
19 18
24 23

Convert a multiway pandas.crosstab to an xarray

I want to create a multiway contingency table from my pandas dataframe and store it in an xarray. It seems to me it ought to be straightfoward enough using pandas.crosstab followed by DataFrame.to_xarray() but I'm getting "TypeError: Cannot interpret 'interval[int64]' as a data type" in pandas v1.1.5. (v1.0.1 gives "ValueError: all arrays must be same length").
In [1]: import numpy as np
...: import pandas as pd
...: pd.__version__
Out[1]: '1.1.5'
In [2]: import xarray as xr
...: xr.__version__
Out[2]: '0.17.0'
In [3]: n = 100
...: np.random.seed(42)
...: x = pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
...: x
Out[3]:
[(1, 2], (2, 3], (2, 3], (1, 2], (0, 1], ..., (1, 2], (1, 2], (1, 2], (0, 1], (0, 1]]
Length: 100
Categories (4, interval[int64]): [(0, 1] < (1, 2] < (2, 3] < (3, 4]]
In [4]: x.value_counts().sort_index()
Out[4]:
(0, 1] 41
(1, 2] 28
(2, 3] 31
(3, 4] 0
dtype: int64
Note I need my table to include empty categories such as (3, 4].
In [6]: idx=pd.date_range('2001-01-01', periods=n, freq='8H')
...: df = pd.DataFrame({'x': x}, index=idx)
...: df['xlag'] = df.x.shift(1, 'D')
...: df['h'] = df.index.hour
...: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab
Out[6]:
x (0, 1] (1, 2] (2, 3] (3, 4]
h xlag
0 (0, 1] 0.000000 0.700000 0.300000 0.0
(1, 2] 0.470588 0.411765 0.117647 0.0
(2, 3] 0.500000 0.333333 0.166667 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
8 (0, 1] 0.588235 0.000000 0.411765 0.0
(1, 2] 1.000000 0.000000 0.000000 0.0
(2, 3] 0.428571 0.142857 0.428571 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
16 (0, 1] 0.333333 0.250000 0.416667 0.0
(1, 2] 0.444444 0.222222 0.333333 0.0
(2, 3] 0.454545 0.363636 0.181818 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
That's fine, but my actual application has more categories and more dimensions, so this seems a clear use-case for xarray, but I get an error:
In [8]: xtab.to_xarray()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-aaedf730bb97> in <module>
----> 1 xtab.to_xarray()
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/pandas/core/generic.py in to_xarray(self)
2818 return xarray.DataArray.from_series(self)
2819 else:
-> 2820 return xarray.Dataset.from_dataframe(self)
2821
2822 #Substitution(returns=fmt.return_docstring)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in from_dataframe(cls, dataframe, sparse)
5131 obj._set_sparse_data_from_dataframe(idx, arrays, dims)
5132 else:
-> 5133 obj._set_numpy_data_from_dataframe(idx, arrays, dims)
5134 return obj
5135
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in _set_numpy_data_from_dataframe(self, idx, arrays, dims)
5062 data = np.zeros(shape, values.dtype)
5063 data[indexer] = values
-> 5064 self[name] = (dims, data)
5065
5066 #classmethod
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in __setitem__(self, key, value)
1427 )
1428
-> 1429 self.update({key: value})
1430
1431 def __delitem__(self, key: Hashable) -> None:
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in update(self, other)
3897 Dataset.assign
3898 """
-> 3899 merge_result = dataset_update_method(self, other)
3900 return self._replace(inplace=True, **merge_result._asdict())
3901
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in dataset_update_method(dataset, other)
958 priority_arg=1,
959 indexes=indexes,
--> 960 combine_attrs="override",
961 )
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
609 coerced = coerce_pandas_values(objects)
610 aligned = deep_align(
--> 611 coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
612 )
613 collected = collect_variables_and_indexes(aligned)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
428 indexes=indexes,
429 exclude=exclude,
--> 430 fill_value=fill_value,
431 )
432
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
352 if not valid_indexers:
353 # fast path for no reindexing necessary
--> 354 new_obj = obj.copy(deep=copy)
355 else:
356 new_obj = obj.reindex(
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in copy(self, deep, data)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in <dictcomp>(.0)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/variable.py in copy(self, deep, data)
2632 """
2633 if data is None:
-> 2634 data = self._data.copy(deep=deep)
2635 else:
2636 data = as_compatible_data(data)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in copy(self, deep)
1484 # 8000341
1485 array = self.array.copy(deep=True) if deep else self.array
-> 1486 return PandasIndexAdapter(array, self._dtype)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in __init__(self, array, dtype)
1407 dtype_ = array.dtype
1408 else:
-> 1409 dtype_ = np.dtype(dtype)
1410 self._dtype = dtype_
1411
TypeError: Cannot interpret 'interval[int64]' as a data type
I can avoid the error by converting x (and xlag) to a different dtype instead of pandas.Categorical before using pandas.crosstab, but then I lose any empty categories, which I need to keep in my real application.

The issue here is not the use of a CategoricalIndex but the category labels (x.categories) is an IntervalIndex which xarray doesn't like.
To remedy this, you can simply replace the categories within your x variable with their string representation, which coerces x.categories to be an "object" dtype instead of an "interval[int64]" dtype:
x = (
pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
.rename_categories(str)
)
Then calculate your crosstab as you have already done and it should work!
To get your dataset in the coordinates you want (I think), all you need to do is to stack everything in a single MultiIndex row shape. (instead of a crosstab MultiIndex row/Index column shape).
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.stack()
.reorder_levels(["x", "h", "xlag"])
.sort_index()
)
xtab.to_xarray()
If you want to shorten your code and lose some of the explicit ordering of index levels, you can also use unstack instead of stack which gives you the correct ordering right away:
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.unstack([0, 1])
)
xtab.to_xarray()
Regardless of the stack() vs unstack([0, 1]) approach you use, you get this output:
<xarray.DataArray (x: 4, h: 3, xlag: 4)>
array([[[0. , 0.47058824, 0.5 , 0. ],
[0.58823529, 1. , 0.42857143, 0. ],
[0.33333333, 0.44444444, 0.45454545, 0. ]],
[[0.7 , 0.41176471, 0.33333333, 0. ],
[0. , 0. , 0.14285714, 0. ],
[0.25 , 0.22222222, 0.36363636, 0. ]],
[[0.3 , 0.11764706, 0.16666667, 0. ],
[0.41176471, 0. , 0.42857143, 0. ],
[0.41666667, 0.33333333, 0.18181818, 0. ]],
[[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* x (x) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'

#Cameron-Riddell's answer is the key to my problem, but there are a couple of additional reshaping wriggles to smooth out. Applying rename_categories(str) to my x variable as he suggests then proceeding as in my question allows the final line to work:
In [8]: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab.to_xarray()
Out[8]:
<xarray.Dataset>
Dimensions: (h: 3, xlag: 4)
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
Data variables:
(0, 1] (h, xlag) float64 0.0 0.4706 0.5 0.0 ... 0.3333 0.4444 0.4545 0.0
(1, 2] (h, xlag) float64 0.7 0.4118 0.3333 0.0 ... 0.25 0.2222 0.3636 0.0
(2, 3] (h, xlag) float64 0.3 0.1176 0.1667 0.0 ... 0.3333 0.1818 0.0
(3, 4] (h, xlag) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
But I wanted a 3-d array with one variable, not a 2-d array with 3 variables. To convert it I need to apply .to_array(dim='x'). But then my dimensions are in the order x, h, xlag and I clearly don't want h in the middle so I also need to transpose them:
In [9]: xtab.to_xarray().to_array(dim='x').transpose('h', 'xlag', 'x')
Out[9]:
<xarray.DataArray (h: 3, xlag: 4, x: 4)>
array([[[0. , 0.7 , 0.3 , 0. ],
[0.47058824, 0.41176471, 0.11764706, 0. ],
[0.5 , 0.33333333, 0.16666667, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.58823529, 0. , 0.41176471, 0. ],
[1. , 0. , 0. , 0. ],
[0.42857143, 0.14285714, 0.42857143, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.33333333, 0.25 , 0.41666667, 0. ],
[0.44444444, 0.22222222, 0.33333333, 0. ],
[0.45454545, 0.36363636, 0.18181818, 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* x (x) <U6 '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
That's what I'd envisaged! It displays similarly to pd.crosstab, but it's a 3-d xarray instead of a pandas dataframe with a multiindex. That'll be much easier to handle in the subsequent stages of my program (the crosstab is just an intermediate step, not a result in itself).
I must say that ended up more complicated than I'd anticipated... I found a question from #kilojoules back in 2017 "When to use multiindexing vs. xarray in pandas" to which #Tkanno wrote an answer beginning "There does seem to be a transition to xarray for doing work on multi-dimensional arrays." Seems a shame to me that there isn't a version of pd.crosstab that returns an xarray - or am I asking for more pandas-xarray integration than is possible?

Confusing error using factorial with numpy or scipy arrays

The following code identifies a list of prime numbers
import math
import numpy as np
import pandas as pd
import scipy
for x in range(20, 50):
print(x, math.factorial(x-1) % x)
the output is
20 0
21 0
22 0
23 22
24 0
25 0
26 0
27 0
28 0
29 28
30 0
31 30
32 0
33 0
34 0
35 0
36 0
37 36
38 0
39 0
40 0
41 40
42 0
43 42
44 0
45 0
46 0
47 46
48 0
49 0
The residue of the calculation is non zero for every prime.
When I try to do the same calculation with arrays, my results are different.
arr = np.arange(20,50)
modFactArr = factorial(arr-1) % arr
print(np.column_stack((arr,modFactArr)),'\n\n')
smodFactArr=scipy.special.factorial(arr-1) % arr
print(np.column_stack((arr,smodFactArr)),'\n\n')
gives
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
Notice now that numbers like 26,27,28,etc are giving residuals now. Is this an error with my code? or is there a reason that scipy and numpy are doing modulo arithmetic differently?

It's happening because of np.dtype. You just have overflow of np.int64 while native python int can not be overflowed.

From the docs:
With exact=False the factorial is approximated using the gamma function
exact=False is the default.
You can tell this is being approximated because the result is a float (hence the .0), and floats are not able to store integral results accurately beyond 2**53.

"Unknown label type" with numpy and scikit

X = np.array(...,dtype=np.float64).astype(np.float)
y = np.array(...,dtype=np.float64).astype(np.float)
print(X)
[[-0.5 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. -3.5]
[-3.5 -3.5]
[-3.5 -2. ]
[-2. 0. ]
[ 0. 0. ]
[ 0. -3. ]]
print(y)
[-3.5 -3.5 -3.5 -3.5 -3.5 -2. -3. -3. -3. 1. ]
(10, 2)
(10,)
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
No matter what I do I get
Traceback (most recent call last):
File "scripts/wave-pool.py", line 174, in <module>
clf.fit(X, y)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 355, in _partial_fit
if _check_partial_fit_first_call(self, classes):
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 320, in _check_partial_fit_first_call
clf.classes_ = unique_labels(classes)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([-3.5, -3. , -2. , 1. ]),)

Scoring returning a numpy.core.memmap instead of a numpy.Number in grid search

We are able (only within the context of our application atm) to reproduce on Ubuntu 15.04 and OS X with scikit 0.17 the following problem when using GridSearchCV with a LogisticRegression on larger data sets.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('cpencoder', <cpml.whitebox.Lin...s', refit=True, scoring=u'roc_auc', verbose=1))]), X= Unnamed: 0 member_id loan_a... 42.993346
[152536 rows x 45 columns], y=array([0, 1, 0, ..., 1, 1, 0]), **fit_params={})
160 y : iterable, default=None
161 Training targets. Must fulfill label requirements for all steps of
162 the pipeline.
163 """
164 Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 165 self.steps[-1][-1].fit(Xt, y, **fit_params)
self.steps.fit = undefined
Xt = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
fit_params = {}
166 return self
167
168 def fit_transform(self, X, y=None, **fit_params):
169 """Fit all the transforms one after the other and transform the
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]))
799 y : array-like, shape = [n_samples] or [n_samples, n_output], optional
800 Target relative to X for classification or regression;
801 None for unsupervised learning.
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
self._fit = <bound method GridSearchCV._fit of GridSearchCV(...obs', refit=True, scoring=u'roc_auc', verbose=1)>
X = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
self.param_grid = {'C': [1], 'class_weight': ['auto'], 'fit_intercept': [False], 'intercept_scaling': [1], 'penalty': ['l2']}
805
806
807 class RandomizedSearchCV(BaseSearchCV):
808 """Randomized search on hyper parameters.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in _fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]), parameter_iterable=<sklearn.grid_search.ParameterGrid object>)
548 )(
549 delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
550 train, test, self.verbose, parameters,
551 self.fit_params, return_parameters=True,
552 error_score=self.error_score)
--> 553 for parameters in parameter_iterable
parameters = undefined
parameter_iterable = <sklearn.grid_search.ParameterGrid object>
554 for train, test in cv)
555
556 # Out is a list of triplet: score, estimator, n_test_samples
557 n_fits = len(out)
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr>>)
807 if pre_dispatch == "all" or n_jobs == 1:
808 # The iterable was consumed all at once by the above for loop.
809 # No need to wait for async callbacks to trigger to
810 # consumption.
811 self._iterating = False
--> 812 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
813 # Make sure that we get a last message telling us we are done
814 elapsed_time = time.time() - self._start_time
815 self._print('Done %3i out of %3i | elapsed: %s finished',
816 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError Mon Jan 18 11:58:09 2016
PID: 71840 Python 2.7.10: /Users/samuelhopkins/.virtualenvs/cpml/bin/python
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
67 def __init__(self, iterator_slice):
68 self.items = list(iterator_slice)
69 self._size = len(self.items)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
75 return self._size
76
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=memmap([0, 1, 0, ..., 1, 1, 0]), scorer=make_scorer(roc_auc_score, needs_threshold=True), train=array([ 49100, 49101, 49102, ..., 152533, 152534, 152535]), test=array([ 0, 1, 2, ..., 57517, 57522, 57532]), verbose=1, parameters={'C': 1, 'class_weight': 'auto', 'fit_intercept': False, 'intercept_scaling': 1, 'penalty': 'l2'}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise')
1545 " numeric value. (Hint: if using 'raise', please"
1546 " make sure that it has been spelled correctly.)"
1547 )
1548
1549 else:
-> 1550 test_score = _score(estimator, X_test, y_test, scorer)
1551 if return_train_score:
1552 train_score = _score(estimator, X_train, y_train, scorer)
1553
1554 scoring_time = time.time() - start_time
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X_test=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+01, 0.00000000e+00, 4.29933458e+01]]), y_test=memmap([0, 1, 0, ..., 1, 1, 1]), scorer=make_scorer(roc_auc_score, needs_threshold=True))
1604 score = scorer(estimator, X_test)
1605 else:
1606 score = scorer(estimator, X_test, y_test)
1607 if not isinstance(score, numbers.Number):
1608 raise ValueError("scoring must return a number, got %s (%s) instead."
-> 1609 % (str(score), type(score)))
1610 return score
1611
1612
1613 def _permutation_test_score(estimator, X, y, cv, scorer):
ValueError: scoring must return a number, got 0.998981811748 (<class 'numpy.core.memmap.memmap'>) instead.
We have made several attempts to reproduce it outside of the context of the application, but are not having any luck. We have made the following change to cross_validation.py and it fixed our particular problem:
...
if isinstance(score, np.core.memmap):
score = np.float(score)
if not isinstance(score, numbers.Number):
raise ValueError("scoring must return a number, got %s (%s) instead."
...
Some more information:
we are on python 2.7
we are using a Pipeline to ensure all inputs are numeric
My questions are the following:
How might we go about reproducing this problem so as to cause the scorer to return a memmap?
Is anyone else having this particular problem?
Is the change we made in cross_validation.py actually a decent solution?

Yes, had a similar case
I fell in love with .memmap-s due to O/S limits on memory allocations and I consider .memmap-s a smart tool for large scale machine-learning, using 'em in .fit()-s and other sklearn methods. ( GridSearchCV() not being yet the case, due to it's adverse effect of pre-allocation of memory on large HyperPARAMETERs' grids with n_jobs = -1 )
How might we ... reproduce ...? As far as I remember, mine case was similar and the change from "ordinary" numpy.ndarray to a numpy.memmap() started these artifacts. So, if you strive to create one such artificially, wrap your data into a .memmap()-ed representation of array and make it be returned, even while containing a single cell of data, instead of a plain number. One shall receive a view into a .memmap()-ed sub-range of generic array representation of that cell.
Is the change ... a decent solution? Well, I have got rid of the .memmap()-ed wrapper by explicitly returning a cell value, by referencing the result's [0] component. An enforced conversion by.float() seems fine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Strange result when subtracting two tensors - tensorflow

Related

How to split the dataset into inputs and labels in tensorflow?

Convert a multiway pandas.crosstab to an xarray

Confusing error using factorial with numpy or scipy arrays

"Unknown label type" with numpy and scikit

Scoring returning a numpy.core.memmap instead of a numpy.Number in grid search

Categories

Resources