Related
consider the code below. I want to split the tensorflow.python.data.ops.dataset_ops.BatchDataset into inputs and labels according to the function below. Yet I get the error 'BatchDataset' object is not subscriptable. Can anyone help me with that?
import tensorflow as tf
input_slice=3
labels_slice=2
def split_window(features):
inputs = features[:, input_slice, :]
labels = features[:, labels_slice, :]
#####create a batch dataset
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5)
#####split the dataset into input and labels
dataset=split_window(dataset)
The dataset without the split window looks like this:
tf.Tensor([1 2 3 4 5], shape=(5,), dtype=int64)
tf.Tensor([ 6 7 8 9 10], shape=(5,), dtype=int64)
tf.Tensor([11 12 13 14 15], shape=(5,), dtype=int64)
tf.Tensor([16 17 18 19 20], shape=(5,), dtype=int64)
tf.Tensor([21 22 23 24 25], shape=(5,), dtype=int64)
But what I meant was to display the inputs and labels like this:
Inputs:
[1 2 3 ]
[ 6 7 8 ]
[11 12 13 ]
[16 17 18 ]
[21 22 23 ]
Labels:
[4 5]
[9 10]
[14 15]
[19 20]
[24 25]
You can try this:
import tensorflow as tf
input_slice=3
labels_slice=2
def split_window(x):
features = tf.slice(x,[0], [input_slice])
labels = tf.slice(x,[input_slice], [labels_slice])
return features, labels
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5).map(split_window)
for i, j in dataset:
print(i.numpy(),end="->")
print(j.numpy())
[1 2 3]->[4 5]
[6 7 8]->[ 9 10]
[11 12 13]->[14 15]
[16 17 18]->[19 20]
[21 22 23]->[24 25]
You can't apply a Python function directly to a tf.data.Dataset. You need to use the .map() method. Also, your function is returning nothing.
import tensorflow as tf
input_slice = 3
labels_slice = 2
def split_window(features):
inputs = tf.gather_nd(features, [input_slice])
labels = tf.gather_nd(features, [labels_slice])
return inputs, labels
dataset = tf.data.Dataset.range(1, 25 + 1).batch(5).map(split_window)
for x, y in dataset:
print(x.numpy(), y.numpy())
4 3
9 8
14 13
19 18
24 23
I want to create a multiway contingency table from my pandas dataframe and store it in an xarray. It seems to me it ought to be straightfoward enough using pandas.crosstab followed by DataFrame.to_xarray() but I'm getting "TypeError: Cannot interpret 'interval[int64]' as a data type" in pandas v1.1.5. (v1.0.1 gives "ValueError: all arrays must be same length").
In [1]: import numpy as np
...: import pandas as pd
...: pd.__version__
Out[1]: '1.1.5'
In [2]: import xarray as xr
...: xr.__version__
Out[2]: '0.17.0'
In [3]: n = 100
...: np.random.seed(42)
...: x = pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
...: x
Out[3]:
[(1, 2], (2, 3], (2, 3], (1, 2], (0, 1], ..., (1, 2], (1, 2], (1, 2], (0, 1], (0, 1]]
Length: 100
Categories (4, interval[int64]): [(0, 1] < (1, 2] < (2, 3] < (3, 4]]
In [4]: x.value_counts().sort_index()
Out[4]:
(0, 1] 41
(1, 2] 28
(2, 3] 31
(3, 4] 0
dtype: int64
Note I need my table to include empty categories such as (3, 4].
In [6]: idx=pd.date_range('2001-01-01', periods=n, freq='8H')
...: df = pd.DataFrame({'x': x}, index=idx)
...: df['xlag'] = df.x.shift(1, 'D')
...: df['h'] = df.index.hour
...: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab
Out[6]:
x (0, 1] (1, 2] (2, 3] (3, 4]
h xlag
0 (0, 1] 0.000000 0.700000 0.300000 0.0
(1, 2] 0.470588 0.411765 0.117647 0.0
(2, 3] 0.500000 0.333333 0.166667 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
8 (0, 1] 0.588235 0.000000 0.411765 0.0
(1, 2] 1.000000 0.000000 0.000000 0.0
(2, 3] 0.428571 0.142857 0.428571 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
16 (0, 1] 0.333333 0.250000 0.416667 0.0
(1, 2] 0.444444 0.222222 0.333333 0.0
(2, 3] 0.454545 0.363636 0.181818 0.0
(3, 4] 0.000000 0.000000 0.000000 0.0
That's fine, but my actual application has more categories and more dimensions, so this seems a clear use-case for xarray, but I get an error:
In [8]: xtab.to_xarray()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-aaedf730bb97> in <module>
----> 1 xtab.to_xarray()
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/pandas/core/generic.py in to_xarray(self)
2818 return xarray.DataArray.from_series(self)
2819 else:
-> 2820 return xarray.Dataset.from_dataframe(self)
2821
2822 #Substitution(returns=fmt.return_docstring)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in from_dataframe(cls, dataframe, sparse)
5131 obj._set_sparse_data_from_dataframe(idx, arrays, dims)
5132 else:
-> 5133 obj._set_numpy_data_from_dataframe(idx, arrays, dims)
5134 return obj
5135
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in _set_numpy_data_from_dataframe(self, idx, arrays, dims)
5062 data = np.zeros(shape, values.dtype)
5063 data[indexer] = values
-> 5064 self[name] = (dims, data)
5065
5066 #classmethod
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in __setitem__(self, key, value)
1427 )
1428
-> 1429 self.update({key: value})
1430
1431 def __delitem__(self, key: Hashable) -> None:
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in update(self, other)
3897 Dataset.assign
3898 """
-> 3899 merge_result = dataset_update_method(self, other)
3900 return self._replace(inplace=True, **merge_result._asdict())
3901
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in dataset_update_method(dataset, other)
958 priority_arg=1,
959 indexes=indexes,
--> 960 combine_attrs="override",
961 )
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
609 coerced = coerce_pandas_values(objects)
610 aligned = deep_align(
--> 611 coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
612 )
613 collected = collect_variables_and_indexes(aligned)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
428 indexes=indexes,
429 exclude=exclude,
--> 430 fill_value=fill_value,
431 )
432
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
352 if not valid_indexers:
353 # fast path for no reindexing necessary
--> 354 new_obj = obj.copy(deep=copy)
355 else:
356 new_obj = obj.reindex(
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in copy(self, deep, data)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in <dictcomp>(.0)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k, v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/variable.py in copy(self, deep, data)
2632 """
2633 if data is None:
-> 2634 data = self._data.copy(deep=deep)
2635 else:
2636 data = as_compatible_data(data)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in copy(self, deep)
1484 # 8000341
1485 array = self.array.copy(deep=True) if deep else self.array
-> 1486 return PandasIndexAdapter(array, self._dtype)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in __init__(self, array, dtype)
1407 dtype_ = array.dtype
1408 else:
-> 1409 dtype_ = np.dtype(dtype)
1410 self._dtype = dtype_
1411
TypeError: Cannot interpret 'interval[int64]' as a data type
I can avoid the error by converting x (and xlag) to a different dtype instead of pandas.Categorical before using pandas.crosstab, but then I lose any empty categories, which I need to keep in my real application.
The issue here is not the use of a CategoricalIndex but the category labels (x.categories) is an IntervalIndex which xarray doesn't like.
To remedy this, you can simply replace the categories within your x variable with their string representation, which coerces x.categories to be an "object" dtype instead of an "interval[int64]" dtype:
x = (
pd.cut(np.random.uniform(low=0, high=3, size=n), range(5))
.rename_categories(str)
)
Then calculate your crosstab as you have already done and it should work!
To get your dataset in the coordinates you want (I think), all you need to do is to stack everything in a single MultiIndex row shape. (instead of a crosstab MultiIndex row/Index column shape).
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.stack()
.reorder_levels(["x", "h", "xlag"])
.sort_index()
)
xtab.to_xarray()
If you want to shorten your code and lose some of the explicit ordering of index levels, you can also use unstack instead of stack which gives you the correct ordering right away:
xtab = (
pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize="index")
.unstack([0, 1])
)
xtab.to_xarray()
Regardless of the stack() vs unstack([0, 1]) approach you use, you get this output:
<xarray.DataArray (x: 4, h: 3, xlag: 4)>
array([[[0. , 0.47058824, 0.5 , 0. ],
[0.58823529, 1. , 0.42857143, 0. ],
[0.33333333, 0.44444444, 0.45454545, 0. ]],
[[0.7 , 0.41176471, 0.33333333, 0. ],
[0. , 0. , 0.14285714, 0. ],
[0.25 , 0.22222222, 0.36363636, 0. ]],
[[0.3 , 0.11764706, 0.16666667, 0. ],
[0.41176471, 0. , 0.42857143, 0. ],
[0.41666667, 0.33333333, 0.18181818, 0. ]],
[[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* x (x) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
#Cameron-Riddell's answer is the key to my problem, but there are a couple of additional reshaping wriggles to smooth out. Applying rename_categories(str) to my x variable as he suggests then proceeding as in my question allows the final line to work:
In [8]: xtab = pd.crosstab([df.h, df.xlag], df.x, dropna=False, normalize='index')
...: xtab.to_xarray()
Out[8]:
<xarray.Dataset>
Dimensions: (h: 3, xlag: 4)
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
Data variables:
(0, 1] (h, xlag) float64 0.0 0.4706 0.5 0.0 ... 0.3333 0.4444 0.4545 0.0
(1, 2] (h, xlag) float64 0.7 0.4118 0.3333 0.0 ... 0.25 0.2222 0.3636 0.0
(2, 3] (h, xlag) float64 0.3 0.1176 0.1667 0.0 ... 0.3333 0.1818 0.0
(3, 4] (h, xlag) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
But I wanted a 3-d array with one variable, not a 2-d array with 3 variables. To convert it I need to apply .to_array(dim='x'). But then my dimensions are in the order x, h, xlag and I clearly don't want h in the middle so I also need to transpose them:
In [9]: xtab.to_xarray().to_array(dim='x').transpose('h', 'xlag', 'x')
Out[9]:
<xarray.DataArray (h: 3, xlag: 4, x: 4)>
array([[[0. , 0.7 , 0.3 , 0. ],
[0.47058824, 0.41176471, 0.11764706, 0. ],
[0.5 , 0.33333333, 0.16666667, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.58823529, 0. , 0.41176471, 0. ],
[1. , 0. , 0. , 0. ],
[0.42857143, 0.14285714, 0.42857143, 0. ],
[0. , 0. , 0. , 0. ]],
[[0.33333333, 0.25 , 0.41666667, 0. ],
[0.44444444, 0.22222222, 0.33333333, 0. ],
[0.45454545, 0.36363636, 0.18181818, 0. ],
[0. , 0. , 0. , 0. ]]])
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
* x (x) <U6 '(0, 1]' '(1, 2]' '(2, 3]' '(3, 4]'
That's what I'd envisaged! It displays similarly to pd.crosstab, but it's a 3-d xarray instead of a pandas dataframe with a multiindex. That'll be much easier to handle in the subsequent stages of my program (the crosstab is just an intermediate step, not a result in itself).
I must say that ended up more complicated than I'd anticipated... I found a question from #kilojoules back in 2017 "When to use multiindexing vs. xarray in pandas" to which #Tkanno wrote an answer beginning "There does seem to be a transition to xarray for doing work on multi-dimensional arrays." Seems a shame to me that there isn't a version of pd.crosstab that returns an xarray - or am I asking for more pandas-xarray integration than is possible?
The following code identifies a list of prime numbers
import math
import numpy as np
import pandas as pd
import scipy
for x in range(20, 50):
print(x, math.factorial(x-1) % x)
the output is
20 0
21 0
22 0
23 22
24 0
25 0
26 0
27 0
28 0
29 28
30 0
31 30
32 0
33 0
34 0
35 0
36 0
37 36
38 0
39 0
40 0
41 40
42 0
43 42
44 0
45 0
46 0
47 46
48 0
49 0
The residue of the calculation is non zero for every prime.
When I try to do the same calculation with arrays, my results are different.
arr = np.arange(20,50)
modFactArr = factorial(arr-1) % arr
print(np.column_stack((arr,modFactArr)),'\n\n')
smodFactArr=scipy.special.factorial(arr-1) % arr
print(np.column_stack((arr,smodFactArr)),'\n\n')
gives
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
[[20. 0.]
[21. 0.]
[22. 0.]
[23. 22.]
[24. 0.]
[25. 22.]
[26. 16.]
[27. 11.]
[28. 12.]
[29. 16.]
[30. 24.]
[31. 8.]
[32. 0.]
[33. 18.]
[34. 16.]
[35. 33.]
[36. 20.]
[37. 12.]
[38. 20.]
[39. 10.]
[40. 0.]
[41. 25.]
[42. 26.]
[43. 6.]
[44. 4.]
[45. 3.]
[46. 36.]
[47. 40.]
[48. 0.]
[49. 12.]]
Notice now that numbers like 26,27,28,etc are giving residuals now. Is this an error with my code? or is there a reason that scipy and numpy are doing modulo arithmetic differently?
It's happening because of np.dtype. You just have overflow of np.int64 while native python int can not be overflowed.
From the docs:
With exact=False the factorial is approximated using the gamma function
exact=False is the default.
You can tell this is being approximated because the result is a float (hence the .0), and floats are not able to store integral results accurately beyond 2**53.
X = np.array(...,dtype=np.float64).astype(np.float)
y = np.array(...,dtype=np.float64).astype(np.float)
print(X)
[[-0.5 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. 0. ]
[ 0. -3.5]
[-3.5 -3.5]
[-3.5 -2. ]
[-2. 0. ]
[ 0. 0. ]
[ 0. -3. ]]
print(y)
[-3.5 -3.5 -3.5 -3.5 -3.5 -2. -3. -3. -3. 1. ]
(10, 2)
(10,)
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
No matter what I do I get
Traceback (most recent call last):
File "scripts/wave-pool.py", line 174, in <module>
clf.fit(X, y)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 192, in fit
sample_weight=sample_weight)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 355, in _partial_fit
if _check_partial_fit_first_call(self, classes):
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 320, in _check_partial_fit_first_call
clf.classes_ = unique_labels(classes)
File "/home/ben/.local/lib/python2.7/site-packages/sklearn/utils/multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array([-3.5, -3. , -2. , 1. ]),)
We are able (only within the context of our application atm) to reproduce on Ubuntu 15.04 and OS X with scikit 0.17 the following problem when using GridSearchCV with a LogisticRegression on larger data sets.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('cpencoder', <cpml.whitebox.Lin...s', refit=True, scoring=u'roc_auc', verbose=1))]), X= Unnamed: 0 member_id loan_a... 42.993346
[152536 rows x 45 columns], y=array([0, 1, 0, ..., 1, 1, 0]), **fit_params={})
160 y : iterable, default=None
161 Training targets. Must fulfill label requirements for all steps of
162 the pipeline.
163 """
164 Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 165 self.steps[-1][-1].fit(Xt, y, **fit_params)
self.steps.fit = undefined
Xt = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
fit_params = {}
166 return self
167
168 def fit_transform(self, X, y=None, **fit_params):
169 """Fit all the transforms one after the other and transform the
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]))
799 y : array-like, shape = [n_samples] or [n_samples, n_output], optional
800 Target relative to X for classification or regression;
801 None for unsupervised learning.
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
self._fit = <bound method GridSearchCV._fit of GridSearchCV(...obs', refit=True, scoring=u'roc_auc', verbose=1)>
X = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
self.param_grid = {'C': [1], 'class_weight': ['auto'], 'fit_intercept': [False], 'intercept_scaling': [1], 'penalty': ['l2']}
805
806
807 class RandomizedSearchCV(BaseSearchCV):
808 """Randomized search on hyper parameters.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in _fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]), parameter_iterable=<sklearn.grid_search.ParameterGrid object>)
548 )(
549 delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
550 train, test, self.verbose, parameters,
551 self.fit_params, return_parameters=True,
552 error_score=self.error_score)
--> 553 for parameters in parameter_iterable
parameters = undefined
parameter_iterable = <sklearn.grid_search.ParameterGrid object>
554 for train, test in cv)
555
556 # Out is a list of triplet: score, estimator, n_test_samples
557 n_fits = len(out)
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr>>)
807 if pre_dispatch == "all" or n_jobs == 1:
808 # The iterable was consumed all at once by the above for loop.
809 # No need to wait for async callbacks to trigger to
810 # consumption.
811 self._iterating = False
--> 812 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
813 # Make sure that we get a last message telling us we are done
814 elapsed_time = time.time() - self._start_time
815 self._print('Done %3i out of %3i | elapsed: %s finished',
816 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError Mon Jan 18 11:58:09 2016
PID: 71840 Python 2.7.10: /Users/samuelhopkins/.virtualenvs/cpml/bin/python
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
67 def __init__(self, iterator_slice):
68 self.items = list(iterator_slice)
69 self._size = len(self.items)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
75 return self._size
76
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=memmap([0, 1, 0, ..., 1, 1, 0]), scorer=make_scorer(roc_auc_score, needs_threshold=True), train=array([ 49100, 49101, 49102, ..., 152533, 152534, 152535]), test=array([ 0, 1, 2, ..., 57517, 57522, 57532]), verbose=1, parameters={'C': 1, 'class_weight': 'auto', 'fit_intercept': False, 'intercept_scaling': 1, 'penalty': 'l2'}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise')
1545 " numeric value. (Hint: if using 'raise', please"
1546 " make sure that it has been spelled correctly.)"
1547 )
1548
1549 else:
-> 1550 test_score = _score(estimator, X_test, y_test, scorer)
1551 if return_train_score:
1552 train_score = _score(estimator, X_train, y_train, scorer)
1553
1554 scoring_time = time.time() - start_time
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X_test=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+01, 0.00000000e+00, 4.29933458e+01]]), y_test=memmap([0, 1, 0, ..., 1, 1, 1]), scorer=make_scorer(roc_auc_score, needs_threshold=True))
1604 score = scorer(estimator, X_test)
1605 else:
1606 score = scorer(estimator, X_test, y_test)
1607 if not isinstance(score, numbers.Number):
1608 raise ValueError("scoring must return a number, got %s (%s) instead."
-> 1609 % (str(score), type(score)))
1610 return score
1611
1612
1613 def _permutation_test_score(estimator, X, y, cv, scorer):
ValueError: scoring must return a number, got 0.998981811748 (<class 'numpy.core.memmap.memmap'>) instead.
We have made several attempts to reproduce it outside of the context of the application, but are not having any luck. We have made the following change to cross_validation.py and it fixed our particular problem:
...
if isinstance(score, np.core.memmap):
score = np.float(score)
if not isinstance(score, numbers.Number):
raise ValueError("scoring must return a number, got %s (%s) instead."
...
Some more information:
we are on python 2.7
we are using a Pipeline to ensure all inputs are numeric
My questions are the following:
How might we go about reproducing this problem so as to cause the scorer to return a memmap?
Is anyone else having this particular problem?
Is the change we made in cross_validation.py actually a decent solution?
Yes, had a similar case
I fell in love with .memmap-s due to O/S limits on memory allocations and I consider .memmap-s a smart tool for large scale machine-learning, using 'em in .fit()-s and other sklearn methods. ( GridSearchCV() not being yet the case, due to it's adverse effect of pre-allocation of memory on large HyperPARAMETERs' grids with n_jobs = -1 )
How might we ... reproduce ...? As far as I remember, mine case was similar and the change from "ordinary" numpy.ndarray to a numpy.memmap() started these artifacts. So, if you strive to create one such artificially, wrap your data into a .memmap()-ed representation of array and make it be returned, even while containing a single cell of data, instead of a plain number. One shall receive a view into a .memmap()-ed sub-range of generic array representation of that cell.
Is the change ... a decent solution? Well, I have got rid of the .memmap()-ed wrapper by explicitly returning a cell value, by referencing the result's [0] component. An enforced conversion by.float() seems fine.