Numpy: using np.pad() for an RGB image causing "operands could not be broadcast together with shapes (4,4,3) (4,4,5)" error - numpy

I have a function color_image_padding that takes an RGB image and adds one layer of zeros padding to the borders. The image has dimensions (Width, Height, 3), with 3 representing the 3 color channels.
My code is:
import numpy as np
def color_image_padding(image: np.ndarray) -> np.ndarray:
return np.pad(image, pad_width=1)
I'm seeing this error:
"operands could not be broadcast together with shapes (4,4,3) (4,4,5)"
It's probably the color channels that are causing this error. Doesn't np.pad split the image into 3 matrices and add the zero padding accordingly?
Thanks in advance for your assistance!
EDIT
See comments below... It turns out that the generalized function image_padding() was throwing an error message because some greyscale images (i.e. 2D Numpy matrices) were passed in. Here's a minimal example:
bar = np.ones((1, 3))
bar.ndim
2
def image_padding(image: np.ndarray, amt: int) -> np.ndarray:
return np.pad(image, pad_width=((amt, amt), (amt, amt), (0, 0)))
image_padding(bar, 2)
Full Traceback:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8116/4065018867.py in <module>
----> 1 img(bar, 3)
~\AppData\Local\Temp/ipykernel_8116/1455868751.py in img(image, amt)
1 def img(image, amt):
----> 2 return np.pad(image, pad_width=((amt, amt), (amt, amt), (0, 0)))
<__array_function__ internals> in pad(*args, **kwargs)
~\anaconda3\lib\site-packages\numpy\lib\arraypad.py in pad(array, pad_width, mode, **kwargs)
741
742 # Broadcast to shape (array.ndim, 2)
--> 743 pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
744
745 if callable(mode):
~\anaconda3\lib\site-packages\numpy\lib\arraypad.py in _as_pairs(x, ndim, as_index)
516 # Converting the array with `tolist` seems to improve performance
517 # when iterating and indexing the result (see usage in `pad`)
--> 518 return np.broadcast_to(x, (ndim, 2)).tolist()
519
520
<__array_function__ internals> in broadcast_to(*args, **kwargs)
~\anaconda3\lib\site-packages\numpy\lib\stride_tricks.py in broadcast_to(array, shape, subok)
409 [1, 2, 3]])
410 """
--> 411 return _broadcast_to(array, shape, subok=subok, readonly=True)
412
413
~\anaconda3\lib\site-packages\numpy\lib\stride_tricks.py in _broadcast_to(array, shape, subok, readonly)
346 'negative')
347 extras = []
--> 348 it = np.nditer(
349 (array,), flags=['multi_index', 'refs_ok', 'zerosize_ok'] + extras,
350 op_flags=['readonly'], itershape=shape, order='C')
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)
Testing whether the image is greyscale or color resolves the issue:
def image_padding(image: np.ndarray, amt: int) -> np.ndarray:
if image.ndim == 2:
return np.pad(image, pad_width=(amt, amt))
elif img.ndim == 3:
return np.pad(image, pad_width=((amt, amt), (amt, amt), (0, 0)))

So this reproduces your error - using the three term pad_width on a 2d array:
ok with 3d:
In [194]: x = np.ones((5,5,3),int)
In [196]: amt_padding=1;np.pad(x, pad_width=((amt_padding, amt_padding), (amt_padding, amt_padding), (0, 0))).shape
Out[196]: (7, 7, 3)
but if the array is 2d:
In [197]: amt_padding=1;np.pad(x[:,:,0], pad_width=((amt_padding, amt_padding), (amt_padding, amt_padding), (0, 0)))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [197], in <cell line: 1>()
----> 1 amt_padding=1;np.pad(x[:,:,0], pad_width=((amt_padding, amt_padding), (amt_padding, amt_padding), (0, 0)))
File <__array_function__ internals>:5, in pad(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\lib\arraypad.py:743, in pad(array, pad_width, mode, **kwargs)
740 raise TypeError('`pad_width` must be of integral type.')
742 # Broadcast to shape (array.ndim, 2)
--> 743 pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
745 if callable(mode):
746 # Old behavior: Use user-supplied function with np.apply_along_axis
747 function = mode
File ~\anaconda3\lib\site-packages\numpy\lib\arraypad.py:518, in _as_pairs(x, ndim, as_index)
514 raise ValueError("index can't contain negative values")
516 # Converting the array with `tolist` seems to improve performance
517 # when iterating and indexing the result (see usage in `pad`)
--> 518 return np.broadcast_to(x, (ndim, 2)).tolist()
File <__array_function__ internals>:5, in broadcast_to(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\lib\stride_tricks.py:411, in broadcast_to(array, shape, subok)
366 #array_function_dispatch(_broadcast_to_dispatcher, module='numpy')
367 def broadcast_to(array, shape, subok=False):
368 """Broadcast an array to a new shape.
369
370 Parameters
(...)
409 [1, 2, 3]])
410 """
--> 411 return _broadcast_to(array, shape, subok=subok, readonly=True)
File ~\anaconda3\lib\site-packages\numpy\lib\stride_tricks.py:348, in _broadcast_to(array, shape, subok, readonly)
345 raise ValueError('all elements of broadcast shape must be non-'
346 'negative')
347 extras = []
--> 348 it = np.nditer(
349 (array,), flags=['multi_index', 'refs_ok', 'zerosize_ok'] + extras,
350 op_flags=['readonly'], itershape=shape, order='C')
351 with it:
352 # never really has writebackifcopy semantics
353 broadcast = it.itviews[0]
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)
It's passing the task to np.nditer (via broadcast_to), which is raising the error. That would account for why I've never seen it before. I've explored nditer some, but it's not something I regularly use or recommend to others.
The _as_pairs expands widths like
In [206]: np.lib.arraypad._as_pairs(1,3, as_index=True)
Out[206]: ((1, 1), (1, 1), (1, 1))
In [207]: np.lib.arraypad._as_pairs(((1,),(2,),(3,)),3, as_index=True)
Out[207]: [[1, 1], [2, 2], [3, 3]]

Related

Error resulting from ImageDataGenerator during data augmentation

Can someone please help me in fixing the error? The code works fine before the for loop. Before the for loop, an array of the image was printed. Is there something wrong with the for loop? The output should be a file stored with augmented images of the input image. The input image is a jpg image.
The code I wrote:
import keras
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
data_gen = tf.keras.preprocessing.image.ImageDataGenerator(
rotation_range=45,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='contrast',
cval=125
)
x = io.imread('mona.jpg')
x = x.reshape((1, ) + x.shape) #Array with shape (1, 256, 256, 3)
i = 0
for batch in data_gen.flow(x, batch_size=16, save_to_dir='/Users/ghad/Desktop',
save_prefix='aug',
save_format='jpg'):
i += 1
if i > 20:
The generated error:
RuntimeError Traceback (most recent call last)
Input In [14], in <cell line: 31>()
28 x = x.reshape((1, ) + x.shape) #Array with shape (1, 256, 256, 3)
30 i = 0
---> 31 for batch in data_gen.flow(x, batch_size=16,
32 save_to_dir='/Users/ghadahalhabib/Desktop',
33 save_prefix='aug',
34 save_format='jpg'):
35 i += 1
36 if i > 20:
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:148, in Iterator.__next__(self, *args, **kwargs)
147 def __next__(self, *args, **kwargs):
--> 148 return self.next(*args, **kwargs)
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:160, in Iterator.next(self)
157 index_array = next(self.index_generator)
158 # The transformation of images is not under thread lock
159 # so it can be done in parallel
--> 160 return self._get_batches_of_transformed_samples(index_array)
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:709, in NumpyArrayIterator._get_batches_of_transformed_samples(self, index_array)
707 x = self.x[j]
708 params = self.image_data_generator.get_random_transform(x.shape)
--> 709 x = self.image_data_generator.apply_transform(
710 x.astype(self.dtype), params)
711 x = self.image_data_generator.standardize(x)
712 batch_x[i] = x
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:1800, in ImageDataGenerator.apply_transform(self, x, transform_parameters)
1797 img_col_axis = self.col_axis - 1
1798 img_channel_axis = self.channel_axis - 1
-> 1800 x = apply_affine_transform(
1801 x,
1802 transform_parameters.get('theta', 0),
1803 transform_parameters.get('tx', 0),
1804 transform_parameters.get('ty', 0),
1805 transform_parameters.get('shear', 0),
1806 transform_parameters.get('zx', 1),
1807 transform_parameters.get('zy', 1),
1808 row_axis=img_row_axis,
1809 col_axis=img_col_axis,
1810 channel_axis=img_channel_axis,
1811 fill_mode=self.fill_mode,
1812 cval=self.cval,
1813 order=self.interpolation_order)
1815 if transform_parameters.get('channel_shift_intensity') is not None:
1816 x = apply_channel_shift(x,
1817 transform_parameters['channel_shift_intensity'],
1818 img_channel_axis)
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:2324, in apply_affine_transform(x, theta, tx, ty, shear, zx, zy, row_axis, col_axis, channel_axis, fill_mode, cval, order)
2321 final_affine_matrix = transform_matrix[:2, :2]
2322 final_offset = transform_matrix[:2, 2]
-> 2324 channel_images = [ndimage.interpolation.affine_transform( # pylint: disable=g-complex-comprehension
2325 x_channel,
2326 final_affine_matrix,
2327 final_offset,
2328 order=order,
2329 mode=fill_mode,
2330 cval=cval) for x_channel in x]
2331 x = np.stack(channel_images, axis=0)
2332 x = np.rollaxis(x, 0, channel_axis + 1)
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/keras/preprocessing/image.py:2324, in <listcomp>(.0)
2321 final_affine_matrix = transform_matrix[:2, :2]
2322 final_offset = transform_matrix[:2, 2]
-> 2324 channel_images = [ndimage.interpolation.affine_transform( # pylint: disable=g-complex-comprehension
2325 x_channel,
2326 final_affine_matrix,
2327 final_offset,
2328 order=order,
2329 mode=fill_mode,
2330 cval=cval) for x_channel in x]
2331 x = np.stack(channel_images, axis=0)
2332 x = np.rollaxis(x, 0, channel_axis + 1)
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/scipy/ndimage/interpolation.py:574, in affine_transform(input, matrix, offset, output_shape, output, order, mode, cval, prefilter)
572 npad = 0
573 filtered = input
--> 574 mode = _ni_support._extend_mode_to_code(mode)
575 matrix = numpy.asarray(matrix, dtype=numpy.float64)
576 if matrix.ndim not in [1, 2] or matrix.shape[0] < 1:
File ~/opt/anaconda3/envs/tensorflow/lib/python3.9/site-packages/scipy/ndimage/_ni_support.py:54, in _extend_mode_to_code(mode)
52 return 6
53 else:
---> 54 raise RuntimeError('boundary mode not supported')
RuntimeError: boundary mode not supported
for the code
for batch in data_gen.flow(x, batch_size=16, save_to_dir='/Users/ghad/Desktop', save_prefix='aug', save_format='jpg'):
you are inputting only a single image but asking to produce 16 augmented images. That won't work. Normal the length of x is LARGER than the batch size. Set the batch size to 1. That way you will produce 1 augment image each time you feed a new image into the generator

Shape of passed values is (12, 4), indices imply (4, 4)

the code to make multiple Regression Models
I am getting following error.
ValueError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes, consolidate)
1773 blocks = _form_blocks(arrays, names, axes, consolidate)
-> 1774 mgr = BlockManager(blocks, axes)
1775 except ValueError as e:
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, verify_integrity)
913
--> 914 self._verify_integrity()
915
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
920 if block.shape[1:] != mgr_shape[1:]:
--> 921 raise construction_error(tot_items, block.shape[1:], self.axes)
922 if len(self.items) != tot_items:
ValueError: Shape of passed values is (12, 4), indices imply (4, 4)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
C:\Users\ASUSVI~1\AppData\Local\Temp/ipykernel_16768/4041741565.py in <module>
19
20
---> 21 pd.DataFrame({'Train RMSE': rmse_train,'Test RMSE': rmse_test,'Training Score':scores_train,'Test Score': scores_test},
22 index=['Linear Regression','Decision Tree Regressor','Random Forest Regressor', 'ANN Regressor'])
23
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
612 elif isinstance(data, dict):
613 # GH#38939 de facto copy defaults to False only in non-dict cases
--> 614 mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
615 elif isinstance(data, ma.MaskedArray):
616 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in dict_to_mgr(data, index, columns, dtype, typ, copy)
462 # TODO: can we get rid of the dt64tz special case above?
463
--> 464 return arrays_to_mgr(
465 arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
466 )
~\Anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity, typ, consolidate)
133
134 if typ == "block":
--> 135 return create_block_manager_from_arrays(
136 arrays, arr_names, axes, consolidate=consolidate
137 )
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes, consolidate)
1774 mgr = BlockManager(blocks, axes)
1775 except ValueError as e:
-> 1776 raise construction_error(len(arrays), arrays[0].shape, axes, e)
1777 if consolidate:
1778 mgr._consolidate_inplace()
ValueError: Shape of passed values is (12, 4), indices imply (4, 4)
please help!!!
well well well...it looks like the sanswer was very simple. Just added .values.ravel() to y_train and it solved the problem.
for eg: -
i.fit(X-train,y_train.values.ravel())

Pandas dropna throwing ValueError: "Cannot convert non-finite values (NA or inf) to integer"

Pandas: 0.25.3
Python: 3.7.4
I have a data frame, and I want to remove the columns which contain only NaN values. That should be easy, because there is a Pandas DataFrame function which does exactly that—dropna. Here's my code:
long_summary = long_summary.dropna(axis='columns', how='all')
But that simple line throws an exception:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I cannot see how calling dropna would lead to this exception. What is going on and how do I fix it?
I'll include the whole exception stack just-in-case that makes the problem clearer:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-88-b4926abd4d81> in <module>
----> 1 long_summary = long_summary.dropna(axis='columns', how='all')
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\frame.py in dropna(self, axis, how, thresh, subset, inplace)
4860 agg_obj = self.take(indices, axis=agg_axis)
4861
-> 4862 count = agg_obj.count(axis=agg_axis)
4863
4864 if thresh is not None:
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\frame.py in count(self, axis, level, numeric_only)
7848 result = Series(counts, index=frame._get_agg_axis(axis))
7849
-> 7850 return result.astype("int64")
7851
7852 def _count_level(self, level, axis=0, numeric_only=False):
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
5880 # else, only a single dtype is given
5881 new_data = self._data.astype(
-> 5882 dtype=dtype, copy=copy, errors=errors, **kwargs
5883 )
5884 return self._constructor(new_data).__finalize__(self)
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
579
580 def astype(self, dtype, **kwargs):
--> 581 return self.apply("astype", dtype=dtype, **kwargs)
582
583 def convert(self, **kwargs):
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
437
--> 438 applied = getattr(b, f)(**kwargs)
439 result_blocks = _extend_blocks(applied, result_blocks)
440
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
557
558 def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559 return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
560
561 def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
641 # _astype_nansafe works fine with 1-d only
642 vals1d = values.ravel()
--> 643 values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
644
645 # TODO(extension)
c:\users\timregan\appdata\local\programs\python\python37\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
698 if not np.isfinite(arr).all():
699 raise ValueError(
--> 700 "Cannot convert non-finite values (NA or inf) to " "integer"
701 )
702
ValueError: Cannot convert non-finite values (NA or inf) to integer
(N.B. the data types of my columns are int64, Int32, and float64)
In the comments Scott asked for data to reproduce this issue. The redacted CSV is available on Dropbox here.
df = pd.read_csv('E:\\Temp\\dropna.csv')
df.dropna(axis='columns', how='all')
But be warned, the CSV is 3.3 GB and the resulting data frame has over 60 million rows. It tried cutting out rows, but it seems to need to be this long to trigger the error.

Can't perform calculations on DataFrame values

I am trying to apply a formula to each value in a Pandas DataFrame, however, I am getting an error.
def transform_x(x):
return x/0.65
transformed = input_df.applymap(transform_x)
This returns the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-72-66afcc1d1b80> in <module>
3
4
----> 5 transformed = input_df.applymap(transform_x)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in applymap(self, func)
6551 return lib.map_infer(x.astype(object).values, func)
6552
-> 6553 return self.apply(infer)
6554
6555 # ----------------------------------------------------------------------
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6485 args=args,
6486 kwds=kwds)
-> 6487 return op.get_result()
6488
6489 def applymap(self, func):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
149 return self.apply_raw()
150
--> 151 return self.apply_standard()
152
153 def apply_empty_result(self):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
255
256 # compute the result using the series generator
--> 257 self.apply_series_generator()
258
259 # wrap results
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
284 try:
285 for i, v in enumerate(series_gen):
--> 286 results[i] = self.f(v)
287 keys.append(v.name)
288 except Exception as e:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in infer(x)
6549 if x.empty:
6550 return lib.map_infer(x, func)
-> 6551 return lib.map_infer(x.astype(object).values, func)
6552
6553 return self.apply(infer)
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-72-66afcc1d1b80> in transform_x(x)
1 def transform_x(x):
----> 2 return x/0.65
3
4
5 transformed = input_df.applymap(transform_x)
TypeError: ("unsupported operand type(s) for /: 'str' and 'float'", 'occurred at index (column_a)')
I have tried converting the type of the DataFrame to float, as I thought that this might be the issue, however, I am encountering a different problem.
input_df = input_df.astype(float)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-2102a8e5c505> in <module>
----> 1 input_df= input_df.astype(float)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
5689 # else, only a single dtype is given
5690 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5691 **kwargs)
5692 return self._constructor(new_data).__finalize__(self)
5693
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
529
530 def astype(self, dtype, **kwargs):
--> 531 return self.apply('astype', dtype=dtype, **kwargs)
532
533 def convert(self, **kwargs):
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
393 copy=align_copy)
394
--> 395 applied = getattr(b, f)(**kwargs)
396 result_blocks = _extend_blocks(applied, result_blocks)
397
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
532 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
533 return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 534 **kwargs)
535
536 def _astype(self, dtype, copy=False, errors='raise', values=None,
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
631
632 # _astype_nansafe works fine with 1-d only
--> 633 values = astype_nansafe(values.ravel(), dtype, copy=True)
634
635 # TODO(extension)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
700 if copy or is_object_dtype(arr) or is_object_dtype(dtype):
701 # Explicit copy, or required since NumPy can't view from / to object.
--> 702 return arr.astype(dtype, copy=True)
703
704 return arr.view(dtype)
ValueError: could not convert string to float:
I am really not sure what is going wrong. I have tried exporting the DataFrames as a csv and, aside from the indexes which do contain text, the values are all floats. Is this something to do with the indexes perhaps?
As an addendum, I tried using pd.to_numeric outside of a lambda function but it also returned an error:
input_df = pd.to_numeric(input_df, errors='coerce')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-93-7178dce9054b> in <module>
----> 1 input_df = pd.to_numeric(input_df, errors='coerce')
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
120 values = np.array([arg], dtype='O')
121 elif getattr(arg, 'ndim', 1) > 1:
--> 122 raise TypeError('arg must be a list, tuple, 1-d array, or Series')
123 else:
124 values = arg
TypeError: arg must be a list, tuple, 1-d array, or Series
You may try something like:
input_df = input_df.apply(lambda x: pd.to_neumeric(x,errors='coerce')).applymap(transform_x)
the input_df is a 2D array but pd.to_neumeric() takes only list, tuple, 1-d array, or Series so you cannot call a dataframe under it.Hence we take the help of lambda x to pass each series individually .
Once all the df has neumeric data, apply your function.

Scoring returning a numpy.core.memmap instead of a numpy.Number in grid search

We are able (only within the context of our application atm) to reproduce on Ubuntu 15.04 and OS X with scikit 0.17 the following problem when using GridSearchCV with a LogisticRegression on larger data sets.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/pipeline.py in fit(self=Pipeline(steps=[('cpencoder', <cpml.whitebox.Lin...s', refit=True, scoring=u'roc_auc', verbose=1))]), X= Unnamed: 0 member_id loan_a... 42.993346
[152536 rows x 45 columns], y=array([0, 1, 0, ..., 1, 1, 0]), **fit_params={})
160 y : iterable, default=None
161 Training targets. Must fulfill label requirements for all steps of
162 the pipeline.
163 """
164 Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 165 self.steps[-1][-1].fit(Xt, y, **fit_params)
self.steps.fit = undefined
Xt = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
fit_params = {}
166 return self
167
168 def fit_transform(self, X, y=None, **fit_params):
169 """Fit all the transforms one after the other and transform the
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]))
799 y : array-like, shape = [n_samples] or [n_samples, n_output], optional
800 Target relative to X for classification or regression;
801 None for unsupervised learning.
802
803 """
--> 804 return self._fit(X, y, ParameterGrid(self.param_grid))
self._fit = <bound method GridSearchCV._fit of GridSearchCV(...obs', refit=True, scoring=u'roc_auc', verbose=1)>
X = array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]])
y = array([0, 1, 0, ..., 1, 1, 0])
self.param_grid = {'C': [1], 'class_weight': ['auto'], 'fit_intercept': [False], 'intercept_scaling': [1], 'penalty': ['l2']}
805
806
807 class RandomizedSearchCV(BaseSearchCV):
808 """Randomized search on hyper parameters.
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/grid_search.py in _fit(self=GridSearchCV(cv=None, error_score='raise',
...jobs', refit=True, scoring=u'roc_auc', verbose=1), X=array([[ 0.00000000e+00, 1.29659900e+06, 5....000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=array([0, 1, 0, ..., 1, 1, 0]), parameter_iterable=<sklearn.grid_search.ParameterGrid object>)
548 )(
549 delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
550 train, test, self.verbose, parameters,
551 self.fit_params, return_parameters=True,
552 error_score=self.error_score)
--> 553 for parameters in parameter_iterable
parameters = undefined
parameter_iterable = <sklearn.grid_search.ParameterGrid object>
554 for train, test in cv)
555
556 # Out is a list of triplet: score, estimator, n_test_samples
557 n_fits = len(out)
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr>>)
807 if pre_dispatch == "all" or n_jobs == 1:
808 # The iterable was consumed all at once by the above for loop.
809 # No need to wait for async callbacks to trigger to
810 # consumption.
811 self._iterating = False
--> 812 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
813 # Make sure that we get a last message telling us we are done
814 elapsed_time = time.time() - self._start_time
815 self._print('Done %3i out of %3i | elapsed: %s finished',
816 (len(self._output), len(self._output),
---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError Mon Jan 18 11:58:09 2016
PID: 71840 Python 2.7.10: /Users/samuelhopkins/.virtualenvs/cpml/bin/python
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
67 def __init__(self, iterator_slice):
68 self.items = list(iterator_slice)
69 self._size = len(self.items)
70
71 def __call__(self):
---> 72 return [func(*args, **kwargs) for func, args, kwargs in self.items]
73
74 def __len__(self):
75 return self._size
76
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _fit_and_score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+00, 0.00000000e+00, 4.29933458e+01]]), y=memmap([0, 1, 0, ..., 1, 1, 0]), scorer=make_scorer(roc_auc_score, needs_threshold=True), train=array([ 49100, 49101, 49102, ..., 152533, 152534, 152535]), test=array([ 0, 1, 2, ..., 57517, 57522, 57532]), verbose=1, parameters={'C': 1, 'class_weight': 'auto', 'fit_intercept': False, 'intercept_scaling': 1, 'penalty': 'l2'}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise')
1545 " numeric value. (Hint: if using 'raise', please"
1546 " make sure that it has been spelled correctly.)"
1547 )
1548
1549 else:
-> 1550 test_score = _score(estimator, X_test, y_test, scorer)
1551 if return_train_score:
1552 train_score = _score(estimator, X_train, y_train, scorer)
1553
1554 scoring_time = time.time() - start_time
...........................................................................
/Users/samuelhopkins/.virtualenvs/cpml/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _score(estimator=LogisticRegression(C=1, class_weight='auto', dua... tol=0.0001, verbose=0, warm_start=False), X_test=memmap([[ 0.00000000e+00, 1.29659900e+06, 5...000000e+01, 0.00000000e+00, 4.29933458e+01]]), y_test=memmap([0, 1, 0, ..., 1, 1, 1]), scorer=make_scorer(roc_auc_score, needs_threshold=True))
1604 score = scorer(estimator, X_test)
1605 else:
1606 score = scorer(estimator, X_test, y_test)
1607 if not isinstance(score, numbers.Number):
1608 raise ValueError("scoring must return a number, got %s (%s) instead."
-> 1609 % (str(score), type(score)))
1610 return score
1611
1612
1613 def _permutation_test_score(estimator, X, y, cv, scorer):
ValueError: scoring must return a number, got 0.998981811748 (<class 'numpy.core.memmap.memmap'>) instead.
We have made several attempts to reproduce it outside of the context of the application, but are not having any luck. We have made the following change to cross_validation.py and it fixed our particular problem:
...
if isinstance(score, np.core.memmap):
score = np.float(score)
if not isinstance(score, numbers.Number):
raise ValueError("scoring must return a number, got %s (%s) instead."
...
Some more information:
we are on python 2.7
we are using a Pipeline to ensure all inputs are numeric
My questions are the following:
How might we go about reproducing this problem so as to cause the scorer to return a memmap?
Is anyone else having this particular problem?
Is the change we made in cross_validation.py actually a decent solution?
Yes, had a similar case
I fell in love with .memmap-s due to O/S limits on memory allocations and I consider .memmap-s a smart tool for large scale machine-learning, using 'em in .fit()-s and other sklearn methods. ( GridSearchCV() not being yet the case, due to it's adverse effect of pre-allocation of memory on large HyperPARAMETERs' grids with n_jobs = -1 )
How might we ... reproduce ...? As far as I remember, mine case was similar and the change from "ordinary" numpy.ndarray to a numpy.memmap() started these artifacts. So, if you strive to create one such artificially, wrap your data into a .memmap()-ed representation of array and make it be returned, even while containing a single cell of data, instead of a plain number. One shall receive a view into a .memmap()-ed sub-range of generic array representation of that cell.
Is the change ... a decent solution? Well, I have got rid of the .memmap()-ed wrapper by explicitly returning a cell value, by referencing the result's [0] component. An enforced conversion by.float() seems fine.