numpy genfromtxt - how to detect bad int input values - numpy

Here is a trivial example of a bad int value to numpy.genfromtxt. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.
>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()
My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt, I cannot detect this bad value.
>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out
masked_array(data=[(0, -1), (1, 2), (3, 4)],
mask=[(False, False), (False, False), (False, False)],
fill_value=(999999, 999999),
dtype=[('a', '<i8'), ('b', '<i8')])
>>> out['b'].data
array([-1, 2, 4])
I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input
>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()
I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?

I found in np.lib._iotools a function
def _loose_call(self, value):
try:
return self.func(value)
except ValueError:
return self.default
When genfromtxt is processing a line it does
if loose:
rows = list(
zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
for (i, conv) in enumerate(converters)]))
where loose is an input parameter. So in the case of int converter it tries
int(astring)
and if that produces a ValueError it returns the default value (e.g. -1) instead of raising an error. Similarly for float and np.nan.
The usemask parameter is applied in:
if usemask:
append_to_masks(tuple([v.strip() in m
for (v, m) in zip(values,
missing_values)]))
Define 2 converters to give more information on what's processed:
def myint(astr):
try:
v = int(astr)
except ValueError:
print('err',astr)
v = '-999'
return v
def myfloat(astr):
try:
v = float(astr)
except ValueError:
print('err',astr)
v = '-inf'
return v
A sample text:
txt='''1,2
3,nan
,foo
bar,
'''.splitlines()
And using the converters:
In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]:
array([( 1, 2.), ( 3, nan), (-999, -inf), (-999, -inf)],
dtype=[('f0', '<i8'), ('f1', '<f8')])
And to see what usemask does:
In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]:
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
mask=[(False, False), (False, False), ( True, False),
(False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
A missing value is a '' string, and int('') produces a ValueError just as int('bad') does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask.

Related

Shape of data changing in Tensorflow dataset

The shape of my data after the mapping function should be (257, 1001, 1). I asserted this condition in the function and the data passed without an issue. But when extracting a vector from the dataset, the shape comes out at (1, 257, 1001, 1). Tfds never fails to be a bloody pain.
The code:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
spectrogram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, labels))
spectrogram_ds = spectrogram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
num_files = len(train_df)
num_train = int(0.8 * num_files)
num_val = int(0.1 * num_files)
num_test = int(0.1 * num_files)
spectrogram_ds = spectrogram_ds.shuffle(buffer_size=1000)
specgram_train_ds = spectrogram_ds.take(num_train)
specgram_test_ds = spectrogram_ds.skip(num_train)
specgram_val_ds = specgram_test_ds.take(num_val)
specgram_test_ds = specgram_test_ds.skip(num_val)
specgram, _ = next(iter(spectrogram_ds))
# The following assertion raises an error; not the one in the read_npy_file function.
assert specgram.shape == (257, 1001, 1), f"Spectrogram shape is {specgram.shape}. Should be (257, 1001, 1)"
I thought that the first dimension represented the batch size, which is 1, of course, before batching. But after batching by calling batch(batch_size=64) on the dataset, the shape of a batch was (64, 1, 257, 1001, 1) when it should be (64, 257, 1001, 1).
Would appreciate any help.
Although I still can't explain why I'm getting that output, I did find a workaround. I simply reshaped the data in another mapping like so:
def read_npy_file(data):
# 'data' stores the file name of the numpy binary file storing the features of a particular sound file
# as a bytes string.
# decode() is called on the bytes string to decode it from a bytes string to a regular string
# so that it can passed as a parameter into np.load()
data = np.load(data.decode())
# Shape of data is now (1, rows, columns)
# Needs to be reshaped to (rows, columns, 1):
data = np.reshape(data, (257, 1001, 1))
assert data.shape == (257, 1001, 1), f"Shape of spectrogram is {data.shape}; should be (257, 1001, 1)."
return data.astype(np.float32)
specgram_ds = tf.data.Dataset.from_tensor_slices((specgram_files, one_hot_encoded_labels))
specgram_ds = specgram_ds.map(
lambda file, label: tuple([tf.numpy_function(read_npy_file, [file], [tf.float32, ]), label]),
num_parallel_calls=tf.data.AUTOTUNE)
specgram_ds = specgram_ds.map(lambda specgram, label: tuple([tf.reshape(specgram, (257, 1001, 1)), label]),
num_parallel_calls=tf.data.AUTOTUNE)

How do I correct this Value Error due to Buffer having the wrong dimensions in a quadprog SVM implementation?

I'm using the quadprog module to set up an SVM for speech recognition. I took a QP implementation from here: https://github.com/stephane-caron/qpsolvers/blob/master/qpsolvers/quadprog_.py
Here is their implementation:
def quadprog_solve_qp(P, q, G=None, h=None, A=None, b=None, initvals=None,
verbose=False):
if initvals is not None:
print("quadprog: note that warm-start values ignored by wrapper")
qp_G = P
qp_a = -q
if A is not None:
if G is None:
qp_C = -A.T
qp_b = -b
else:
qp_C = -vstack([A, G]).T
qp_b = -np.insert(h, 0, 0, axis=0)
meq = A.shape[0]
else: # no equality constraint
qp_C = -G.T if G is not None else None
qp_b = -h if h is not None else None
meq = 0
try:
return solve_qp(qp_G, qp_a, qp_C, qp_b, meq)[0]
except ValueError as e:
if "matrix G is not positive definite" in str(e):
# quadprog writes G the cost matrix that we write P in this package
raise ValueError("matrix P is not positive definite")
raise
Shapes:
P: (127, 127)
h: (254, 1)
q: (127, 1)
A: (1, 127)
G: (254, 127)
I also had that qp_b was initially assigned to an hstack of an array arr = array([0]) with h but the shape: (1,) prevented numpy from concatenating the arrays. I fixed this error by inserting a [0] instead.
When I try quadprog_solve_qp(P, q, G, h, A) I get a:
File "----------------------------.py", line 95, in quadprog_solve_qp
return solve_qp(qp_G, qp_a, qp_C, qp_b, meq)[0]
File "quadprog/quadprog.pyx", line 12, in quadprog.solve_qp
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
And I have no idea where it's coming from, nor what I can do. If anyone has any idea how the quadprog module works or simply what I might be doing wrong I would be pleased to hear.

Tensorflow tf.data.Dataset API, dataset unzip function?

In tensorflow 1.12 there is the Dataset.zip function: documented here.
However, I was wondering if there is a dataset unzip function which will return back the original two datasets.
# NOTE: The following examples use `{ ... }` to represent the
# contents of a dataset.
a = { 1, 2, 3 }
b = { 4, 5, 6 }
c = { (7, 8), (9, 10), (11, 12) }
d = { 13, 14 }
# The nested structure of the `datasets` argument determines the
# structure of elements in the resulting dataset.
Dataset.zip((a, b)) == { (1, 4), (2, 5), (3, 6) }
Dataset.zip((b, a)) == { (4, 1), (5, 2), (6, 3) }
# The `datasets` argument may contain an arbitrary number of
# datasets.
Dataset.zip((a, b, c)) == { (1, 4, (7, 8)),
(2, 5, (9, 10)),
(3, 6, (11, 12)) }
# The number of elements in the resulting dataset is the same as
# the size of the smallest dataset in `datasets`.
Dataset.zip((a, d)) == { (1, 13), (2, 14) }
I would like to have the following
dataset = Dataset.zip((a, d)) == { (1, 13), (2, 14) }
a, d = dataset.unzip()
My workaround was to just use map, not sure if there might be interest in a syntax sugar function for unzip later though.
a = dataset.map(lambda a, b: a)
b = dataset.map(lambda a, b: b)
TensorFlow's get_single_element() is finally around which can be used to unzip datasets (as asked in the question above).
This avoids the need of generating and using an iterator using .map() or iter() (which could be costly for big datasets).
get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
import tensorflow as tf
a = [ 1, 2, 3 ]
b = [ 4, 5, 6 ]
c = [ (7, 8), (9, 10), (11, 12) ]
d = [ 13, 14 ]
# Creating datasets from lists
ads = tf.data.Dataset.from_tensor_slices(a)
bds = tf.data.Dataset.from_tensor_slices(b)
cds = tf.data.Dataset.from_tensor_slices(c)
dds = tf.data.Dataset.from_tensor_slices(d)
list(tf.data.Dataset.zip((ads, bds)).as_numpy_iterator()) == [ (1, 4), (2, 5), (3, 6) ] # True
list(tf.data.Dataset.zip((bds, ads)).as_numpy_iterator()) == [ (4, 1), (5, 2), (6, 3) ] # True
# Let's zip and unzip ads and dds
x = tf.data.Dataset.zip((ads, dds))
xa, xd = tf.data.Dataset.get_single_element(x.batch(len(x)))
xa = list(xa.numpy())
xd = list(xd.numpy())
print(xa, xd) # [1,2] [13, 14] # notice how xa is now different from a because ads was curtailed when zip was done above.
d == xd # True
Building on Ouwen Huang's answer, this function seems to work for arbitrary datasets:
def split_datasets(dataset):
tensors = {}
names = list(dataset.element_spec.keys())
for name in names:
tensors[name] = dataset.map(lambda x: x[name])
return tensors
I have written a more general unzip function for tf.data.Dataset pipelines, which also handles the recursive case where a pipeline has multiple levels of zipping.
import tensorflow as tf
def tfdata_unzip(
tfdata: tf.data.Dataset,
*,
recursive: bool=False,
eager_numpy: bool=False,
num_parallel_calls: int=tf.data.AUTOTUNE,
):
"""
Unzip a zipped tf.data pipeline.
Args:
tfdata: the :py:class:`tf.data.Dataset`
to unzip.
recursive: Set to ``True`` to recursively unzip
multiple layers of zipped pipelines.
Defaults to ``False``.
eager_numpy: Set this to ``True`` to return
Python lists of primitive types or
:py:class:`numpy.array` objects. Defaults
to ``False``.
num_parallel_calls: The level of parallelism to
each time we ``map()`` over a
:py:class:`tf.data.Dataset`.
Returns:
Returns a Python list of either
:py:class:`tf.data.Dataset` or NumPy
arrays.
"""
if isinstance(tfdata.element_spec, tf.TensorSpec):
if eager_numpy:
return list(tfdata.as_numpy_iterator())
return tfdata
def tfdata_map(i: int) -> list:
return tfdata.map(
lambda *cols: cols[i],
deterministic=True,
num_parallel_calls=num_parallel_calls,
)
if isinstance(tfdata.element_spec, tuple):
num_columns = len(tfdata.element_spec)
if recursive:
return [
tfdata_unzip(
tfdata_map(i),
recursive=recursive,
eager_numpy=eager_numpy,
num_parallel_calls=num_parallel_calls,
)
for i in range(num_columns)
]
else:
return [
tfdata_map(i)
for i in range(num_columns)
]
raise ValueError(
"Unknown tf.data.Dataset element_spec: " +
str(tfdata.element_spec)
)
Here is how tfdata_unzip() works, given these example datasets:
>>> import numpy as np
>>> baby = tf.data.Dataset.from_tensor_slices([
np.array([1,2]),
np.array([3,4]),
np.array([5,6]),
])
>>> baby.element_spec
TensorSpec(shape=(2,), dtype=tf.int64, name=None)
TensorSpec(shape=(2,), dtype=tf.int64, name=None)
>>> parent = tf.data.Dataset.zip((baby, baby))
>>> parent.element_spec
(TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None))
>>> grandparent = tf.data.Dataset.zip((parent, parent))
>>> grandparent.element_spec
((TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None)),
(TensorSpec(shape=(2,), dtype=tf.int64, name=None),
TensorSpec(shape=(2,), dtype=tf.int64, name=None)))
This is what tfdata_unzip() returns on the above baby, parent, and grandparent datasets:
>>> tfdata_unzip(baby)
<TensorSliceDataset shapes: (2,), types: tf.int64>
>>> tfdata_unzip(parent)
[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>]
>>> tfdata_unzip(grandparent)
[<ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>,
<ParallelMapDataset shapes: ((2,), (2,)), types: (tf.int64, tf.int64)>]
>>> tfdata_unzip(grandparent, recursive=True)
[[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>],
[<ParallelMapDataset shapes: (2,), types: tf.int64>,
<ParallelMapDataset shapes: (2,), types: tf.int64>]]
>>> tfdata_unzip(grandparent, recursive=True, eager_numpy=True)
[[[array([1, 2]), array([3, 4]), array([5, 6])],
[array([1, 2]), array([3, 4]), array([5, 6])]],
[[array([1, 2]), array([3, 4]), array([5, 6])],
[array([1, 2]), array([3, 4]), array([5, 6])]]]

Error in using np.NaN is vectorize functions

I am using Python 3 on 64bit Win1o. I had issues with the following simple function:
def skudiscounT(t):
s = t.find("ITEMADJ")
if s >= 0:
t = t[s + 8:]
if t.find("-") == 2:
return t
else:
return np.nan # if change to "" it will work fine!
I tried to use this function in np.Vectorize and got the following error:
Traceback (most recent call last):
File "C:/Users/lz09/Desktop/P3/SODetails_Clean_V1.py", line 45, in <module>
SO["SKUDiscount"] = np.vectorize(skudiscounT)(SO['Description'])
File "C:\PD\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2739, in __call__
return self._vectorize_call(func=func, args=vargs)
File "C:\PD\Anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2818, in _vectorize_call
res = array(outputs, copy=False, subok=True, dtype=otypes[0])
ValueError: could not convert string to float: '23-126-408'
When I replace the last line [return np.nan] to [return ''] it worked fine. Anyone know why this is case? Thanks!
Without otypes the dtype of the return array is determined by the first trial result:
In [232]: f = np.vectorize(skudiscounT)
In [234]: f(['abc'])
Out[234]: array([ nan])
In [235]: _.dtype
Out[235]: dtype('float64')
I'm trying to find an argument that returns a string. It looks like your function can also return None.
From the docs:
The data type of the output of vectorized is determined by calling
the function with the first element of the input. This can be avoided
by specifying the otypes argument.
With otypes:
In [246]: f = np.vectorize(skudiscounT, otypes=[object])
In [247]: f(['abc', '23-126ITEMADJ408'])
Out[247]: array([nan, None], dtype=object)
In [248]: f = np.vectorize(skudiscounT, otypes=['U10'])
In [249]: f(['abc', '23-126ITEMADJ408'])
Out[249]:
array(['nan', 'None'],
dtype='<U4')
But for returning a generic object dtype, I'd use the slightly faster:
In [250]: g = np.frompyfunc(skudiscounT, 1,1)
In [251]: g(['abc', '23-126ITEMADJ408'])
Out[251]: array([nan, None], dtype=object)
So what kind of array do you want? float that can hold np.nan, string? or object that can hold 'anything'.

Cython Typing List of Strings

I'm trying to use cython to improve the performance of a loop, but I'm running
into some issues declaring the types of the inputs.
How do I include a field in my typed struct which is a string that can be
either 'front' or 'back'
I have a np.recarray that looks like the following (note the length of the
recarray is unknown as compile time)
import numpy as np
weights = np.recarray(4, dtype=[('a', np.int64), ('b', np.str_, 5), ('c', np.float64)])
weights[0] = (0, "front", 0.5)
weights[1] = (0, "back", 0.5)
weights[2] = (1, "front", 1.0)
weights[3] = (1, "back", 0.0)
as well as inputs of a list of strings and a pandas.Timestamp
import pandas as pd
ts = pd.Timestamp("2015-01-01")
contracts = ["CLX16", "CLZ16"]
I am trying to cythonize the following loop
def ploop(weights, contracts, timestamp):
cwts = []
for gen_num, position, weighting in weights:
if weighting != 0:
if position == "front":
cntrct_idx = gen_num
elif position == "back":
cntrct_idx = gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((gen_num, contracts[cntrct_idx], weighting, timestamp))
return cwts
My attempt involved typing the weights input as a struct in cython,
in a file struct_test.pyx as follows
import numpy as np
cimport numpy as np
cdef packed struct tstruct:
np.int64_t gen_num
char[5] position
np.float64_t weighting
def cloop(tstruct[:] weights_array, contracts, timestamp):
cdef tstruct weights
cdef int i
cdef int cntrct_idx
cwts = []
for k in xrange(len(weights_array)):
w = weights_array[k]
if w.weighting != 0:
if w.position == "front":
cntrct_idx = w.gen_num
elif w.position == "back":
cntrct_idx = w.gen_num + 1
else:
raise ValueError("transition.columns must contain "
"'front' or 'back'")
cwts.append((w.gen_num, contracts[cntrct_idx], w.weighting,
timestamp))
return cwts
But I am receiving runtime errors, which I believe are related to the
char[5] position.
import pyximport
pyximport.install()
import struct_test
struct_test.cloop(weights, contracts, ts)
ValueError: Does not understand character buffer dtype format string ('w')
In addition I am a bit unclear how I would go about typing contracts as well
as timestamp.
Your ploop (without the timestamp variable) produces:
In [226]: ploop(weights, contracts)
Out[226]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
Equivalent function without a loop:
def ploopless(weights, contracts):
arr_contracts = np.array(contracts) # to allow array indexing
wgts1 = weights[weights['c']!=0]
mask = wgts1['b']=='front'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]]
mask = wgts1['b']=='back'
wgts1['b'][mask] = arr_contracts[wgts1['a'][mask]+1]
return wgts1.tolist()
In [250]: ploopless(weights, contracts)
Out[250]: [(0, 'CLX16', 0.5), (0, 'CLZ16', 0.5), (1, 'CLZ16', 1.0)]
I'm taking advantage of the fact that returned list of tuples has same (int, str, int) layout as the input weight array. So I'm just making a copy of weights and replacing selected values of the b field.
Note that I use the field selection index before the mask one. The boolean mask produces a copy, so we have to careful about indexing order.
I'm guessing that loop-less array version will be competitive in time with the cloop (on realistic arrays). The string and list operations in cloop probably limit its speedup.