Write boolean structured arrays with PyFITS - fits

I would like to write a Boolean structured array with PyFITS in a FITS file.
I had some issues. Here is a simple example.
I create the test dictionary and transform it into a structured array.
In [241]: test = {'p':np.array([True]*10+[False]*10,dtype='b')}
In [242]: test = np.core.records.fromarrays(list(test.values()), names=list(test.keys()))
Here it is the test structured array I would like to print in a .fit file.
In [243]: test
Out[243]:
rec.array([(1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (0,),
(0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (0,)],
dtype=[('p', 'i1')])
I print test in a fit file using pyfits
In [244]: pyfits.writeto('./test.fit',test,clobber=True)
In [245]: d = pyfits.open('./test.fit')
In [246]: d = d[1].data
However, all entries are now set to the False value, as follows:
In [247]: d
Out[247]:
FITS_rec([(False), (False), (False), (False), (False), (False), (False),
(False), (False), (False), (False), (False), (False), (False),
(False), (False), (False), (False), (False), (False)],
dtype=[('p', 'i1')])
Furthermore, it seems that the original test array is also somehow modified by pyfits.
In [248]: prova
Out[248]:
rec.array([(70,), (70,), (70,), (70,), (70,), (70,), (70,), (70,), (70,),
(70,), (70,), (70,), (70,), (70,), (70,), (70,), (70,), (70,),
(70,), (70,)],
dtype=[('p', 'i1')])
Could you please help me to solve this issue? Thank you very much!

Boolean columns in FITS are poorly understood, in large part due to their unusual representation (which uses the ASCII characters 'T' and 'F' to store true and false values, hence the 70s you're getting, which are ASCII 'F's.
Nevertheless, I've put a lot of work in the past into making this work correctly, so that even if you pass in an array of 0's and 1's it should infer what you meant by that. There seems to be a bug here, that the writeto "convenience" function isn't handling the boolean-ish array properly. I was able to make it work like this instead:
>>> hdu = fits.BinTableHDU.from_columns(test)
>>> hdu.writeto('test.fits', clobber=True)
>>> fits.getdata('test.fits')
FITS_rec([(True), (True), (True), (True), (True), (True), (True), (True),
(True), (True), (False), (False), (False), (False), (False),
(False), (False), (False), (False), (False)],
dtype=[('p', 'i1')])
What you did in the first place probably should have worked. Though in general I would recommend using dtype='?' or equivalently dtype=bool explicitly if you want the type to be correctly guessed as boolean (otherwise there's some ambiguity as to whether or not you actually wanted bytes).
Update: There's also an old bug report about the same problem here: https://github.com/astropy/astropy/issues/1901 Apparently I tried to solve this a while ago but got fed up with the ambiguities. Which is strange because I thought I did fix this at one point... In any case if you explicitly make your arrays bool dtype and use the .from_columns method as demonstrated above it should work. I'll see about revisiting some of these bugs.

Related

What is the best way to initialise a NumPy masked array with an existing mask?

I was expecting to just say something like
ma.zeros(my_shape, mask=my_mask, hard_mask=True)
(where the mask is the correct shape) but ma.zeros (or ma.ones or ma.empty) rather surprisingly doesn't recognise the mask argument. The simplest I've come up with is
ma.array(np.zeros(my_shape), mask=my_mask, hard_mask=True)
which seems to involve unnecessary copying of lots of zeros. Is there a better way?
Make a masked array:
In [162]: x = np.arange(5); mask=np.array([1,0,0,1,0],bool)
In [163]: M = np.ma.MaskedArray(x,mask)
In [164]: M
Out[164]:
masked_array(data=[--, 1, 2, --, 4],
mask=[ True, False, False, True, False],
fill_value=999999)
Modify x, and see the result in M:
In [165]: x[-1] = 10
In [166]: M
Out[166]:
masked_array(data=[--, 1, 2, --, 10],
mask=[ True, False, False, True, False],
fill_value=999999)
In [167]: M.data
Out[167]: array([ 0, 1, 2, 3, 10])
In [169]: M.data.base
Out[169]: array([ 0, 1, 2, 3, 10])
The M.data is a view of the array used in creating it. No unnecessary copies.
I haven't used functions like np.ma.zeros, but
In [177]: np.ma.zeros
Out[177]: <numpy.ma.core._convert2ma at 0x1d84a052af0>
_convert2ma is a Python class, that takes a funcname and returns new callable. It does not add mask-specific parameters. Study that yourself if necessary.
np.ma.MaskedArray, the function that actually subclasses ndarray takes a copy parameter
copy : bool, optional
Whether to copy the input data (True), or to use a reference instead.
Default is False.
and the first line of its __new__ is
_data = np.array(data, dtype=dtype, copy=copy,
order=order, subok=True, ndmin=ndmin)
I haven't quite sorted out whether M._data is just a reference to the source data, or a view. In either case, it isn't a copy, unless you say so.
I haven't worked a lot with masked arrays, but my impression is that, while they can be convenient, they shouldn't be used where you are concerned about performance. There's a lot of extra work required to maintain both the mask and the data. The extra time involved in copying the data array, if any, will be minor.

.all() on a Series is returning values from the Series, not True/False

I have a Pandas Series that should be filled with all strings, but I want to make sure there are no missing values. I thought that .all() would do the trick since empty strings are Falsey (right?), but instead, .all() (and .any()) are returning values from the Series, not True or False as I expected.
For example:
series_1 = pd.Series(['beads', 'doubloons', 'king cake'])
series_2 = pd.Series(['beads', 'doubloons', np.nan, 'king cake'])
series_3 = pd.Series(['beads', '', 'doubloons', np.nan, 'king cake'])
I would expect series_x.any() or series_x.all() to return True or False, but instead, what is happening is
series_1.any(), series_2.any(), and series_3.any() each return 'beads'
series_1.all() and series_2.all() return 'king cake' and series_3.all() returns ''
I am utterly befuddled, and frankly, the documentation is no help at all.
Thanks!
Edit: I am running pandas version 1.0.3

efficient way of numpy array cache lookup?

suppose I have a boolean array, mask = np.array([True, False, True, False])
and every a few elements could form a cache key for some value lookup.
E.g. intevals = (2, 2) -> CACHE[(0, (True, False))]->[.1,.2] and CACHE[(1, (True, False))]->[.2, .1]
or intevals = (1, 3) -> CACHE[(0, (True))]->[.1] and CACHE[(1, (False, True, False))]->[.2, .2, .1]
since the output value shape would be the same as the mask, wondering what's the most efficient way to performing such caching lookup

numpy genfromtxt - how to detect bad int input values

Here is a trivial example of a bad int value to numpy.genfromtxt. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.
>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()
My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt, I cannot detect this bad value.
>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out
masked_array(data=[(0, -1), (1, 2), (3, 4)],
mask=[(False, False), (False, False), (False, False)],
fill_value=(999999, 999999),
dtype=[('a', '<i8'), ('b', '<i8')])
>>> out['b'].data
array([-1, 2, 4])
I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input
>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()
I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?
I found in np.lib._iotools a function
def _loose_call(self, value):
try:
return self.func(value)
except ValueError:
return self.default
When genfromtxt is processing a line it does
if loose:
rows = list(
zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
for (i, conv) in enumerate(converters)]))
where loose is an input parameter. So in the case of int converter it tries
int(astring)
and if that produces a ValueError it returns the default value (e.g. -1) instead of raising an error. Similarly for float and np.nan.
The usemask parameter is applied in:
if usemask:
append_to_masks(tuple([v.strip() in m
for (v, m) in zip(values,
missing_values)]))
Define 2 converters to give more information on what's processed:
def myint(astr):
try:
v = int(astr)
except ValueError:
print('err',astr)
v = '-999'
return v
def myfloat(astr):
try:
v = float(astr)
except ValueError:
print('err',astr)
v = '-inf'
return v
A sample text:
txt='''1,2
3,nan
,foo
bar,
'''.splitlines()
And using the converters:
In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]:
array([( 1, 2.), ( 3, nan), (-999, -inf), (-999, -inf)],
dtype=[('f0', '<i8'), ('f1', '<f8')])
And to see what usemask does:
In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]:
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
mask=[(False, False), (False, False), ( True, False),
(False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
A missing value is a '' string, and int('') produces a ValueError just as int('bad') does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask.

Padding Labels for Tensorflow CTC Loss?

I would like to pad my labels so that they would be of equal length to be passed into the ctc_loss function. Apparently, -1 is not allowed. If I were to apply padding, should the padding value be part of the labels for ctc?
Update
I have this code that converts dense labels into sparse ones to be passed to the ctc_loss function which I think is related to the problem.
def dense_to_sparse(dense_tensor, out_type):
indices = tf.where(tf.not_equal(dense_tensor, tf.constant(0, dense_tensor.dtype)
values = tf.gather_nd(dense_tensor, indices)
shape = tf.shape(dense_tensor, out_type=out_type)
return tf.SparseTensor(indices, values, shape)
Actually, -1 values are allowed to be present in the y_true argument of the ctc_batch_cost with one limitation - they should not appear within the actual label "content" which is specified by label_length (here i-th label "content" would start from the index 0 and end at the index label_length[i]).
So it is perfectly fine to pad labels with -1 so that they would be of equal length, as you intended. The only thing you should take care about is to correctly calculate and pass corresponding label_length values.
Here is the sample code which is a modified version of the test_ctc unit test from keras:
import numpy as np
from tensorflow.keras import backend as K
number_of_categories = 4
number_of_timesteps = 5
labels = np.asarray([[0, 1, 2, 1, 0], [0, 1, 1, 0, -1]])
label_lens = np.expand_dims(np.asarray([5, 4]), 1)
# dimensions are batch x time x categories
inputs = np.zeros((2, number_of_timesteps, number_of_categories), dtype=np.float32)
input_lens = np.expand_dims(np.asarray([5, 5]), 1)
k_labels = K.variable(labels, dtype="int32")
k_inputs = K.variable(inputs, dtype="float32")
k_input_lens = K.variable(input_lens, dtype="int32")
k_label_lens = K.variable(label_lens, dtype="int32")
res = K.eval(K.ctc_batch_cost(k_labels, k_inputs, k_input_lens, k_label_lens))
It runs perfectly fine even with -1 as the last element of the (second) labels sequence because corresponding label_lens item (second) specified that its length is 4.
If we change it to be 5 or if we change some other label value to be -1 then we have the All labels must be nonnegative integers exception that you've mentioned. But this just means that our label_lens is invalid.
Here's how I do it. I have a dense tensor labels that includes padding with -1, so that all targets in a batch have the same length. Then I use
labels_sparse = dense_to_sparse(labels, sparse_val=-1)
where
def dense_to_sparse(dense_tensor, sparse_val=0):
"""Inverse of tf.sparse_to_dense.
Parameters:
dense_tensor: The dense tensor. Duh.
sparse_val: The value to "ignore": Occurrences of this value in the
dense tensor will not be represented in the sparse tensor.
NOTE: When/if later restoring this to a dense tensor, you
will probably want to choose this as the default value.
Returns:
SparseTensor equivalent to the dense input.
"""
with tf.name_scope("dense_to_sparse"):
sparse_inds = tf.where(tf.not_equal(dense_tensor, sparse_val),
name="sparse_inds")
sparse_vals = tf.gather_nd(dense_tensor, sparse_inds,
name="sparse_vals")
dense_shape = tf.shape(dense_tensor, name="dense_shape",
out_type=tf.int64)
return tf.SparseTensor(sparse_inds, sparse_vals, dense_shape)
This creates a sparse tensor of the labels, which is what you need to put into the ctc loss. That is, you call tf.nn.ctc_loss(labels=labels_sparse, ...) The padding (i.e. all values equal to -1 in the dense tensor) is simply not represented in this sparse tensor.