Can fromfile omit fields? - numpy

I am reading data from a given binary format, however I am only interested in a subset of the fields.
For example:
MY_DTYPE = np.dtype({'names': ('A', 'B', 'C'), 'formats': ('<f8', '<u2', 'u1')})
data = np.fromfile(infile, count=-1, dtype=MY_DTYPE)
Assume I don't really need data['C'], is it possible to specify what fields I want to keep in the first place?

Simulate the load:
In [117]: MY_DTYPE = np.dtype({'names': ('A', 'B', 'C'), 'formats': ('<f8', '<u2', 'u1')})
In [118]: data = np.zeros(3, MY_DTYPE)
In [119]: data
Out[119]:
array([(0., 0, 0), (0., 0, 0), (0., 0, 0)],
dtype=[('A', '<f8'), ('B', '<u2'), ('C', 'u1')])
In [120]: data['C']
Out[120]: array([0, 0, 0], dtype=uint8)
In the latest numpy version, multifield indexing creates a view:
In [121]: data[['A','B']]
Out[121]:
array([(0., 0), (0., 0), (0., 0)],
dtype={'names':['A','B'], 'formats':['<f8','<u2'], 'offsets':[0,8], 'itemsize':11})
It provides a repack_fields functions to make a proper copy:
In [122]: import numpy.lib.recfunctions as rf
In [123]: rf.repack_fields(data[['A','B']])
Out[123]: array([(0., 0), (0., 0), (0., 0)], dtype=[('A', '<f8'), ('B', '<u2')])
See the docs of repack for more information, or look at recent release notes.

Related

Given a (nested) view into a numpy 2D array, how to retrive the coords w.r.t. the original array

Consider the following:
A = np.zeros((100,100)) # TODO: populate A
filt = median_filter(A, size=5) # doesn't impact A.shape
view = filt[30:40, 30:40]
subvew = view[0:5, 0:5]
Is it possible to extract from subview the corresponding rectangle within A?
I'd like to do something like:
coords = get_rect(subview)
rect_A = A[coords]
But if I'm constantly having to pass bounding-rects thru the system the code uglifies fast.
numpy must store this information internally, but is it possible to access it?
PS I'm not doing anything fancy like view = A[::2]
PPS From reviewing the excellent answer, it looks like it should be possible to subclass numpy.ndarray, adding a .parent property and a .get_global_rect() method. But it looks like a HARD task.
In [40]: x = np.arange(24).reshape(4,6)
__array_interface__ is a way of viewing everything about a numpy array.
In [41]: x.__array_interface__
Out[41]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4, 6),
'version': 3}
In [42]: x.strides
Out[42]: (48, 8)
For a view:
In [43]: y = x[:3,1:4]
In [44]: y
Out[44]:
array([[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]])
In [45]: y.__array_interface__
Out[45]:
{'data': (43385720, False),
'strides': (48, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3),
'version': 3}
In [46]: y.base
Out[46]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23])
x.base is the same, the original np.arange(24).
The key difference in y is the shape, and data value, which "points" 8 bytes further along.
So while one could, in theory, deduce the indexing used to create y, numpy does not have a function or method to do that for us. Keeping track of your own "coordinates" is the best option.
Another way to put it, y is a numpy.ndarray, just like x. It does not carry any extra information about how it was created. The same applies to z, a view of y.
As for the 1d base
In [48]: x.base.strides
Out[48]: (8,)
In [49]: x.base.shape
Out[49]: (24,)
In [50]: x.base.__array_interface__
Out[50]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (24,),
'version': 3}

pd.MultiIndex from product

from pandas documentation:
numbers = [0, 1, 2]
colors = ['green', 'purple']
pd.MultiIndex.from_product([numbers, colors],names=['number', 'color'])
MultiIndex([(0, 'green'),
(0, 'purple'),
(1, 'green'),
(1, 'purple'),
(2, 'green'),
(2, 'purple')],
names=['number', 'color'])
what I got:
MultiIndex(levels=[[0, 1, 2], ['green', 'purple']],
codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['numbers', 'colors'])
can someone please help understand why I got this output by putting in the same code?
That was how previous Pandas versions represent the multiIndex. On my system, Pandas 1.0.3 gives the former and 0.24.2 gives the latter. Make sure your system's version is the same with that of the doc.
See the section "Better repr for MultiIndex" enhancement which was released in v0.25.0.

Pandas: apply tupleize_cols to dataframe without to_csv()?

I like the tupleize_cols option in the to_csv() function. Is this function available on a in-memory dataframe? I would like to clean up the tuples of the multi-indexed columns to 'reportable' column names automatically.
Thanks,
Luc
Just use .values on the index
In [1]: i = pd.MultiIndex.from_product([[1,2,3],['a','b','c']])
In [2]: i
Out[2]:
MultiIndex(levels=[[1, 2, 3], [u'a', u'b', u'c']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]])
In [3]: i.values
Out[3]:
array([(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c'),
(3, 'a'), (3, 'b'), (3, 'c')], dtype=object)

Add another column(index) into the array

I have an array
a =array([ 0.74552751, 0.70868784, 0.7351144 , 0.71597612, 0.77608263,
0.71213591, 0.77297658, 0.75637376, 0.76636106, 0.76098067,
0.79142821, 0.71932262, 0.68984604, 0.77008623, 0.76334351,
0.76129872, 0.76717526, 0.78413129, 0.76483804, 0.75160062,
0.7532506 ], dtype=float32)
I want to store my array in item,value format and can't seems to get it right.
I'm trying to get this format:
a = [(0, 0.001497),
(1, 0.0061543),
..............
(46, 0.001436781),
(47, 0.00654533),
(48, 0.0027139),
(49, 0.00462962)],
Numpy arrays have a fixed data type that you must specify. It looks like a data type of
int for your item and float for you value would work best. Something like:
import numpy as np
dtype = [("item", int), ("value", float)]
a = np.array([(0, 0.), (1, .1), (2, .2)], dtype=dtype)
The string part of the dtype is the name of each field. The names allow you to access the fields more easily like this:
print a['value']
# [ 0., 0.1, 0.2]
a['value'] = [7, 8, 9]
print a
# [(0, 7.0) (1, 8.0) (2, 9.0)]
If you need to copy another array into the array I describe above, you can do it just by using the filed name:
new = np.empty(len(a), dtype)
new['index'] = [3, 4, 5]
new['value'] = a['value']
print new
# [(3, 7.0) (4, 8.0) (5, 9.0)]

Creating structured array from flat list

Suppose I have a flat list L like this:
In [97]: L
Out[97]: [2010.5, 1, 2, 3, 4, 5]
...and I want to get a structured array like this:
array((2010.5, [1, 2, 3, 4, 5]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])
How can I most efficiently make this conversion? I can not pass the latter dtype directly to array(L, ...) or to array(tuple(L)):
In [98]: dtp [9/1451]
Out[98]: [('A', '<f4', 1), ('B', '<u4', 5)]
In [99]: array(L, dtp)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-99-32809e0456a7> in <module>()
----> 1 array(L, dtp)
TypeError: expected an object with a buffer interface
In [101]: array(tuple(L), dtp)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-101-4d0c49a9f01d> in <module>()
----> 1 array(tuple(L), dtp)
ValueError: size of tuple must match number of fields.
What does work is to pass a temporary dtype where each field has one entry, then view this with the dtype I actually want:
In [102]: tdtp
Out[102]:
[('a', numpy.float32),
('b', numpy.uint32),
('c', numpy.uint32),
('d', numpy.uint32),
('e', numpy.uint32),
('f', numpy.uint32)]
In [103]: array(tuple(L), tdtp).view(dtp)
Out[103]:
array((2010.5, [1, 2, 3, 4, 5]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])
But to create this temporary dtype is an additional step that I would like to avoid if possible.
Is it possible to go directly from my flat list to my structured dtype, without using the intermediate dtype shown above?
(Note: in my real use case I have a reading routine reading a custom file format and many values per entry; so I would prefer to avoid a situation where I need to construct both the temporary and the actual dtype by hand.)
Instead of passing tuple(L) to array, pass an argument with nested values that match the nesting of the dtype. For the example you showed, you could pass in (L[0], L[1:]):
In [28]: L
Out[28]: [2010.5, 1, 2, 3, 4, 5]
In [29]: dtp
Out[29]: [('A', '<f4', 1), ('B', '<u4', 5)]
In [30]: array((L[0], L[1:]), dtype=dtp)
Out[30]:
array((2010.5, [1L, 2L, 3L, 4L, 5L]),
dtype=[('A', '<f4'), ('B', '<u4', (5,))])