Does the pandas dataframe ._is_view work as expected when dropping a column? - pandas

I have the following example (pandas v1.5.1):
import pandas as pd
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
index=["a", "c", "d"],
columns=["Ohio", "Texas", "California"])
# drop a row
frame2 = frame.drop(index='c')
print(frame)
print(frame2)
# frame2 seems to be a copy
print('frame : ', frame.to_numpy().__array_interface__)
print('frame2: ', frame2.to_numpy().__array_interface__)
# frame2._is_view returns True
print(f'{frame2._is_view=}')
that prints
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio Texas California
a 0 1 2
d 6 7 8
frame : {'data': (2332905904160, False), 'strides': None, 'descr': [('', '<i4')], 'typestr': '<i4', 'shape': (3, 3), 'version': 3}
frame2: {'data': (2332905739696, False), 'strides': None, 'descr': [('', '<i4')], 'typestr': '<i4', 'shape': (2, 3), 'version': 3}
frame2._is_view=True
Given that the address of the underlying NumPy ndarray has changed I would assume that we created a fresh dataframe (copy). This is confirmed by the fact that changing a value in frame2 does not alter frame.
What does ._is_view exactly return?

Related

Given a (nested) view into a numpy 2D array, how to retrive the coords w.r.t. the original array

Consider the following:
A = np.zeros((100,100)) # TODO: populate A
filt = median_filter(A, size=5) # doesn't impact A.shape
view = filt[30:40, 30:40]
subvew = view[0:5, 0:5]
Is it possible to extract from subview the corresponding rectangle within A?
I'd like to do something like:
coords = get_rect(subview)
rect_A = A[coords]
But if I'm constantly having to pass bounding-rects thru the system the code uglifies fast.
numpy must store this information internally, but is it possible to access it?
PS I'm not doing anything fancy like view = A[::2]
PPS From reviewing the excellent answer, it looks like it should be possible to subclass numpy.ndarray, adding a .parent property and a .get_global_rect() method. But it looks like a HARD task.
In [40]: x = np.arange(24).reshape(4,6)
__array_interface__ is a way of viewing everything about a numpy array.
In [41]: x.__array_interface__
Out[41]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4, 6),
'version': 3}
In [42]: x.strides
Out[42]: (48, 8)
For a view:
In [43]: y = x[:3,1:4]
In [44]: y
Out[44]:
array([[ 1, 2, 3],
[ 7, 8, 9],
[13, 14, 15]])
In [45]: y.__array_interface__
Out[45]:
{'data': (43385720, False),
'strides': (48, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3),
'version': 3}
In [46]: y.base
Out[46]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23])
x.base is the same, the original np.arange(24).
The key difference in y is the shape, and data value, which "points" 8 bytes further along.
So while one could, in theory, deduce the indexing used to create y, numpy does not have a function or method to do that for us. Keeping track of your own "coordinates" is the best option.
Another way to put it, y is a numpy.ndarray, just like x. It does not carry any extra information about how it was created. The same applies to z, a view of y.
As for the 1d base
In [48]: x.base.strides
Out[48]: (8,)
In [49]: x.base.shape
Out[49]: (24,)
In [50]: x.base.__array_interface__
Out[50]:
{'data': (43385712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (24,),
'version': 3}

how numpy arrays are stored in memory locations?

a=np.array([[1,2],[4,5]])
b=a.T
print(a is b)
print(np.shares_memory(a,b))
for i in a:
for j in I:
print(i,j,id(j))
print('************')
for i in b:
for j in I:
print(i,j,id(j))
The output of the above code is
False
True
[1 2] 1 2027214431408
[1 2] 2 2027214431184
[4 5] 4 2027214431408
[4 5] 5 2027214431184
************
[1 4] 1 2027214431632
[1 4] 4 2027214431184
[2 5] 2 2027214431632
[2 5] 5 2027214431184
My question is why the location of alternate integer objects are the same in the above code. As python initializes different memory locations to each different objects
In [64]: a=np.array([[1,2],[4,5]])
...: b=a.T
The data value from __array_interface__ tells us (in some sense) where the data-buffer of the array is located. a and b has the same value, indicating that they share the buffer.
In [65]: a.__array_interface__
Out[65]:
{'data': (74597280, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 2),
'version': 3}
In [66]: b.__array_interface__
Out[66]:
{'data': (74597280, False),
'strides': (8, 16),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 2),
'version': 3}
b is a view of a, using the same buffer, but with its own shape and strides.
In [67]: a.strides
Out[67]: (16, 8)
In [68]: b.strides
Out[68]: (8, 16)
Asking for the id of an indexed element tells us nothing about that data-buffer. It just identifies the object that was derived from the array - by value, not by reference. list stores elements by reference, arrays do not.
In [70]: type(a[0,1])
Out[70]: numpy.int64

numpy - why Z[(0,2)] is view but Z[(0, 2), (0)] is copy?

Question
Why are the numpy tuple indexing behaviors inconsistent? Please explain the rational or design decision behind these behaviors. In my understanding, Z[(0,2)] and Z[(0, 2), (0)] are both tuple indexing and expected the consistent behavior for copy/view. If this is incorrect, please explain,
import numpy as np
Z = np.arange(36).reshape(3, 3, 4)
print("Z is \n{}\n".format(Z))
b = Z[
(0,2) # Select Z[0][2]
]
print("Tuple indexing Z[(0,2)] is \n{}\nIs view? {}\n".format(
b,
b.base is not None
))
c = Z[ # Select Z[0][0][1] & Z[0][2][1]
(0,2),
(0)
]
print("Tuple indexing Z[(0, 2), (0)] is \n{}\nIs view? {}\n".format(
c,
c.base is not None
))
Z is
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]]
Tuple indexing Z[(0,2)] is
[ 8 9 10 11]
Is view? True
Tuple indexing Z[(0, 2), (0)] is
[[ 0 1 2 3]
[24 25 26 27]]
Is view? False
Numpy indexing is confusing and wonder how people built the understanding. If there is a good way to understand or cheat-sheets, please advise.
It's the comma that creates a tuple. The () just set boundaries where needed.
Thus
Z[(0,2)]
Z[0,2]
are the same, select on the first 2 dimension. Whether that returns an element, or an array depends on how many dimensions Z has.
The same interpretation applies to the other case.
Z[(0, 2), (0)]
Z[( np.array([0,2]), 0)]
Z[ np.array([0,2]), 0]
are the same - the first dimensions is indexed with a list/array, and thus is advanced indexing. It's a copy.
[ 8 9 10 11]
is a row of the 3d array; its a contiguous block of Z
[[ 0 1 2 3]
[24 25 26 27]]
is 2 rows from Z. They aren't contiguous, so there's no way of identifying them with just shape and strides (and offset in the databuffer).
details
__array_interface__ gives details about the underlying data of an array
In [146]: Z = np.arange(36).reshape(3,3,4)
In [147]: Z.__array_interface__
Out[147]:
{'data': (38255712, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 3, 4),
'version': 3}
In [148]: Z.strides
Out[148]: (96, 32, 8)
For the view:
In [149]: Z1 = Z[0,2]
In [150]: Z1
Out[150]: array([ 8, 9, 10, 11])
In [151]: Z1.__array_interface__
Out[151]:
{'data': (38255776, False), # 38255712+8*8
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (4,),
'version': 3}
The data buffer pointer is 8 elements further along in Z buffer. Shape is much reduced.
In [152]: Z2 = Z[[0,2],0]
In [153]: Z2
Out[153]:
array([[ 0, 1, 2, 3],
[24, 25, 26, 27]])
In [154]: Z2.__array_interface__
Out[154]:
{'data': (31443104, False), # an entirely different location
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 4),
'version': 3}
Z2 is the same as two selections:
In [158]: Z[0,0]
Out[158]: array([0, 1, 2, 3])
In [159]: Z[2,0]
Out[159]: array([24, 25, 26, 27])
It is not
Z[0][0][1] & Z[0][2][1]
Z[0,0,1] & Z[0,2,1]
Compare that with a 2 row slice:
In [156]: Z3 = Z[0:2,0]
In [157]: Z3.__array_interface__
Out[157]:
{'data': (38255712, False), # same as Z's
'strides': (96, 8),
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (2, 4),
'version': 3}
A view is returned if the new array can be described with shape, strides and all or part of the original data buffer.

List of the (row, col) of the n largest values in a numeric pandas DataFrame?

Given a Pandas DataFrame of numeric values how can one produce a list of the .loc cell locations that one can then use to then obtain the corresponding n largest values in the entire DataFame?
For example:
A
B
C
D
E
X
1.3
3.6
33
61.38
0.3
Y
3.14
2.71
64
23.2
21
Z
1024
42
66
137
22.2
T
63.123
111
1.23
14.16
50.49
An n of 3 would produce the (row,col) pairs for the values 1024, 137 and 111.
These locations could then, as usual, be fed to .loc to extract those values from the DataFrame. i.e.
df.loc['Z','A']
df.loc['Z','D']
df.loc['T','B']
Note: It is easy to mistake this question for one that involves .idxmax. That isn't applicable due to the fact that there may be multiple values selected from a row and/or column in the n largest.
You could try:
>>> data = {0 : [1.3, 3.14, 1024, 63.123], 1: [3.6, 2.71, 42, 111], 2 : [33, 64, 66, 1.23], 3 : [61.38, 23.2, 137, 14.16], 4 : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data)
>>> df
0 1 2 3 4
0 1.300 3.60 33.00 61.38 0.30
1 3.140 2.71 64.00 23.20 21.00
2 1024.000 42.00 66.00 137.00 22.20
3 63.123 111.00 1.23 14.16 50.49
>>>
>>> a = list(zip(*df.stack().nlargest(3).index.labels))
>>> a
[(2, 0), (2, 3), (3, 1)]
>>> # then ...
>>> df.loc[a[0]]
1024.0
>>>
>>> # all sorted in decreasing order ...
>>> list(zip(*df.stack().nlargest(20).index.labels))
[(2, 0), (2, 3), (3, 1), (2, 2), (1, 2), (3, 0), (0, 3), (3, 4), (2, 1), (0, 2), (1, 3), (2, 4), (1, 4), (3, 3), (0, 1), (1, 0), (1, 1), (0, 0), (3, 2), (0, 4)]
Edit: In pandas versions 0.24.0 and above, MultiIndex.labels has been replaced by MultiIndex.codes(see Deprecations in What’s new in 0.24.0 (January 25, 2019)). The above code will throw AttributeError: 'MultiIndex' object has no attribute 'labels' and needs to be updated as follows:
>>> a = list(zip(*df.stack().nlargest(3).index.codes))
>>> a
[(2, 0), (2, 3), (3, 1)]
Edit 2: This question has become a "moving target", as the OP keeps changing it (this is my last update/edit). In the last update, OP's dataframe looks as follows:
>>> data = {'A' : [1.3, 3.14, 1024, 63.123], 'B' : [3.6, 2.71, 42, 111], 'C' : [33, 64, 66, 1.23], 'D' : [61.38, 23.2, 137, 14.16], 'E' : [0.3, 21, 22.2, 50.49] }
>>> df = pd.DataFrame(data, index=['X', 'Y', 'Z', 'T'])
>>> df
A B C D E
X 1.300 3.60 33.00 61.38 0.30
Y 3.140 2.71 64.00 23.20 21.00
Z 1024.000 42.00 66.00 137.00 22.20
T 63.123 111.00 1.23 14.16 50.49
The desired output can be obtained using:
>>> a = df.stack().nlargest(3).index
>>> a
MultiIndex([('Z', 'A'),
('Z', 'D'),
('T', 'B')],
)
>>>
>>> df.loc[a[0]]
1024.0
The trick is to use np.unravel_index on the np.argsort
Example:
import numpy as np
import pandas as pd
N = 5
df = pd.DataFrame([[11, 3, 50, -3],
[5, 73, 11, 100],
[75, 9, -2, 44]])
s_ix = np.argsort(df.values, axis=None)[::-1][:N]
labels = np.unravel_index(s_ix, df.shape)
labels = list(zip(*labels))
print(labels) # --> [(1, 3), (2, 0), (1, 1), (0, 2), (2, 3)]
print(df.loc[labels[0]]) # --> 100

Multi-column label-encoding: Print mappings

Following code can be used to transform strings into categorical labels:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
['A','E','H','F','G','I','K','','',''],
['A','C','I','F','H','G','','','','']],
columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])
pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0 1 2 3 4 5 6 7 9 10 8
1 1 5 8 6 7 9 10 0 0 0
2 1 3 9 6 8 7 0 0 0 0
Question:
How can I query the mappings (it appears they are sorted alphabetically)?
I.e. a list like:
A: 1
B: 2
C: 3
...
I: 9
K: 10
Thank you!
yes, it's possible if you define the LabelEncoder separately and query its classes_ attribute later.
le = LabelEncoder()
data = le.fit_transform(df.values.flatten())
dict(zip(le.classes_[1:], np.arange(1, len(le.classes_))))
{'A': 1,
'B': 2,
'C': 3,
'D': 4,
'E': 5,
'F': 6,
'G': 7,
'H': 8,
'I': 9,
'K': 10}
The classes_ stores a list of classes, in the order that they were encoded.
le.classes_
array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K'], dtype=object)
So you may safely assume the first element is encoded as 1, and so on.
To reverse encodings, use le.inverse_transform.
I think there is transform in LabelEncoder
le=LabelEncoder()
le.fit(df.values.flatten())
dict(zip(df.values.flatten(),le.transform(df.values.flatten()) ))
Out[137]:
{'': 0,
'A': 1,
'B': 2,
'C': 3,
'D': 4,
'E': 5,
'F': 6,
'G': 7,
'H': 8,
'I': 9,
'K': 10}