Creating a Pandas DataFrame from a NumPy masked array? - pandas

I am trying to create a Pandas DataFrame from a NumPy masked array, which I understand is a supported operation. This is an example of the source array:
a = ma.array([(1, 2.2), (42, 5.5)],
dtype=[('a',int),('b',float)],
mask=[(True,False),(False,True)])
which outputs as:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(999999, 1.e+20),
dtype=[('a', '<i8'), ('b', '<f8')])
Attempting to create a DataFrame with pd.DataFrame(a) returns:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-a4c5236a3cd4> in <module>
----> 1 pd.DataFrame(a)
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
636 # a masked array
637 else:
--> 638 data = sanitize_masked_array(data)
639 mgr = ndarray_to_mgr(
640 data,
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_masked_array(data)
452 """
453 mask = ma.getmaskarray(data)
--> 454 if mask.any():
455 data, fill_value = maybe_upcast(data, copy=True)
456 data.soften_mask() # set hardmask False if it was True
/usr/local/anaconda/lib/python3.8/site-packages/numpy/core/_methods.py in _any(a, axis, dtype, out, keepdims, where)
54 # Parsing keyword arguments is currently fairly slow, so avoid it for now
55 if where is True:
---> 56 return umr_any(a, axis, dtype, out, keepdims)
57 return umr_any(a, axis, dtype, out, keepdims, where=where)
58
TypeError: cannot perform reduce with flexible type
Is this operation indeed supported? Currently using Pandas 1.3.3 and NumPy 1.20.3.
Update
Is this supported?
According to the Pandas documentation here:
Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.
The code above was my asking the question "What will I get?" if I passed a NumPy masked array to Pandas, but that was the result I was hoping for. Above was the simplest example I could come up with.
I do expect each Series/column in Pandas to be of a single type.
Update 2
Anyone interested in this should probably see this Pandas GitHub issue; it's noted there that Pandas has "deprecated support for MaskedRecords".

If the array has a simple dtype, the dataframe creation works (as documented):
In [320]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: mask=[(True,False),(False,True)])
In [321]: a
Out[321]:
masked_array(
data=[[--, 2.2],
[42.0, --]],
mask=[[ True, False],
[False, True]],
fill_value=1e+20)
In [322]: import pandas as pd
In [323]: pd.DataFrame(a)
Out[323]:
0 1
0 NaN 2.2
1 42.0 NaN
This a is (2,2), and the result is 2 rows, 2 columns
With the compound dtype, the shape is 1d:
In [326]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',int),('b',float)],
...: mask=[(True,False),(False,True)])
In [327]: a.shape
Out[327]: (2,)
The error is the result of a test on the mask. flexible type refers to your compound dtype:
In [330]: a.mask.any()
Traceback (most recent call last):
File "<ipython-input-330-8dc32ee3f59d>", line 1, in <module>
a.mask.any()
File "/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py", line 57, in _any
return umr_any(a, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type
The documented pandas feature clearly does not apply to structured arrays. Without studying the pandas code I can't say exactly what it's trying to do at this point, but it's clear the code was not written with structured arrays in mind.
The non-masked part does work, with the desired column dtypes:
In [332]: pd.DataFrame(a.data)
Out[332]:
a b
0 1 2.2
1 42 5.5
Using the default fill:
In [344]: a.filled()
Out[344]:
array([(999999, 2.2e+00), ( 42, 1.0e+20)],
dtype=[('a', '<i8'), ('b', '<f8')])
In [345]: pd.DataFrame(a.filled())
Out[345]:
a b
0 999999 2.200000e+00
1 42 1.000000e+20
I'd have to look more at ma docs/code to see if it's possible to apply a different fill to the two fields. Filling with nan doesn't work for the int field. numpy doesn't have pandas' int none. I haven't worked enough with that pandas feature to know whether the resulting dtype is still int, or it is changed to object.
Anyways, you are pushing the bounds of both np.ma and pandas with this task.
edit
The default fill_value is a tuple, one for each field:
In [350]: a.fill_value
Out[350]: (999999, 1.e+20)
So we can fill the fields differently, and make a frame from that:
In [351]: a.filled((-1, np.nan))
Out[351]: array([(-1, 2.2), (42, nan)], dtype=[('a', '<i8'), ('b', '<f8')])
In [352]: pd.DataFrame(a.filled((-1, np.nan)))
Out[352]:
a b
0 -1 2.2
1 42 NaN
Looks like I can make a structured array with a pandas dtype, and its associated fill_value:
In [363]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',pd.Int64Dtype),('b',float)],
...: mask=[(True,False),(False,True)],
fill_value=(pd.NA,np.nan))
In [364]: a
Out[364]:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(<NA>, nan),
dtype=[('a', 'O'), ('b', '<f8')])
In [366]: pd.DataFrame(a.filled())
Out[366]:
a b
0 <NA> 2.2
1 42 NaN

The question is what would you expect to get? It would be ambiguous for pandas to convert your data.
If you want to get the original data:
>>> pd.DataFrame(a.data)
a b
0 1 2.2
1 42 5.5
If you want to consider masked values invalid:
>>> pd.DataFrame(a.filled(np.nan))
BUT, for this you should have all type float in the masked array

Related

How to do simple condition on pandas rows which match float("nan") [duplicate]

Lets say I have following pandas DataFrame:
import pandas as pd
df = pd.DataFrame({"A":[1,pd.np.nan,2], "B":[5,6,0]})
Which would look like:
>>> df
A B
0 1.0 5
1 NaN 6
2 2.0 0
First option
I know one way to check if a particular value is NaN, which is as follows:
>>> df.isnull().ix[1,0]
True
Second option (not working)
I thought below option, using ix, would work as well, but it's not:
>>> df.ix[1,0]==pd.np.nan
False
I also tried iloc with same results:
>>> df.iloc[1,0]==pd.np.nan
False
However if I check for those values using ix or iloc I get:
>>> df.ix[1,0]
nan
>>> df.iloc[1,0]
nan
So, why is the second option not working? Is it possible to check for NaN values using ix or iloc?
Try this:
In [107]: pd.isnull(df.iloc[1,0])
Out[107]: True
UPDATE: in a newer Pandas versions use pd.isna():
In [7]: pd.isna(df.iloc[1,0])
Out[7]: True
The above answer is excellent. Here is the same with an example for better understanding.
>>> import pandas as pd
>>>
>>> import numpy as np
>>>
>>> pd.Series([np.nan, 34, 56])
0 NaN
1 34.0
2 56.0
dtype: float64
>>>
>>> s = pd.Series([np.nan, 34, 56])
>>> pd.isnull(s[0])
True
>>>
I also tried couple of times, the following trials did not work. Thanks to #MaxU.
>>> s[0]
nan
>>>
>>> s[0] == np.nan
False
>>>
>>> s[0] is np.nan
False
>>>
>>> s[0] == 'nan'
False
>>>
>>> s[0] == pd.np.nan
False
>>>
pd.isna(cell_value) can be used to check if a given cell value is nan. Alternatively, pd.notna(cell_value) to check the opposite.
From source code of pandas:
def isna(obj):
"""
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
in object arrays, ``NaT`` in datetimelike).
Parameters
----------
obj : scalar or array-like
Object to check for null or missing values.
Returns
-------
bool or array-like of bool
For scalar input, returns a scalar boolean.
For array input, returns an array of boolean indicating whether each
corresponding element is missing.
See Also
--------
notna : Boolean inverse of pandas.isna.
Series.isna : Detect missing values in a Series.
DataFrame.isna : Detect missing values in a DataFrame.
Index.isna : Detect missing values in an Index.
Examples
--------
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
df.isnull().loc[1,0]
I tried the above syntax and it worked.
I made up some workaround:
x = [np.nan]
In [4]: x[0] == np.nan
Out[4]: False
but:
In [5]: np.nan in x
Out[5]: True
You can see list contain method implementation, to understand why it works.

Numpy interpolation on pandas TimeStamp data works if it's a pandas series but not if it's a single object?

I'm trying to use np.interp to interpolate a float value based on pandas TimeStamp data. However, I noticed that np.interp works if the input x is a pandas TimeStamp pandas series, but not if it's a single TimeStamp object.
Here's the code to illustrate this:
import pandas as pd
import numpy as np
coarse = pd.DataFrame({'start': ['2016-01-01 07:00:00.00000+00:00',
'2016-01-01 07:30:00.00000+00:00',]} )
fine = pd.DataFrame({'start': ['2016-01-01 07:00:02.156657+00:00',
'2016-01-01 07:00:15+00:00',
'2016-01-01 07:00:32+00:00',
'2016-01-01 07:11:17+00:00',
'2016-01-01 07:14:00+00:00',
'2016-01-01 07:15:55+00:00',
'2016-01-01 07:33:04+00:00'],
'price': [0,
1,
2,
3,
4,
5,
6,
]} )
coarse['start'] = pd.to_datetime(coarse['start'])
fine['start'] = pd.to_datetime(fine['start'])
np.interp(x=coarse.start, xp=fine.start, fp=fine.price) # works
np.interp(x=coarse.start.iloc[-1], xp=fine.start, fp=fine.price) # doesn't work
The latter gives the error
TypeError: float() argument must be a string or a number, not 'Timestamp'
I am wondering why the latter doesn't work, while the former does?
The input of interp must be an "array-like" (iterable), you can use .iloc[[-1]]:
np.interp(x=coarse.start.iloc[[-1]], xp=fine.start, fp=fine.price)
Output: array([5.82118562])
Look at what you get when selecting an item from the Series:
In [8]: coarse.start
Out[8]:
0 2016-01-01 07:00:00+00:00
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
In [9]: coarse.start.iloc[-1]
Out[9]: Timestamp('2016-01-01 07:30:00+0000', tz='UTC')
With the list index, it's a Series:
In [10]: coarse.start.iloc[[-1]]
Out[10]:
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
I was going to scold you for not showing the full error message, but I see that it's a compiled piece of code that raises the error. Keep in mind that interp is a numpy function, which works with numpy arrays, and for math like this, float dtype ones.
So it's a good guess that interp is trying to make a float array from your argument.
In [14]: np.asarray(coarse.start, dtype=float)
Out[14]: array([1.4516316e+18, 1.4516334e+18])
In [15]: np.asarray(coarse.start.iloc[[1]], dtype=float)
Out[15]: array([1.4516334e+18])
In [16]: np.asarray(coarse.start.iloc[1], dtype=float)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 1
----> 1 np.asarray(coarse.start.iloc[1], dtype=float)
TypeError: float() argument must be a string or a number, not 'Timestamp'
It can't make a float value from a Python TimeStamp object.

dtype is ignored when using multilevel columns

When using DataFrame.read_csv with multi level columns (read with header=) pandas seems to ignore the dtype= keyword.
Is there a way to make pandas use the passed types?
I am reading large data sets from CSV and therefore try to read the data already in the correct format to save CPU and memory.
I tried passing a dict using dtype with tuples as well as strings. It seems that dtype expects strings. At least I observed, that if I pass the level 0 keys the types are assigned, but unfortunately that would mean that all columns with the same level 0 label would get the same type. In the esample below columns (A, int16) and (A, int32) would get type object and (B, float32) and (B, int16) would get float32.
import pandas as pd
df= pd.DataFrame({
('A', 'int16'): pd.Series([1, 2, 3, 4], dtype='int16'),
('A', 'int32'): pd.Series([132, 232, 332, 432], dtype='int32'),
('B', 'float32'): pd.Series([1.01, 1.02, 1.03, 1.04], dtype='float32'),
('B', 'int16'): pd.Series([21, 22, 23, 24], dtype='int16')})
print(df)
df.to_csv('test_df.csv')
print(df.dtypes)
<i># full column name tuples with level 0/1 labels don't work</i>
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
<i># using the level 0 labels for dtype= seems to work</i>
df_new2= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype={
'A':'object',
'B': 'float32'
})
print(df_new2.dtypes)
I'd expect the second print(df.dtypes) to output the same column types as the first print(df.dtypes), but it does not seem to use the dtype= argument at all and infers the types resulting in much more memory intense types.
Was I missing something?
Thank you in advance Jottbe
This is a bug, that is also present in the current version of pandas. I filed a bug report here.
But also for the current version, there is a workaround. It works perfectly, if the engine is switched to python:
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
engine='python',
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
The output is:
Unnamed: 0_level_0 Unnamed: 0_level_1 int64
A int16 int16
int32 int32
B float32 float64
int16 int64
So the "A-columns" are typed as specified in dtypes.

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.

Selecting columns from numpy recarray

I have an object from type numpy.core.records.recarray. I want to use it effectively as pandas dataframe. More precisely, I want to use a subset of its columns in order to obtain a new recarray, the same way you would do pandas_dataframe[[selected_columns]].
What's the easiest way to achieve this?
Without using pandas you can select a subset of the fields of a structured array (recarray). For example:
In [338]: dt=np.dtype('i,f,i,f')
In [340]: A=np.ones((3,),dtype=dt)
In [341]: A[:]=(1,2,3,4)
In [342]: A
Out[342]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
a subset of the fields.
In [343]: B=A[['f1','f3']].copy()
In [344]: B
Out[344]:
array([(2.0, 4.0), (2.0, 4.0), (2.0, 4.0)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
that can be modified independently of A:
In [346]: B['f3']=[.1,.2,.3]
In [347]: B
Out[347]:
array([(2.0, 0.10000000149011612), (2.0, 0.20000000298023224),
(2.0, 0.30000001192092896)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
In [348]: A
Out[348]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
The structured subset of fields is not highly developed. A[['f0','f1']] is enough for viewing, but it will warn or give an error if you try to modify that subset. That's why I used copy with B.
There's a set of functions that facilitate adding and removing fields from recarrays. I'll have to look up the access pattern. But mostly the construct a new dtype, and empty array, and then copy fields by name.
import numpy.lib.recfunctions as rf
update
With newer numpy versions, the multi-field index has changed
In [17]: B=A[['f1','f3']]
In [18]: B
Out[18]:
array([(2., 4.), (2., 4.), (2., 4.)],
dtype={'names':['f1','f3'], 'formats':['<f4','<f4'], 'offsets':[4,12], 'itemsize':16})
This B is a true view, referencing the same data buffer as A. The offsets lets it ignore the missing fields. Those fields can be removed with repack_fields as just documented.
But when putting this into a dataframe, it doesn't look like we need to do that.
In [19]: df = pd.DataFrame(A)
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f0 3 non-null int32
1 f1 3 non-null float32
2 f2 3 non-null int32
3 f3 3 non-null float32
dtypes: float32(2), int32(2)
memory usage: 176.0 bytes
In [22]: df = pd.DataFrame(B)
In [24]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 3 non-null float32
1 f3 3 non-null float32
dtypes: float32(2)
memory usage: 152.0 bytes
The frame created from B is smaller.
Sometimes when making a dataframe from an array, the array itself is used as the frame's memory. Changing values in the source array will change the values in the frame. But with structured arrays, pandas makes a copy of the data, with a different memory layout.
Columns of matching dtype are grouped into a common NumericBlock:
In [42]: pd.DataFrame(A)._data
Out[42]:
BlockManager
Items: Index(['f0', 'f1', 'f2', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 5, 2), 2 x 3, dtype: float32
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int32
In [43]: pd.DataFrame(B)._data
Out[43]:
BlockManager
Items: Index(['f1', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: float32
In addition to #hpaulj answer, you'll want to repack the copy, otherwise the copied subset will have the same size memory footprint as the original.
import numpy as np
# note that you have to import this library explicitly
import numpy.lib.recfunctions
# B has a subset of "colums" but uses the same amount of memory as A
B = A[['f1','f3']].copy()
# C has a smaller memory footprint
C = np.lib.recfunctions.repack_fields(B)