Numpy interpolation on pandas TimeStamp data works if it's a pandas series but not if it's a single object? - pandas

I'm trying to use np.interp to interpolate a float value based on pandas TimeStamp data. However, I noticed that np.interp works if the input x is a pandas TimeStamp pandas series, but not if it's a single TimeStamp object.
Here's the code to illustrate this:
import pandas as pd
import numpy as np
coarse = pd.DataFrame({'start': ['2016-01-01 07:00:00.00000+00:00',
'2016-01-01 07:30:00.00000+00:00',]} )
fine = pd.DataFrame({'start': ['2016-01-01 07:00:02.156657+00:00',
'2016-01-01 07:00:15+00:00',
'2016-01-01 07:00:32+00:00',
'2016-01-01 07:11:17+00:00',
'2016-01-01 07:14:00+00:00',
'2016-01-01 07:15:55+00:00',
'2016-01-01 07:33:04+00:00'],
'price': [0,
1,
2,
3,
4,
5,
6,
]} )
coarse['start'] = pd.to_datetime(coarse['start'])
fine['start'] = pd.to_datetime(fine['start'])
np.interp(x=coarse.start, xp=fine.start, fp=fine.price) # works
np.interp(x=coarse.start.iloc[-1], xp=fine.start, fp=fine.price) # doesn't work
The latter gives the error
TypeError: float() argument must be a string or a number, not 'Timestamp'
I am wondering why the latter doesn't work, while the former does?

The input of interp must be an "array-like" (iterable), you can use .iloc[[-1]]:
np.interp(x=coarse.start.iloc[[-1]], xp=fine.start, fp=fine.price)
Output: array([5.82118562])

Look at what you get when selecting an item from the Series:
In [8]: coarse.start
Out[8]:
0 2016-01-01 07:00:00+00:00
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
In [9]: coarse.start.iloc[-1]
Out[9]: Timestamp('2016-01-01 07:30:00+0000', tz='UTC')
With the list index, it's a Series:
In [10]: coarse.start.iloc[[-1]]
Out[10]:
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
I was going to scold you for not showing the full error message, but I see that it's a compiled piece of code that raises the error. Keep in mind that interp is a numpy function, which works with numpy arrays, and for math like this, float dtype ones.
So it's a good guess that interp is trying to make a float array from your argument.
In [14]: np.asarray(coarse.start, dtype=float)
Out[14]: array([1.4516316e+18, 1.4516334e+18])
In [15]: np.asarray(coarse.start.iloc[[1]], dtype=float)
Out[15]: array([1.4516334e+18])
In [16]: np.asarray(coarse.start.iloc[1], dtype=float)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 1
----> 1 np.asarray(coarse.start.iloc[1], dtype=float)
TypeError: float() argument must be a string or a number, not 'Timestamp'
It can't make a float value from a Python TimeStamp object.

Related

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

Creating a Pandas DataFrame from a NumPy masked array?

I am trying to create a Pandas DataFrame from a NumPy masked array, which I understand is a supported operation. This is an example of the source array:
a = ma.array([(1, 2.2), (42, 5.5)],
dtype=[('a',int),('b',float)],
mask=[(True,False),(False,True)])
which outputs as:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(999999, 1.e+20),
dtype=[('a', '<i8'), ('b', '<f8')])
Attempting to create a DataFrame with pd.DataFrame(a) returns:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-40-a4c5236a3cd4> in <module>
----> 1 pd.DataFrame(a)
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
636 # a masked array
637 else:
--> 638 data = sanitize_masked_array(data)
639 mgr = ndarray_to_mgr(
640 data,
/usr/local/anaconda/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_masked_array(data)
452 """
453 mask = ma.getmaskarray(data)
--> 454 if mask.any():
455 data, fill_value = maybe_upcast(data, copy=True)
456 data.soften_mask() # set hardmask False if it was True
/usr/local/anaconda/lib/python3.8/site-packages/numpy/core/_methods.py in _any(a, axis, dtype, out, keepdims, where)
54 # Parsing keyword arguments is currently fairly slow, so avoid it for now
55 if where is True:
---> 56 return umr_any(a, axis, dtype, out, keepdims)
57 return umr_any(a, axis, dtype, out, keepdims, where=where)
58
TypeError: cannot perform reduce with flexible type
Is this operation indeed supported? Currently using Pandas 1.3.3 and NumPy 1.20.3.
Update
Is this supported?
According to the Pandas documentation here:
Alternatively, you may pass a numpy.MaskedArray as the data argument to the DataFrame constructor, and its masked entries will be considered missing.
The code above was my asking the question "What will I get?" if I passed a NumPy masked array to Pandas, but that was the result I was hoping for. Above was the simplest example I could come up with.
I do expect each Series/column in Pandas to be of a single type.
Update 2
Anyone interested in this should probably see this Pandas GitHub issue; it's noted there that Pandas has "deprecated support for MaskedRecords".
If the array has a simple dtype, the dataframe creation works (as documented):
In [320]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: mask=[(True,False),(False,True)])
In [321]: a
Out[321]:
masked_array(
data=[[--, 2.2],
[42.0, --]],
mask=[[ True, False],
[False, True]],
fill_value=1e+20)
In [322]: import pandas as pd
In [323]: pd.DataFrame(a)
Out[323]:
0 1
0 NaN 2.2
1 42.0 NaN
This a is (2,2), and the result is 2 rows, 2 columns
With the compound dtype, the shape is 1d:
In [326]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',int),('b',float)],
...: mask=[(True,False),(False,True)])
In [327]: a.shape
Out[327]: (2,)
The error is the result of a test on the mask. flexible type refers to your compound dtype:
In [330]: a.mask.any()
Traceback (most recent call last):
File "<ipython-input-330-8dc32ee3f59d>", line 1, in <module>
a.mask.any()
File "/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py", line 57, in _any
return umr_any(a, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type
The documented pandas feature clearly does not apply to structured arrays. Without studying the pandas code I can't say exactly what it's trying to do at this point, but it's clear the code was not written with structured arrays in mind.
The non-masked part does work, with the desired column dtypes:
In [332]: pd.DataFrame(a.data)
Out[332]:
a b
0 1 2.2
1 42 5.5
Using the default fill:
In [344]: a.filled()
Out[344]:
array([(999999, 2.2e+00), ( 42, 1.0e+20)],
dtype=[('a', '<i8'), ('b', '<f8')])
In [345]: pd.DataFrame(a.filled())
Out[345]:
a b
0 999999 2.200000e+00
1 42 1.000000e+20
I'd have to look more at ma docs/code to see if it's possible to apply a different fill to the two fields. Filling with nan doesn't work for the int field. numpy doesn't have pandas' int none. I haven't worked enough with that pandas feature to know whether the resulting dtype is still int, or it is changed to object.
Anyways, you are pushing the bounds of both np.ma and pandas with this task.
edit
The default fill_value is a tuple, one for each field:
In [350]: a.fill_value
Out[350]: (999999, 1.e+20)
So we can fill the fields differently, and make a frame from that:
In [351]: a.filled((-1, np.nan))
Out[351]: array([(-1, 2.2), (42, nan)], dtype=[('a', '<i8'), ('b', '<f8')])
In [352]: pd.DataFrame(a.filled((-1, np.nan)))
Out[352]:
a b
0 -1 2.2
1 42 NaN
Looks like I can make a structured array with a pandas dtype, and its associated fill_value:
In [363]: a = np.ma.array([(1, 2.2), (42, 5.5)],
...: dtype=[('a',pd.Int64Dtype),('b',float)],
...: mask=[(True,False),(False,True)],
fill_value=(pd.NA,np.nan))
In [364]: a
Out[364]:
masked_array(data=[(--, 2.2), (42, --)],
mask=[( True, False), (False, True)],
fill_value=(<NA>, nan),
dtype=[('a', 'O'), ('b', '<f8')])
In [366]: pd.DataFrame(a.filled())
Out[366]:
a b
0 <NA> 2.2
1 42 NaN
The question is what would you expect to get? It would be ambiguous for pandas to convert your data.
If you want to get the original data:
>>> pd.DataFrame(a.data)
a b
0 1 2.2
1 42 5.5
If you want to consider masked values invalid:
>>> pd.DataFrame(a.filled(np.nan))
BUT, for this you should have all type float in the masked array

ValueError: total size of new array must be unchanged (numpy for reshape)

I want reshape my data vector, but when I running the code
from pandas import read_csv
import numpy as np
#from pandas import Series
#from matplotlib import pyplot
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
A= np.array(series)
B = np.reshape(10,10)
print (B)
I found error
result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: total size of new array must be unchanged
my data
Month xxx
1749-01 58
1749-02 62.6
1749-03 70
1749-04 55.7
1749-05 85
1749-06 83.5
1749-07 94.8
1749-08 66.3
1749-09 75.9
1749-10 75.5
1749-11 158.6
1749-12 85.2
1750-01 73.3
.... ....
.... ....
There seem to be two issues with what you are trying to do. The first relates to how you read the data in pandas:
series = read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
print(series)
>>>>Empty DataFrame
Columns: []
Index: [1749-01 58, 1749-02 62.6, 1749-03 70, 1749-04 55.7, 1749-05 85, 1749-06 83.5, 1749-07 94.8, 1749-08 66.3, 1749-09 75.9, 1749-10 75.5, 1749-11 158.6, 1749-12 85.2, 1750-01 73.3]
This isn't giving you a column of floats in a dataframe with the dates the index, it is putting each line into the index, dates and value. I would think that you want to add delimtier=' ' so that it splits the lines properly:
series =read_csv('book1.csv', header=0, parse_dates=[0], index_col=0, delimiter=' ', squeeze=True)
>>>> Month
1749-01-01 58.0
1749-02-01 62.6
1749-03-01 70.0
1749-04-01 55.7
1749-05-01 85.0
1749-06-01 83.5
1749-07-01 94.8
1749-08-01 66.3
1749-09-01 75.9
1749-10-01 75.5
1749-11-01 158.6
1749-12-01 85.2
1750-01-01 73.3
Name: xxx, dtype: float64
This gives you the dates as the index with the 'xxx' value in the column.
Secondly the reshape. The error is quite descriptive in this case. If you want to use numpy.reshape you can't reshape to a layout that has a different number of elements to the original data. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6]) # size 6 array
a.reshape(2, 3)
>>>> [[1, 2, 3],
[4, 5, 6]]
This is fine because the array starts out length 6, and I'm reshaping to 2 x 3, and 2 x 3 = 6.
However, if I try:
a.reshape(10, 10)
>>>> ValueError: cannot reshape array of size 6 into shape (10,10)
I get the error, because I need 10 x 10 = 100 elements to do this reshape, and I only have 6.
Without the complete dataset it's impossible to know for sure, but I think this is the same problem you are having, although you are converting your whole dataframe to a numpy array.

How do I get pandas update function to correctly handle numpy.datetime64?

I have a dataframe with a column that may contain None and another dataframe with the same index that has datetime values populated. I am trying to update the first from the second using pandas.update.
import numpy as np
import pandas as pd
df = pd.DataFrame([{'id': 0, 'as_of_date': np.datetime64('2017-05-08')}])
print(df.as_of_date)
df2 = pd.DataFrame([{'id': 0, 'as_of_date': None}])
print(df2.as_of_date)
df2.update(df)
print(df2.as_of_date)
print(df2.apply(lambda x: x['as_of_date'] - np.timedelta64(1, 'D'), axis=1))
This results in
0 2017-05-08
Name: as_of_date, dtype: datetime64[ns]
0 None
Name: as_of_date, dtype: object
0 1494201600000000000
Name: as_of_date, dtype: object
0 -66582 days +10:33:31.122941
dtype: timedelta64[ns]
So basically update converts the datetime to milliseconds, but keeps the type as object. Then if I try to do date math on it, I get wacky results because numpy doesn't know how to treat it.
I was hoping df2 would look like df1 after updating. How can I fix this?
Try this:
In [391]: df2 = df2.combine_first(df)
In [392]: df2
Out[392]:
as_of_date id
0 2017-05-08 0
In [396]: df2.dtypes
Out[396]:
as_of_date datetime64[ns]
id int64
dtype: object
A two step approach
Fill None data in df2 using date from df:
df2 = df2.combine_first(df)
Update all elements in df2 using elements from df
df2.update(df)
Without 2nd step, df2 will only take the values from df to fill its Nones.

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))