Why does Series.min(skipna=True) throws an error caused by na value? - pandas

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?

I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

Related

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Pandas error with multidimensional key using .loc and a boolean

Been running into this same error for 2 weeks, even though the code worked before. Not sure if I updated pandas as part of another library install, and maybe something changed there. Currently on version 23.4. Expected outcome is returning just the row with that identifier value.
In [42]: df.head()
Out[43]:
index Identifier ...
0 51384710 ...
1 74838J10 ...
2 80589M10 ...
3 67104410 ...
4 50241310 ...
[5 rows x 14 columns]
In [43]: df.loc[df.Identifier.isin(['51384710'])].head()
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-44-a3dbf43451ef>", line 1, in <module>
df.loc[df.Identifier.isin(['51384710'])].head()
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1899, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
**ValueError: Cannot index with multidimensional key**
Code Snippet
Fixed it. I'd done df.columns = [column_list] where column_list = [...], which caused df to be treated as if it had a multiindex, even though there was only one level. removed the brackets from the df.columns assignment.
Try changing
df.loc[df.Identifier.isin(['51384710'])].head()
to
df[df.Identifier.isin(['51384710'])].head()

How to merge pandas series off of a column of dates

I have two series:
date DEF
0 1/31/1986 0.0140
1 2/28/1986 0.0150
2 3/31/1986 0.0160
3 4/30/1986 0.0120
4 5/30/1986 0.0120
date PE
0 1/31/1900 12.71
1 2/28/1900 12.94
2 3/31/1900 13.04
3 4/30/1900 13.21
4 5/31/1900 12.58
I need to iterate over several DataFrames of this nature and combine them all into one big DataFrame, where only values that align with the dates get added. My function so far:
def get_combined_vars(start, end):
rows = pd.date_range(start=start, end=end, freq='BM')
df1 = pd.DataFrame(rows, columns=['date'])
for key in variables.keys():
check = variables[key][0]
if check == 1:
df2 = pd.DataFrame(variables[key][1]())
print(df2.head(5))
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))
print(df1.head(10))
return df1
I can't seem to find the right command to merge DataFrames based off of a column.
Desired output:
date DEF PE
0 1/31/1900 0.0140 12.71
1 2/28/1900 0.0150 12.94
2 3/31/1900 0.0160 13.04
3 4/30/1900 0.0120 13.21
4 5/31/1900 0.0120 12.58
Merge_asof issue:
runfile('H:/Market Timing/Files/market_timing.py', wdir='H:/Market Timing/Files')
date BY
0 1/31/1963 0.98
1 2/28/1963 1
2 3/29/1963 1.01
3 4/30/1963 1.01
4 5/31/1963 1.01
Traceback (most recent call last):
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 303, in _convert_listlike
values, tz = tslib.datetime_to_datetime64(arg)
File "pandas\_libs\tslib.pyx", line 1884, in pandas._libs.tslib.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "H:/Market Timing/Files/market_timing.py", line 88, in <module>
print(get_combined_vars('1/31/1995', '1/31/2005').head(10))
File "H:/Market Timing/Files/market_timing.py", line 43, in get_combined_vars
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 373, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 306, in _convert_listlike
raise e
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 294, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 2156, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2379, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2373, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 581, in parse
ret = default.replace(**repl)
ValueError: day is out of range for month
I believe on the third pass of these two DataFrames attempting to be combined it runs into this error: ValueError: day is out of range for month
Can a buffer be added for discrepancies in data like this?
You can use pd.merge_asof, however, first you'll need to get your dates on a common year.
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))[['date','DEF','PE']]
Output:
date DEF PE
0 1900-01-31 0.014 12.71
1 1900-02-28 0.015 12.94
2 1900-03-31 0.016 13.04
3 1900-04-30 0.012 13.21
4 1900-05-31 0.012 12.58
You would use pandas.Merge (or DataFrame.join as shorthand) to do this:
import pandas as pd
pd.Merge(df1, df2, on="date")
...But as Scott Boston mentioned in his comment, the data doesn't align so you won't get your expected results.

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.