Pandas tells me non-ambiguous time is ambiguous - pandas

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505

It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')

I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Related

trouble deleting specific columns in genfromtxt function

I made a python script which takes pdbqt files as input and returns a txt file. As all the lines doesn't have the same no. of columns its not able to read the files. How can I ignore those lines?
sample pdbqt and txt files
the code
from __future__ import division
import numpy as np
def function(filename):
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
import os
all_filenames = os.listdir()
import glob
all_filenames = glob.glob('*.pdbqt')
print(all_filenames)
for filename in all_filenames:
function(filename)
the error I am getting
Traceback (most recent call last):
File "cen7.py", line 45, in <module>
function(filename)
File "cen7.py", line 7, in function
data = np.genfromtxt(filename, dtype = float , usecols = (6, 7, 8), skip_footer=1)
File "/home/../.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 2261, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #3037 (got 4 columns instead of 3)
Line #6066 (got 4 columns instead of 3)
Line #9103 (got 4 columns instead of 3)
Line #12140 (got 4 columns instead of 3)
Line #15177 (got 4 columns instead of 3)
Let's make a sample csv:
In [75]: txt = """1,2,3,4
...: 5,6,7,8,9
...: """.splitlines()
This error is to be expected - the number of columns in the 2nd line is larger than previous:
In [76]: np.genfromtxt(txt, delimiter=',')
Traceback (most recent call last):
Input In [76] in <cell line: 1>
np.genfromtxt(txt, delimiter=',')
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #2 (got 5 columns instead of 4)
I can avoid that with usecols. It isn't bothered by the extra columns in line 2:
In [77]: np.genfromtxt(txt, delimiter=',',usecols=(1,2,3))
Out[77]:
array([[2., 3., 4.],
[6., 7., 8.]])
But if the line is too short for the usecols, I get an error:
In [78]: np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
Traceback (most recent call last):
Input In [78] in <cell line: 1>
np.genfromtxt(txt, delimiter=',',usecols=(2,3,4))
File /usr/local/lib/python3.8/dist-packages/numpy/lib/npyio.py:2261 in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #1 (got 4 columns instead of 3)
The wording of the error isn't quite right, but it is clear which line is the problem.
That should give you something to look for when scanning the problem lines in your csv.

Trying to sum columns in Pandas dataframe, issue with index it seems

Trying to sum columns in Pandas dataframe, issue with index it seems...
Part of dataset looks like this, for multiple years:
snapshot of dataset
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
dataframe looks like this now and
this is the properties
Trying to sum multi-family units so I am specifying the columns to sum
cols = ['05', '06']
CA_HousingTrend['sum_stats'] = CA_HousingTrend[cols].sum(axis=1)
This is the error I get:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
exec(code, self.locals)
File "", line 5, in
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/frame.py", line 3511, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")1
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5782, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/Users/alexandramaxim/Documents/Py/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 5842, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['05', '06'], dtype='object', name='UNITSSTR')] are in the [columns]"
Not sure if you need the index, but the pivot probably created a multi-index. Try this.
CA_HousingTrend = CA_HousingTrend_temp.pivot_table(index='YEAR',columns='UNITSSTR', aggfunc='size')
# A new dataframe just so you have something new to play with.
new_df = CA_Housing_Trend_temp.reset_index()

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

Pandas error with multidimensional key using .loc and a boolean

Been running into this same error for 2 weeks, even though the code worked before. Not sure if I updated pandas as part of another library install, and maybe something changed there. Currently on version 23.4. Expected outcome is returning just the row with that identifier value.
In [42]: df.head()
Out[43]:
index Identifier ...
0 51384710 ...
1 74838J10 ...
2 80589M10 ...
3 67104410 ...
4 50241310 ...
[5 rows x 14 columns]
In [43]: df.loc[df.Identifier.isin(['51384710'])].head()
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-44-a3dbf43451ef>", line 1, in <module>
df.loc[df.Identifier.isin(['51384710'])].head()
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1899, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
**ValueError: Cannot index with multidimensional key**
Code Snippet
Fixed it. I'd done df.columns = [column_list] where column_list = [...], which caused df to be treated as if it had a multiindex, even though there was only one level. removed the brackets from the df.columns assignment.
Try changing
df.loc[df.Identifier.isin(['51384710'])].head()
to
df[df.Identifier.isin(['51384710'])].head()

How to implement aggregation functions for pandas groupby objects?

Here's the setup for this question:
import numpy as np
import pandas as pd
import collections as co
data = [['a', 1],
['a', 2],
['a', 3],
['a', 4],
['b', 5],
['b', 6],
['b', 7]]
varnames = tuple('PQ')
df = pd.DataFrame(co.OrderedDict([(varnames[i], [row[i] for row in data])
for i in range(len(varnames))]))
gdf = df.groupby(df.ix[:, 0])
After evaluating the above, df looks like this:
>>> df
P Q
0 a 1
1 a 2
2 a 3
3 a 4
4 b 5
5 b 6
6 b 7
gdf is a DataFrameGroupBy object associated with df, where the groups are determined by the values in the first column of df.
Now, watch this:
>>> gdf.aggregate(sum)
Q
P
a 10
b 18
...but repeating the same thing after replacing sum with a pass-through wrapper for it, bombs:
>>> mysum = lambda *a, **k: sum(*a, **k)
>>> mysum(range(10)) == sum(range(10))
True
>>> gdf.aggregate(mysum)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1699, in aggregate
result = self._aggregate_generic(arg, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1757, in _aggregate_generic
return self._aggregate_item_by_item(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1782, in _aggregate_item_by_item
result[item] = colg.aggregate(func, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1426, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "/home/yt/.virtualenvs/yte/lib/python2.7/site-packages/pandas/core/groupby.py", line 1508, in _aggregate_named
output = func(group, *args, **kwargs)
File "<stdin>", line 1, in <lambda>
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Here's a subtler (though probably related) issue. Recall that the result of gdf.aggregate(sum) was a dataframe with a single column, Q. Now, note the result below contains two columns, P and Q:
>>> import random as rn
>>> gdf.aggregate(lambda *a, **k: rn.random())
P Q
P
a 0.344457 0.344457
b 0.990507 0.990507
I have not been able to find anything in the documentation that would explain
why should gdf.aggregate(mysum) fail? (IOW, does this failure agree with documented behavior, or is it a bug in pandas?)
why should gdf.aggregate(lambda *a, **k: rn.random()) produce a two-column output while gdf.aggregate(sum) produce a one-column output?
what signatures (input and output) should an aggregation function foo have so that gdf.aggregate(foo) will return a table having only column Q (like the result of gdf.aggregate(sum))?
Your problems all come down to the columns that are included in the GroupBy. I think you want to group by P and computed statistics on Q. To do that use
gdf = df.groupby('P')
instead of your method. Then any aggregations will not include the P column.
The sum in your function is Python's sum. Groupby.sum() is written in Cython and only acts on numeric dtypes. That's why you get the error about adding ints to strs.
Your other two questions are related to that. You're inputing two columns into gdf.agg, P and Q so you get two columns out for your gdf.aggregate(lambda *a, **k: rn.random()). gdf.sum() ignores the string column.