How to merge pandas series off of a column of dates - pandas

I have two series:
date DEF
0 1/31/1986 0.0140
1 2/28/1986 0.0150
2 3/31/1986 0.0160
3 4/30/1986 0.0120
4 5/30/1986 0.0120
date PE
0 1/31/1900 12.71
1 2/28/1900 12.94
2 3/31/1900 13.04
3 4/30/1900 13.21
4 5/31/1900 12.58
I need to iterate over several DataFrames of this nature and combine them all into one big DataFrame, where only values that align with the dates get added. My function so far:
def get_combined_vars(start, end):
rows = pd.date_range(start=start, end=end, freq='BM')
df1 = pd.DataFrame(rows, columns=['date'])
for key in variables.keys():
check = variables[key][0]
if check == 1:
df2 = pd.DataFrame(variables[key][1]())
print(df2.head(5))
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))
print(df1.head(10))
return df1
I can't seem to find the right command to merge DataFrames based off of a column.
Desired output:
date DEF PE
0 1/31/1900 0.0140 12.71
1 2/28/1900 0.0150 12.94
2 3/31/1900 0.0160 13.04
3 4/30/1900 0.0120 13.21
4 5/31/1900 0.0120 12.58
Merge_asof issue:
runfile('H:/Market Timing/Files/market_timing.py', wdir='H:/Market Timing/Files')
date BY
0 1/31/1963 0.98
1 2/28/1963 1
2 3/29/1963 1.01
3 4/30/1963 1.01
4 5/31/1963 1.01
Traceback (most recent call last):
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 303, in _convert_listlike
values, tz = tslib.datetime_to_datetime64(arg)
File "pandas\_libs\tslib.pyx", line 1884, in pandas._libs.tslib.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Developer\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "H:/Market Timing/Files/market_timing.py", line 88, in <module>
print(get_combined_vars('1/31/1995', '1/31/2005').head(10))
File "H:/Market Timing/Files/market_timing.py", line 43, in get_combined_vars
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 373, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 306, in _convert_listlike
raise e
File "C:\Developer\Anaconda\lib\site-packages\pandas\core\tools\datetimes.py", line 294, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 2156, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2379, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 2373, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 1182, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Developer\Anaconda\lib\site-packages\dateutil\parser.py", line 581, in parse
ret = default.replace(**repl)
ValueError: day is out of range for month
I believe on the third pass of these two DataFrames attempting to be combined it runs into this error: ValueError: day is out of range for month
Can a buffer be added for discrepancies in data like this?

You can use pd.merge_asof, however, first you'll need to get your dates on a common year.
pd.merge_asof(df1.assign(datekey=pd.to_datetime(df1['date'].dt.strftime('%m-%d') + '-1900')),
df2,
right_on='date',
left_on='datekey',
direction='nearest',
suffixes=('_x',''))[['date','DEF','PE']]
Output:
date DEF PE
0 1900-01-31 0.014 12.71
1 1900-02-28 0.015 12.94
2 1900-03-31 0.016 13.04
3 1900-04-30 0.012 13.21
4 1900-05-31 0.012 12.58

You would use pandas.Merge (or DataFrame.join as shorthand) to do this:
import pandas as pd
pd.Merge(df1, df2, on="date")
...But as Scott Boston mentioned in his comment, the data doesn't align so you won't get your expected results.

Related

Pandas tells me non-ambiguous time is ambiguous

I have the following test code:
import pandas as pd
dt = pd.to_datetime('2021-11-07 01:00:00-0400').tz_convert('America/New_York')
pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
When using pandas version 1.1.5, this runs successfully. But under pandas version 1.2.5 or 1.3.4, it fails with the following error:
Traceback (most recent call last):
File "test.py", line 5, in <module>
'value': [3, 4, 5]})
File "venv/lib/python3.7/site-packages/pandas/core/frame.py", line 614, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 465, in dict_to_mgr
arrays, data_names, index, columns, dtype=dtype, typ=typ, consolidate=copy
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 124, in arrays_to_mgr
arrays = _homogenize(arrays, index, dtype)
File "venv/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 590, in _homogenize
val, index, dtype=dtype, copy=False, raise_cast_failure=False
File "venv/lib/python3.7/site-packages/pandas/core/construction.py", line 514, in sanitize_array
data = construct_1d_arraylike_from_scalar(data, len(index), dtype)
File "venv/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1907, in construct_1d_arraylike_from_scalar
subarr = cls._from_sequence([value] * length, dtype=dtype)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 336, in _from_sequence
return cls._from_sequence_not_strict(scalars, dtype=dtype, copy=copy)
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 362, in _from_sequence_not_strict
ambiguous=ambiguous,
File "venv/lib/python3.7/site-packages/pandas/core/arrays/datetimes.py", line 2098, in sequence_to_dt64ns
data.view("i8"), tz, ambiguous=ambiguous
File "pandas/_libs/tslibs/tzconversion.pyx", line 284, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2021-11-07 01:00:00, try using the 'ambiguous' argument
I am aware that Daylight Saving Time is happening on November 7. But this data looks explicit to me, and fully localized; why is pandas forgetting its timezone information, and why is it refusing to put it in a DataFrame? Is there some kind of workaround here?
Update:
I remembered that I'd actually filed a bug about this a few months ago, but it was only of somewhat academic interest to us until this week when we're starting to see actual DST-transition dates in production: https://github.com/pandas-dev/pandas/issues/42505
It's ambiguous because there are 2 dates with this special time: with DST and without DST:
# Timestamp('2021-11-07 01:00:00-0500', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=False).dst()
datetime.timedelta(0)
# Timestamp('2021-11-07 01:00:00-0400', tz='America/New_York')
>>> pd.to_datetime('2021-11-07 01:00:00') \
.tz_localize('America/New_York', ambiguous=True).dst()
datetime.timedelta(3600)
Workaround
dt = pd.to_datetime('2021-11-07 01:00:00-0400')
df = pd.DataFrame({'datetime': dt,
'value': [3, 4, 5]})
df['datetime'] = df['datetime'].dt.tz_convert('America/New_York')
I accepted #Corralien's answer, and I also wanted to show what workaround I finally decided to go with:
# Work around Pandas DST bug, see https://github.com/pandas-dev/pandas/issues/42505 and
# https://stackoverflow.com/questions/69846645/pandas-tells-me-non-ambiguous-time-is-ambiguous
max_len = max(len(x) if self.is_array(x) else 1 for x in data.values())
if max_len > 0 and self.is_scalar(data['datetime']):
data['datetime'] = [data['datetime']] * max_len
df = pd.DataFrame(data)
The is_array() and is_scalar() functions check whether x is an instance of any of set, list, tuple, np.ndarray, pd.Series, pd.Index.
It's not perfect, but hopefully the duct tape will hold until this can be fixed in Pandas.

Why does Series.min(skipna=True) throws an error caused by na value?

I work with timestamps (having mixed DST values). Tried in Pandas 1.0.0:
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
Asking for min() or max() fails:
s.min(), s.max() # same result with s.min(skipna=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 11216, in stat_func
f, name, axis=axis, skipna=skipna, numeric_only=numeric_only
File "C:\Anaconda\lib\site-packages\pandas\core\series.py", line 3892, in _reduce
return op(delegate, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 125, in f
result = alt(values, axis=axis, skipna=skipna, **kwds)
File "C:\Anaconda\lib\site-packages\pandas\core\nanops.py", line 837, in reduction
result = getattr(values, meth)(axis)
File "C:\Anaconda\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
return umr_minimum(a, axis, None, out, keepdims, initial, where)
TypeError: '<=' not supported between instances of 'Timestamp' and 'float'
Workaround:
s.loc[s.notna()].min(), s.loc[s.notna()].max()
(Timestamp('2019-04-13 12:10:20+0200', tz='pytz.FixedOffset(120)'), Timestamp('2020-02-01 11:35:44+0100', tz='pytz.FixedOffset(60)'))
What I am missing here? Is it a bug?
I think problem here is pandas working with Series with different timezones like objects, so max and min here failed.
s = pd.Series(
[pd.Timestamp('2020-02-01 11:35:44+01'),
np.nan, # same result with pd.Timestamp('nat')
pd.Timestamp('2019-04-13 12:10:20+02')])
print (s)
0 2020-02-01 11:35:44+01:00
1 NaN
2 2019-04-13 12:10:20+02:00
dtype: object
So if convert to datetimes (but not with mixed timezones) it working well:
print (pd.to_datetime(s, utc=True))
0 2020-02-01 10:35:44+00:00
1 NaT
2 2019-04-13 10:10:20+00:00
dtype: datetime64[ns, UTC]
print (pd.to_datetime(s, utc=True).max())
2020-02-01 10:35:44+00:00
Another possible solution if need different timezones is:
print (s.dropna().max())
2020-02-01 11:35:44+01:00

changing of object into datetime format

I am currently working on google sheet by importing in python.When I import the sheet it was in object format and later I converted into float, but I try to change the format of Date column then it gives me an error.
Following is the Dataframe on which I have to work on
df.head()
Out[21]:
Date Avg_Energy Avg_Voltage
1 24-06-2018 12-50-02 2452.93
2 24-06-2018 12-50-03 2452.98 228.03
3 24-06-2018 12-50-04 2453.04 228.7
4 24-06-2018 12-50-05 2453.1 228.4
5 24-06-2018 12-50-06 2453.16 228.74
I have applied the following code to change it into datetime format
df['DateTime'] = pd.to_datetime(df['Date'])
I provide me the following error
df2['DateTime'] = pd.to_datetime(df2['Date'])
Traceback (most recent call last):
File "<ipython-input-22-0636e9d0e511>", line 1, in <module>
df2['DateTime'] = pd.to_datetime(df2['Date'])
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 451, in to_datetime
values = _convert_listlike(arg._values, True, format)
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 380, in _convert_listlike
raise e
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 368, in _convert_listlike
require_iso8601=require_iso8601
File "pandas\_libs\tslib.pyx", line 492, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 739, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 733, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 99, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\dateutil\parser\_parser.py", line 1356, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Users\Hussnain\Anaconda3\lib\site-packages\dateutil\parser\_parser.py", line 648, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '24-06-2018 12-50-100')
You have an unorthodox datetime format. Use the format argument.
pd.to_datetime(df.Date, format='%d-%m-%Y %H-%M-%S')
0 2018-06-24 12:50:02
1 2018-06-24 12:50:03
2 2018-06-24 12:50:04
3 2018-06-24 12:50:05
4 2018-06-24 12:50:06
Name: Date, dtype: datetime64[ns]
See http://strftime.org/ for more information.
On my end I tested just with:
pd.to_datetime(df.Date)
And it worked. Appears that you don't have the first Avg_Voltage value.
Date Energy Voltage
1 24-06-2018 12-50-02 2452.93 322323.00
2 24-06-2018 12-50-03 2452.98 228.03
3 24-06-2018 12-50-04 2453.04 228.70
4 24-06-2018 12-50-05 2453.10 228.40
5 24-06-2018 12-50-06 2453.16 228.74
1 2018-06-24 12:00:00-02:00
2 2018-06-24 12:00:00-03:00
3 2018-06-24 12:00:00-04:00
4 2018-06-24 12:00:00-05:00
5 2018-06-24 12:00:00-06:00
Name: Date, dtype: object
You may use:
pd.to_datetime(df.Date).dt.strftime('%Y-%m-%d %H:%M:%S')
to achieve better format.

Pandas error with multidimensional key using .loc and a boolean

Been running into this same error for 2 weeks, even though the code worked before. Not sure if I updated pandas as part of another library install, and maybe something changed there. Currently on version 23.4. Expected outcome is returning just the row with that identifier value.
In [42]: df.head()
Out[43]:
index Identifier ...
0 51384710 ...
1 74838J10 ...
2 80589M10 ...
3 67104410 ...
4 50241310 ...
[5 rows x 14 columns]
In [43]: df.loc[df.Identifier.isin(['51384710'])].head()
Traceback (most recent call last):
File "C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-44-a3dbf43451ef>", line 1, in <module>
df.loc[df.Identifier.isin(['51384710'])].head()
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1478, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1899, in _getitem_axis
raise ValueError('Cannot index with multidimensional key')
**ValueError: Cannot index with multidimensional key**
Code Snippet
Fixed it. I'd done df.columns = [column_list] where column_list = [...], which caused df to be treated as if it had a multiindex, even though there was only one level. removed the brackets from the df.columns assignment.
Try changing
df.loc[df.Identifier.isin(['51384710'])].head()
to
df[df.Identifier.isin(['51384710'])].head()

df.Change[-1] producing errors.

I'm trying to slice the last value of the series Change from my dataframe df.
The dataframe looks something like this
Change
0 1.000000
1 0.917727
2 1.000000
3 0.914773
4 0.933182
5 0.936136
6 0.957500
14466949 1.998392
14466950 2.002413
14466951 1.998392
14466952 1.974266
14466953 1.966224
When I input the following code
df.Change[0]
df.Change[100]
df.Change[100000]
I'm getting an output, but when I'm input
df.Change[-1]
I'm getting the following error
Traceback (most recent call last):
File "<pyshell#188>", line 1, in <module>
df.Change[-1]
File "C:\Python27\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
result = self.index.get_value(self, key)
File "C:\Python27\lib\site-packages\pandas\indexes\base.py", line 2139, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas/index.pyx", line 105, in pandas.index.IndexEngine.get_value (pandas\index.c:3338)
File "pandas/index.pyx", line 113, in pandas.index.IndexEngine.get_value (pandas\index.c:3041)
File "pandas/index.pyx", line 151, in pandas.index.IndexEngine.get_loc (pandas\index.c:3898)
KeyError: -1
Pretty much any negative number I use for slicing is resulting in an error, and I'm not exactly sure why.
Thanks.
There are several ways to do this. What's happening is that pandas has no issues with df.Change[100] because 100 is in its index. -1 is not. You happen to have your index the same as if you were using ordinal positions. To explicitly get ordinal positions, use iloc.
df.Change.iloc[-1]
or
df.Change.values[-1]