Pandas timestamp doesn't play nicely with numpy datetime? - numpy

I'm just beginning to use pandas 0.9 and am seeing unexpected behavior with pandas timestamps. After setting an index with a datetime, the timeindex doesn't seem to correctly convert to anything else. I'm probably not using it correctly, so please set me straight:
from pandas import *
import datetime
version.version
# 0.9.1
np.version.version
# 1.6.2
ndx = ['a','b','b']
date = [datetime.datetime(2013, 2, 16, 15, 0), datetime.datetime(2013, 2, 16, 11, 0),datetime.datetime(2013, 2, 16, 2, 0)]
vals = [1,2,3,]
df = DataFrame({'ndx':ndx,'date':date,'vals':vals})
df2=df.groupby(['ndx','date']).sum()
df2.index.get_level_values('date')
# array([1970-01-16 143:00:00, 1970-01-16 130:00:00, 1970-01-16 139:00:00], dtype=datetime64[ns])
df.set_index([ndx,date]).reset_index()['level_1'].unique() # fetch from index
# array([1970-01-16 143:00:00, 1970-01-16 139:00:00, 1970-01-16 130:00:00], dtype=datetime64[ns])
df.set_index([ndx,date]).reset_index()['date'].unique() # fetch from column
# array([2013-02-16 15:00:00, 2013-02-16 11:00:00, 2013-02-16 02:00:00], dtype=object)
I would wouldn't expect anything with 1970 as a result of these operations. Thoughts?

this is a numpy bug
see the following
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#numpy-datetime64-dtype-and-1-6-dependency
https://github.com/pydata/pandas/issues/2872

Related

Numpy interpolation on pandas TimeStamp data works if it's a pandas series but not if it's a single object?

I'm trying to use np.interp to interpolate a float value based on pandas TimeStamp data. However, I noticed that np.interp works if the input x is a pandas TimeStamp pandas series, but not if it's a single TimeStamp object.
Here's the code to illustrate this:
import pandas as pd
import numpy as np
coarse = pd.DataFrame({'start': ['2016-01-01 07:00:00.00000+00:00',
'2016-01-01 07:30:00.00000+00:00',]} )
fine = pd.DataFrame({'start': ['2016-01-01 07:00:02.156657+00:00',
'2016-01-01 07:00:15+00:00',
'2016-01-01 07:00:32+00:00',
'2016-01-01 07:11:17+00:00',
'2016-01-01 07:14:00+00:00',
'2016-01-01 07:15:55+00:00',
'2016-01-01 07:33:04+00:00'],
'price': [0,
1,
2,
3,
4,
5,
6,
]} )
coarse['start'] = pd.to_datetime(coarse['start'])
fine['start'] = pd.to_datetime(fine['start'])
np.interp(x=coarse.start, xp=fine.start, fp=fine.price) # works
np.interp(x=coarse.start.iloc[-1], xp=fine.start, fp=fine.price) # doesn't work
The latter gives the error
TypeError: float() argument must be a string or a number, not 'Timestamp'
I am wondering why the latter doesn't work, while the former does?
The input of interp must be an "array-like" (iterable), you can use .iloc[[-1]]:
np.interp(x=coarse.start.iloc[[-1]], xp=fine.start, fp=fine.price)
Output: array([5.82118562])
Look at what you get when selecting an item from the Series:
In [8]: coarse.start
Out[8]:
0 2016-01-01 07:00:00+00:00
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
In [9]: coarse.start.iloc[-1]
Out[9]: Timestamp('2016-01-01 07:30:00+0000', tz='UTC')
With the list index, it's a Series:
In [10]: coarse.start.iloc[[-1]]
Out[10]:
1 2016-01-01 07:30:00+00:00
Name: start, dtype: datetime64[ns, UTC]
I was going to scold you for not showing the full error message, but I see that it's a compiled piece of code that raises the error. Keep in mind that interp is a numpy function, which works with numpy arrays, and for math like this, float dtype ones.
So it's a good guess that interp is trying to make a float array from your argument.
In [14]: np.asarray(coarse.start, dtype=float)
Out[14]: array([1.4516316e+18, 1.4516334e+18])
In [15]: np.asarray(coarse.start.iloc[[1]], dtype=float)
Out[15]: array([1.4516334e+18])
In [16]: np.asarray(coarse.start.iloc[1], dtype=float)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[16], line 1
----> 1 np.asarray(coarse.start.iloc[1], dtype=float)
TypeError: float() argument must be a string or a number, not 'Timestamp'
It can't make a float value from a Python TimeStamp object.

Converting dict to dataframe of Solution point values & plotting

I am trying to plot some results obtained after optimisation using Gurobi.
I have converted the dictionary to python dataframe.
it is 96*1
But now how do I use this dataframe to plot as 1st row-value, 2nd row-value, I am attaching the snapshot of the same.
Please anyone can help me in this?
x={}
for t in time1:
x[t]= [price_energy[t-1]*EnergyResource[174,t].X]
df = pd.DataFrame.from_dict(x, orient='index')
df
You can try pandas.DataFrame(data=x.values()) to properly create a pandas DataFrame while using row numbers as indices.
In the example below, I have generated a (pseudo) random dictionary with 10 values, and stored it as a data frame using pandas.DataFrame giving a name to the only column as xyz. To understand how indexing works, please see Indexing and selecting data.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Create a dictionary 'x'
rng = np.random.default_rng(121)
x = dict(zip(np.arange(10), rng.random((1, 10))[0]))
# Create a dataframe from 'x'
df = pd.DataFrame(x.values(), index=x.keys(), columns=["xyz"])
print(df)
print(df.index)
# Plot the dataframe
plt.plot(df.index, df.xyz)
plt.show()
This prints df as:
xyz
0 0.632816
1 0.297902
2 0.824260
3 0.580722
4 0.593562
5 0.793063
6 0.444513
7 0.386832
8 0.214222
9 0.029993
and gives df.index as:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
and also plots the figure:

pandas resample: what is the 3M equivalent of Q

I have a time series, e.g.:
import pandas as pd
df = pd.DataFrame.from_dict({'count': {
pd.Timestamp('2016-02-29'): 1, pd.Timestamp('2016-03-31'): 2,
pd.Timestamp('2016-04-30'): 4, pd.Timestamp('2016-05-31'): 8,
pd.Timestamp('2016-06-30'): 16, pd.Timestamp('2016-07-31'): 32,
}})
df
And can resample it to get counts per Quarter with e.g:
df.resample('Q').agg('sum')
I am trying to do the same with '3M' but no matter what I try, I fail to get the same result, e.g.:
df.resample('3M', closed='right', origin='start',label='right').agg('sum')
gives:
How can I achieve the result of resample('Q') using resample('3M')?

interp1d for pandas time series

I want to interpolate between times in a pandas time series. I would like to use scipy.interpolate.interp1d (or similar). The pandas.interpolate function is undesirable, as it would require inserting nan values and then replace them using interpolation (and thus modifying the dataset). What I tried to do:
from datetime import datetime
import scipy.interpolate as si
import pandas as pd
d1 = datetime(2019, 1, 1)
d2 = datetime(2019, 1, 5)
d3 = datetime(2019, 1, 10)
df = pd.DataFrame([1, 4, 2], index=[d1, d2, d3], columns=['conc'])
import scipy.interpolate as si
f = si.interp1d(df.index, df.conc)
f(datetime(2019, 1, 3))
All is good until the last line, where I get an ValueError: object arrays are not supported. Strangely enough f.x nicely shows the dates as dtype='datetime64[ns], so I was hoping it would work. Anybody know how to get this to work?
This works, but is arguably a bit ugly:
f = si.interp1d(pd.to_numeric(df.index), df.conc)
f(pd.to_numeric(pd.to_datetime(['2019-1-3'])))

Inconsistent internal representation of dates in matplotlib/pandas

import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-02'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (736066.7, 736469.3)
Now, if we change the last date.
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot()
print(ax.get_xlim())
# Out: (184.8, 189.2)
The first example seems consistent with the matplotlib docs:
Matplotlib represents dates using floating point numbers specifying the number of days since 0001-01-01 UTC, plus 1
Why does the second example return something seemingly completely different? I'm using pandas version 0.22.0 and matplotlib version 2.2.2.
In the second example, if you look at the plots, rather than giving dates matplotlib is giving quarter values:
The dates in this case are exactly six months and therefore two quarters apart, which is presumably why you're seeing this behavior. While I can't find it in the docs, the numbers given by xlim in this case are consistent with being the number of quarters since the Unix Epoch (Jan. 1, 1970).
Pandas uses different units to represents dates and times on the axes, depending on the range of dates/times in use. This means that different locators are in use.
In the first case,
print(ax.xaxis.get_major_locator())
# Out: pandas.plotting._converter.PandasAutoDateLocator
in the second case
print(ax.xaxis.get_major_locator())
# pandas.plotting._converter.TimeSeries_DateLocator
You may force pandas to always use the PandasAutoDateLocator using the x_compat argument,
df.plot(x_compat=True)
This would ensure to always get the same datetime definition, consistent with the matplotlib.dates convention.
The drawback is that this removes the nice quarterly ticking
and replaces it with the standard ticking
On the other hand it would then allow to use the very customizable matplotlib.dates tickers and formatters. For example to get quarterly ticks/labels
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import pandas as pd
index = pd.to_datetime(['2016-05-01', '2016-11-01', '2017-05-01'])
data = pd.DataFrame({'a': [1, 2, 3],
'b': [4, 5, 6]}, index=index)
ax = data.plot(x_compat=True)
# Quarterly ticks
ax.xaxis.set_major_locator(mdates.MonthLocator((1,4,7,10)))
# Formatting:
def func(x,pos):
q = (mdates.num2date(x).month-1)//3+1
tx = "Q{}".format(q)
if q == 1:
tx += "\n{}".format(mdates.num2date(x).year)
return tx
ax.xaxis.set_major_formatter(mticker.FuncFormatter(func))
plt.setp(ax.get_xticklabels(), rotation=0, ha="center")
plt.show()