Querying a HDF-store - pandas

I created a hd5 file by
hdf=pandas.HDFStore(pfad)
hdf.append('df', df, data_columns=True)
I have a list that contains numpy.datetime64 values called expirations and try to read the portion of the hd5 table into a dataframe, that has values between expirations[1] and expirations[0] in column "expiration". Column expiration entries have the format Timestamp('2002-05-18 00:00:00').
I use the following command:
df=hdf.select('df', where=('expiration<expiration[1] & expiration>=expirations[0]'))
However, I get ValueError: Unable to parse x
How should this be correctly done?
df.dtypes
Out[37]:
adjusted stock close price float64
expiration datetime64[ns]
strike int64
call put object
ask float64
bid float64
volume int64
open interest int64
unadjusted stock price float64
df.info
Out[36]:
<bound method DataFrame.info of adjusted stock close price expiration strike call put ask date
2002-05-16 5047.00 2002-05-18 4300 C 802.000
There is more columns but they aren't of interest for the query.

Problem solved!
I obtained expirations by
df_expirations=df.drop_duplicates(subset='expiration')
expirations=df['expiration'].values
This obviously changed the number format from datetime into tz datetime.
I reingeneered this by using
expirations=df['expirations']
Now this query is working:
del df
df=hdf.select('df', where=('expiration=expirations[1]'))
Thanks for pointing me on the datetime format problem.

Related

Pandas resample based on datetime column where there're duplicate datetimes then plot

Suppose I have two columns time (e.g. 2019-02-13T22:31:47.000000000) and amount (e.g. 15). The time column might have duplicates.
What's the best way to resample amount into daily/monthly/yearly then plot?
I tried:
df.resample('M', on='time').sum().plot(x='time', y='amount')
but it says:
raise KeyError(key) from err
if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'time'
Already verified that time is datetime (without null values):
df['time'].isnull().any()
false
Amount is float as well.
The documentation says
The object must have a datetime-like index
So, try:
df.set_index('time').resample('M').sum().plot(x='time', y='amount')

Pandas reindex Dates To Subset of Dates from List

I am sorry, but there is online documentation and examples and I'm still not understanding. I have a pandas df with an index of dates in datetime format (yyyy-mm-dd) and I'm trying to resample or reindex this dataframe based on a subset of dates in the same format (yyyy-mm-dd) that are in a list. I have converted the df.index values to datetime using:
dfmla.index = pd.to_datetime(dfmla.index)
I've tried various things and I keep getting NaN's after applying the reindex. I know this must be a datatypes problem and my df is in the form of:
df.dtypes
Out[30]:
month int64
mean_mon_flow float64
std_mon_flow float64
monthly_flow_ln float64
std_anomaly float64
dtype: object
My data looks like this:
df.head(5)
Out[31]:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1949-10-01 10 8.565828 0.216126 8.848631 1.308506
1949-11-01 11 8.598055 0.260254 8.368006 -0.883938
1949-12-01 12 8.612080 0.301156 8.384662 -0.755149
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
My month_list (list datatype) looks like this:
month_list[0:2]
Out[37]: ['1950-08-01', '1950-09-01']
I need my condensed, new reindexed df to look like this:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967
thank you for your suggestions,
If you're certain that all month_list are in the index, you can do df.loc[month_list], else you can use reindex:
df.reindex(pd.to_datetime(month_list))
Output:
month mean_mon_flow std_mon_flow monthly_flow_ln std_anomaly
date
1950-08-01 8 8.614236 0.310865 8.173776 -1.416887
1950-09-01 9 8.663943 0.351730 8.437089 -0.644967

Changing Excel Dates (As integers) mixed with timestamps in single column - Have tried str.extract

I have a dataframe with a column of dates, unfortunately my import (using read_excel) brought in format of dates as datetime and also excel dates as integers.
What I am seeking is a column with dates only in format %Y-%m-%d
From research, excel starts at 1900-01-00, so I could add these integers. I have tried to use str.extract and a regex in order to separate the columns into two, one of datetimes, the other as integers. However the result is NaN.
Here is an input code example
df = pd.DataFrame({'date_from': [pd.Timestamp('2022-09-10 00:00:00'),44476, pd.Timestamp('2021-02-16 00:00:00')], 'date_to': [pd.Timestamp('2022-12-11 00:00:00'),44455, pd.Timestamp('2021-12-16 00:00:00')]})
Attempt to first separate the columns by extracting the integers( dates imported from MS excel)
df.date_from.str.extract(r'(\d\d\d\d\d)')
however this gives NaN.
The reason I have tried to separate integers out of the column, is that I get an error when trying to act on the excel dates within the mixed column (in other words and error using the following code:)
def convert_excel_time(excel_time):
return pd.to_datetime('1900-01-01') + pd.to_timedelta(excel_time,'D')
Any guidance on how I might get a column of dates only? I find the datetime modules and aspects of pandas and python the most frustrating of all to get to grips with!
thanks
You can convert values to timedeltas by to_timedelta with errors='coerce' for NaT if not integers add Timestamp called d, then convert datetimes with errors='coerce' and last pass to Series.fillna in custom function:
def f(x):
#https://stackoverflow.com/a/9574948/2901002
d = pd.Timestamp(1899, 12, 30)
timedeltas = pd.to_timedelta(x, unit='d', errors='coerce')
dates = pd.to_datetime(x, errors='coerce')
return (timedeltas + d).fillna(dates)
cols = ['date_from','date_to']
df[cols] = df[cols].apply(f)
print (df)
date_from date_to
0 2022-09-10 2022-12-11
1 2021-10-07 2021-09-16
2 2021-02-16 2021-12-16

Resolving error when merging dataframes on two columns

I am trying to merge two dataframes (D1 & R1) on two columns (Date & Symbol) but I'm receiving this error "You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat".
I've been using pd.merge and I've tried different dtypes. I don't want to concatenate these because I just want to add D1 to the right side of R1.
D2 = pd.merge(D1, R2, on=['Date','Symbol'])
D1.dtypes()
Date object
Symbol object
High float64
Low float64
Open float64
Close float64
Volume float64
Adj Close float64
pct_change_1D float64
Symbol_above object
NE bool
R1.dtypes()
gvkey int64
datadate int64
fyearq int64
fqtr int64
indfmt object
consol object
popsrc object
datafmt object
tic object
curcdq object
datacqtr object
datafqtr object
rdq int64
costat object
ipodate float64
Report_Today int64
Symbol object
Date int64
Ideally, the columns not in the index of R1 (gvkey - Report_Today) will be on the right side of the columns in D1.
Any help is appreciated. Thanks.
In your description of DataFrames we can see,
In D1 DataFrame column Date has type "object"
In R1 DataFrame column Date has type "int64".
Make types of these columns the same and everything will be OK.

HDFStore error 'correct atom type -> [dtype->uint64'

using read_hdf for first time love it want to use it to combine a bunch of smaller *.h5 into one big file. plan on calling append() of a HDFStore. later will add chunking to conserve memory.
Example table looks like this
Int64Index: 220189 entries, 0 to 220188
Data columns (total 16 columns):
ID 220189 non-null values
duration 220189 non-null values
epochNanos 220189 non-null values
Tag 220189 non-null values
dtypes: object(1), uint64(3)
code:
import pandas as pd
print pd.__version__ # I am running 0.11.0
dest_h5f = pd.HDFStore('c:\\t3_combo.h5',complevel=9)
df = pd.read_hdf('\\t3\\t3_20130319.h5', 't3', mode = 'r')
print df
dest_h5f.append(tbl, df, data_columns=True)
dest_h5f.close()
Problem: the append traps this exception
Exception: cannot find the correct atom type -> [dtype->uint64,items->Index([InstrumentID], dtype=object)] 'module' object has no attribute 'Uint64Col'
this feels like a problem with some version of pytables or numpy
pytables = v 2.4.0 numpy = v 1.6.2
We normally represent epcoch seconds as int64 and use datetime64[ns]. Try using datetime64[ns], will make your life easier. In any event nanoseconds since 1970 is well within the range of in64 anyhow. (and uint64 only buy you 2x this range). So no real advantage to using unsigned ints.
We use int64 because the min value (-9223372036854775807) is used to represent NaT or an integer marker for Not a Time
In [11]: (Series([Timestamp('20130501')])-
Series([Timestamp('19700101')]))[0].astype('int64')
Out[11]: 1367366400000000000
In [12]: np.iinfo('int64').max
Out[12]: 9223372036854775807
You can then represent time form about the year 1677 till 2264 at the nanosecond level