HDFStore error 'correct atom type -> [dtype->uint64' - pandas

using read_hdf for first time love it want to use it to combine a bunch of smaller *.h5 into one big file. plan on calling append() of a HDFStore. later will add chunking to conserve memory.
Example table looks like this
Int64Index: 220189 entries, 0 to 220188
Data columns (total 16 columns):
ID 220189 non-null values
duration 220189 non-null values
epochNanos 220189 non-null values
Tag 220189 non-null values
dtypes: object(1), uint64(3)
code:
import pandas as pd
print pd.__version__ # I am running 0.11.0
dest_h5f = pd.HDFStore('c:\\t3_combo.h5',complevel=9)
df = pd.read_hdf('\\t3\\t3_20130319.h5', 't3', mode = 'r')
print df
dest_h5f.append(tbl, df, data_columns=True)
dest_h5f.close()
Problem: the append traps this exception
Exception: cannot find the correct atom type -> [dtype->uint64,items->Index([InstrumentID], dtype=object)] 'module' object has no attribute 'Uint64Col'
this feels like a problem with some version of pytables or numpy
pytables = v 2.4.0 numpy = v 1.6.2

We normally represent epcoch seconds as int64 and use datetime64[ns]. Try using datetime64[ns], will make your life easier. In any event nanoseconds since 1970 is well within the range of in64 anyhow. (and uint64 only buy you 2x this range). So no real advantage to using unsigned ints.
We use int64 because the min value (-9223372036854775807) is used to represent NaT or an integer marker for Not a Time
In [11]: (Series([Timestamp('20130501')])-
Series([Timestamp('19700101')]))[0].astype('int64')
Out[11]: 1367366400000000000
In [12]: np.iinfo('int64').max
Out[12]: 9223372036854775807
You can then represent time form about the year 1677 till 2264 at the nanosecond level

Related

Pandas Interpolation: {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

I am trying to interpolate time series data, df, which looks like:
id data lat notes analysis_date
0 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
1 17358709 NaN 26.125979 None 2019-09-20 12:00:00+00:00
2 17352742 -2.331365 26.125979 None 2019-09-20 12:00:00+00:00
3 17358709 -4.424366 26.125979 None 2019-09-20 12:00:00+00:00
I try: df.groupby(['lat', 'lon']).apply(lambda group: group.interpolate(method='linear')), and it throws {ValueError}Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
I suspect the issue is with the fact that I have None values, and I do not want to interpolate those. What is the solution?
df.dtypes gives me:
id int64
data float64
lat float64
notes object
analysis_date datetime64[ns, psycopg2.tz.FixedOffsetTimezone...
dtype: object
DataFrame.interpolate has issues with timezone-aware datetime64ns columns, which leads to that rather cryptic error message. E.g.
import pandas as pd
df = pd.DataFrame({'time': pd.to_datetime(['2010', '2011', 'foo', '2012', '2013'],
errors='coerce')})
df['time'] = df.time.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')
df.interpolate()
ValueError: Invalid fill method. Expecting pad (ffill) or backfill
(bfill). Got linear
In this case interpolating that column is unnecessary so only interpolate the column you need. We still want DataFrame.interpolate so select with [[ ]] (Series.interpolate leads to some odd reshaping)
df['data'] = df.groupby(['lat', 'lon']).apply(lambda x: x[['data']].interpolate())
This error happens because one of the columns you are interpolating is of object data type. Interpolating works only for numerical data types such as integer or float.
If you need to use interpolating for an object or categorical data type, then first convert it to a numerical data type. For this, you need to encode your column first. The following piece of code will resolve your problem:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
notes_encoder=LabelEncoder()
df['notes'] = notes_encoder.fit_transform(df['notes'])
After doing this, check the column's data type. It must be int. If it is categorical ,then change its type to int using the following code:
df['notes']=df['notes'].astype('int32')

Why list of pd.Interval doesn't recognized by DataFrame automatically?

intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
pd.DataFrame({'d':intervals}).dtypes
Produces dtype as Object not Interval:
>>> d object
>>> dtype: object
But at the same time list of, for example, DateTimes is recognized on the fly:
datetimes = [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]
pd.DataFrame({'d':datetimes}).dtypes
>>> d datetime64[ns]
>>> dtype: object
Is situation with intervals somewhat like with list of strings - default type of column in the DataFrame will be object as well, because DataFrame doesn't 'know' if we want to treat this column as objects (for dumping to disk, ..), or as strings (for concatenation, ..) or even as elements of category type? If so - what different use cases with intervals may be? If not what is the case here?
This is a bug in pandas: https://github.com/pandas-dev/pandas/issues/23563
For now, the cleanest workaround is to wrap the list with pd.array:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.24.2'
In [2]: intervals = [pd.Interval(0, 0.1), pd.Interval(1, 5)]
In [3]: pd.DataFrame({'d': pd.array(intervals)}).dtypes
Out[3]:
d interval[float64]
dtype: object

Resampling/interpolating/extrapolating columns of a pandas dataframe

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))

Pandas df.head() Error

I'm having a basic problem with df.head(). When the function is executed, it usually displays a nice HTML formatted table of the top 5 values, but now it only appears to slice the dataframe and outputs like below:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
survived 5 non-null values
pclass 5 non-null values
name 5 non-null values
sex 5 non-null values
age 5 non-null values
sibsp 5 non-null values
parch 5 non-null values
fare 5 non-null values
embarked 5 non-null values
dtypes: float64(2), int64(4), object(3)
After looking at this thread I tried
pd.util.terminal.get_terminal_size()
and received the expected output (80, 25). Manually setting the print options with
pd.set_printoptions(max_columns=10)
Yields the same sliced dataframe results like above.
This was confirmed after diving into the documentation here and using the
get_option("display.max_rows")
get_option("display.max_columns")
and getting the correct default 60 rows and 10 columns.
I've never had a problem with df.head() before but now its an issue in all of my IPython Notebooks.
I'm running pandas 0.11.0 and IPython 0.13.2 in google chrome.
In pandas 11.0, I think the minimum of display.height and max_rows (and display.width and max_columns`) is used, so you need to manually change that too.
I don't like this, I posted this github issue about it previously.
Try using the following to display the top 10 items
from IPython.display import HTML
HTML(users.head(10).to_html())
I think pandas 11.0 head function is totally unintuitive and should
simply have remained as head() and you get your html.