Check whether a PyTables node in a pandas HDFStore is tabular - pandas

Is there a preferred way to check whether a PyTables node in a pandas HDFStore is tabular? This works, but NoSuchNodeError doesn't seem like part of the API, so maybe I should not rely on it.
In [34]: from tables.table import NoSuchNodeError
In [35]: def is_tabular(store, key):
try:
store.get_node(key).table
except NoSuchNodeError:
return False
return True
....:
In [36]: is_tabular(store, 'first_600')
Out[36]: False
In [37]: is_tabular(store, 'features')
Out[37]: True

You could do something like this. The pandas_type, table_type meta-data will be present in the pytables attribute _v_attrs at the top-level of the node.
In [28]: store = pd.HDFStore('test.h5',mode='w')
In [29]: store.append('df',DataFrame(np.random.randn(10,2),columns=list('AB')))
In [30]: store
Out[30]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])
In [31]: store._handle.root.df._v_attrs
Out[31]:
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := u'',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0']]
In [33]: getattr(getattr(getattr(store._handle.root,'df',None),'_v_attrs',None),'pandas_type',None)
Out[33]: 'frame_table'
In [34]: store.close()
In [35]:

Related

Boolean Indexing in Numpy involving two arrays

I was reading a book on Data Analysis with Python where there's a topic on Boolean Indexing.
This is the Code given in the Book:
>>> import numpy as np
>>> names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
>>> data = np.random.randn(7,4)
>>> names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
>>> data
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[-0.54500574, -0.21700484, 0.34375588, -0.99216205],
[ 0.29883509, -3.08641931, 0.61289669, 0.58233649],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794],
[ 0.35764077, -0.51405297, -0.21406197, -0.88982479],
[-0.59219242, -1.87402141, -2.66339726, 1.30208623],
[ 0.32612407, 0.19612659, -0.63334406, 1.0275622 ]])
>>> names == 'Bob'
array([ True, False, False, True, False, False, False])
Until this it's perfectly clear. But I'm unable to understand when they do data[names == 'Bob']
>>> data[names == 'Bob']
array([[ 0.35214065, -0.6258314 , -1.18156785, -0.75981437],
[ 0.32047465, 0.05380018, -2.29797299, 0.04553794]])
>>> data[names == 'Bob', 2:]
array([[-1.18156785, -0.75981437],
[-2.29797299, 0.04553794]])
How is this happening?
data[names == 'Bob']
is the same as:
data[[True, False, False, True, False, False, False]]
And this just means to get row 0 and row 4 from data.
data[names == 'Bob',2:]
gives the same rows, but now restricts the columns to start with column 2. Before the comma concerns the rows, after the comma concerns the columns.

Calculate min_itemsize of Pandas HFStore column string before appending

I have csv files encoded in asian characters.... (let's UTF-8).
As I try to convert csv to Pandas HDFStore,
I need to handle the unicode and min_itemsize before appending in Pandas HDFSstore.
How can I know the max size of one dataframe column containing UTF-8 (asian characters) string ?
EDIT: asian text :
SMALL_AREA_NAME,PREF_NAME,COUPON_ID_hash
埼玉,埼玉県,6b263844241eea98c5a97f1335ea82af
新宿・高田馬場・中野・吉祥寺,東京都,e0a410ff611abefbfb57ca262dcdf42e
銀座・新橋・東京・上野,東京都,b286f6fb50a4f849e4382c9752405d7a
EDIT 2:
It seems unicode has issues with HDFStore append returns an error :
(Python 2.7 , cannot use Python 3 due to other packages conflicts...)
for col in col_list :
df_i[col] = df_i[col].map(lambda x: x.encode('utf-8'))
max_size= df_i[col].str.len().max()
store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)
return this error:
Traceback (most recent call last):
File "D:\_devs\Python01\Anaconda27\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-69-e96ff71ee569>", line 26, in <module>
store.append(tablename, df_i, format='table', encoding="utf-8", min_itemsize=max_size)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 919, in append
**kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 1264, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3787, in write
**kwargs)
File "D:\_devs\Python01\Anaconda27\lib\site-packages\pandas\io\pytables.py", line 3460, in create_axes
raise e
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
UPDATE: test under Python 2.7
Python 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: df = pd.read_clipboard()
In [2]: df
Out[2]:
a b
0 1 hi
1 2 привіт
2 3 Grüßi
In [3]: store = pd.HDFStore('d:/temp/test_py27.h5')
In [4]: store.append('test', df)
In [5]: store.get_storer('test').table
Out[5]:
/test/table (Table(3,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": StringCol(itemsize=12, shape=(1,), dflt='', pos=2)}
byteorder := 'little'
chunkshape := (2340,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
Old answer:
you can use Series.str.len().max():
Demo:
In [91]: df
Out[91]:
A
0 aaa.bbbbbbb
1 ccc,xxxxxxxxxxxxxx
2 xxxxx.zzz
In [92]: df.A.str.len()
Out[92]:
0 11
1 18
2 9
Name: A, dtype: int64
In [93]: df.A.str.len().max()
Out[93]: 18
For the record, this one works in Python 2.7: (removing the last encoding)
for col in col_list :
df_i[col] = df_i[col].map(lambda x: str(x.encode('utf-8')))
max_size= df_i[col].str.len().max()
store.append(tablename, df_i, format='table', min_itemsize=max_size)

How to store large columnar text+numeric data in Python?

To save on disk without building columnar DB, there are :
SQLLite,
HDFS5 : only numeric/fixed string
pickle serialization
csv
csv compressed.
....
Just wondering which one is most efficient in term of speed ?
Thanks
I'd consider Feather, HDF5. MySQL or PostgreSQL - might also be an option depending on how you are going to query your data...
Here is demo for HDF5:
In [33]: df = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 3)), columns=list('abc'))
In [34]: df['txt'] = 'X' * 300
In [35]: df
Out[35]:
a b c txt
0 689347 129498 770470 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
1 954132 97912 783288 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
2 40548 938326 861212 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
3 869895 39293 242473 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
4 938918 487643 362942 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...
In [37]: df.to_hdf('c:/temp/test_str.h5', 'test', format='t', data_columns=['a', 'c'])
In [38]: store = pd.HDFStore('c:/temp/test_str.h5')
In [39]: store.get_storer('test').table
Out[39]:
/test/table (Table(10000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
"values_block_1": StringCol(itemsize=300, shape=(1,), dflt=b'', pos=2), # <---- NOTE
"a": Int32Col(shape=(), dflt=0, pos=3),
"c": Int32Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (204,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"c": Index(6, medium, shuffle, zlib(1)).is_csi=False}

Pandas HDFStore strange behaviour on shape

i am facing this strange behaviour, i got a HDFStore containing DataFrames.
For 2 keys in the store , shape information differs depending how they are query.
Example:
In [1]: mystore = pandas.HDFStore('/store')
In [2]: mystore
Out[2]:
<class 'pandas.io.pytables.HDFStore'>
File path: /store
/chunk_data frame (shape->[1,1])
/enrich_data_kb frame (shape->[1,11])
/inputs frame (shape->[105,4])
/prepare_data frame (shape->[105,7])
/reduce_data frame (shape->[18,4])
In [3]: mystore['chunk_data'].shape
Out[3]: (0, 1)
In [4]: mystore['enrich_data_kb'].shape
Out[4]: (18, 11)
In [5]: mystore['inputs'].shape
Out[5]: (105, 4)
Any Idea ?
As Jeff suggest, here is the result of ptdump (restricted to enrich_data_kb key):
/enrich_data_kb (Group) ''
/enrich_data_kb._v_attrs (AttributeSet), 13 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
axis0_variety := 'regular',
axis1_variety := 'regular',
block0_items_variety := 'regular',
block1_items_variety := 'regular',
block2_items_variety := 'regular',
encoding := None,
nblocks := 3,
ndim := 2,
pandas_type := 'frame',
pandas_version := '0.15.2']
/enrich_data_kb/axis0 (Array(11,)) ''
atom := StringAtom(itemsize=10, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/enrich_data_kb/axis0._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
kind := 'string',
name := None,
transposed := True]
/enrich_data_kb/axis1 (Array(18,)) ''
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/enrich_data_kb/axis1._v_attrs (AttributeSet), 7 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
kind := 'integer',
name := None,
transposed := True]
/enrich_data_kb/block0_items (Array(8,)) ''
atom := StringAtom(itemsize=10, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/enrich_data_kb/block0_items._v_attrs (AttributeSet), 8 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
freq := None,
kind := 'string',
name := None,
transposed := True]
/enrich_data_kb/block0_values (VLArray(1,)) ''
atom = ObjectAtom()
byteorder = 'irrelevant'
nrows = 1
flavor = 'numpy'
/enrich_data_kb/block0_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'VLARRAY',
PSEUDOATOM := 'object',
TITLE := '',
VERSION := '1.4',
transposed := True]
/enrich_data_kb/block1_items (Array(2,)) ''
atom := StringAtom(itemsize=10, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/enrich_data_kb/block1_items._v_attrs (AttributeSet), 8 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
freq := None,
kind := 'string',
name := None,
transposed := True]
/enrich_data_kb/block1_values (Array(18, 2)) ''
atom := Float64Atom(shape=(), dflt=0.0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/enrich_data_kb/block1_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
transposed := True]
/enrich_data_kb/block2_items (Array(1,)) ''
atom := StringAtom(itemsize=8, shape=(), dflt='')
maindim := 0
flavor := 'numpy'
byteorder := 'irrelevant'
chunkshape := None
/enrich_data_kb/block2_items._v_attrs (AttributeSet), 8 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
freq := None,
kind := 'string',
name := None,
transposed := True]
/enrich_data_kb/block2_values (Array(18, 1)) ''
atom := Int64Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := None
/enrich_data_kb/block2_values._v_attrs (AttributeSet), 5 attributes:
[CLASS := 'ARRAY',
FLAVOR := 'numpy',
TITLE := '',
VERSION := '2.4',
transposed := True]

Pandas HDFStore: slow on query for non-matching string

My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example:
I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed).
When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows.
However, when I search for a code that does NOT exist in the dataset ( store.select('df', 'code=not_valid_code') ) it takes around 980 seconds to get the result of the query (0 rows).
I create the store like:
store = pd.HDFStore('data.h5', complevel=1, complib='zlib')
And the first append is like:
store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)
Is this behavior normal or is there something wrong going on?
Thanks!
PS: this question is probably related with this other question
UPDATE:
Following Jeff's advice, I replicated his experiment, and I got the following results on a Mac. This is the table that was generated:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(50000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (8192,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string64',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 50000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
And these are the results:
In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop
In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop
In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop
In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop
Since the differences were maybe not big enough, I tried the same experiment but with a bigger dataframe. Also, I did this experiment on a different machine, one with Linux.
This is the code (I just multiplied the original dataset by 10):
import pandas as pd
df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})
df = pd.concat([ df ] * 20)
store = pd.HDFStore('test.h5',mode='w')
for i in range(50):
print "%s" % i
store.append('df',df,data_columns=['A'])
This is the table:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(500000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (15420,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string72',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 500000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
These are the files:
-rw-rw-r-- 1 user user 8.2G Oct 5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct 5 14:30 test_zlib.h5
And these are the results:
In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop
In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop
In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop
In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop
These are my versions of Pandas and Pytables:
user#host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires:
user#host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy
Although I am quite sure that the issue is not related with Pandas, since I have observed similar behavior when using only Pytables without Pandas.
UPDATE 2:
I have switched to Pytables 3.0.0 and the problem got fixed. This is using the same files that were generated with Pytables 3.1.1.
In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop
In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop
I think your issue is one which we filed a bug a while ago here with the PyTables guys. Essentially, when using a compressed store AND specifying expectedrows AND using an indexed columns causes mis-indexing.
The soln is simply NOT to use expectedrows, and rather to ptrepack the file with a specified chunkshape (or AUTO). This is good practice anyhow. Further, not sure if you specifying compression up-front, but it is IMHO better to do this via ptrepack, see docs here. Their is also an issue on SO about this (can't find it right now, essentially if you are creating the file, don't don't index up-front but when you are done appending, if you can).
In any event, creating a test store:
In [1]: df = DataFrame({'A' : [ 'foo%05d' % i for i in range(50000) ]})
In [2]: df = pd.concat([ df ] * 20)
Append 50M rows.
In [4]: store = pd.HDFStore('test.h5',mode='w')
In [6]: for i in range(50):
...: print "%s" % i
...: store.append('df',df,data_columns=['A'])
...:
Here is the table
In [9]: !ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(50000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (8192,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string64',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 50000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
Create a blosc and zlib version.
In [12]: !ptrepack --complib blosc --chunkshape auto --propindexes test.h5 test_blosc.h5
In [13]: !ptrepack --complib zlib --chunkshape auto --propindexes test.h5 test_zlib.h5
In [14]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 866182540 Oct 4 20:31 test.h5
-rw-rw-r-- 1 jreback users 976674013 Oct 4 20:36 test_blosc.h5
-rw-rw-r-- 1 jreback users 976674013 Oct 4 2014 test_zlib.h5
Perf is pretty similar (for the found rows)
In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 337 ms per loop
In [15]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "foo00002"')
1 loops, best of 3: 345 ms per loop
In [16]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 347 ms per loop
And missing rows (though the compressed do perform better here).
In [11]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
10 loops, best of 3: 82.4 ms per loop
In [17]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "bar"')
10 loops, best of 3: 32.2 ms per loop
In [18]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 32.3 ms per loop
So. try w/o the expected rows specifier, and use ptrepack.
Another possiblity if you are expecting a relatively low density of entries for this column (e.g. a smaller number of unique entries). Is to select the entire column, store.select_column('df','A').unique() in this case, and use that as a quick lookup mechanism (so you don't search at all).
Thanks to Jeff's help I fixed the issue by downgrading Pytables to the version 3.0.0. The issue has been reported to the devs of Pytables.