How to store large columnar text+numeric data in Python? - pandas

To save on disk without building columnar DB, there are :
SQLLite,
HDFS5 : only numeric/fixed string
pickle serialization
csv
csv compressed.
....
Just wondering which one is most efficient in term of speed ?
Thanks

I'd consider Feather, HDF5. MySQL or PostgreSQL - might also be an option depending on how you are going to query your data...
Here is demo for HDF5:
In [33]: df = pd.DataFrame(np.random.randint(0, 10**6, (10**4, 3)), columns=list('abc'))
In [34]: df['txt'] = 'X' * 300
In [35]: df
Out[35]:
a b c txt
0 689347 129498 770470 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
1 954132 97912 783288 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
2 40548 938326 861212 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
3 869895 39293 242473 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
4 938918 487643 362942 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
...
In [37]: df.to_hdf('c:/temp/test_str.h5', 'test', format='t', data_columns=['a', 'c'])
In [38]: store = pd.HDFStore('c:/temp/test_str.h5')
In [39]: store.get_storer('test').table
Out[39]:
/test/table (Table(10000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
"values_block_1": StringCol(itemsize=300, shape=(1,), dflt=b'', pos=2), # <---- NOTE
"a": Int32Col(shape=(), dflt=0, pos=3),
"c": Int32Col(shape=(), dflt=0, pos=4)}
byteorder := 'little'
chunkshape := (204,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"c": Index(6, medium, shuffle, zlib(1)).is_csi=False}

Related

Nested Dict To DataFrame

CAN ANYONE HELP
msg = {'e': 'kline',
'E': 1672157513375,
's': 'BTCUSDT',
'k': {
't': 1672157460000, #REQUIRE, CONVERT MS TO DATETIME,
#RENAME AS TIME, AS INDEX
'T': 1672157519999,
's': 'BTCUSDT',
'i': '1m',
'f': 2388965371,
'L': 2388969270,
'o': '16787.32000000', #REQUIRE RENAME AS OPEN
'c': '16783.23000000', #REQUIRE RENAME AS CLOSE
'h': '16789.41000000', #REQUIRE RENAME AS HIGH
'l': '16782.69000000', #REQUIRE RENAME AS LOW
'v': '149.27507000', #REQUIRE RENAME AS VOLUME
'n': 3900,
'x': False,
'q': '2505669.98288240',
'V': '59.70465000',
'Q': '1002207.92308370',
'B': '0'
}
}
Time = k(t),datetime
Open = k(o),dtype float
High = k(h),dtype float
Low = k(l), dtype float
Close = k(c), dtype float
Volume = k (v),dtype float
index give as,
k(t) convert this millisecond to datetime,
and converted give as index
language python
WHAT I TRYING:
def getdata(msg):
frame = pd.DataFrame(msg)
#DONT UNDERSTOOD
frame = frame.loc[frame['k']['t'],frame['k']['t'],frame['k']['t'],
frame['k']['t'],frame['k']['t'],frame['k']['t']]
#SOME UNDERSTOOD
frame.columns = ["Time","Open","High","Low","Close","Volume"]
frame.set_index("Time",inplace=True)
frame.index = pd.to_datetime(frame.index,unit='ms')
frame = frame.astype(float)
return frame
getdata(msg)
REQUIRE OUTPUT:
Time Open High Low Close Volume
2022-12-27 16:11:00 16787.7 16789.4 16782.6 16783.2 149
<3
Using json_normalize():
df = (pd
.json_normalize(data=msg["k"])[["t", "o", "h", "l", "c", "v"]]
.rename(columns={"t": "Time", "o": "Open", "h": "High", "l": "Low", "c": "Close", "v": "Volume"})
)
df["Time"] = (
pd.to_datetime(df["Time"], unit="ms")
.dt.tz_localize("UTC")
.dt.tz_localize(None)
.dt.floor("S")
)
print(df)
Output:
Time Open High Low Close Volume
0 2022-12-27 16:11:00 16787.32000000 16789.41000000 16782.69000000 16783.23000000 149.27507000

Pyspark remove field in struct column

I want to remove a part of a value in a struct and save that version of the value as a new column in my dataframe, which looks something like this:
column
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}
I want to remove the field C and it's values and save the rest as one new column without dividing A, B, D fields into different columns. What I want should look like this:
column
newColumn
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}
{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "D": "XD"}
I have successfully removed C by converting my dataframe to a dict, but now I can't manage to convert it back into ONE column. My attempt at removing C looks like this:
dfTemp = df.select('column').collect()[0][0].asDict(True)
dfDict = {}
for k in dfTemp:
if k != 'C':
dfDict[k] = dfTemp[k]
If you have a better way to remove a part of struct like mine and keeping the result in one column and not adding more rows or if you know how to convert a dict to a dataframe without dividing the key and value pairs into separate columns please leave a suggestion.
Assuming your column is of type string and contains json, you can first parse it into StructType using from_json like this:
df = spark.createDataFrame([
('{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}',)
], ["column"])
df = df.withColumn(
"parsed_column",
F.from_json("column", "struct<A:string,B:int,C:struct<A:int,CB:string>,D:string>")
)
Now removing the field C from the struct column:
Spark >=3.1
Use dropFields method:
result = df.withColumn("newColumn", F.to_json(F.col("parsed_column").dropFields("C"))).drop("parsed_column")
result.show(truncate=False)
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|column |newColumn |
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
#|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
#+-----------------------------------------------------------------------------------------+----------------------------------------------------+
Spark <3.1
Recreate the struct column and filter the field C
result = df.withColumn(
"newColumn",
F.to_json(
F.struct(*[
F.col(f"parsed_column.{c}").alias(c)
for c in df.selectExpr("parsed_column.*").columns if c != 'C'
])
)
).drop("parsed_column")
Another method by parsing the json string values into MapType then applying function map_filter to remove key C:
result = df.withColumn(
"newColumn",
F.to_json(
F.map_filter(
F.from_json("column", "map<string,string>"),
lambda k, v: k != "C"
)
)
Well, it's not trivial as it would seems. First, your approach is not meant for Spark, unless you're working with very little data (and so, you don't need Spark) and you're better off using pure Python like you tried. Using collect() fetch all data on the driver which would not work with large data.
The distributed approach for this is as follows:
infer schema on part of your JSON data (unless you want to do it manually - which is tedious)
transform your dataframe with this schema to have access to named attributes
select attributes as needed and back to JSON
I tried to decompose as much as I could here:
from pyspark.sql.types import IntegerType, StructType, StringType, StructField
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import json
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
# Create input data
data = [json.dumps({"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"})]
df = spark.createDataFrame(data, "string").toDF("colA")
df.show()
+-----------------------------------------------------------------------------------------+
|colA |
+-----------------------------------------------------------------------------------------+
|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|
+-----------------------------------------------------------------------------------------+
# Infer schema - infering on first 10 rows
s = df.select(F.col("colA").alias("s")).rdd.map(lambda x: x.s).take(10)
schema = spark.read.json(sc.parallelize(s)).schema
print(schema)
# StructType(List(StructField(A,StringType,true),StructField(B,LongType,true),StructField(C,StructType(List(StructField(CA,LongType,true),StructField(CB,StringType,true))),true),StructField(D,StringType,true)))
# read JSON string with schema
new_df = df.withColumn("colB", F.from_json("colA", schema))
new_df.show(truncate=False)
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
|colA |colB |
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
|{"A": "2022-01-26T14:21:32.214+0000", "B": 69, "C": {"CA": 42, "CB": "Hello"}, "D": "XD"}|{2022-01-26T14:21:32.214+0000, 69, {42, Hello}, XD}|
+-----------------------------------------------------------------------------------------+---------------------------------------------------+
# Finally ...
new_df.select(F.to_json(F.struct("colB.A", "colB.B", "colB.D")).alias("colC")).show(truncate=False)
+----------------------------------------------------+
|colC |
+----------------------------------------------------+
|{"A":"2022-01-26T14:21:32.214+0000","B":69,"D":"XD"}|
+----------------------------------------------------+

Failing to generate scalar predictions from NuPIC CLA model

I'm failing to get scalar predictions out of a CLA model.
Here's a self-contained example. It uses config to create a model using the ModelFactory. Then it trains it with a simple data set ({input_field=X, output_field=X} where X is random between 0-1). Then it attempts to extract predictions with input of the form {input_field=X, output_field=None}.
#!/usr/bin/python
import random
from nupic.frameworks.opf.modelfactory import ModelFactory
config = {
'model': "CLA",
'version': 1,
'modelParams': {
'inferenceType': 'NontemporalClassification',
'sensorParams': {
'verbosity' : 0,
'encoders': {
'_classifierInput': {
'classifierOnly': True,
'clipInput': True,
'fieldname': u'output_field',
'maxval': 1.0,
'minval': 0.0,
'n': 100,
'name': '_classifierInput',
'type': 'ScalarEncoder',
'w': 21},
u'input_field': {
'clipInput': True,
'fieldname': u'input_field',
'maxval': 1.0,
'minval': 0.0,
'n': 100,
'name': u'input_field',
'type': 'ScalarEncoder',
'w': 21},
},
},
'spEnable': False,
'tpEnable' : False,
'clParams': {
'regionName' : 'CLAClassifierRegion',
'clVerbosity' : 0,
'alpha': 0.001,
'steps': '0',
},
},
}
model = ModelFactory.create(config)
ROWS = 100
def sample():
return random.uniform(0.0, 1.0)
# training data is {input_field: X, output_field: X}
def training():
for r in range(ROWS):
value = sample()
yield {"input_field": value, "output_field": value}
# testing data is {input_field: X, output_field: None} (want output_field predicted)
def testing():
for r in range(ROWS):
value = sample()
yield {"input_field": value, "output_field": None}
model.enableInference({"predictedField": "output_field"})
model.enableLearning()
for row in training():
model.run(row)
#model.finishLearning() fails in clamodel.py
model.disableLearning()
for row in testing():
result = model.run(row)
print result.inferences # Shows None as value
The output I see is high confidence None rather than what I expect, which is something close to the input value (since the model was trained on input==output).
{'multiStepPredictions': {0: {None: 1.0}}, 'multiStepBestPredictions': {0: None}, 'anomalyScore': None}
{'multiStepPredictions': {0: {None: 0.99999999999999978}}, 'multiStepBestPredictions': {0: None}, 'anomalyScore': None}
{'multiStepPredictions': {0: {None: 1.0000000000000002}}, 'multiStepBestPredictions': {0: None}, 'anomalyScore': None}
{'multiStepPredictions': {0: {None: 1.0}}, 'multiStepBestPredictions': {0: None}, 'anomalyScore': None}
'NontemporalClassification' seems to be the right inferenceType, because it's a simple classification. But does that work with scalars?
Is there a different way of expressing that I want a prediction other than output_field=None?
I need output_field to be classifierOnly=True. Is there related configuration missing or wrong?
Thanks for your help.
Here's the working example. The key changes were
Use TemporalMultiStep as recommended by #matthew-taylor (adding required parameters)
Use "implementation": "py" in clParams. My values are in the range 0.0-1.0. The fast classifier always returns None for values in that range. The same code with "py" implementation returns valid values. Change the range to 10-100 and the fast algorithm returns valid values also. It was this change that finally produced non-None results.
Less significant than #2, in order to improve the results I repeat each training row in order to let it sink in, which makes sense for training.
To see the classifier bug, comment out line 19 "implementation": "py". The results will be None. Then change MIN_VAL to 10 and MAX_VAL to 100 and watch the results come back.
#!/usr/bin/python
import random
from nupic.frameworks.opf.modelfactory import ModelFactory
from nupic.support import initLogging
from nupic.encoders import ScalarEncoder
import numpy
MIN_VAL = 0.0
MAX_VAL = 1.0
config = {
'model': "CLA",
'version': 1,
'predictAheadTime': None,
'modelParams': {
'clParams': {
"implementation": "py", # cpp version fails with small numbers
'regionName' : 'CLAClassifierRegion',
'clVerbosity' : 0,
'alpha': 0.001,
'steps': '1',
},
'inferenceType': 'TemporalMultiStep',
'sensorParams': {
'encoders': {
'_classifierInput': {
'classifierOnly': True,
'clipInput': True,
'fieldname': 'output_field',
'maxval': MAX_VAL,
'minval': MIN_VAL,
'n': 200,
'name': '_classifierInput',
'type': 'ScalarEncoder',
'w': 21},
u'input_field': {
'clipInput': True,
'fieldname': 'input_field',
'maxval': MAX_VAL,
'minval': MIN_VAL,
'n': 100,
'name': 'input_field',
'type': 'ScalarEncoder',
'w': 21},
},
'sensorAutoReset' : None,
'verbosity' : 0,
},
'spEnable': True,
'spParams': {
'columnCount': 2048,
'globalInhibition': 1,
'spatialImp': 'cpp',
},
'tpEnable' : True,
'tpParams': { 'activationThreshold': 12,
'cellsPerColumn': 32,
'columnCount': 2048,
'temporalImp': 'cpp',
},
'trainSPNetOnlyIfRequested': False,
},
}
# end of config dictionary
model = ModelFactory.create(config)
TRAINING_ROWS = 100
TESTING_ROWS = 100
def sample(r = 0.0):
return random.uniform(MIN_VAL, MAX_VAL)
def training():
for r in range(TRAINING_ROWS):
value = sample(r / TRAINING_ROWS)
for rd in range(5):
yield {
"input_field": value,
"output_field": value,
'_reset': 1 if (rd==0) else 0,
}
def testing():
for r in range(TESTING_ROWS):
value = sample()
yield {
"input_field": value,
"output_field": None,
}
model.enableInference({"predictedField": "output_field"})
for row in training():
model.run(row)
for row in testing():
result = model.run(row)
prediction = result.inferences['multiStepBestPredictions'][1]
if prediction==None:
print "Input %f, Output None" % (row['input_field'])
else:
print "Input %f, Output %f (err %f)" % (row['input_field'], prediction, prediction - row['input_field'])
The inferenceType you want is TemporalMultistep.
See this example for a complete walkthrough.

Pandas HDFStore: slow on query for non-matching string

My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example:
I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed).
When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows.
However, when I search for a code that does NOT exist in the dataset ( store.select('df', 'code=not_valid_code') ) it takes around 980 seconds to get the result of the query (0 rows).
I create the store like:
store = pd.HDFStore('data.h5', complevel=1, complib='zlib')
And the first append is like:
store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)
Is this behavior normal or is there something wrong going on?
Thanks!
PS: this question is probably related with this other question
UPDATE:
Following Jeff's advice, I replicated his experiment, and I got the following results on a Mac. This is the table that was generated:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(50000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (8192,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string64',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 50000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
And these are the results:
In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop
In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop
In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop
In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop
Since the differences were maybe not big enough, I tried the same experiment but with a bigger dataframe. Also, I did this experiment on a different machine, one with Linux.
This is the code (I just multiplied the original dataset by 10):
import pandas as pd
df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})
df = pd.concat([ df ] * 20)
store = pd.HDFStore('test.h5',mode='w')
for i in range(50):
print "%s" % i
store.append('df',df,data_columns=['A'])
This is the table:
!ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(500000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (15420,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string72',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 500000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
These are the files:
-rw-rw-r-- 1 user user 8.2G Oct 5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct 5 14:30 test_zlib.h5
And these are the results:
In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop
In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop
In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop
In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop
These are my versions of Pandas and Pytables:
user#host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires:
user#host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy
Although I am quite sure that the issue is not related with Pandas, since I have observed similar behavior when using only Pytables without Pandas.
UPDATE 2:
I have switched to Pytables 3.0.0 and the problem got fixed. This is using the same files that were generated with Pytables 3.1.1.
In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop
In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop
I think your issue is one which we filed a bug a while ago here with the PyTables guys. Essentially, when using a compressed store AND specifying expectedrows AND using an indexed columns causes mis-indexing.
The soln is simply NOT to use expectedrows, and rather to ptrepack the file with a specified chunkshape (or AUTO). This is good practice anyhow. Further, not sure if you specifying compression up-front, but it is IMHO better to do this via ptrepack, see docs here. Their is also an issue on SO about this (can't find it right now, essentially if you are creating the file, don't don't index up-front but when you are done appending, if you can).
In any event, creating a test store:
In [1]: df = DataFrame({'A' : [ 'foo%05d' % i for i in range(50000) ]})
In [2]: df = pd.concat([ df ] * 20)
Append 50M rows.
In [4]: store = pd.HDFStore('test.h5',mode='w')
In [6]: for i in range(50):
...: print "%s" % i
...: store.append('df',df,data_columns=['A'])
...:
Here is the table
In [9]: !ptdump -av test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['A'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['A']]
/df/table (Table(50000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
byteorder := 'little'
chunkshape := (8192,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[A_dtype := 'string64',
A_kind := ['A'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := '',
FIELD_1_NAME := 'A',
NROWS := 50000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
Create a blosc and zlib version.
In [12]: !ptrepack --complib blosc --chunkshape auto --propindexes test.h5 test_blosc.h5
In [13]: !ptrepack --complib zlib --chunkshape auto --propindexes test.h5 test_zlib.h5
In [14]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 866182540 Oct 4 20:31 test.h5
-rw-rw-r-- 1 jreback users 976674013 Oct 4 20:36 test_blosc.h5
-rw-rw-r-- 1 jreback users 976674013 Oct 4 2014 test_zlib.h5
Perf is pretty similar (for the found rows)
In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 337 ms per loop
In [15]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "foo00002"')
1 loops, best of 3: 345 ms per loop
In [16]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 347 ms per loop
And missing rows (though the compressed do perform better here).
In [11]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
10 loops, best of 3: 82.4 ms per loop
In [17]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "bar"')
10 loops, best of 3: 32.2 ms per loop
In [18]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 32.3 ms per loop
So. try w/o the expected rows specifier, and use ptrepack.
Another possiblity if you are expecting a relatively low density of entries for this column (e.g. a smaller number of unique entries). Is to select the entire column, store.select_column('df','A').unique() in this case, and use that as a quick lookup mechanism (so you don't search at all).
Thanks to Jeff's help I fixed the issue by downgrading Pytables to the version 3.0.0. The issue has been reported to the devs of Pytables.

Check whether a PyTables node in a pandas HDFStore is tabular

Is there a preferred way to check whether a PyTables node in a pandas HDFStore is tabular? This works, but NoSuchNodeError doesn't seem like part of the API, so maybe I should not rely on it.
In [34]: from tables.table import NoSuchNodeError
In [35]: def is_tabular(store, key):
try:
store.get_node(key).table
except NoSuchNodeError:
return False
return True
....:
In [36]: is_tabular(store, 'first_600')
Out[36]: False
In [37]: is_tabular(store, 'features')
Out[37]: True
You could do something like this. The pandas_type, table_type meta-data will be present in the pytables attribute _v_attrs at the top-level of the node.
In [28]: store = pd.HDFStore('test.h5',mode='w')
In [29]: store.append('df',DataFrame(np.random.randn(10,2),columns=list('AB')))
In [30]: store
Out[30]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index])
In [31]: store._handle.root.df._v_attrs
Out[31]:
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := u'',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0']]
In [33]: getattr(getattr(getattr(store._handle.root,'df',None),'_v_attrs',None),'pandas_type',None)
Out[33]: 'frame_table'
In [34]: store.close()
In [35]: