I have pulled some big tables (typically 100M rows X 100 columns) from a database, with both numerical and textual columns. Now I need to save them to a remote machine which doesn't allow for setting up databases. After the data are saved there, the use case would be occasional reads of these files by different collaborators and in-place queries are favorable.
So I figured that HDF5 would be the file format to use. However, the performance seems really bad, both read and write, even worse than .csv files. I used a random subset (1M rows) of a big table to test the saving process in Python. Some info about this test dataframe:
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 1 to 1000000
Data columns (total 49 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 xid 1000000 non-null int64
1 yid 1000000 non-null int64
2 timestamp 1000000 non-null datetime64[ns]
...
7 content1 361892 non-null object
8 content2 168397 non-null object
...
dtypes: datetime64[ns](14), float64(14), int64(5), object(16)
memory usage: 381.5+ MB
>>> df['content2'].str.len().describe()
count 168397.000000
mean 111.846975
std 427.148959
min 4.000000
25% 72.000000
50% 73.000000
75% 73.000000
max 13320.000000
I allocated 128G memory for this process. And when I saved this DF:
>>> timeit(lambda: df.to_csv('test.csv', index=False), number=5)
157.54342390294187
>>> store = pd.HDFStore('test.h5')
>>> timeit(lambda: store.put(key='df', value=df, append=False, index=False, complevel=9, complib='blosc:zstd', format='table',min_itemsize={'content2': 65535, 'content2': 65535}}), number=5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 1030, in put
self._write_to_group(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 1697, in _write_to_group
s.write(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4137, in write
table = self._create_axes(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 3806, in _create_axes
data_converted = _maybe_convert_for_string_atom(
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4812, in _maybe_convert_for_string_atom
data_converted = _convert_string_array(data, encoding, errors).reshape(data.shape)
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/pandas/io/pytables.py", line 4857, in _convert_string_array
data = np.asarray(data, dtype=f"S{itemsize}")
File "/opt/apps/anaconda/2020.07/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
MemoryError: Unable to allocate 61.0 GiB for an array with shape (1, 1000000) and data type |S65535
Even if I used smaller subsets (like 100K) that didn't lead to this MemoryError, saving to HDF5 is significantly slower than to_csv. This seems surprising to me given the touted advantages of HDF5 when handling large datasets. Is it because HDF5 is still bad for handling text data? Note that the textual columns do have long-tail distributions and the content can be as long as 60K characters (which is why I set min_itemsize to 65535).
I would appreciate any suggestion on the most appropriate file format to use, if not HDF5, or any fixes to my current approach. Thanks!
Related
A pandas dataframe file size increases significantly after saving it as .h5 the first time. If I save loaded dataframe, the file size doesn't increase again. It makes me suspect that some kind of meta-data is being saved the first time. What is it the reason for this crease?
Is there an easy way to avoid it?
I can compress the file but I am making comparisons without compression. Would the problem scale differently with compression?
Example code below. The file size increases from 15.3 MB to 22.9 MB
import numpy as np
import pandas as pd
x = np.random.normal (0,1, 1000000)
y = x*2
dataset = pd.DataFrame({'Column1': x, 'Column2': y})
print (dataset.info(memory_usage='deep'))
dataset.to_hdf('data.h5', key='df', mode='w')
dataset2 = pd.read_hdf("data.h5")
print (dataset2.info(memory_usage='deep'))
dataset2.to_hdf('data2.h5', key='df', mode='w')
dataset3 = pd.read_hdf("data2.h5")
print (dataset3.info(memory_usage='deep'))
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 15.3 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
Column1 1000000 non-null float64
Column2 1000000 non-null float64
dtypes: float64(2)
memory usage: 22.9 MB
None
It is happening because RangeIndex is converted to Int64Index on save. Is there a way to optimise this ? Looks like there's no way to drop the index:
https://github.com/pandas-dev/pandas/issues/8319
The best solution I found till now is to save as pickle:
dataset.to_pickle("datapkl.pkl")
less convenient option is to convert to numpy and save with h5py, but then loading and converting back to pandas takes a lot of time
a = dataset.to_numpy()
h5f = h5py.File('datah5.h5', 'w')
h5f.create_dataset('dataset_1', data=a)
I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to
Hello All and thanks in advance.
I'm trying to do a periodic storing of financial data to a database for later querying. I am using Pandas for almost all of the data coding. I want to append a dataframe I have created into an HDF database. I read the csv into a dataframe and index it by timestamp. and the DataFrame looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 900 entries, 1378400701110 to 1378410270251
Data columns (total 23 columns):
....
...Columns with numbers of non-null values....
.....
dtypes: float64(19), int64(4)
store = pd.HDFStore('store1.h5')
store.append('df', df)
print store
<class 'pandas.io.pytables.HDFStore'>
File path: store1.h5
/df frame_table (typ->appendable,nrows->900,ncols->23,indexers->[index])
But when I then try to do anything with store,
print store['df']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 289, in __getitem__
return self.get(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 422, in get
return self._read_group(group)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 930, in _read_group
return s.read(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 3175, in read
mgr = BlockManager([block], [cols_, index_])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1007, in __init__
self._set_ref_locs(do_refs=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 1117, in _set_ref_locs
"does not have _ref_locs set" % (block,labels))
AssertionError: cannot create BlockManager._ref_locs because block
[FloatBlock: [LastTrade, Bid1, Bid1Volume,....., Ask5Volume], 19 x 900, dtype float64]
with duplicate items
[Index([u'LastTrade', u'Bid1', u'Bid1Volume',..., u'Ask5Volume'], dtype=object)]
does not have _ref_locs set
I guess I am doing something wrong with the index, I'm quite new at this and have little knowhow.
EDIT:
The data frame construction looks like:
columns = ['TimeStamp', 'LastTrade', 'Bid1', 'Bid1Volume', 'Bid1', 'Bid1Volume', 'Bid2', 'Bid2Volume', 'Bid3', 'Bid3Volume', 'Bid4', 'Bid4Volume',
'Bid5', 'Bid5Volume', 'Ask1', 'Ask1Volume', 'Ask2', 'Ask2Volume', 'Ask3', 'Ask3Volume', 'Ask4', 'Ask4Volume', 'Ask5', 'Ask5Volume']
df = pd.read_csv('/20130905.csv', names=columns, index_col=[0])
df.head() looks like:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1378400701110 to 1378400703105
Data columns (total 21 columns):
LastTrade 5 non-null values
Bid1 5 non-null values
Bid1Volume 5 non-null values
Bid1 5 non-null values
.................values
Ask4 5 non-null values
Ask4Volume 5 non-null values
dtypes: float64(17), int64(4)
There's too many columns for it to print out the contents. But for example:
print df['LastTrade'].iloc[10]
LastTrade 1.31202
Name: 1378400706093, dtype: float64
and Pandas version:
>>> pd.__version__
'0.12.0'
Any ideas would be thoroughly appreciated, thank you again.
Do you really have a duplicate 'Bid1' and 'Bid1Volume' columns?
Unrrelated, but you should also set the index to a datetime index
import pandas as pd
df.index = pd.to_datetime(df.index,unit='ms')
This is a bug, because the duplicate columns cross dtypes (not a big deal
but undetected).
Easiest just to not have duplicate columns.
Will be fixed in 0.13, see here: https://github.com/pydata/pandas/pull/4768
using read_hdf for first time love it want to use it to combine a bunch of smaller *.h5 into one big file. plan on calling append() of a HDFStore. later will add chunking to conserve memory.
Example table looks like this
Int64Index: 220189 entries, 0 to 220188
Data columns (total 16 columns):
ID 220189 non-null values
duration 220189 non-null values
epochNanos 220189 non-null values
Tag 220189 non-null values
dtypes: object(1), uint64(3)
code:
import pandas as pd
print pd.__version__ # I am running 0.11.0
dest_h5f = pd.HDFStore('c:\\t3_combo.h5',complevel=9)
df = pd.read_hdf('\\t3\\t3_20130319.h5', 't3', mode = 'r')
print df
dest_h5f.append(tbl, df, data_columns=True)
dest_h5f.close()
Problem: the append traps this exception
Exception: cannot find the correct atom type -> [dtype->uint64,items->Index([InstrumentID], dtype=object)] 'module' object has no attribute 'Uint64Col'
this feels like a problem with some version of pytables or numpy
pytables = v 2.4.0 numpy = v 1.6.2
We normally represent epcoch seconds as int64 and use datetime64[ns]. Try using datetime64[ns], will make your life easier. In any event nanoseconds since 1970 is well within the range of in64 anyhow. (and uint64 only buy you 2x this range). So no real advantage to using unsigned ints.
We use int64 because the min value (-9223372036854775807) is used to represent NaT or an integer marker for Not a Time
In [11]: (Series([Timestamp('20130501')])-
Series([Timestamp('19700101')]))[0].astype('int64')
Out[11]: 1367366400000000000
In [12]: np.iinfo('int64').max
Out[12]: 9223372036854775807
You can then represent time form about the year 1677 till 2264 at the nanosecond level
I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170