As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe.
If I read the csv in without chunking and then add it to the store, I have no problems.
Update 1:
I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04.
The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.
Update 2:
Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.
Update 3:
The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?
The wide object columns are definitely the problem. My solution has been to truncate the object columns while reading them in. If I truncate to a width of 20 characters, the h5 file is only about twice as large as a csv file. However, if I truncate to 100 characters, the h5 file is about 6 times larger.
I include my code below as an answer, but if anyone has any idea how to reduce this size disparity without having to truncate so much text, I'd be grateful.
store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
na_values="null", error_bad_lines=False):
chunk = chunk.apply(truncateCol)
store.append(table, chunk)
def truncateCol(ser, width=100):
if ser.dtype == np.object:
ser = ser.str[:width] if ser.str.len().max() > width else ser
return ser
Related
I'm posting this for pandas, numpy and spark tags because I'm not really sure the best approach to solve this problem within those three systems.
I have a large parquet file that a downstream process is having trouble opening because it exceeds the system's memory(~63gb in memory if opened at once). I was writing the file as such:
FULL_MAIN.write.mode("overwrite").parquet(PATH+"/FULL_MAIN.parquet")
but the file was too big, so I tried to do this to break the file into smaller chucks:
split_factor = [.1,.1,.1,.1,.1,.1,.1,.1,.1,.1]
FULL_MAIN_RDD1,FULL_MAIN_RDD2,FULL_MAIN_RDD3,FULL_MAIN_RDD4,FULL_MAIN_RDD5, FULL_MAIN_RDD6,FULL_MAIN_RDD7,FULL_MAIN_RDD8,FULL_MAIN_RDD9,FULL_MAIN_RDD10 = FULL_MAIN.randomSplit(split_factor)
FULL_MAIN_RDD1.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD1.parquet")
FULL_MAIN_RDD2.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD2.parquet")
...
The problem with this approach is there are other dataframes that I needed to keep the rows aligned to and doing this random split is making the dataframes not aligned.
So my two questions are:
Is there way to split multiple dataframes by relative equal amounts when I don't have any row numbers or numeric counter for each row in my dataset?
Is there a way to read parquet files in batches in pandas or numpy? This would basically solve my problem on the downstream system. I can't figure out how to open the parquet in batches(I've tried to open it in pandas and then split the rows and save each file but when I load the dataframe it crashes my system). I am not sure if it's possible without exceeding memory.
Parquet file format supports row groups. Install pyarrow and use row_groups when creating parquet file:
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
Then you can read group-by-group (or even only specific group):
import pyarrow.parquet as pq
pq_file = pq.ParquetFile("filename.parquet")
n_groups = pq_file.num_row_groups
for grp_idx in range(n_groups):
df = pq_file.read_row_group(grp_idx, use_pandas_metadata=True).to_pandas()
process(df)
If you don't have control over creation of the parquet file, you still able to read only part of the file:
pq_file = pq.ParquetFile("filename.parquet")
batch_size = 10000 # records
batches = pq_file.iter_batches(batch_size, use_pandas_metadata=True) # batches will be a generator
for batch in batches:
df = batch.to_pandas()
process(df)
I am not sure if you are having spark . If you want to provide downstream smaller chunks of file , you use repartition to a desired number of chunks and rewrite the parquet file .
You can change the repartition number as per your need.
df = spark.read.parquet('filename.parquet')
df.repartition(200).mode('overwrite').save('targetPath')
I have a Windows 10 machine with 8 GB RAM and 5 cores.
I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB.
When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:
Pandas :
df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Dask:
import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed
Vaex:
df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.
For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.
It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?
Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).
Check that you are using the latest version of pyarrow. A few times updating has helped me.
pip install -U pyarrow
pip install pyarrow==0.15.0 worked for me.
I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/
I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.
I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.
How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.
Buying more RAM isn't an option.
To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:
import pandas as pd
import gc
df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')
del df
del filtered_df
gc.collect()
After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.
Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.
Pandas has a method .to_hdf() to save a dataframe as a HDF table.
However each time the command .to_hdf(path, key) is run, the size of the file increases.
import os
import string
import pandas as pd
import numpy as np
size = 10**4
df = pd.DataFrame({"C":np.random.randint(0,100,size),
"D": np.random.choice(list(string.ascii_lowercase), size = size)})
for iteration in range(4):
df.to_hdf("a_file.h5","key1")
print(os.path.getsize("a_file.h5"))
And the output clearly shows that the size of the file is increasing:
# 1240552
# 1262856
# 1285160
# 1307464
As a new df is saved each time, the hdf size should be constant.
As the increase seems quite modest for small df, with larger df it fastly leads to hdf files that are significantly bigger than the size of the file on the first save.
Sizes I get with a 10**7 long dataframe after 7 iterations:
29MB, 48MB, 67MB, 86MB, 105MB, 125MB, 144MB
Why is it so that the hdf file size is not constant and increase a each new to_hdf()?
This behavior is not really documented if you look in a fast manner at the documentation (which is 2973 pdf pages long). But can be found in #1643, and in the warning in IO Tools section/delete from a table section of the documentation:
If you do not specify anything, the default writing mode is 'a'which is the case of a simple df.to_hdf('a_path.h5','a_key') will nearly double the size of your hdf file each time you run your script.
Solution is to use the write mode: df.to_hdf('a_path.h5','a_key', mode = 'w')
However, this behavior will happen only with the fixed format (which is the default format) but not with the table format (except if append is set to True).