How can I read and manipulate large csv files in Google Colaboratory while not using all the RAM? - pandas

I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.
I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.
How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.
Buying more RAM isn't an option.

To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:
import pandas as pd
import gc
df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')
del df
del filtered_df
gc.collect()
After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.
Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.

Related

reading hdf5 file from s3 to sagemaker, is the whole file transferred?

I'm reading a file from my S3 bucket in a notebook in sagemaker studio (same account) using the following code:
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
h5_file = h5py.File(s3.open(s3url,'rb'), 'r')
data = h5_file.get(dataset_path_in_h5)
But I don't know what actually append behind the scene, does the whole h5 file is being transferred ? that's seems unlikely as the code is executed quite fast while the whole file is 20GB. Or is just the dataset in dataset_path_in_h5 is transferred ?
I suppose that if the whole file is transferred at each call it could cost me a lot.
When you open the file, a file object is created. It has a tiny memory footprint. The dataset values aren't read into memory until you access them.
You are returning data as a NumPy array. That loads the entire dataset into memory. (NOTE: the .get() method you are using is deprecated. Current syntax is provided in the example.)
As an alternative to returning an array, you can create a dataset object (which also has a small memory foorprint). When you do, the data is read into memory as you need it. Dataset objects behave like NumPy arrays. (Use of a dataset object vs NumPy array depends on downstream usage. Frequently you don't need an array, but sometimes they are required.) Also, if chunked I/O was enabled when the dataset was created, datasets are read in chunks.
Differences shown below. Note, I used Python's file context manager to open the file. It avoids problems if the file isn't closed properly (you forget or the program exits prematurely).
dataset_path_in_h5="/Mode1/SingleFault/SimulationCompleted/IDV2/Mode1_IDVInfo_2_100/Run1/processdata"
s3 = s3fs.S3FileSystem()
with h5py.File(s3.open(s3url,'rb'), 'r') as h5_file:
# your way to get a numpy array -- .get() is depreciated:
data = h5_file.get(dataset_path_in_h5)
# this is the preferred syntax to return an array:
data_arr = h5_file[dataset_path_in_h5][()]
# this returns a h5py dataset object:
data_ds = h5_file[dataset_path_in_h5] # deleted [()]

How can I process a large parquet file from spark in numpy/pandas?

I'm posting this for pandas, numpy and spark tags because I'm not really sure the best approach to solve this problem within those three systems.
I have a large parquet file that a downstream process is having trouble opening because it exceeds the system's memory(~63gb in memory if opened at once). I was writing the file as such:
FULL_MAIN.write.mode("overwrite").parquet(PATH+"/FULL_MAIN.parquet")
but the file was too big, so I tried to do this to break the file into smaller chucks:
split_factor = [.1,.1,.1,.1,.1,.1,.1,.1,.1,.1]
FULL_MAIN_RDD1,FULL_MAIN_RDD2,FULL_MAIN_RDD3,FULL_MAIN_RDD4,FULL_MAIN_RDD5, FULL_MAIN_RDD6,FULL_MAIN_RDD7,FULL_MAIN_RDD8,FULL_MAIN_RDD9,FULL_MAIN_RDD10 = FULL_MAIN.randomSplit(split_factor)
FULL_MAIN_RDD1.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD1.parquet")
FULL_MAIN_RDD2.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD2.parquet")
...
The problem with this approach is there are other dataframes that I needed to keep the rows aligned to and doing this random split is making the dataframes not aligned.
So my two questions are:
Is there way to split multiple dataframes by relative equal amounts when I don't have any row numbers or numeric counter for each row in my dataset?
Is there a way to read parquet files in batches in pandas or numpy? This would basically solve my problem on the downstream system. I can't figure out how to open the parquet in batches(I've tried to open it in pandas and then split the rows and save each file but when I load the dataframe it crashes my system). I am not sure if it's possible without exceeding memory.
Parquet file format supports row groups. Install pyarrow and use row_groups when creating parquet file:
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
Then you can read group-by-group (or even only specific group):
import pyarrow.parquet as pq
pq_file = pq.ParquetFile("filename.parquet")
n_groups = pq_file.num_row_groups
for grp_idx in range(n_groups):
df = pq_file.read_row_group(grp_idx, use_pandas_metadata=True).to_pandas()
process(df)
If you don't have control over creation of the parquet file, you still able to read only part of the file:
pq_file = pq.ParquetFile("filename.parquet")
batch_size = 10000 # records
batches = pq_file.iter_batches(batch_size, use_pandas_metadata=True) # batches will be a generator
for batch in batches:
df = batch.to_pandas()
process(df)
I am not sure if you are having spark . If you want to provide downstream smaller chunks of file , you use repartition to a desired number of chunks and rewrite the parquet file .
You can change the repartition number as per your need.
df = spark.read.parquet('filename.parquet')
df.repartition(200).mode('overwrite').save('targetPath')

Create Dataframe in Pandas - Out of memory error while reading Parquet files

I have a Windows 10 machine with 8 GB RAM and 5 cores.
I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB.
When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:
Pandas :
df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Dask:
import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed
Vaex:
df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.
For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.
It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?
Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).
Check that you are using the latest version of pyarrow. A few times updating has helped me.
pip install -U pyarrow
pip install pyarrow==0.15.0 worked for me.

How to load a large h5 file in memory?

I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/

Pandas reading csv into hdfstore thrashes, creates huge file

As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe.
If I read the csv in without chunking and then add it to the store, I have no problems.
Update 1:
I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04.
The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.
Update 2:
Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.
Update 3:
The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?
The wide object columns are definitely the problem. My solution has been to truncate the object columns while reading them in. If I truncate to a width of 20 characters, the h5 file is only about twice as large as a csv file. However, if I truncate to 100 characters, the h5 file is about 6 times larger.
I include my code below as an answer, but if anyone has any idea how to reduce this size disparity without having to truncate so much text, I'd be grateful.
store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
na_values="null", error_bad_lines=False):
chunk = chunk.apply(truncateCol)
store.append(table, chunk)
def truncateCol(ser, width=100):
if ser.dtype == np.object:
ser = ser.str[:width] if ser.str.len().max() > width else ser
return ser