Jupyter kernel dies while reading file - pandas

I am reading a 22.2 GB csv file to a pandas df in a Jupyter notebook on an EC2 instance but I keep getting this error:
The instance is a t3 2x large instance and while reading the file, the CPU utilization is 13.4% and the total volume size is 60 GB
I am not sure what is causing this issue. Any ideas?

Related

Memory Leak - After every request hit on Flask API running in a container

I have a flask app running in a container on EC2. On starting the container, the docker stats gave memory usage close to 48MB. After making the first API call (reading a 2gb file from s3), the usage rises to 5.72GB. Even after completion of the api call, the usage does not go down.
On hitting the request, the usage goes up by around twice the file size and after a few requests, the server starts giving the memory error
Also, on running the same Flask app without the container, we do not see any such increment in memory utilized.
Output of "docker stats <container_id>" before hitting the API-
Output of "docker stats <container_id>" after hitting the API
Flask app (app.py) contains-
import os
import json
import pandas as pd
import flask
app = flask.Flask(__name__)
#app.route('/uploadData', methods=['POST'])
def test():
json_input = flask.request.args.to_dict()
s3_path = json_input['s3_path']
# reading file directly from s3 - without downloading
df = pd.read_csv(s3_path)
print(df.head(5))
#clearing df
df = None
return json_input
#app.route('/healthcheck', methods=['GET'])
def HealthCheck():
return "Success"
if __name__ == '__main__':
app.run(host="0.0.0.0", port='8898')
Docker contains-
FROM python:3.7.10
RUN apt-get update -y && apt-get install -y python-dev
# We copy just the requirements.txt first to leverage Docker cache
COPY . /app_abhi
WORKDIR /app_abhi
EXPOSE 8898
RUN pip3 install flask boto3 pandas fsspec s3fs
CMD [ "python","-u", "app.py" ]
I tried reading the file directly from S3 as well as downloading the file and then reading it but it did not work.
Any leads in getting this memory utilization down to the initial consumption would be a great help!
You can try following possible solutions:
Update the dtype of the columns :
Pandas (by default) try to infer dtypes of the datatype of columns when it creates a dataframe. Certain data types can result in large memory allocation. You can reduce it by updating the dtypes of such columns. e.g. update integer columns to pd.np.int8 and float columns to pd.np.float16. Refer this : Pandas/Python memory spike while reading 3.2 GB file
Read data in Chunks :
You can read data into a chunk size say and perform the required processing on the chunk and then moving on to the new chunk. This way you will not be storing the entire data into memory. Although reading data into chunks can be slower as compared to reading whole data at once, but it is memory efficient.
Try using new library : Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed. But you might not find a lot of built-in pandas operations in Dask. https://docs.dask.org/en/latest/dataframe.html
The memory growth is almost certainly caused by constructing the dataframe.
df = None doesn't return that memory to the operating system, though it does return memory to the heap managed within the process. There's an explanation for that in How do I release memory used by a pandas dataframe?
I had a similar problem (see question Google Cloud Run: script requires little memory, yet reaches memory limit)
Finally, I was able to solve it by adding
import gc
...
gc.collect()

Create Dataframe in Pandas - Out of memory error while reading Parquet files

I have a Windows 10 machine with 8 GB RAM and 5 cores.
I have created a parquet file compressed with gzip. The size of the file after compression is 137 MB.
When I am trying to read the parquet file through Pandas, dask and vaex, I am getting memory issues:
Pandas :
df = pd.read_parquet("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Dask:
import dask.dataframe as dd
df = dd.read_parquet("C:\\files\\test.parquet").compute()
OSError: Out of memory: realloc of size 3915749376 failed
Vaex:
df = vaex.open("C:\\files\\test.parquet")
OSError: Out of memory: realloc of size 3915749376 failed
Since Pandas /Python is meant for efficiency and 137 mb file is below par size , are there any recommended ways to create efficient dataframes? Libraries like Vaex, Dask claims to be very efficient.
For single machine, I would recommend Vaex with HDF file format. The data resides on hard disk and thus you can use bigger data sets. There is a built-in function in vaex that will read and convert bigger csv file into hdf file format.
df = vaex.from_csv('./my_data/my_big_file.csv', convert=True, chunk_size=5_000_000)
Dask is optimized for distributed system. You read the big file in chunks and then scatter it among worker machines.
It is totally possible that a 137MB parquet file expands to 4GB in memory, due to efficient compression and encoding in parquet. You may have some options on load, please show your schema. Are you using fastparquet or pyarrow?
Since all of the engines you are trying to use are capable of loading one "row-group" at a time, I suppose you only have one row group, and so splitting won't work. You could load only a selection of columns to save memory, if this can accomplish your task (all the loaders support this).
Check that you are using the latest version of pyarrow. A few times updating has helped me.
pip install -U pyarrow
pip install pyarrow==0.15.0 worked for me.

How to load a large h5 file in memory?

I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/

How can I read and manipulate large csv files in Google Colaboratory while not using all the RAM?

I am trying to import and manipulate compressed .csv files (that are each about 500MB in compressed form) in Google Colaboratory. There are 7 files. Using pandas.read_csv(), I "use all the available RAM" just after 2 files are imported and I have to restart my runtime.
I have searched forever on here looking for answers and have tried all the ones I came across, but none work. I have the files in my google drive and am mounted to it.
How can I read all of the files and manipulate them without using all the RAM? I have 12.72GB of RAM and 358.27GM of Disk.
Buying more RAM isn't an option.
To solve my problem, I created 7 cells (one for each data file). Within each cell I read the file, manipulated it, saved what I needed, then deleted everything:
import pandas as pd
import gc
df = pd.read_csv('Google drive path', compression = 'gzip')
filtered_df = df.query('my query condition here')
filtered_df.to_csv('new Google drive path', compression = 'gzip')
del df
del filtered_df
gc.collect()
After all 7 files, each about 500MB, for a total row-by-column size of 7,000,000 by 100, my RAM has stayed under 1MB.
Just using del didn't free up enough RAM. I had to use gc.collect() after in each cell.

Pandas reading csv into hdfstore thrashes, creates huge file

As a test, I'm trying to read a small 25 mg csv file using pandas.HDFStore:
store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
store.append('df',chunk)
store.close()
It causes my computer to thrash and when it finally completes, file.h5 is 6.7 gigs. I don't know what is causing the file size to balloon: when I look at the store afterwards, the only thing in there is the small dataframe.
If I read the csv in without chunking and then add it to the store, I have no problems.
Update 1:
I'm running Anaconda, using python 2.7.6, HDF5 version 1.8.9, numpy 1.8.0, pytables 3.1.0, pandas 13.1, ubuntu 12.04.
The data is proprietary, so I can't post the chunk information online. I do have some mixed types. It still crashes if I try to read everything in as object.
Update 2:
Dropped all the columns with mixed type and I'm still getting the same issue. I have some very large text columns if that makes any difference.
Update 3:
The problem seems to be loading the dataframe into the hdfstore. I drastically reduced the size of my file, but kept one of my very wide columns (1259 characters). Whereas the size of the csv file is 878.6kb, the size of the hdfstore is 53 megs. Is pytables unable to handle very wide columns? Is there a threshold above which I should truncate?
The wide object columns are definitely the problem. My solution has been to truncate the object columns while reading them in. If I truncate to a width of 20 characters, the h5 file is only about twice as large as a csv file. However, if I truncate to 100 characters, the h5 file is about 6 times larger.
I include my code below as an answer, but if anyone has any idea how to reduce this size disparity without having to truncate so much text, I'd be grateful.
store = pd.HDFStore(filepath, 'w')
for chunk in pd.read_csv(f, chunksize=5000, sep='\t',
na_values="null", error_bad_lines=False):
chunk = chunk.apply(truncateCol)
store.append(table, chunk)
def truncateCol(ser, width=100):
if ser.dtype == np.object:
ser = ser.str[:width] if ser.str.len().max() > width else ser
return ser