Extremely high memory usage with pyarrow reading gzipped parquet files

Extremely high memory usage with pyarrow reading gzipped parquet files - pandas

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB of RAM. However when the file is > 1.5 MB (70+ rows) it starts consuming 9-10 GB of RAM without ever loading the dataframe. If I specify just 2-3 columns, it is able to load them from the "big" file (still consuming that kind of RAM), but anything beyond that seems impossible. All columns are text.
I am currently using pandas.read_parquet, but I have also tried pyarrow.read_table with same results.
Any ideas what could be going on? I just don't understand why loading that amount of data should blow up RAM like that and become unusable. My objective with this is to load the data in parquet to a database, so if there are better ways to do it that would be great to know as well.
The code is below; it's just a simple usage of pandas.read_parquet.
import pandas as pd
df = pd.read_parquet(bytesIO_from_file, columns=[...])

There was a memory usage issue in pyarrow 0.14 that has been resolved: https://issues.apache.org/jira/browse/ARROW-6060
The upcoming 0.15 release will have this fix, as well as a bunch of other optimizations in Parquet reading. If you're curious to try it now, see the docs for installing the development version.

Related

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))

I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.

There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

I can pivot a smaller dataset fine using pandas, dask, or pyspark.
However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.
I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?
If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).
Can anyone help out here? I'll go ahead and add a bounty for this.
Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.
#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))

Processing data on disk with a Pandas DataFrame

Is there a way to take a very large amount of data on disk (a few 100 GB) and interact with it on disk as a pandas dataframe?
Here's what I've done so far:
Described the data using pytables and this example:
http://www.pytables.org/usersguide/introduction.html
Run a test by loading a portion of the data (a few GB) into an HDF5 file
Converted the data into a dataframe using pd.DataFrame.from_records()
This last step loads all the data in memory.
I've looked for some way to describe the data as a pandas dataframe in step 1 but haven't been able to find a good set of instructions to do that. Is what I want to do feasible?

blaze is a nice way to interact with out-of-core data by using lazy expression evaluation. This uses pandas and PyTables under the hood (as well as a host of conversions with odo)

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM.
I initially created the memory mapped file using numpy.memmap by reading in a lot of smaller files and processing their data and then writing the processed data to the memmap file. During the creation of the memmap file, I had no memory issues (I was using memmap.flush() periodically). Here's how I create the memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2))
for i1 in np.arange(numFiles):
auxData = load_data_from(file[i1])
mmapData[i1,:] = auxData
mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small portions (<10 MB) of the memmap file, it floods my whole ram when the memmap object is created. The machine slows down drastically and I can't do anything. Here's how I try to read in the data from the memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2))
aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access parts of massive arrays without trying to load the whole thing to memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored in disk?

Could it be that you're looking at virtual, rather than physical memory consumption, and the slowdown is coming from something else?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas