Pandas v 0.25 groupby with many columns gives memory error - pandas

After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error
MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64
Doing a bit of research I find issue (#14942) reported on Git for an earlier version
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cat': np.random.randint(0, 255, size=3000000),
'int_id': np.random.randint(0, 255, size=3000000),
'other_id': np.random.randint(0, 10000, size=3000000),
'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')
# killed after 6 minutes of 100% cpu and 90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()
Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?

Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:
df.groupby(index, observed=True)
There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup.

While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.
My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.
db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.
class DuckDB:
def __init__(self, db=None):
self.db_loc = db or ':memory:'
self.db = duckdb.connect(self.db_loc)
def sql(self, sql=""):
return self.db.execute(sql).fetchdf()
def __del__():
self.db.close()
Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.
I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.

Related

Unmanaged memory jamming cluster during dask's merge_asof method

I am trying to merge large dataframes using dask.dataframe.multi.merge_asof, but I am running into issues with accumulating unmanaged memory on the cluster.
I have boiled down the problem to the following. It essentially just creates some sample data with timestamps using dask, converts to pandas dataframes, and then runs dask's mergea_asof implementation. It quickly exceed the memory of the workers with unmanaged memory and eventually the cluster just gets stuck (it does not crash, just stops doing anything).
It looks like this:
from distributed import Client, LocalCluster
import pandas as pd
import dask
import dask.dataframe
client = Client(LocalCluster(n_workers=40))
# make sample datasets, and convert them to pandas dataframes
left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index().compute()
right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index().compute()
left = dask.dataframe.from_pandas(left, npartitions=250)
right = dask.dataframe.from_pandas(right, npartitions=250)
# Alternative to above block (no detour via pandas)
#left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index()
#right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index()
# (dask crashes on datetime, so convert to int first)
left['timestamp'] = left['timestamp'].values.astype('int64')
right['timestamp'] = right['timestamp'].values.astype('int64')
dask.dataframe.multi.merge_asof(
left=left.sort_values(by='timestamp'),
right=right.sort_values(by='timestamp'),
tolerance=pd.to_timedelta(10000000, unit='seconds').delta,
direction='forward',
left_on='timestamp',
right_on='timestamp',
).compute()
Note that I deliberately call compute() on the sample data to get pandas dataframes, since in the final use case I also need to start from padnas dataframes, so I need to "model" that step as well.
Interestingly, not making this detour works much better. So if I outcomment the first data creation code block and use the one labelled as an alternative instead, then the merge is successful (still some unmanaged memory, but not enough to stop the cluster)
While I suspect that occupying machine memory by the local pandas dataframe may increase the unmanaged memory (?) the data size is far from taking all of the machine's RAM (200 GiB). So my questions are
Why does the cluster hang despite memory being available on the machine?
Why does using pandas dataframes intermittently affect memory management during merge_asof?
I have tried various things with regard to garbage collection, thinking there might be some pandas data sticking around in memory while merge_asof is executed, but to no avail. I have also tried garbage collection on the workers, as well as automatic and manual memory trimming, as described in this blog post and this video, neither of which showed any effect worth mentioning.

Pandas to Koalas (Databricks) conversion code for big scoring dataset

I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.

Pandas is using lot of memory

I have flask app code where an API is exposed to dump the data from oracle database to postgress database.
I am using Pandas to copy the content of the tables from oracle, mysql and postgress to postgress.
After using constantly for 15 days or so, the CPU memory consumption is very high.
It usually transfers atleast 5 million records per two days.
Can anyone help me optimizing pandas write.
If you have some preprocess step, I suggest using dask. Dask offers parallel computation and do not fill memory unless you explicitly force it. The force means computation of any task on dataframe. Refer to documentation here for dask api read_sql_table method.
import dask.dataframe as dd
# read the data as dask dataframe
df = dd.read_csv('path/ to / file') # this code is subject to change as your
# source changes, just consider this as a
# pseudo.
{
 # do the preprocess step on data.
}
# finally write it.
This solution comes very handy if you have to deal with large dataset with a preprocessing step possibly a reduction. Refer to documentation here for more information. It may have a significant improvement depending on your preprocess step.
Or alternatively, you can use chunksize parameter of pandas as #TrigonaMinima suggested. This allows your machine to retrieve the data in chunks as "x rows at a time" so you may want to process it as above with preprocessing, this may require you to create temp file and append them.

Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.
Why, so much memory is consumed?
I would say duplication of dataframe in memory, as you suggested.
How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run
df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()
Although I know this does not directly answer your question, I hope it may help !
What about trying to delete the PySpark df? :
del(df)

Memory is not released when taking a small slice of a DataFrame

Summary
adataframe is a DataFrame with 800k rows. Naturally, it consumes a bit of memory. When I do this:
adataframe = adataframe.tail(144)
memory is not released.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python. However, if I attempt to create a new 800k-row DataFrame and also keep only a small slice, memory usage grows. If I do it again, it grows again, ad infinitum.
I'm using Debian Jessie's Python 3.4.2 with Pandas 0.18.1 and numpy 1.11.1.
Demonstration with minimal program
With the following program I create a dictionary
data = {
0: a_DataFrame_loaded_from_a_CSV,_only_the_last_144_rows,
1: same_thing,
# ...
9: same_thing,
}
and I monitor memory usage while I'm creating the dictionary. Here it is:
#!/usr/bin/env python3
from resource import getrusage, RUSAGE_SELF
import pandas as pd
def print_memory_usage():
print(getrusage(RUSAGE_SELF).ru_maxrss)
def read_dataframe_from_csv(f):
result = pd.read_csv(f, parse_dates=[0],
names=('date', 'value', 'flags'),
usecols=('date', 'value', 'flags'),
index_col=0, header=None,
converters={'flags': lambda x: x})
result = result.tail(144)
return result
print_memory_usage()
data = {}
for i in range(10):
with open('data.csv') as f:
data[i] = read_dataframe_from_csv(f)
print_memory_usage()
Results
If data.csv only contains a few rows (e.g. 144, in which case the slicing is redundant), memory usage grows very slowly. But if data.csv contains 800k rows, the results are similar to these:
52968
153388
178972
199760
225312
244620
263656
288300
309436
330568
349660
(Adding gc.collect() before print_memory_usage() does not make any significant difference.)
What can I do about it?
As #Alex noted, slicing a dataframe only gives you a view to the original frame, but does not delete it; you need to use .copy() for that. However, even when I used .copy(), memory usage grew and grew and grew, albeit at a slower rate.
I suspect that this has to do with how Python, numpy and pandas use memory. A dataframe is not a single object in memory; it contains pointers to other objects (especially, in this particular case, to strings, which is the "flags" column). When the dataframe is freed, and these objects are freed, the reclaimed free memory space can be fragmented. Later, when a huge new dataframe is created, it might not be able to use the fragmented space, and new space might need to be allocated. The details depend on many little things, such as the Python, numpy and pandas versions, and the particulars of each case.
Rather than investigating these little details, I decided that reading a huge time series and then slicing it is a no go, and that I must read only the part I need right from the start. I like some of the code I created for that, namely the textbisect module and the FilePart class.
You could argue that it is released, but that it appears to be used, but that it's marked free and will be reused by Python.
Correct that is how maxrss works (it measures peak memory usage). See here.
So the question then is why is the garbage collector not cleaning up the original DataFrames after they have been subsetted.
I suspect it is because subsetting returns a DataFrame that acts as a proxy to the original one (so values don't need to be copied). This would result in a relatively fast subset operation but also memory leaks like the one you found and weird speed characteristics when setting values.