Pandas is using lot of memory - pandas

I have flask app code where an API is exposed to dump the data from oracle database to postgress database.
I am using Pandas to copy the content of the tables from oracle, mysql and postgress to postgress.
After using constantly for 15 days or so, the CPU memory consumption is very high.
It usually transfers atleast 5 million records per two days.
Can anyone help me optimizing pandas write.

If you have some preprocess step, I suggest using dask. Dask offers parallel computation and do not fill memory unless you explicitly force it. The force means computation of any task on dataframe. Refer to documentation here for dask api read_sql_table method.
import dask.dataframe as dd
# read the data as dask dataframe
df = dd.read_csv('path/ to / file') # this code is subject to change as your
# source changes, just consider this as a
# pseudo.
{
 # do the preprocess step on data.
}
# finally write it.
This solution comes very handy if you have to deal with large dataset with a preprocessing step possibly a reduction. Refer to documentation here for more information. It may have a significant improvement depending on your preprocess step.
Or alternatively, you can use chunksize parameter of pandas as #TrigonaMinima suggested. This allows your machine to retrieve the data in chunks as "x rows at a time" so you may want to process it as above with preprocessing, this may require you to create temp file and append them.

Related

Unmanaged memory jamming cluster during dask's merge_asof method

I am trying to merge large dataframes using dask.dataframe.multi.merge_asof, but I am running into issues with accumulating unmanaged memory on the cluster.
I have boiled down the problem to the following. It essentially just creates some sample data with timestamps using dask, converts to pandas dataframes, and then runs dask's mergea_asof implementation. It quickly exceed the memory of the workers with unmanaged memory and eventually the cluster just gets stuck (it does not crash, just stops doing anything).
It looks like this:
from distributed import Client, LocalCluster
import pandas as pd
import dask
import dask.dataframe
client = Client(LocalCluster(n_workers=40))
# make sample datasets, and convert them to pandas dataframes
left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index().compute()
right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index().compute()
left = dask.dataframe.from_pandas(left, npartitions=250)
right = dask.dataframe.from_pandas(right, npartitions=250)
# Alternative to above block (no detour via pandas)
#left = dask.datasets.timeseries(start='2021-01-01', end='2022-12-31', partition_freq="3d").reset_index()
#right = dask.datasets.timeseries(start='2021-01-02', end='2023-01-01', partition_freq="3d").reset_index()
# (dask crashes on datetime, so convert to int first)
left['timestamp'] = left['timestamp'].values.astype('int64')
right['timestamp'] = right['timestamp'].values.astype('int64')
dask.dataframe.multi.merge_asof(
left=left.sort_values(by='timestamp'),
right=right.sort_values(by='timestamp'),
tolerance=pd.to_timedelta(10000000, unit='seconds').delta,
direction='forward',
left_on='timestamp',
right_on='timestamp',
).compute()
Note that I deliberately call compute() on the sample data to get pandas dataframes, since in the final use case I also need to start from padnas dataframes, so I need to "model" that step as well.
Interestingly, not making this detour works much better. So if I outcomment the first data creation code block and use the one labelled as an alternative instead, then the merge is successful (still some unmanaged memory, but not enough to stop the cluster)
While I suspect that occupying machine memory by the local pandas dataframe may increase the unmanaged memory (?) the data size is far from taking all of the machine's RAM (200 GiB). So my questions are
Why does the cluster hang despite memory being available on the machine?
Why does using pandas dataframes intermittently affect memory management during merge_asof?
I have tried various things with regard to garbage collection, thinking there might be some pandas data sticking around in memory while merge_asof is executed, but to no avail. I have also tried garbage collection on the workers, as well as automatic and manual memory trimming, as described in this blog post and this video, neither of which showed any effect worth mentioning.

What's the real advantage of using Tensorflow Transform?

today i'm using pandas as my main tool of data pre-processing in my project, where i need to do some transformations in the data to ensure they are in a correct format, which my python class expects.
So i heard about TF Tansform and tested it a little, but i didn't see any obvious advantage (obviously i'm referring to data transformation itself, not in a machine learning pipeline).
For example, i made a simple code in TFT to uppercase all values in my dataframe column:
upper = tf.strings.upper(input, encoding='', name=None)
The execution time of this pre processing function is: 17.1880068779
This is, in the other hand, the code that i use to do the exactly same thing in dataframe:
x = dataset['CITY'].str.upper()
The execution time is: 0.0188028812408
So,i'm doing something wrong? I think that if we have dozens of transformations and a dataset if millions of lines, maybe TFT will be better in this comparison, but for a 100k dataframe it seems not so useful.

Pandas v 0.25 groupby with many columns gives memory error

After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error
MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64
Doing a bit of research I find issue (#14942) reported on Git for an earlier version
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cat': np.random.randint(0, 255, size=3000000),
'int_id': np.random.randint(0, 255, size=3000000),
'other_id': np.random.randint(0, 10000, size=3000000),
'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')
# killed after 6 minutes of 100% cpu and 90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()
Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?
Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:
df.groupby(index, observed=True)
There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup.
While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.
My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.
db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.
class DuckDB:
def __init__(self, db=None):
self.db_loc = db or ':memory:'
self.db = duckdb.connect(self.db_loc)
def sql(self, sql=""):
return self.db.execute(sql).fetchdf()
def __del__():
self.db.close()
Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.
I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.

Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.
Why, so much memory is consumed?
I would say duplication of dataframe in memory, as you suggested.
How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run
df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()
Although I know this does not directly answer your question, I hope it may help !
What about trying to delete the PySpark df? :
del(df)

Processing data on disk with a Pandas DataFrame

Is there a way to take a very large amount of data on disk (a few 100 GB) and interact with it on disk as a pandas dataframe?
Here's what I've done so far:
Described the data using pytables and this example:
http://www.pytables.org/usersguide/introduction.html
Run a test by loading a portion of the data (a few GB) into an HDF5 file
Converted the data into a dataframe using pd.DataFrame.from_records()
This last step loads all the data in memory.
I've looked for some way to describe the data as a pandas dataframe in step 1 but haven't been able to find a good set of instructions to do that. Is what I want to do feasible?
blaze is a nice way to interact with out-of-core data by using lazy expression evaluation. This uses pandas and PyTables under the hood (as well as a host of conversions with odo)