Large Dataframe slow with Lifelines Survival Analysis - pandas

I'm trying to run a survival analysis on a large dataset (about 80 rows x 12,000 cols) in python.
Currently I'm using:
from lifelines import CoxPHFitter
cf = CoxPHFitter()
cf.fit(df, duration_col='Time', event_col='Status')
But it is extremely slow. Breaking up the dataframe into chunks of 100 and running cf.fit multiple times is slightly faster, but it's still clocking in at around 80s. This is notably slower than R's coxph, and I'd really prefer not to use rpy2 to run the analysis in R.
I'm a bit at a loss for how to make this faster, so any suggestions would be greatly appreciated.

Related

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))
I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

OpenRefine speed puzzlement

I am curious about OpenRefine's speed. I have two projects about 5 MB in size about 35,000-40,000 lines in length.
This dataset works normally: https://raw.githubusercontent.com/whanley/egypt-data/main/exp-manifests-rough(1).tsv
This dataset works slowly: https://raw.githubusercontent.com/whanley/b-g/master/bslc-members/bslc-members-to-1900-tsv.tsv
I notice the slow speed when faceting. For example, faceting the second dataset column "Surname" by count is very slow.
I have tried increasing memory, etc. What confuses me is the difference in speeds between two quite similar projects. Any insight or tips would be appreciated.

Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.
Why, so much memory is consumed?
I would say duplication of dataframe in memory, as you suggested.
How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run
df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()
Although I know this does not directly answer your question, I hope it may help !
What about trying to delete the PySpark df? :
del(df)

Alternative for Pandas resample

I am looking for a solution to resample time series data on a big scale (tens or hundreds of millions of data records). Pandas resample() worked well until about 10 mio data records were reached, afterwards it actually stopped working, because the hardware had not enough memory. I had this problem several times with Pandas on huge datasets. But if I just used a for loop on huge datasets, I could read the data and work with it, even if it was much slower. Does anybody know a good solution to resample time series data without pandas?
The source of the data is a MySQL server and the records contain OHLC data and a timestamp. The frequency of the time series is 1 minute and the resampling frequencies are 5 min, 30 min, 1h, 6h, 1d, 1w, 1m, which I all store into different tables. I consider to switch in future to mongoDB.
Look at this:
Pandas Panel resampling alternatives
Meanwhile the package is called xarray. Although you can check out dask, which together with xarray can offer fast, parallel resampling (and many other numpy and pandas functions).

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

I can pivot a smaller dataset fine using pandas, dask, or pyspark.
However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.
I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?
If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).
Can anyone help out here? I'll go ahead and add a bounty for this.
Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.
#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))