Pyspark, dask, or any other python: how to pivot a large table without crashing laptop? - pandas

I can pivot a smaller dataset fine using pandas, dask, or pyspark.
However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.
I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?
If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).
Can anyone help out here? I'll go ahead and add a bounty for this.
Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.
#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))

Related

dask dataframe optimal partition size for 70GB data join operations

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I have to take each of the 3 columns and join them to another even larger dataframe.
The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.
Is there a reason to keep it at 700 x 100MB partitions?
If not which partition size should be used here?
Does this also depend on the number of workers I use?
1 x 50GB worker
2 x 25GB worker
3 x 17GB worker
Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.
For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.
Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

Extremely high memory usage with pyarrow reading gzipped parquet files

I have a (set of) gzipped parquet files with about 210 columns, of which I am loading about 100 columns into a pandas dataframe. It works fine and very fast when the file size is about 1 MB (with about 50 rows); the python3 process consumes < 500 MB of RAM. However when the file is > 1.5 MB (70+ rows) it starts consuming 9-10 GB of RAM without ever loading the dataframe. If I specify just 2-3 columns, it is able to load them from the "big" file (still consuming that kind of RAM), but anything beyond that seems impossible. All columns are text.
I am currently using pandas.read_parquet, but I have also tried pyarrow.read_table with same results.
Any ideas what could be going on? I just don't understand why loading that amount of data should blow up RAM like that and become unusable. My objective with this is to load the data in parquet to a database, so if there are better ways to do it that would be great to know as well.
The code is below; it's just a simple usage of pandas.read_parquet.
import pandas as pd
df = pd.read_parquet(bytesIO_from_file, columns=[...])
There was a memory usage issue in pyarrow 0.14 that has been resolved: https://issues.apache.org/jira/browse/ARROW-6060
The upcoming 0.15 release will have this fix, as well as a bunch of other optimizations in Parquet reading. If you're curious to try it now, see the docs for installing the development version.

Is there any way to Drop pyspark Dataframe after converting it into pandas Dataframe via toPandas()

I create Spark Dataframe using input text file of size 4GB by using pyspark. then use some condition like:
df.cache() #cache df for fast execution of later instruction
df_pd = df.where(df.column1=='some_value').toPandas() #around 70% of data
Now i am doing all operation on pandas Dataframe df_pd. Now my memory usage become around 13 GB.
Why, so much memory is consumed?
How can i do to make my computation faster and efficient? #here df.cache() lead to took 10 minutes for caching.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
Note : I am mainly using Pyspark because it efficiently using cpu cores and pandas only use single core of my machine for read file operation.
Why, so much memory is consumed?
I would say duplication of dataframe in memory, as you suggested.
How can i do to make my computation faster and computation efficient? #here df.cache() took 10 minutes to run
df.cache() is only useful if you're going to use this df mutliple times. Think of it as a checkpoint, only useful when you need to do mutliple operations on the same dataframe. Here, it is not necessary since you're doing only one process. More info here.
I tried to free up pyspark DF memory by using df.unpersist() and sqlContext.clearCache() But it doesn't help.
unpersist is the right thing to do. About sqlContext.clearCache(), I don't know which version of Spark you're using but you may want to take a look at spark.catalog.clearCache()
Although I know this does not directly answer your question, I hope it may help !
What about trying to delete the PySpark df? :
del(df)

Large Dataframe slow with Lifelines Survival Analysis

I'm trying to run a survival analysis on a large dataset (about 80 rows x 12,000 cols) in python.
Currently I'm using:
from lifelines import CoxPHFitter
cf = CoxPHFitter()
cf.fit(df, duration_col='Time', event_col='Status')
But it is extremely slow. Breaking up the dataframe into chunks of 100 and running cf.fit multiple times is slightly faster, but it's still clocking in at around 80s. This is notably slower than R's coxph, and I'd really prefer not to use rpy2 to run the analysis in R.
I'm a bit at a loss for how to make this faster, so any suggestions would be greatly appreciated.

Alternative for Pandas resample

I am looking for a solution to resample time series data on a big scale (tens or hundreds of millions of data records). Pandas resample() worked well until about 10 mio data records were reached, afterwards it actually stopped working, because the hardware had not enough memory. I had this problem several times with Pandas on huge datasets. But if I just used a for loop on huge datasets, I could read the data and work with it, even if it was much slower. Does anybody know a good solution to resample time series data without pandas?
The source of the data is a MySQL server and the records contain OHLC data and a timestamp. The frequency of the time series is 1 minute and the resampling frequencies are 5 min, 30 min, 1h, 6h, 1d, 1w, 1m, which I all store into different tables. I consider to switch in future to mongoDB.
Look at this:
Pandas Panel resampling alternatives
Meanwhile the package is called xarray. Although you can check out dask, which together with xarray can offer fast, parallel resampling (and many other numpy and pandas functions).