Merge large datasets with dask

Merge large datasets with dask - pandas

I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.
Methodology
Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False.
Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).
This method creates 75000 tasks which eventually blow up the memory.
Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.

Yes, what you want to do is possible, but you may need to play with partition sizes a bit. If there is a lot of repetition in your data then Pandas may suddenly produce very large values. You can address this by ...
Using smaller partitions (maybe)
Reducing the amount of parallelism (perhaps try dask.config.set(scheduler="single-threaded") so see if that helps

Related

Most efficient way to merge 100,000 lists(or df) into one pandas dataframe

Summary: I am facing performance issues merging pandas dataframes. I am looking for a technique to merge thousands of columns into a single dataframe. I have tried
.merge()
pre-creating cols with reindex()
dask
Background
I have a postgres database with hundreds of millions of rows that represent 1.5million features of a patient... i'm trying to boil it down into something more manageable, about 100,000 features.
I have a patient dataframe with 6000 rows representing patients: unique ID, and a few features. I have a method to identify the most important features, and now I want to merge those into my dataframe on the patient ID key. This works fine with only a few thousand features (3000), but as I scale, it starts to crawl.
Pandas merge is single threaded, and is elegant but too slow for this many features. Its working but takes a few days -- I think it re-allocates memory with each merge.
I tried pre-creating the columns with Null values ensure all are sorted correctly, and then update the column, but when I try to update the column it is still very slow. It has been running for 5 hours now, using only 1 core. Not sure how long it will take. My logs say it has only finished 4000/100,000 so far. That doesn't seem right but I forgot to flush the print so I'll have to be patient.
I tried dask but it doesn't appear to perform a multithreaded merge - it sticks to one CPU at the point where I merge them. (The dask dataframe was created from a pandas dataframe in case that matters).
Postgres has 1600 column limit, so I can't transpose to a table
I'm not limited to python,pandas... and would to stick to my current stack (postgres, python, and R) -- can python scale to that size dataframe (6000x100000)?
Any thoughts?

Single Pandas DataFrame or multiple DataFrames

I am having millions of devices and from these millions of devices, I am going to get counter values at periodic intervals for those devices.
This particular value has to be held in memory and i have to do further processing on top of this.
This is my problem. To handle this problem, I am using Pandas DataFrame with 2 columns. One column is time stamp and the other column being actual counter value.
There are multiple options that I am looking at.
1.) One DataFrame for all these 1 million devices.
2.) The 1 million devices are logically grouped based on the location of the device. So instead of having all 1 million device counters in a single DataFrame,
I can create multiple DataFrames based on the location of the server where device resides.
If we do the 2nd way, it will come close to 1000 DataFrames. So I can have 1000 DataFrames based on the grouping of the devices on location.
So it comes to 1 DataFrame for 1 group. The reason of splitting this into 1000 DataFrames is because of the parallelism (both read/write) which I can do (multi-processing).
I can have parallelism implemented effectively. And also the time that it takes to do some calculations on the DataFrame will be a simple operation in smaller size of DataFrame.
Some of the questions that I have is:
1.) what is the maximum number of DataFrames that we can create in Pandas in a system that has 1-2GB of RAM
2.) Having more number of DataFrames, will it induce any overhead in pandas
I need some opinion on this like what is the best way - should I go with 1000 DataFrames or 1 huge DataFrame.
Please note: I am getting the data from the devices at a periodic interval (e.g. 5 minutes) which is in the network and the total devices approx. is 1000000.
Also regularly, depending on some conditions, I need to delete the old data. Its not just reading the data and appending the data in DataFrame but also deletion of old data.

When exactly should I repartition my dataframes?

I'm working with a large Spark application on EMR that processes data in the hundreds of GBs. At the moment, I'm reading in a number of parquet files that are transformed into Spark dataframes and subsequently performing data enrichment.
During this step, I'm joining several of these dataframes together in order to compute intermediate values or derive new columns from existing ones. As I'm learning more about Spark, it's become obvious to me as this job continues to scale that I need to be mindful of potential performance optimizations.
After looking into the Spark UI for my application, I notice that a number of my executors are not getting input and there is seemingly skew in the shuffle reads and writes
There is an ID key in my dataframes that I am joining on that should be evenly distributed so it makes me feel like I need to repartition my dataframes upon reading them in from the original parquet files. Is this line of thinking correct? Should I be calling repartition manually on my dataframe on that ID key, or is Spark already handling this and what I am seeing above is actually a symptom of something else?
Thanks!

Optimal way of repartition in Spark

I have 500 MB csv file which I am reading as a dataframe
I am looking for an optimal value of partition of this dataframe
I need to do some wide transformations and join this dataframe with another csv and so I have below 3 approaches now for re-partitioning this dataframe
df.repartition(no. of cores)
Re-partitioning the dataframe as per calculation 500MB/128MB~ 4 partitions so as to have at least 128MB data in each partition
Re-partitioning dataframe using specific columns of csv so as to co-locate data in same partitions
I want to know which of these option will be best for having parallel computation and processing in Spark 2.4

If you know the data very well then using the columns to partition the data works best. However, repartitioning based on either based on the block size and the no of cores are subject to change whenever the cluster configuration changes and for every environment you need to make those changes if your cluster configuration is different in higher environments. So, all in all going to with Data driven repartitioning is the better approach.

How to Let Spark Handle Bigger Data Sets?

I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.

My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas