I'm working with a large Spark application on EMR that processes data in the hundreds of GBs. At the moment, I'm reading in a number of parquet files that are transformed into Spark dataframes and subsequently performing data enrichment.
During this step, I'm joining several of these dataframes together in order to compute intermediate values or derive new columns from existing ones. As I'm learning more about Spark, it's become obvious to me as this job continues to scale that I need to be mindful of potential performance optimizations.
After looking into the Spark UI for my application, I notice that a number of my executors are not getting input and there is seemingly skew in the shuffle reads and writes
There is an ID key in my dataframes that I am joining on that should be evenly distributed so it makes me feel like I need to repartition my dataframes upon reading them in from the original parquet files. Is this line of thinking correct? Should I be calling repartition manually on my dataframe on that ID key, or is Spark already handling this and what I am seeing above is actually a symptom of something else?
Thanks!
Related
Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards,
Chaitanya
you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?
Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.
Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.
I can give you an example:
Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.
But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.
As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake.
If there is not a specific reason to do so in approach 2 I would just use:
spark.read.json("path")
To avoid that process of converting it from pandas to spark dataframes.
I have 500 MB csv file which I am reading as a dataframe
I am looking for an optimal value of partition of this dataframe
I need to do some wide transformations and join this dataframe with another csv and so I have below 3 approaches now for re-partitioning this dataframe
df.repartition(no. of cores)
Re-partitioning the dataframe as per calculation 500MB/128MB~ 4 partitions so as to have at least 128MB data in each partition
Re-partitioning dataframe using specific columns of csv so as to co-locate data in same partitions
I want to know which of these option will be best for having parallel computation and processing in Spark 2.4
If you know the data very well then using the columns to partition the data works best. However, repartitioning based on either based on the block size and the no of cores are subject to change whenever the cluster configuration changes and for every environment you need to make those changes if your cluster configuration is different in higher environments. So, all in all going to with Data driven repartitioning is the better approach.
I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.
Methodology
Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False.
Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).
This method creates 75000 tasks which eventually blow up the memory.
Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.
Yes, what you want to do is possible, but you may need to play with partition sizes a bit. If there is a lot of repetition in your data then Pandas may suddenly produce very large values. You can address this by ...
Using smaller partitions (maybe)
Reducing the amount of parallelism (perhaps try dask.config.set(scheduler="single-threaded") so see if that helps
i'm new in the word of bigdata. My goal is to maintain an input data stream in some kind of data structures to perform queries and aggregation operations on them.
Having as input a continuous data streaming, through the Structured Streaming of spark, I store it in a DataFrame. My questions are:
is a DataFrame a volatile data structure?
In case the program crashes, is the DataFrame maintained?
Is a DataFrame distributed on the various nodes of the cluster or is it kept on the node that executes the code?
Is it possible to create index on a Dataframe to speed up the response of some queries?
If your are planning to use Spark 2.4+ then go with Dataframes
Yes DataFrame are volatile they vanish once Spark job finishes but you can save them to Disk as Parquet files and query them later via SPARK / HIVE ro any tool which can read parquet file format.
In case the program crashes, the DataFrame is not recoverable unless you happen to save them before crash which you can read later once your Spark job is up again.
Data Frame is distributed Data structure used and understood by Spark. So yes it is partitioned/split across Spark nodes.
Data frame has partitions to tune the query performance and minimize data shuffling.
Apart from above points Spark has inbuilt check pointing mechanism to make sure there is no data loss when your job crashes. Detail doc can be found on Spark
I have a very complex query that needs to join 9 or more tables with some 'group by' expressions . Most of these tables have almost the same of numbers of the rows. These tables also have some columns that can be used as the 'key' to partition the tables.
Previously, the app ran fine, but now the data set has 3~4 times data as before. My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling and the application stalls (no matter how I adjust the memory, partition, executors, etc.). The actual data probably is just dozens of Gs.
I would think that if the partitioning works properly, Spark shouldn't do shuffle so much and the join should be done on each node. It is puzzling that why Spark is not so 'smart' to do so.
I could split the data set (with the 'key' I mentioned above) into many data sets that these data sets can be dealt with independently. But the burden will be on myself...it discounts the very reason to use Spark. What other approaches that could help?
I use Spark 2.0 over Hadoop YARN.
My tests turned out if the row count of each table is less than 4,000,000, the application can still run pretty nicely. However, if the count is more than that, the application writes hundreds of terabytes of shuffling
When joining datasets if the size of one side is less than a certain configurable size, spark broadcasts the entire table to each executor so that join may be performed locally everywhere. Your above observation is consistent with this. You can also provide broadcast hint explicitly to the spark, like so df1.join(broadcast(df2))
Other than that, can you please provide more specifics about your problem?
[Sometime ago I was also grappling with the issue of join and shuffles for one of our jobs that had to handle couple of TBs. We were using RDDs (and not the dataset api). I wrote about my findings [here]1. These may be of some use to you are try to reason about the underlying data shuffle.]
Update: According to documentation -- spark.sql.autoBroadcastJoinThreshold is the configurable property key. 10 MB is its default value. And it does the following:
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run.
So apparently, this is supported only for the Hive tables.