Optimal way of repartition in Spark - dataframe

I have 500 MB csv file which I am reading as a dataframe
I am looking for an optimal value of partition of this dataframe
I need to do some wide transformations and join this dataframe with another csv and so I have below 3 approaches now for re-partitioning this dataframe
df.repartition(no. of cores)
Re-partitioning the dataframe as per calculation 500MB/128MB~ 4 partitions so as to have at least 128MB data in each partition
Re-partitioning dataframe using specific columns of csv so as to co-locate data in same partitions
I want to know which of these option will be best for having parallel computation and processing in Spark 2.4

If you know the data very well then using the columns to partition the data works best. However, repartitioning based on either based on the block size and the no of cores are subject to change whenever the cluster configuration changes and for every environment you need to make those changes if your cluster configuration is different in higher environments. So, all in all going to with Data driven repartitioning is the better approach.

Related

When exactly should I repartition my dataframes?

I'm working with a large Spark application on EMR that processes data in the hundreds of GBs. At the moment, I'm reading in a number of parquet files that are transformed into Spark dataframes and subsequently performing data enrichment.
During this step, I'm joining several of these dataframes together in order to compute intermediate values or derive new columns from existing ones. As I'm learning more about Spark, it's become obvious to me as this job continues to scale that I need to be mindful of potential performance optimizations.
After looking into the Spark UI for my application, I notice that a number of my executors are not getting input and there is seemingly skew in the shuffle reads and writes
There is an ID key in my dataframes that I am joining on that should be evenly distributed so it makes me feel like I need to repartition my dataframes upon reading them in from the original parquet files. Is this line of thinking correct? Should I be calling repartition manually on my dataframe on that ID key, or is Spark already handling this and what I am seeing above is actually a symptom of something else?
Thanks!

Delta Lake: Performance Challenges

Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards,
Chaitanya
you dont give enough information to answer your question. What exactly is not performant the whole process of data ingest?
Z-ordering doesn't give you an advantage if you are processing the data into the delta lake it will even more likely slow you down. It gives you an advantage when you are reading the data in afterwards. Z-ordering by for example ID, tries to save columns with the same ID in the same file(s) which will enable spark to use dataskipping to avoid reading in unnecessary data.
Also how big is your data actually? If we are talking about a few GBs of data at the end pandas and a traditional database will perform faster.
I can give you an example:
Lets say you have a daily batch job that processes 4 GB of data. If its just about processing that 4 GB to store it somewhere spark will not necessarly perform faster as I already mentioned.
But now consider you have that job running for a year which gives you ~1.5 TB of data at the end of the year. Now you can perform analytics on the entire history of data and in this scenario you probably will be much faster than a database and pandas.
As a side note you say you are reading in a bunch of json files to convert them to pandas and than to delta lake.
If there is not a specific reason to do so in approach 2 I would just use:
spark.read.json("path")
To avoid that process of converting it from pandas to spark dataframes.

Merge large datasets with dask

I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds the memory of the server since there can be multiple transactions per customer. I am working on a windows server with 16 cores and 64GB RAM which I understand is very limited specs for this type of work.
Methodology
Read the big dataset into a dask dataframe and set the index to be the customer ID. Read the 3.6GB dataset in pandas and set the index to be the customer ID. Launch a local cluster, with the parameter memory_limit='50GB' and processes=False.
Merge the dask dataframe with the pandas dataframe on index (left_index=True, right_index=True).
This method creates 75000 tasks which eventually blow up the memory.
Is what I am trying to accomplish possible? Have I chosen the wrong tools for that? I am running out of ideas and I would desperately need some help.
Yes, what you want to do is possible, but you may need to play with partition sizes a bit. If there is a lot of repetition in your data then Pandas may suddenly produce very large values. You can address this by ...
Using smaller partitions (maybe)
Reducing the amount of parallelism (perhaps try dask.config.set(scheduler="single-threaded") so see if that helps

How to save an input data stream to a performing data structure for sql queries in Spark?

i'm new in the word of bigdata. My goal is to maintain an input data stream in some kind of data structures to perform queries and aggregation operations on them.
Having as input a continuous data streaming, through the Structured Streaming of spark, I store it in a DataFrame. My questions are:
is a DataFrame a volatile data structure?
In case the program crashes, is the DataFrame maintained?
Is a DataFrame distributed on the various nodes of the cluster or is it kept on the node that executes the code?
Is it possible to create index on a Dataframe to speed up the response of some queries?
If your are planning to use Spark 2.4+ then go with Dataframes
Yes DataFrame are volatile they vanish once Spark job finishes but you can save them to Disk as Parquet files and query them later via SPARK / HIVE ro any tool which can read parquet file format.
In case the program crashes, the DataFrame is not recoverable unless you happen to save them before crash which you can read later once your Spark job is up again.
Data Frame is distributed Data structure used and understood by Spark. So yes it is partitioned/split across Spark nodes.
Data frame has partitions to tune the query performance and minimize data shuffling.
Apart from above points Spark has inbuilt check pointing mechanism to make sure there is no data loss when your job crashes. Detail doc can be found on Spark

Difference between df.repartition and DataFrameWriter partitionBy?

What is the difference between DataFrame repartition() and DataFrameWriter partitionBy() methods?
I hope both are used to "partition data based on dataframe column"? Or is there any difference?
Watch out: I believe the accepted answer is not quite right! I'm glad you ask this question, because the behavior of these similarly-named functions differs in important and unexpected ways that are not well documented in the official spark documentation.
The first part of the accepted answer is correct: calling df.repartition(COL, numPartitions=k) will create a dataframe with k partitions using a hash-based partitioner. COL here defines the partitioning key--it can be a single column or a list of columns. The hash-based partitioner takes each input row's partition key, hashes it into a space of k partitions via something like partition = hash(partitionKey) % k. This guarantees that all rows with the same partition key end up in the same partition. However, rows from multiple partition keys can also end up in the same partition (when a hash collision between the partition keys occurs) and some partitions might be empty.
In summary, the unintuitive aspects of df.repartition(COL, numPartitions=k) are that
partitions will not strictly segregate partition keys
some of your k partitions may be empty, whereas others may contain rows from multiple partition keys
The behavior of df.write.partitionBy is quite different, in a way that many users won't expect. Let's say that you want your output files to be date-partitioned, and your data spans over 7 days. Let's also assume that df has 10 partitions to begin with. When you run df.write.partitionBy('day'), how many output files should you expect? The answer is 'it depends'. If each partition of your starting partitions in df contains data from each day, then the answer is 70. If each of your starting partitions in df contains data from exactly one day, then the answer is 10.
How can we explain this behavior? When you run df.write, each of the original partitions in df is written independently. That is, each of your original 10 partitions is sub-partitioned separately on the 'day' column, and a separate file is written for each sub-partition.
I find this behavior rather annoying and wish there were a way to do a global repartitioning when writing dataframes.
If you run repartition(COL) you change the partitioning during calculations - you will get spark.sql.shuffle.partitions (default: 200) partitions. If you then call .write you will get one directory with many files.
If you run .write.partitionBy(COL) then as the result you will get as many directories as unique values in COL. This speeds up futher data reading (if you filter by partitioning column) and saves some space on storage (partitioning column is removed from data files).
UPDATE: See #conradlee's answer. He explains in details not only how the directories structure will look like after applying different methods but also what will be resulting number of files in both scenarios.
repartition() is used to partition data in memory and partitionBy is used to partition data on disk. They're often used in conjunction.
Both repartition() and partitionBy can be used to "partition data based on dataframe column", but repartition() partitions the data in memory and partitionBy partitions the data on disk.
repartition()
Let's play around with some code to better understand partitioning. Suppose you have the following CSV data.
first_name,last_name,country
Ernesto,Guevara,Argentina
Vladimir,Putin,Russia
Maria,Sharapova,Russia
Bruce,Lee,China
Jack,Ma,China
df.repartition(col("country")) will repartition the data by country in memory.
Let's write out the data so we can inspect the contents of each memory partition.
val outputPath = new java.io.File("./tmp/partitioned_by_country/").getCanonicalPath
df.repartition(col("country"))
.write
.csv(outputPath)
Here's how the data is written out on disk:
partitioned_by_country/
part-00002-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00044-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv
Each file contains data for a single country - the part-00059-95acd280-42dc-457e-ad4f-c6c73be6226f-c000.csv file contains this China data for example:
Bruce,Lee,China
Jack,Ma,China
partitionBy()
Let's write out data to disk with partitionBy and see how the filesystem output differs.
Here's the code to write out the data to disk partitions.
val outputPath = new java.io.File("./tmp/partitionedBy_disk/").getCanonicalPath
df
.write
.partitionBy("country")
.csv(outputPath)
Here's what the data looks like on disk:
partitionedBy_disk/
country=Argentina/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000.csv
country=China/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
country=Russia/
part-00000-906f845c-ecdc-4b37-a13d-099c211527b4.c000
Why partition data on disk?
Partitioning data on disk can make certain queries run much faster.