Apache Spark dataframe lineage trimming via RDD and role of cache - dataframe

There is the following trick how to trim Apache Spark dataframe lineage, especially for iterative computations:
def getCachedDataFrame(df: DataFrame): DataFrame = {
val rdd = df.rdd.cache()
df.sqlContext.createDataFrame(rdd, df.schema)
}
It looks like some sort of pure magic, but right now I'm wondering why do we need to invoke cache() method on RDD? What is the purpose of having cache in this lineage trimming logic?

To understand the purpose of caching, it helps to understand the different types of RDD operations: transformations and actions. From the docs:
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Also consider this bit:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
So Spark's transformations (like map for example) are all lazy, because this helps Spark in being smart about which calculations need to be done while creating a query plan.
What does this have to do with caching?
Consider the following code:
// Reading in some data
val df = spark.read.parquet("some_big_file.parquet")
// Applying some transformations on the data (lazy operations here)
val cleansedDF = df
.filter(filteringFunction)
.map(cleansingFunction)
// Executing an action, triggering the transformations to be calculated
cleansedDF.write.parquet("output_file.parquet")
// Executing another action, triggering the transformations to be calculated
// again
println(s"You have ${cleansedDF.count} rows in the cleansed data")
In here, we reading in some file, applying some transformations and applying 2 actions to the same dataframe: cleansedDF.write.parquet and cleansedDF.count.
As the comments in the code explain, if we run the code like this we will be actually computing those transformations twice. Since the transformations are lazy, they will only get executed if an action requires them to be executed.
How can we prevent this double calculation? With caching: we can tell Spark to keep "save" result of some transformations so that they don't have to be calculated multiple times. This can be either on disk/memory/....
So with this knowledge, our code could be something like this:
// Reading in some data
val df = spark.read.parquet("some_big_file.parquet")
// Applying some transformations on the data (lazy operations here) AND caching the result of this calculation
val cleansedDF = df
.filter(filteringFunction)
.map(cleansingFunction)
.cache
// Executing an action, triggering the transformations to be calculated AND the results to be cached
cleansedDF.write.parquet("output_file.parquet")
// Executing another action, reusing the cached data
println(s"You have ${cleansedDF.count} rows in the cleansed data")
I've adjusted the comments in this code block to highlight the difference with the previous block.
Note that .persist also exists. With .cache you use the default storage level, with .persist you can specify which storage level as this SO answer nicely explains.
Hope this helps!

Related

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Is there an idiomatic way to cache Spark dataframes?

I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:
The following is similar but not exact logic to what I'm trying to accomplish:
df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)
# Thinking about adding the following line
special_rows.cache()
def f1(df):
new_df_1 = df.withColumn('foo', lit(0))
return new_df_1
def f2(df):
new_df_2 = df.withColumn('foo', lit(1))
return new_df_2
new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)
Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.
My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?
there is no action called until my final write to parquet.
and
Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.
Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.
You might want to repartition after running special_rows = df.filter(col('special') > 0). There can be a large number of empty partitions after running a filtering operation, as explained here.
The new_df_1 will make cache special_rows which will be reused by new_df_2 here new_df_1.union(new_df_2). That's not necessarily a performance optimization. Caching is expensive. I've seen caching slow down a lot of computations, even when it's being used in a textbook manner (i.e. caching a DataFrame that gets reused several times downstream).
Counting does not necessarily make sure the data is cached. Counts avoid scanning rows whenever possible. They'll use the Parquet metadata when they can, which means they don't cache all the data like you might expect.
You can also "cache" data by writing it to disk. Something like this:
df.filter(col('special') > 0).repartition(500).write.parquet("some_path")
special_rows = spark.read.parquet("some_path")
To summarize, yes, the DataFrame will be cached in this example, but it's not necessarily going to make your computation run any faster. It might be better to have no cache or to "cache" by writing data to disk.

When i need to persist dataframe

I am curious to know when i need to persist my dataframe in spark and when not. Cases:-
If i need data from file ( Do i need to persist it? if i apply repetitive count like:-
val df=spark.read.json("file://root/Download/file.json")
df.count
df.count
Do i need to persist df?? because according to me it should store df in memory after first count and use same df in second count. Record in file is 4 , Because when i practically check it , it read file again and again, So why spark doesn't store it in memory
Second question is in spark read is an action or transformation?
DataFrames by design are immutable, so every transformation done on them would create a new data frame altogether. A spark pipeline generally involves multiple transformations leading to multiple data frames being created. If spark stores all of these data frames, the memory requirement would be huge. So spark leaves the responsibility of persisting data frames to the user. Whichever data frame you are planning on re using, you can persist them and later unpersist them when done.
I don't think we can define spark read as an action or a transformation. Action/Transformation is applied over a data frame. To identify the difference, you should remember that the transformation operation will return a new dataframe while action will return some value/s.

cache table in pyspark using sql [duplicate]

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.
You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
dataframe.repartition(100)
Number of "input" partitions are fixed by the File System configuration.
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.