Memory Allocation in Spark(df) - dataframe

I have defined a df in spark and have applied some transformations(filter) and have stored it in the same df what happens to the memory allocated to the df.
df=rdd.filter1
df=df.fitler2
df.filter3
df.fitler4

There are two ways to look at it-
Dataframe is immutable, thus for every operation that you apply on it
a new Dataframe is created(with fresh memory allocated). So, the 'df'
will ultimately point to the dataframe returned by the last 'filter'
operation. Since everytime a new dataframe object is created - your
question about 'change in memory allocation' stands invalid.
If you mean - multiple filter operation will reduce the data and the
memory required. The answer is 'Yes'. Due to filter operations the
dataframe partitions will shrink and 'less' memory would be occupied.

Actually, nothing until you invoke some action like collecting data to driver or writing DF to file. All transformations in Spark are lazy.

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

When i need to persist dataframe

I am curious to know when i need to persist my dataframe in spark and when not. Cases:-
If i need data from file ( Do i need to persist it? if i apply repetitive count like:-
val df=spark.read.json("file://root/Download/file.json")
df.count
df.count
Do i need to persist df?? because according to me it should store df in memory after first count and use same df in second count. Record in file is 4 , Because when i practically check it , it read file again and again, So why spark doesn't store it in memory
Second question is in spark read is an action or transformation?
DataFrames by design are immutable, so every transformation done on them would create a new data frame altogether. A spark pipeline generally involves multiple transformations leading to multiple data frames being created. If spark stores all of these data frames, the memory requirement would be huge. So spark leaves the responsibility of persisting data frames to the user. Whichever data frame you are planning on re using, you can persist them and later unpersist them when done.
I don't think we can define spark read as an action or a transformation. Action/Transformation is applied over a data frame. To identify the difference, you should remember that the transformation operation will return a new dataframe while action will return some value/s.

PySpark - Does DataFrame.count() Cause a cache()?

If I have a very large DataFrame on my PySpark cluster, does calling df.count() on it cause the entire DataFrame df to be brought into memory of a single node, or do all the individual nodes count their part of the structure and return it somewhere to be aggregated as a final result?
I don't see anything in the documentation to indicate this one way or the other. Basically I don't want to call count() on a DataFrame that's too big to fit in the memory of any individual node.
count is something that can be distributed across executors. So, for each executor, count their number of records. Then send the aggregated number of records to be counted together. Spark optimizations will take care of those simple details.
If you call collect() then, that's what causes driver to be flooded with complete dataframe and most likely resulting in failure.
The best practice on the spark is not to usee count and it's recommended to use isEmpty method instead of count method if it's possible. Also, all of the spark actions except collect method will run on the spark executors and the only result will return to the spark driver

cache table in pyspark using sql [duplicate]

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.