Multiple persists in the Spark Execution plan - dataframe

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.

I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

Related

loading input dataset for every iteration in for loop in palantir

I have a #transform_pandas code which loads the input file for computing.
Inside the compute function I have a for loop which has to read the complete input data and filter accordingly for every iteration.
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
)
I have the below code where I'm trying to read source_df dataset for every iteration in for loop and filter the dataset specifically to the year and family and do the computation.
def compute(source_df):
for entire_row in vhcl_df.itertuples():
modyr = entire_row[1]
fam = str(entire_row[2])
/* source_df should be read again here.
source_df = source_df.loc[source_df['i_yr']==modyr]
source_df = source_df.loc[source_df['fam']==fam]
...
Is there a way to achieve this. Thank you for your support.
As already suggested by #nicornk in the comments, you should create a new .copy() item of your source_df right after you declare the transform.
The two filtering steps (that can ben also merged in one, if you don't need to work just on the "modyr filtered" source_df.
Please note that modyr, fam are actual colnames of vhcl_df, it is actually sufficient to
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
vhcl_df=Input(path)
)
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
temp_df = source_df.loc[source_df['i_yr']==modyr]
temp_df = source_df.loc[source_df['fam']==str(fam)]
which, in a more concise and clean way is actually writable as
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
filtered_temp_df = temp_df[(temp_df.i_yr==modyr) & (temp_df.fam==str(fam))]
PS: Remember that if source_df is big, you should proceed with PySpark (see foundry docs)
Note that transform_pandas should only be used on datasets that can fit into memory. If you have larger datasets that you wish to filter down first before converting to Pandas, you should write your transformation using the transform_df() decorator and the pyspark.sql.SparkSession.createDataFrame() method.

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

How should I merge monthly datasets into on dataset for cleaning?

I am working on a case study for a ride share. The data is broken up into monthly datasets, and in order to analyze the data over the last year I would need to merge the data. I uploaded all the data to both BigQuery and Rstudio but am unsure of the best way to make one large dataset.
I may not even have to do this, but I believe that to find trends I should have all the data in one datatable. If this is not the case then I will clean the data one month at a time.
Maybe use purrr::map_dfr()? It's like lapply() and rbind() rolled into one.
library(tidyverse)
all_the_tables <-
map_dfr( # union as it loops over the function
.x = list.files(pattern = ".csv"), # input for the function
.f = read_csv # the function
)
If it's more complicated and you need something to vary the source by you can use something like.
map_dfr(
.x = list.files(pattern = ".csv"),
.f = # the tilde let's you make a more complex sequence of steps
~read_csv(path = .x) |>
mutate(source = .x)
)
If it's a lot of files consider using vroom::vroom()

A value is trying to be set on a copy of a slice from a DataFrame - don't understand ist

I know this topic has been discussed a lot and I am so sorry, that I stil dont't find the sulution, even the difference between a view and a copy is easy to understand (in other languages)
def hole_aktienkurse_und_berechne_hist_einstandspreis(index, start_date, end_date):
df_history = pdr.get_data_yahoo(symbols=index, start=start_date, end=end_date)
df_history['HistEK'] = df_history['Adj Close']
df_only_trd_index = df_group_trade.loc[index].copy()
for i_hst, r_hst in df_history.iterrows():
df_bis = df_only_trd_index[(df_only_trd_index['DateClose']<=i_hst) & (df_only_trd_index['OpenPos']==0)].copy()
# here comes the part what causes the trouble:
df_history.loc[i_hst]['HistEK'] = df_history.loc[i_hst]['Adj Close'] - df_bis['Total'].sum()/100.0
return df_history
I think I tried nearly everithing, but I don't get it. python is not easy when it comes to this topic.
When you have to specify bow index and column in .loc you have to put all together otherwise the annoying message relative to views appears.
df_history.loc[i_hst, 'HistEK'] = df_history.loc[i_hst, 'Adj Close'] - df_bis['Total'].sum()/100.0
Look the examples here

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.
Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.
So how does one go about creating a good, reproducible example?
Provide small sample data, that can be easily recreated.
At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.
I have the following dataframe:
+-----+---+-----+----------+
|index| X|label| date|
+-----+---+-----+----------+
| 1| 1| A|2017-01-01|
| 2| 3| B|2017-01-02|
| 3| 5| A|2017-01-03|
| 4| 7| B|2017-01-04|
+-----+---+-----+----------+
which can be created with this code:
df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label', 'date')
)
Show the desired output.
Ask your specific question and show us your desired output.
How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?
Desired output:
+-----+---+-----+----------+------------+
|index| X|label| date|is_divisible|
+-----+---+-----+----------+------------+
| 1| 1| A|2017-01-01| yes|
| 2| 3| B|2017-01-02| yes|
| 3| 5| A|2017-01-03| yes|
| 4| 7| B|2017-01-04| no|
+-----+---+-----+----------+------------+
Explain how to get your output.
Explain, in great detail, how you get your desired output. It helps to show an example calculation.
For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.
Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.
Share your existing code.
Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.
(*You can leave out the code to create the spark context, but you should include all imports.)
I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.
from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
Include versions, imports, and use syntax highlighting
Full details in this answer written by desertnaut.
For performance tuning posts, include the execution plan
Full details in this answer written by Alper t. Turker.
It helps to use standardized names for contexts.
Parsing spark output files
MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.
Other notes.
Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
Read the other answers to this question, which are linked above.
Have a good, descriptive title.
Be polite. People on SO are volunteers, so ask nicely.
Performance tuning
If the question is related to performance tuning please include following information.
Execution Plan
It is best to include extended execution plan. In Python:
df.explain(True)
In Scala:
df.explain(true)
or extended execution plan with statistics. In Python:
print(df._jdf.queryExecution().stringWithStats())
in Scala:
df.queryExecution.stringWithStats
Mode and cluster information
mode - local, client, `cluster.
Cluster manager (if applicable) - none (local mode), standalone, YARN, Mesos, Kubernetes.
Basic configuration information (number of cores, executor memory).
Timing information
slow is relative, especially when you port non-distributed application or you expect low latency. Exact timings for different tasks and stages, can be retrieved from Spark UI (sc.uiWebUrl) jobs or Spark REST UI.
Use standarized names for contexts
Using established names for each context allows us to quickly reproduce the problem.
sc - for SparkContext.
sqlContext - for SQLContext.
spark - for SparkSession.
Provide type information (Scala)
Powerful type inference is one of the most useful features of Scala, but it makes hard to analyze code taken out of context. Even if type is obvious from the context it is better to annotate the variables. Prefer
val lines: RDD[String] = sc.textFile("path")
val words: RDD[String] = lines.flatMap(_.split(" "))
over
val lines = sc.textFile("path")
val words = lines.flatMap(_.split(" "))
Commonly used tools can assist you:
spark-shell / Scala shell
use :t
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> :t rdd
org.apache.spark.rdd.RDD[String]
InteliJ Idea
Use Alt + =
Some additional suggestions to what has been already offered:
Include your Spark version
Spark is still evolving, although not so rapidly as in the days of 1.x. It is always (but especially if you are using a somewhat older version) a good idea to include your working version. Personally, I always start my answers with:
spark.version
# u'2.2.0'
or
sc.version
# u'2.2.0'
Including your Python version, too, is never a bad idea.
Include all your imports
If your question is not strictly about Spark SQL & dataframes, e.g. if you intend to use your dataframe in some machine learning operation, be explicit about your imports - see this question, where the imports were added in the OP only after extensive exchange in the (now removed) comments (and turned out that these wrong imports were the root cause of the problem).
Why is this necessary? Because, for example, this LDA
from pyspark.mllib.clustering import LDA
is different from this LDA:
from pyspark.ml.clustering import LDA
the first coming from the old, RDD-based API (formerly Spark MLlib), while the second one from the new, dataframe-based API (Spark ML).
Include code highlighting
OK, I'll confess this is subjective: I believe that PySpark questions should not be tagged as python by default; the thing is, python tag gives automatically code highlighting (and I believe this is a main reason for those who use it for PySpark questions). Anyway, if you happen to agree, and you still would like a nice, highlighted code, simply include the relevant markdown directive:
<!-- language-all: lang-python -->
somewhere in your post, before your first code snippet.
[UPDATE: I have requested automatic syntax highlighting for pyspark and sparkr tags, which has been implemented indeed]
This small helper function might help to parse Spark output files into DataFrame:
PySpark:
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
Scala:
// read Spark Output Fixed width table:
def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
val step1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "|")
.option("parserLib", "UNIVOCITY")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("comment", "+")
.csv(filePath)
val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)
val columns = step2.columns
columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null", col(c))))
}
Usage:
df = read_spark_output("file:///tmp/spark.out")
PS: For pyspark, eqNullSafe is available from spark 2.3.