How to make good reproducible Apache Spark examples - dataframe

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them to post an MCVE but sometimes getting them to show some sample input/output data is like pulling teeth.
Perhaps part of the problem is that people just don't know how to easily create an MCVE for spark-dataframes. I think it would be useful to have a spark-dataframe version of this pandas question as a guide that can be linked.
So how does one go about creating a good, reproducible example?

Provide small sample data, that can be easily recreated.
At the very least, posters should provide a couple of rows and columns on their dataframe and code that can be used to easily create it. By easy, I mean cut and paste. Make it as small as possible to demonstrate your problem.
I have the following dataframe:
+-----+---+-----+----------+
|index| X|label| date|
+-----+---+-----+----------+
| 1| 1| A|2017-01-01|
| 2| 3| B|2017-01-02|
| 3| 5| A|2017-01-03|
| 4| 7| B|2017-01-04|
+-----+---+-----+----------+
which can be created with this code:
df = sqlCtx.createDataFrame(
[
(1, 1, 'A', '2017-01-01'),
(2, 3, 'B', '2017-01-02'),
(3, 5, 'A', '2017-01-03'),
(4, 7, 'B', '2017-01-04')
],
('index', 'X', 'label', 'date')
)
Show the desired output.
Ask your specific question and show us your desired output.
How can I create a new column 'is_divisible' that has the value 'yes' if the day of month of the 'date' plus 7 days is divisible by the value in column'X', and 'no' otherwise?
Desired output:
+-----+---+-----+----------+------------+
|index| X|label| date|is_divisible|
+-----+---+-----+----------+------------+
| 1| 1| A|2017-01-01| yes|
| 2| 3| B|2017-01-02| yes|
| 3| 5| A|2017-01-03| yes|
| 4| 7| B|2017-01-04| no|
+-----+---+-----+----------+------------+
Explain how to get your output.
Explain, in great detail, how you get your desired output. It helps to show an example calculation.
For instance in row 1, the X = 1 and date = 2017-01-01. Adding 7 days to date yields 2017-01-08. The day of the month is 8 and since 8 is divisible by 1, the answer is 'yes'.
Likewise, for the last row X = 7 and the date = 2017-01-04. Adding 7 to the date yields 11 as the day of the month. Since 11 % 7 is not 0, the answer is 'no'.
Share your existing code.
Show us what you have done or tried, including all* of the code even if it does not work. Tell us where you are getting stuck and if you receive an error, please include the error message.
(*You can leave out the code to create the spark context, but you should include all imports.)
I know how to add a new column that is date plus 7 days but I'm having trouble getting the day of the month as an integer.
from pyspark.sql import functions as f
df.withColumn("next_week", f.date_add("date", 7))
Include versions, imports, and use syntax highlighting
Full details in this answer written by desertnaut.
For performance tuning posts, include the execution plan
Full details in this answer written by Alper t. Turker.
It helps to use standardized names for contexts.
Parsing spark output files
MaxU provided useful code in this answer to help parse Spark output files into a DataFrame.
Other notes.
Be sure to read how to ask and How to create a Minimal, Complete, and Verifiable example first.
Read the other answers to this question, which are linked above.
Have a good, descriptive title.
Be polite. People on SO are volunteers, so ask nicely.

Performance tuning
If the question is related to performance tuning please include following information.
Execution Plan
It is best to include extended execution plan. In Python:
df.explain(True)
In Scala:
df.explain(true)
or extended execution plan with statistics. In Python:
print(df._jdf.queryExecution().stringWithStats())
in Scala:
df.queryExecution.stringWithStats
Mode and cluster information
mode - local, client, `cluster.
Cluster manager (if applicable) - none (local mode), standalone, YARN, Mesos, Kubernetes.
Basic configuration information (number of cores, executor memory).
Timing information
slow is relative, especially when you port non-distributed application or you expect low latency. Exact timings for different tasks and stages, can be retrieved from Spark UI (sc.uiWebUrl) jobs or Spark REST UI.
Use standarized names for contexts
Using established names for each context allows us to quickly reproduce the problem.
sc - for SparkContext.
sqlContext - for SQLContext.
spark - for SparkSession.
Provide type information (Scala)
Powerful type inference is one of the most useful features of Scala, but it makes hard to analyze code taken out of context. Even if type is obvious from the context it is better to annotate the variables. Prefer
val lines: RDD[String] = sc.textFile("path")
val words: RDD[String] = lines.flatMap(_.split(" "))
over
val lines = sc.textFile("path")
val words = lines.flatMap(_.split(" "))
Commonly used tools can assist you:
spark-shell / Scala shell
use :t
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> :t rdd
org.apache.spark.rdd.RDD[String]
InteliJ Idea
Use Alt + =

Some additional suggestions to what has been already offered:
Include your Spark version
Spark is still evolving, although not so rapidly as in the days of 1.x. It is always (but especially if you are using a somewhat older version) a good idea to include your working version. Personally, I always start my answers with:
spark.version
# u'2.2.0'
or
sc.version
# u'2.2.0'
Including your Python version, too, is never a bad idea.
Include all your imports
If your question is not strictly about Spark SQL & dataframes, e.g. if you intend to use your dataframe in some machine learning operation, be explicit about your imports - see this question, where the imports were added in the OP only after extensive exchange in the (now removed) comments (and turned out that these wrong imports were the root cause of the problem).
Why is this necessary? Because, for example, this LDA
from pyspark.mllib.clustering import LDA
is different from this LDA:
from pyspark.ml.clustering import LDA
the first coming from the old, RDD-based API (formerly Spark MLlib), while the second one from the new, dataframe-based API (Spark ML).
Include code highlighting
OK, I'll confess this is subjective: I believe that PySpark questions should not be tagged as python by default; the thing is, python tag gives automatically code highlighting (and I believe this is a main reason for those who use it for PySpark questions). Anyway, if you happen to agree, and you still would like a nice, highlighted code, simply include the relevant markdown directive:
<!-- language-all: lang-python -->
somewhere in your post, before your first code snippet.
[UPDATE: I have requested automatic syntax highlighting for pyspark and sparkr tags, which has been implemented indeed]

This small helper function might help to parse Spark output files into DataFrame:
PySpark:
from pyspark.sql.functions import *
def read_spark_output(file_path):
step1 = spark.read \
.option("header","true") \
.option("inferSchema","true") \
.option("delimiter","|") \
.option("parserLib","UNIVOCITY") \
.option("ignoreLeadingWhiteSpace","true") \
.option("ignoreTrailingWhiteSpace","true") \
.option("comment","+") \
.csv("file://{}".format(file_path))
# select not-null columns
step2 = t.select([c for c in t.columns if not c.startswith("_")])
# deal with 'null' string in column
return step2.select(*[when(~col(col_name).eqNullSafe("null"), col(col_name)).alias(col_name) for col_name in step2.columns])
Scala:
// read Spark Output Fixed width table:
def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
val step1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "|")
.option("parserLib", "UNIVOCITY")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("comment", "+")
.csv(filePath)
val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)
val columns = step2.columns
columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null", col(c))))
}
Usage:
df = read_spark_output("file:///tmp/spark.out")
PS: For pyspark, eqNullSafe is available from spark 2.3.

Related

Multiple persists in the Spark Execution plan

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.
I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

How to split a column in many different columns?

I have a dataset and in one of it columns I have many values that I want to convert to new columns:
"{'availabilities': {'bikes': 4, 'stands': 28, 'mechanicalBikes': 4, 'electricalBikes': 0, 'electricalInternalBatteryBikes': 0, 'electricalRemovableBatteryBikes': 0}, 'capacity': 32}"
I tried to use str.split() and received the error because of the patterns.
bikes_table_ready[['availabilities',
'bikes',
'stands',
'mechanicalBikes',
'electricalBikes',
'electricalInternalBatteryBikes',
'electricalRemovableBatteryBikes',
'capacity']]= bikes_table_ready.totalStands.str.extract('{.}', expand=True)
ValueError: pattern contains no capture groups
Which patterns should I use to have it done?
IIUC, use ast.literal_eval with pandas.json_normalize.
With a dataframe df with two columns (id) and the column to be splitted (col), it gives this :
import ast
​
df["col"] = df["col"].apply(lambda x: ast.literal_eval(x.strip('"')))
​
out = df.join(pd.json_normalize(df.pop("col").str["availabilities"]))
# Output :
print(out.to_string())
id bikes stands mechanicalBikes electricalBikes electricalInternalBatteryBikes electricalRemovableBatteryBikes
0 id001 4 28 4 0 0 0
Welcome to Stack Overflow! Please provide a minimal reproducible example demonstrating the problem. To learn more about this community and how we can help you, please start with the tour and read How to Ask and its linked resources.
That being said, it seems that the data you are trying to use the method str.split() is not actually a string. Check this to find more about data types. It seems you are trying to retrieve the information from a Python List "[xxx] Or Dictionary "dicName{"Key":"value}". If that's the case, try checking this link which talks about how to use Python Lists or this which talks about dictionaries.

How can I put several extracted values from a Json in an array in Kusto?

I'm trying to write a query that returns the vulnerabilities found by "Built-in Qualys vulnerability assessment" in log analytics.
It was all going smoothly I was getting the values from the properties Json and turning then into separated strings but I found out that some of the terms posses more than one value, and I need to get all of them in a single cell.
My query is like this right now
securityresources | where type =~ "microsoft.security/assessments/subassessments"
| extend assessmentKey=extract(#"(?i)providers/Microsoft.Security/assessments/([^/]*)", 1, id), IdAzure=tostring(properties.id)
| extend IdRecurso = tostring(properties.resourceDetails.id)
| extend NomeVulnerabilidade=tostring(properties.displayName),
Correcao=tostring(properties.remediation),
Categoria=tostring(properties.category),
Impacto=tostring(properties.impact),
Ameaca=tostring(properties.additionalData.threat),
severidade=tostring(properties.status.severity),
status=tostring(properties.status.code),
Referencia=tostring(properties.additionalData.vendorReferences[0].link),
CVE=tostring(properties.additionalData.cve[0].link)
| where assessmentKey == "1195afff-c881-495e-9bc5-1486211ae03f"
| where status == "Unhealthy"
| project IdRecurso, IdAzure, NomeVulnerabilidade, severidade, Categoria, CVE, Referencia, status, Impacto, Ameaca, Correcao
Ignore the awkward names of the columns, for they are in Portuguese.
As you can see in the "Referencia" and "CVE" columns, I'm able to extract the values from a specific index of the array, but I want all links of the whole array
Without sample input and expected output it's hard to understand what you need, so trying to guess here...
I think that summarize make_list(...) by ... will help you (see this to learn how to use make_list)
If this is not what you're looking for, please delete the question, and post a new one with minimal sample input (using datatable operator), and expected output, and we'll gladly help.

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

pseudo randomization in loop PsychoPy

I know other people have asked similar questions in past but I am still stuck on how to solve the problem and was hoping someone could offer some help. Using PsychoPy, I would like to present different images, specifically 16 emotional trials, 16 neutral trials and 16 face trials. I would like to pseudo randomize the loop such that there would not be more than 2 consecutive emotional trials. I created the experiment in Builder but compiled a script after reading through previous posts on pseudo randomization.
I have read the previous posts that suggest creating randomized excel files and using those, but considering how many trials I have, I think that would be too many and was hoping for some help with coding. I have tried to implement and tweak some of the code that has been posted for my experiment, but to no avail.
Does anyone have any advice for my situation?
Thank you,
Rae
Here's an approach that will always converge very quickly, given that you have 16 of each type and only reject runs of more than two emotion trials. #brittUWaterloo's suggestion to generate trials offline is very good--this what I do myself typically. (I like to have a small number of random orders, do them forward for some subjects and backwards for others, and prescreen them to make sure there are no weird or unintended juxtapositions.) But the algorithm below is certainly safe enough to do within an experiment if you prefer.
This first example assumes that you can represent a given trial using a string, such as 'e' for an emotion trial, 'n' neutral, 'f' face. This would work with 'emo', 'neut', 'face' as well, not just single letters, just change eee to emoemoemo in the code:
import random
trials = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while 'eee' in ''.join(trials):
random.shuffle(trials)
print trials
Here's a more general way of doing it, where the trial codes are not restricted to be strings (although they are strings here for illustration):
import random
def run_of_3(trials, obj):
# detect if there's a run of at least 3 objects 'obj'
for i in range(2, len(trials)):
if trials[i-2: i+1] == [obj] * 3:
return True
return False
tr = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while run_of_3(tr, 'e'):
random.shuffle(tr)
print tr
Edit: To create a PsychoPy-style conditions file from the trial list, just write the values into a file like this:
with open('emo_neu_face.csv', 'wb') as f:
f.write('stim\n') # this is a 'header' row
f.write('\n'.join(tr)) # these are the values
Then you can use that as a conditions file in a Builder loop in the regular way. You could also open this in Excel, and so on.
This is not quite right, but hopefully will give you some ideas. I think you could occassionally get caught in an infinite cycle in the elif statement if the last three items ended up the same, but you could add some sort of a counter there. In any case this shows a strategy you could adapt. Rather than put this in the experimental code, I would generate the trial sequence separately at the command line, and then save a successful output as a list in the experimental code to show to all participants, and know things wouldn't crash during an actual run.
import random as r
#making some dummy data
abc = ['f']*10 + ['e']*10 + ['d']*10
def f (l1,l2):
#just looking at the output to see how it works; can delete
print "l1 = " + str(l1)
print l2
if not l2:
#checks if second list is empty, if so, we are done
out = list(l1)
elif (l1[-1] == l1[-2] and l1[-1] == l2[0]):
#shuffling changes list in place, have to copy it to use it
r.shuffle(l2)
t = list(l2)
f (l1,t)
else:
print "i am here"
l1.append(l2.pop(0))
f(l1,l2)
return l1
You would then run it with something like newlist = f(abc[0:2],abc[2:-1])