Databricks - automatic parallelism and Spark SQL - apache-spark-sql

I have a general question about Databrick cells and auto-parallelism with Spark SQL. I have a summary table that has a number of fields of which most have a complex logic behind them.
If I put blocks (%SQL) of individual field logic in individual cells, will the scheduler automatically attempt to allocate the cells to different nodes on the cluster to improve performance ( depending on how many nodes my cluster has) ? Alternatively are their PySpark functions I can use to organise the parallel running myself ? I cant find much about this elsewhere...
I am using LTS 10.4 (Spark 3.2.1 Scala 2.12)
Many thanks
Richard

If you write python "pyspark" code over multiple cells there is something called "lazy execution" meaning the actual work only happens at the last possible moment (for example when data is written or displayed). So before you run for example a display(df) no actual work is done on the cluster. So technically here the code of multiple code cells is parallelized efficiently.
However, in Databricks Spark SQL a single cell is executed to completion before the next one is started. If you want to run those concurrently you can take a look at running multiple notebooks at the same time (or multiple parameterized instances of the same notebook) with dbutils.notebook.run(). Then the cluster will automatically split the resources evenly between those queries running at the same time.

You can try run the sql statements using spark.sql() and assign the outputs to different dataframes. In the last step, you could execute an operation (for ex: join) that brings all into one dataframe. The lazy evaluation should then evaluate all dataframes (i.e. your sql queries) in parallel.

Related

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

Pyspark - identical dataframe filter operation gives different output

I’m facing a particularly bizarre issue while firing filter queries on a spark dataframe. Here's a screenshot of the filter command I'm trying to run:
As you can see, I'm trying to run the same command multiple times. Each time, it's giving a different number of rows. It is actually meant to return 6 records, but it ends up showing a random number of records every time.
FYI, The underlying data source (from which I'm creating the dataframe) is an Avro file in a Hadoop data lake.
This query only gives me consistent results if I cache the dataframe. But this is not always possible for me because the dataframe might be very huge and hence would choke up memory resources if I cache it.
What might be the possible reasons for this random behavior? Any advice on how to fix it?
Many thanks :)

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using:
data = sqlContext.read.json("s3n://.....")
Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the data:
data.registerTempTable("table")
SqlData = sqlContext.sql("SELECT * FROM table")
Context: The data is not too big, takes a long time to load into Spark and also to query from. I think Spark partitions the data into too many nodes. I want to be able to set that manually. I know when dealing with RDDs and sc.parallelize I can pass the number of partitions as an input. Also, I have seen repartition(), but I am not sure if it can solve my problem. The variable data is a DataFrame in my example.
Let me define partition more precisely. Definition one: commonly referred to as "partition key" , where a column is selected and indexed to speed up query (that is not what i want). Definition two: (this is where my concern is) suppose you have a data set, Spark decides it is going to distribute it across many nodes so it can run operations on the data in parallel. If the data size is too small, this may further slow down the process. How can i set that value
By default it partitions into 200 sets. You can change it by using set command in sql context sqlContext.sql("set spark.sql.shuffle.partitions=10");. However you need to set it with caution based up on your data characteristics.
You can call repartition() on dataframe for setting partitions. You can even set spark.sql.shuffle.partitions this property after creating hive context or by passing to spark-submit jar:
spark-submit .... --conf spark.sql.shuffle.partitions=100
or
dataframe.repartition(100)
Number of "input" partitions are fixed by the File System configuration.
1 file of 1Go, with a block size of 128M will give you 10 tasks. I am not sure you can change it.
repartition can be very bad, if you have lot of input partitions this will make lot of shuffle (data traffic) between partitions.
There is no magic method, you have to try, and use the webUI to see how many tasks are generated.

Optimizing R code for ETL

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.
The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.
Here's an outline of the code:
Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
rbind the results together into a single resulting data frame (~10 - 15m)
I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.
The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.
Any thoughts on other optimization techniques?
update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R
update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).
update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability