U-SQL Query Optimizer behavior - azure-data-lake

Okay, here is what am doing. I have a U-SQL script which does the following.
Step 1. INSERT a record into a txn table 'A' say "PROCESSING STARTED", recording the start of Step 2.
Step 2. Extract from a file
Step 3. Insert into table 'B' using the rowset from Step 2.
Step 4. INSERT a record into a txn table 'A' say "PROCESSING FINISHED", recording the successful execution of Step 2.
When I coded the above I was hoping above steps will execute in the mentioned order. To my surprise it was not, when I closely looked into the Algebra I came to understand that query optimizer shuffled all my tasks and it runs it as below.
All Extract
All Splits, Aggregates, Partitions
All Writes (if you notice there are 2 tables am inserting into)
So the question I have here is how do I ensure that Step 2, Step 3 executes only after Step 1 ? I am not bothered about Step 4 as of now. I could possibly run as below too but I was hoping there would be some other options.
Job 1 (Step 1)
Job 2 (Step 2, 3)
Job 3 (Step 4)
Pls can you help out ?

U-SQL is designed to optimize your query so it can be scaled out across multiple nodes - resulting in efficient execution of your query. What you are observing is by design, in your code, since there is no dependency between Steps 1 and 2, there is an opportunity for parallelizing their execution.
One option I can think of for you to execute them in a certain sequence is to introduce a dependency on a result from Step 1 in Step 2.
Having said that, if you are looking at a sequential execution pattern, I'm curious as to why you chose U-SQL (which is designed for massively parallized applications).

Related

BigQuery - how to decrease slot time of Coalesce execution step?

I have a pretty complex query, with about 70 execution steps. The query was somewhat optimized for performance, so most of the steps in the execution plan run pretty fast - except Coalesce steps, which take about 10 to 100 times more slot time, compared to the others. As far as I understand, it prepares data for the following Join step, but why does it take so long even if actual nuber of records processed by this step is low? The most extreme case I saw looks like this (ZERO records processed by this step, but still takes 8 seconds! ):
S46: Coalesce
Slot time: 8223 ms
Duration: 92 ms
Bytes Shuffled: 0 B
I wasn't able to find any hints regarding this "Coalesce" step and ways to optimize it in Google documentation, so perhaps you can give me some advice about it or point to actual documentation that explains it.

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

how to handle query execution time (performance issue ) in oracle

I have situation need to execute patch script for million row of data.The current query execution time is not meet the expectation for few rows (18000) which is take around 4 hours( testing data before deploy for live ).
The patch script actually select million row of data in loop and update according to the specification , im just wonder how long it could take for million row of data since it take around 4 hour for just 18000 rows.
to overcome this problem im decided to create temp table hold the entire select statement data and proceed with the patch process using the temp table where the process could be bit faster compare select and update.
is there any other ways i can use to handle this situation ? Any suggestion and ways to solve this.
(Due to company policy im unable to post the PL/SQl script here )
seems there is no one can answer my question here i post my answer.
In oracle there is Parallel Execution which is allows spreading the processing of a single SQL statement across multiple threads.
So by using this method i solved my long running query (4 hours ) to 6 minz ..
For more information :
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-1-1872400.html

Optimizing R code for ETL

I have both an R script and a Pentaho (PDI) ETL transformation for loading data from a SQL database and performing a calculation. The initial data set has 1.28 million rows of 21 variables and is equivalent in both R and PDI. In fact, I originally wrote the R code and then subsequently "ported" to a transformation in PDI.
The PDI transformation runs in 30s (and includes an additional step of writing the output to a separate DB table). The R script takes between 45m and one hour total. I realize that R is a scripting language and thus interpreted, but it seems like I'm missing some optimization opportunities here.
Here's an outline of the code:
Read data from a SQL DB into a data frame using sqlQuery() from the RODBC package (~45s)
str_trim() two of the columns (~2 - 4s)
split() the data into partitions to prepare for performing a quantitative calculation (separate function) (~30m)
run the calculation function in parallel for each partition of the data using parLapply() (~15-20m)
rbind the results together into a single resulting data frame (~10 - 15m)
I've tried using ddply() instead of split(), parLapply() and rbind(), but it ran for several hours (>3) without completing. I've also modified the SQL select statement to return an artificial group ID that is the dense rank of the rows based on the unique pairs of two columns, in an effort to increase performance. But it didn't seem to have the desired effect. I've tried using isplit() and foreach() %dopar%, but this also ran for multiple hours with no end.
The PDI transformation is running Java code, which is undoubtedly faster than R in general. But it seems that the equivalent R script should take no more than 10 minutes (i.e. 20X slower than PDI/Java) rather than an hour or longer.
Any thoughts on other optimization techniques?
update: step 3 above, split(), was resolved by using indexes as suggested here Fast alternative to split in R
update 2: I tried using mclapply() instead of parLapply(), and it's roughly the same (~25m).
update 3: rbindlist() instead of rbind() runs in under 2s, which resolves step 5

How do I do the Delayed::Job equivalent of Process#waitall?

I have a large task that proceeds in several major steps: Step A must complete before Step B can be started, etc. But each major step can be divided up across multiple processes, in my case, using Delayed::Job.
The question: Is there a simple technique for starting Step B only after all the processes have completed working on Step A?
Note 1: I don't know a priori how many external workers have been spun up, so keeping a reference count of completed workers won't help.
Note 2: I'd prefer not to create a worker whose sole job is to busy wait for the other jobs to complete. Heroku workers cost money!
Note 3: I've considered having each worker examine the Delayed::Job queue in the after callback to decide if it's the last one working on Step A, in which case it could initiate Step B. This could work, but seems potentially fraught with gotchas. (In the absence of better answers, this is the approach I'm going with.)
I think it really depends on the specifics of what you are doing, but you could set priority levels such that any jobs from Step A run first. Depending on the specifics, that might be enough. From the github page:
By default all jobs are scheduled with priority = 0, which is top
priority. You can change this by setting
Delayed::Worker.default_priority to something else. Lower numbers have
higher priority.
So if you set Step A to run at priority = 0, and Step B to run at priority = 100, nothing in Step B will run until Step A is complete.
There's some cases where this will be problematic -- in particular, if you have a lot of jobs and are running a lot of workers, you will probably have some workers running Step B before the work in Step A is finished. Ideally in this setup, Step B has some sort of check to make check if it can be run or not.