Apache Spark: count vs head(1).isEmpty - dataframe

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.

dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.

Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

Related

What is 'showString' in Spark UI? Are the code blocks below different?

(1)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT COUNT(*) FROM dftable')
sqldf.show()
(2)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT * FROM dftable')
sqldf.count()
What is the difference between above two codes? (1) takes 20 seconds to perform while (2) only takes 5 seconds. Only difference that I was able to notice is that in their corresponding stages, there is something like "showString at NativeMethodAccessorImpl.java:0" in (1) while (2) has "count at NativeMethodAccessorImpl.java:0". I attached their corresponding stages too. Stage 46 is for (1) and Stage 43 is for (2).
https://i.stack.imgur.com/Vcf9A.png
https://i.stack.imgur.com/pGOxC.png
As you have already mentioned in you question, the 1st query is using .show() while the 2nd query is using the .count(), which is completely two different action. Therefore, you can see that they have different spark job (blue boxes) when you check their execution DAG. Because they are using different action, some action are expensive than other action, especially when you use the action that bring the data back to the driver node (e.g. .collect()) so you can't expect they use the same time.
Back to your example, the reason why .count() is faster than .show() is because .count() is distributed and the final count (number) will be the sum of all partition which is done by driver, while .show() need to fetches the amount of data that you requested (default 20) back to the driver.
You can try different expensive action like .collect() to see how different action require time and resource.

Apply np.random.rand to groups - optimization issue

Need to optimize a single line of code that will be executed tens of thousands of times during the calculations and hence timing becomes an issue. Seems to be simple but really got stuck.
The line is:
df['Random']=df['column'].groupby(level=0).transform(lambda x: np.random.rand())
So I want to assign the same random number to each group and "ungroup". Since rand() is called many times using this implementation the code is very ineffective.
Can someone help in vectorizing this?
Try this!
df = pd.DataFrame(np.sort(np.random.randint(2,5,50)),columns=['column'])
uniques =df['column'].unique()
final = df.merge(pd.Series(np.random.rand(len(uniques)),index=uniques).to_frame(),
left_on='column',right_index=True)
You can store the uniques and then run last line every time to get new random numbers and join with df.

how to handle query execution time (performance issue ) in oracle

I have situation need to execute patch script for million row of data.The current query execution time is not meet the expectation for few rows (18000) which is take around 4 hours( testing data before deploy for live ).
The patch script actually select million row of data in loop and update according to the specification , im just wonder how long it could take for million row of data since it take around 4 hour for just 18000 rows.
to overcome this problem im decided to create temp table hold the entire select statement data and proceed with the patch process using the temp table where the process could be bit faster compare select and update.
is there any other ways i can use to handle this situation ? Any suggestion and ways to solve this.
(Due to company policy im unable to post the PL/SQl script here )
seems there is no one can answer my question here i post my answer.
In oracle there is Parallel Execution which is allows spreading the processing of a single SQL statement across multiple threads.
So by using this method i solved my long running query (4 hours ) to 6 minz ..
For more information :
https://docs.oracle.com/cd/E11882_01/server.112/e25523/parallel002.htm
http://www.oracle.com/technetwork/articles/database-performance/geist-parallel-execution-1-1872400.html

looping in a Kettle transformation

I want to repetitively execute an SQL query looking like this:
SELECT '${date.i}' AS d,
COUNT(DISTINCT xid) AS n
FROM table
WHERE date
BETWEEN DATE_SUB('${date.i}', INTERVAL 6 DAY)
AND '${date.i}'
;
It is basically a grouping by time spans, just that those are intersecting, which prevents usage of GROUP BY.
That is why I want to execute the query repetitively for every day in a certain time span. But I am not sure how I should implement the loop. What solution would you suggest?
The Kettle variable date.i is initialized from a global variable. The transformation is just one of several in the same transformation bundle. The "stop trafo" would be implemented maybe implicitely by just not reentering the loop.
Here's the flow chart:
Flow of the transformation:
In step "INPUT" I create a result set with three identical fields keeping the dates from ${date.from} until ${date.until} (Kettle variables). (for details on this technique check out my article on it - Generating virtual tables for JOIN operations in MySQL).
In step "SELECT" I set the data source to be used ("INPUT") and that I want "SELECT" to be executed for every row in the served result set. Because Kettle maps parameters 1 on 1 by a faceless question-mark I have to serve three times the same paramter - for each usage.
The "text file output" finally outputs the result in a generical fashion. Just a filename has to be set.
Content of the resulting text output for 2013-01-01 until 2013-01-05:
d;n
2013/01/01 00:00:00.000;3038
2013/01/02 00:00:00.000;2405
2013/01/03 00:00:00.000;2055
2013/01/04 00:00:00.000;2796
2013/01/05 00:00:00.000;2687
I am not sure if this is the slickest solution but it does the trick.
In Kettle you want to avoid loops and they can cause real trouble in transforms. Instead you should do this by adding a step that will put a row in the stream for each date you want (with the value stored in a field) and then using that field value in the query.
ETA: The stream is the thing that moves rows (records) between steps. It may help to think of it as consisting of a table at each hop that temporarily holds rows between steps.
You want to avoid loops because a Kettle transform is only sequential at the row level: rows may process in parallel and out of order and the only guarantee is that the row will pass through the steps in order. Because of this a loop in a transform does not function as you would intuitively expect.
FYI, it also sounds like you might need to go through some of the Kettle tutorials if you are still unclear about what the stream is.

mysqldumpslow: What does these fields indicate..?

Recently we have started on optimizing live slow queries. As part of that, we thought to use mysqldumpslow to prioritize slow queries. I am new to this tool. I am able to understand some basic info, but I would like to know what exactly the below fields in the out put will tell us.
OUTPUT: Count: 6 Time=22.64s (135s) Lock=0.00s (0s) Rows=1.0 (6)
What about the below fields ?
Time : Is it the average time taken of all these 6 times of occurance...?
135s : What is this 135 seconds....?
Rows=1.0 (6): again what does this mean...?
I didn't find a better explanation. Really thanks in advance.
Regards,
UDAY
I made a research for that coz i wanted to know that too.
I have a log from a pretty highly used DB server.
The command mysqldumpslow has several optional parameters (https://dev.mysql.com/doc/refman/5.7/en/mysqldumpslow.html), including sort by (-s)
thanks to many queries I can work with, I can tell, that:
value before brackets represents an average value from all the same queries within to group ('count' in total) and the value within brackets is the maximum value of one of the queries. Meaning, in your case:
you have a query that was called 6 times, it is executed within 22.64 seconds (average), but once it took about 135 seconds to execute it. The same applies for locks (if provided) and rows. So most of the time it returns about one row, however it returned 6 rows at least once