Why only 1 task is running in 1 executor of Spark - dataframe

I am running a spark job where last step is to group by the data according to the date and calculating the count
This step was taking much time so when I checked in Spark UI I could see only one task was running in 1 executor
Remaining all other executors were idle
df_missing=spark.sql("select a.tid,a.load_dt from df_all_dates_full a left join df_Del_base b on a.tid=b.tid and a.load_dt=b.load_dt where b.load_dt is Null ").distinct().repartition(100)
df_missing.createOrReplaceTempView("df_missing")
print("Running group by")
spark.sql("select load_dt,count(*) as TID_DEL from df_missing group by load_dt").show()
Tried by changing the dataframe partition count and changing the executor and driver memory

I think you only see last stage of this query and groupBy is executed on that one executor. You have shuffle reads on other workers as well, so they did their job and at the end all data must be pushed on one place to execute grouping.
Try running explain on final DataFrame it should be relatively obvious.

Related

What is 'showString' in Spark UI? Are the code blocks below different?

(1)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT COUNT(*) FROM dftable')
sqldf.show()
(2)df.createOrReplaceTempView("dftable")
sqldf = spark.sql('SELECT * FROM dftable')
sqldf.count()
What is the difference between above two codes? (1) takes 20 seconds to perform while (2) only takes 5 seconds. Only difference that I was able to notice is that in their corresponding stages, there is something like "showString at NativeMethodAccessorImpl.java:0" in (1) while (2) has "count at NativeMethodAccessorImpl.java:0". I attached their corresponding stages too. Stage 46 is for (1) and Stage 43 is for (2).
https://i.stack.imgur.com/Vcf9A.png
https://i.stack.imgur.com/pGOxC.png
As you have already mentioned in you question, the 1st query is using .show() while the 2nd query is using the .count(), which is completely two different action. Therefore, you can see that they have different spark job (blue boxes) when you check their execution DAG. Because they are using different action, some action are expensive than other action, especially when you use the action that bring the data back to the driver node (e.g. .collect()) so you can't expect they use the same time.
Back to your example, the reason why .count() is faster than .show() is because .count() is distributed and the final count (number) will be the sum of all partition which is done by driver, while .show() need to fetches the amount of data that you requested (default 20) back to the driver.
You can try different expensive action like .collect() to see how different action require time and resource.

How changing number of workers will effect the Glue job

I have 2200000 records to process in Glue job which is leading to timeout as by default it is set to 2 days and number of workers are 10 . Increasing the number of workers will help in running the glue job faster ??
Increasing the numbers of worker will help to run the job faster, if your job has transformations that can run in parallel since you are allocating more executor nodes.
2200000 records isn't that much though, and you should check if somethings wrong with the code if it takes > 2 days.

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

Dataflow Apache beam Python job stuck at Group by step

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
I figured out the thing myself.
Below are the 2 changes that I did in my pipeline:
1. I added a Combine function just after the Group by Key, see screenshot
since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

Any faster way to count rows in Pig

I Followed This Stack Over Flow question where is shown how to count rows in pig.
The problem i found is, this one is incredibly time consuming if i do some regex filter match and other operation before try to count rows of filtered variable.
Here is my code
all_data = load '/logs/chat1.log' USING TextLoader() as line:chararray;
match_filter_1 = filter all_data by ( line matches 'some regex');
inputGroup = GROUP match_filter_1 ALL;
totalLine = foreach inputGroup generate COUNT (match_filter_1);
dump totalLine;
so, is there any way to get result faster?
Use the PARALLEL clause to increase the parallelism of a job:
PARALLEL sets the number of reduce tasks for the MapReduce jobs generated by Pig. The default value is 1 (one reduce task).
PARALLEL only affects the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.
If you don’t specify PARALLEL, you still get the same map parallelism but only one reduce task.
A = LOAD 'myfile' AS (t, u, v);
B = GROUP A BY t PARALLEL 18;
Hope this Helps!!!...