I am a spark newbie and have a simple spark application using Spark SQL/hiveContext to:
select data from hive table (1 billion rows)
do some filtering, aggregation including row_number over window function to select first row, group by, count() and max(), etc.
write the result into HBase (hundreds million rows)
I submit the job to run it on yarn cluster (100 executors), it's slow and when I looked at the DAG Visualization in Spark UI, it seems only the hive table scan tasks were running in parallel, rest of steps #2, and #3 above are only running in one instance which probably should be able to optimize to be parallelized?
The application looks like:
Step 1:
val input = hiveContext
.sql(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
, row_number() over
(partition by user_id, address, phone_number, first_name, last_name order by user_id, address, phone_number, first_name, last_name, server_ts desc, age) AS rn
FROM
(
SELECT
user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
table
WHERE
phone_number <> '911' AND
server_date >= '2015-12-01' and server_date < '2016-01-01' AND
user_id IS NOT NULL AND
first_name IS NOT NULL AND
last_name IS NOT NULL AND
address IS NOT NULL AND
phone_number IS NOT NULL AND
) all_rows
) all_rows_with_row_number
WHERE rn = 1)
val input_tbl = input.registerTempTable(input_tbl)
Step 2:
val result = hiveContext.sql(
SELECT state,
phone_number,
address,
COUNT(*) as hash_count,
MAX(server_ts) as latest_ts
FROM
( SELECT
udf_getState(address) as state
, user_id
, address
, age
, phone_number
, first_name
, last_name
, server_ts
FROM
input_tbl ) input
WHERE state IS NOT NULL AND state != ''
GROUP BY state, phone_number, address)
Step 3:
result.cache()
result.map(x => ...).saveAsNewAPIHadoopDataset(conf)
The DAG Visualization looks like:
As you can see, the "Filter", "Project" and "Exchange" in stage 0 are only running in one instance, so does the stage1 and stage2, so a few questions and apologies if the question is dumb:
Does "Filter", "Project" and "Exchange" happen in Driver after data shuffling from each executor?
What code maps to "Filter", "Project" and "Exchange"?
how I could run "Filter", "Project" and "Exchange" in parallel to optimize the performance?
is it possible to run stage1 and stage2 in parallel?
You're not reading the DAG graph correctly - the fact that each step is visualized using a single box does not mean that it isn't using multiple tasks (and therefore cores) to calculate that step.
You can see how many tasks are used for each step by drilling-down into the stage view, that displays all tasks for this stage.
For example, here's a sample DAG visualization similar to yours:
You can see each stage is depicted by a "single" column of steps.
But if we look at the table below, we can see the number of tasks per stage:
One of them is using only 2 tasks, but the other uses 220, which means data is split into 220 partitions and partitions are processed in parallel, given enough available resources.
If you drill-down into that stage, you can see again that it used 220 tasks and details for all the tasks.
Only tasks reading data from disk are shown in graph as having these "multiple dots" to help you understand how many files were read.
SO - as Rashid's answer suggestes, check the number of tasks for each stage.
It is not obvious so I would do following things to zero in the problem.
Calculate execution time of each steps.
First step may be slow if your table is of text format, spark usually works better if data is stored in Hive in parquet format.
See if your table is partitioned by the column used in where clause.
If saving data to Hbase is slow then you may need to pre-split hbase table as by default data is stored in a single region.
Look at stages tab in spark ui to see how many tasks are started for each stage and also look for data local level as describe here
Hopefully, you will be able to zero in the problem.
Related
Table with large data, does anyone know how to optimize the count statement?
Eg: table Booking(id, email, mobile,....) (about 30 fields).
Function GetBookingCount(p_email, p_mobile) return number
Select count(id)
from Booking
Where email = p_email
or mobile = p_mobile
Function GetBookingStatus3Count(p_email, p_mobile) return number
Select count(id)
from Booking
Where (email = p_email or mobile = p_mobile)
and status = 3;
Final select:
Select GetBookingCount(email, mobile) as BookingCount
, GetBookingStatus3Count(email, mobile) as BookingStatus3Count
, ...
From Booking
where ....
solution1: Set the field column index what in where clause to count as email column, mobile, status column.
solution2: create a new table with few columns to count.
new table: Booking_Stats(id, email, mobile, status).
Thanks for any suggestion.
select count(*) count_all, count( case when status=3 then 1 else null end ) count_status_3
from Booking
where email = p_email and mobile = p_mobile
//NOTE: Query is written from head, not tested
You would consider creating an index on (email,mobile) or on (email,mobile,status) depending how many lines for given (email,mobile) you get and would you pay the cost of update of the index for status change (if allowed). In case of many updates of status for the same line, you might prefer indexing only (email,mobile) [read/write cost trade off].
Email is likely very discriminant (one value filters out most of the columns). If that is not the case, consider changing order to (mobile,email) if mobile column is better candidate.
It seems likely all those GetBookingBlahBlah() functions are not helpful and in fact quiet injurious to performance.
You haven't posted a complete set of requirements (what is meant by ...?), so it's difficult to be certain, but it seems likely that a solution along these lines would be more performative:
with bk as (
select *
from booking
where email = p_email
or mobile = p_mobile
)
select count(*) as BookingCount
, count(case when bk.status = 3 then 1 end) as BookingStatus3Count
, ...
from bk
The idea is to query the base table once, getting all the data necessary to calculate all the counts, and crunching the aggregates on the smallest result set possible.
An index on booking(email,mobile) might be useful but probably not. A better solution would be to have different queries for each of p_email and p_mobile, with single column indexes supporting each query.
The booking table should have an index on email, mobile and status. You should use this select:
WITH B1 AS
(
SELECT ID,
COUNT(ID) CNT1,
STATUS
FROM BOOKING
WHERE EMAIL = P_EMAIL
AND MOBILE = P_MOBILE
)
SELECT CNT1,
COUNT(ID) CNT2
FROM B1
WHERE STATUS = 3;
I have the following query that I would like to cut into smaller queries
and execute in parallel:
insert into users (
SELECT user, company, date
FROM (
SELECT
json_each(json ->> 'users') user
json ->> 'company' company,
date(datetime) date
FROM companies
WHERE date(datetime) = '2015-05-18'
) AS s
);
I could try to do it manually - launch workers that would connect to Postgresql
and every worker would take 1000 companies, extract users and insert to the other table. But is it possible to do it in plain SQL?
Say I got this SQL schema.
Table Job:
id,title, type, is_enabled
Table JobFileCopy:
job_id,from_path,to_path
Table JobFileDelete:
job_id, file_path
Table JobStartProcess:
job_id, file_path, arguments, working_directory
There are many other tables with varying number of columns and they all got foreign key job_id which is linked to id in table Job.
My questions:
Is this the right approach? I don't have requirement to delete anything at anytime. I will require to select and insert mostly.
Secondly, what is the best approach to get the list of jobs with relevant details from all the different tables in a single database hit? e.g I would like to select top 20 jobs with details, their details can be in any table (depends on column type in table Job) which I don't know until runtime.
select (case when type = 'type1' then (select field from table1) else (select field from table2) end) as a from table;
Could it be a solution for you?
I'm building a BI report for a client where there is a 1-n related join involved.
The joined table has a field for employee ID (EmplId).
The query that I've built for this report is supposed to give a 1 in its field "OneEmployee" if all the related posts have the same employee in the EmplId field, null if it's different employees, i.e:
TaskTrans
TaskTransHours > EmplId: 'John'
TaskTransHours > EmplId: 'John'
This should give a 1 in the said field in the query
TaskTrans
TaskTransHours > EmplId: 'John'
TaskTransHours > EmplId: 'George'
This should leave the said field blank
The idea is to create a field where a case function checks this and returns the correct value. But my problem is whereas there is a way to check for this through SQL.
select not count(*) from your_table
where employee_id = GIVEN_ID
and your_field not in ( select min(your_field)
from your_table
where employee_id = GIVEN_ID);
Note: my first idea was to use LIMIT 1 in the inner query, but MYSQL didn't like it, so min it was - the points to use any, but only one. Min should work, but the field should be indexed, then this query will actually execute rather fast, as only indexes would be used (obviously employee_id should also be indexed).
Note2: Do not get too confused with not in front of count(*), you want 1 when there is none that is different, I count different ones, and then give you the not count(*), which will be one if count is 0, otherwise 0.
Seems a job for a window COUNT():
SELECT
…,
CASE COUNT(DISTINCT TaskTransHours.EmplId) OVER () WHEN 1 THEN 1 END
AS OneEmployee
FROM …
I have a table called Vehicle_Location containing the columns (and more):
ID NUMBER(10)
SEQUENCE_NUMBER NUMBER(10)
TIME DATE
and I'm trying to get the min/max/avg number of records per day per id.
So far, I have
select id, to_char(time), count(*) as c
from vehicle_location
group by id, to_char(time), min having id = 16
which gives me:
ID TO_CHAR(TIME) COUNT(*)
---------------------- ------------- ----------------------
16 11-05-31 159
16 11-05-23 127
16 11-06-03 56
So I'd like to get the min/max/avg of the count(*) column. I am using Oracle as my RDBMS.
I don't have an oracle station to test on but you should be able to just wrap the aggregator around your SELECT as a subquery/derived table/inline view
So it would be (UNTESTED!!)
SELECT
AVG(s.c)
, MIN(s.c)
, MAX(s.c)
, s.ID
FROM
--Note this is just your query
(select id, to_char(time), count(*) as c from vehicle_location group by id, to_char(time), min having id = 16) as s
GROUP BY s.ID
Here's some reading on it:
http://www.devshed.com/c/a/Oracle/Inserting-SubQueries-in-SELECT-Statements-in-Oracle/3/
EDIT: Though normally it is a bad idea to select both the MIN and MAX in a single query.
EDIT2: The min/max issue is related to how some RDBMS (including oracle) handle aggregations on indexed columns. It may not affect this particular query but the premise is that it's easy to use the index to find either the MIN or the MAX but not both at the same time because any index may not be used effectively.
Here's some reading on it:
http://momendba.blogspot.com/2008/07/min-and-max-functions-in-single-query.html