I am running the below sql where the data in the table is around 4,34,836,959 records. It is taking more than 3 minutes to get the result.
select distinct col_1,col_2,col_3,col_4,col_5,
to_date(concat(year(col_6),'-',month(col_6), '-1')) as col_6_new,col_7,
cast(first_value(col_11) over (partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6) as double) as col_9,
cast(first_value(col_11) over (partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6 desc) as double) as col_10,
min(to_date(concat(year(col_6),'-',month(col_6), '-1'))) over (partition by col_1,col_2,col_5,col_4) as col_8
from my_table
When I checked the execution from background, I could see that only 1 Job and 1 Stage is running at a time. Is there a way to parallelize this?
I even tried the below code, but Jobs/Stages are not running in parallel.
spark.sql("set hive.exec.parallel=true")
spark.sql("set hive.exec.parallel.thread.number=16")
spark.sql("set hive.vectorized.execution = true")
spark.sql("set hive.vectorized.execution.enabled = true")
The Spark version I am using is 2.3.
Any help is greatly appreciated.
Every stage in the Spark execution plan corresponds to the set of operations that do not require shuffling.
As far as I can see, you need at least 2 shuffles in your query:
For calculating window functions with clauses like partition by col_1,col_2,col_5,col_4,concat(year(col_6),'-',month(col_6)) order by col_6
For calculating final distinct operation.
Thus, 2 shuffles result in 3 Spark stages.
Since you cannot do distinct before calculating all window functions, the stages should be executed one by one and cannot be parallelized.
To check it from your side you can find the execution DAG in the Spark UI.
Related
I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.
since June 2nd we are having issues with analytic functions. when the query (not the partitions) passes a certain size the query fails with the following error:
Resources exceeded during query execution: The query could not be
executed in the allotted memory. Peak usage: 125% of limit. Top memory
consumer(s): analytic OVER() clauses: 97% other/unattributed: 3% . at
[....]
has anyone encountered the same problem?
BigQuery chooses several parallel workers for OVER() clauses based on the size of the tables being run upon. We can see resources exceeded error, when too much data are being processed by workers that BigQuery assigned to your query.
I assume that this issue could come from the OVER() clause and the amount of data used. You'll need to try to tune a bit on your query script (specially on the OVER() clauses) as it is said in the error message.
To read more about the error, take a look to the official documentation.
That's what would help is Slots - unit of computational capacity required to execute SQL queries:
When you enroll in a flat-rate pricing plan, you purchase a dedicated
number of slots to use for query processing. You can specify the
allocation of slots by location for all projects attached to that
billing account. Contact your sales representative if you are
interested in flat-rate pricing.
I hope you find the above pieces of information useful.
We were able ot overcome this limitation by splitting the original data into several shards and applying the analytical function on each shard.
In essence (for 8 shards):
WITH
t AS (
SELECT
RAND () AS __aux,
*
FROM <original table>
)
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 0
UNION ALL
....
SELECT
* EXCEPT (__aux),
F () OVER (...) AS ...
FROM t
MOD (CAST (__aux*POW(10,9) AS INT64), 8) = 7
currently, I am using hive with s3 storage.
I have total 1000000 partitions right now. I am facing a problem where:
If I do:
Query execution time is less than 1 second.
select sum(metric) from foo where pt_partition_number = 'bar1'
select sum(metric) from foo where pt_partition_number = 'bar2'
But if I do
select sum(metric) from foo where pt_partition_number IN ('bar1','bar2')
The query is taking near about 30 seconds. I am thinking hive is doing directory scan in case of second query.
Is there a way to optimize query:
My request pattern always access two partition data.
I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.
I have a very large table, CLAIMS, with the following columns:
p_key
c_key
claim_type
Each row is uniquely defined by p_key, c_key. Often there will be multiple c_keys for each p_key. The table would look like this:
p_key c_key claim_type
1 1 A
1 2 A
2 3 B
2 5 C
3 1 B
I want to find the minimum c_key for each p_key. This is my query:
SELECT p_key,
min(c_key) as min_ckey
from CLAIMS
GROUP BY p_key
The issue is, when I run this as a mapreduce job through HIVE CLI (0.13), the reduce portion takes 30 minutes to even get 5% done. I'm not entirely sure what could cause a simple query to take so long. This query gives the same issue:
SELECT p_key,
row_number() OVER(PARTITION BY p_key ORDER BY c_key) as RowNum
from CLAIMS
So my question is why would the reduce portion of a seemingly simple mapreduce job take so long? Any suggestions on how to investigate this/improve the query would also be appreciated.
Do you know if the data is imbalanced? If there is one p_key with a very large number of c_key values compared to the average case, then the reducer which deals with that p_key will take a very long time.
Alternatively, is it possible that there are a small number of p_key values in general? Since you're grouping by p_key that would limit the number of reducers doing useful work.
The reduce phase occurs in three stages. When <=33% is shuffle, between 33% and 66% is sort, and >= 67% is the reduce phase.
Your job sounds like it is getting hung up in the shuffle portion of the reduce phase. My guess is that your data is spread all over and this portion is IO bound. Your observations are being moved to reducers.
You can try bucketing your data:
create table claim_bucket (p_key string, c_key string, claim_type string)
clustered by (p_key) into 6 buckets
row format delimited fields terminated by ",";
You may want more or less buckets and this will require some heavy lifting by hive inititally but should speed up subsequent queries of the table where p_key is used.
Of course you haven't left much else to go on here. If you post an edit and give more information you might get a better answer. Good luck.