Optimize hive query requesting data from two parition - hive

currently, I am using hive with s3 storage.
I have total 1000000 partitions right now. I am facing a problem where:
If I do:
Query execution time is less than 1 second.
select sum(metric) from foo where pt_partition_number = 'bar1'
select sum(metric) from foo where pt_partition_number = 'bar2'
But if I do
select sum(metric) from foo where pt_partition_number IN ('bar1','bar2')
The query is taking near about 30 seconds. I am thinking hive is doing directory scan in case of second query.
Is there a way to optimize query:
My request pattern always access two partition data.

Related

Athena pagination and performance issue

I have huge data set in S3 and using AWS Athena I am trying to query it, below 3 parameters are input for my query.
marketplaceId
startIndex
endIndex
but it's took 16 seconds to query just 50 records. ( I am using python to query data from Athena --> S3)
What I am doing wrong here? and the way I implemented pagination is right or not?
SQL Query which I am executing.
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_num
FROM
(
SELECT
dataset_date,
marketplace_id,
gl_product_group,
gl_product_group_desc,
browse_node_id,
browse_node_name,
root_browse_node_id,
browse_root_name,
wt_xref_id,
gt_xref_id,
node_path,
total_count_of_asins,
buyable_asin_count,
glance_view_count_t12m,
ord_cnt,
price_p50,
price_p90,
price_p100,
row_number() over (
order by
browse_node_id,
gl_product_group,
glance_view_count_t12m desc
) as row_num
from
(
select
*
from
category_info
WHERE
marketplace_id = '<marketplaceId>'
)
)
WHERE
row_num between '<startIndex>'
and '<endIndex>';
Update
After debugging my issue with timestamp I found It's taking 6 second for 1 query. and I am running two query.
1st - to get data , query which I mentioned above.
2nd - to get count of total number or rows in my table.
so that's why it's taking 12-16 sec.
So is there any way to get total number of rows without second query (select count(*) from category_info).

Partition column useful or not while taking count

Which one will be best to use from the perspective of cost, time and processing.here etl_batch_date is the partition column for the table.
1.Query - This query will process 607.7 kb when run
Table size : 9.77 MB
SELECT count(*) from demo
WHERE etlbatchid = '20200003094244327' and etl_batch_date='2020-06-03
Query - This query will process 427.6 kb when run
Table size : 9.77MB
SELECT count(*) from demo WHERE etlbatchid = '20200003094244327'
Also when you write second query then does it read the data from every partition?
You valuable comments will be appreciated.
Rule of thumb: Always use the partitioned column to filter data.
Play with this query:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
# 2.2 GB processed
For every datehour you add to the filter, an extra gigabyte of data will be queried. That's because:
Filtering by datehour implies a read of the datehour column. So this makes the query go over more data.
But since the datehour column is the partitioned column, then it only scans that day of data.
Now, if I add another filter:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
AND wiki='en'
# 686.8 MB processed
That processed less data!
That's because wiki is the main clustering column.
So try to always use partitions and clusters - even tho for smaller tables the results might look less intuitive.

Get the first row of a nested field in BigQuery

I have been struggling with a question that seem simple, yet eludes me.
I am dealing with the public BigQuery table on bitcoin and I would like to extract the first transaction of each block that was mined. In other word, to replace a nested field by its first row, as it appears in the table preview. There is no field that can identify it, only the order in which it was stored in the table.
I ran the following query:
#StandardSQL
SELECT timestamp,
block_id,
FIRST_VALUE(transactions) OVER (ORDER BY (SELECT 1))
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
But it process 492 GB when run and throws the following error:
Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for OVER(ORDER BY) used too much memory..
It seems so simple, I must be missing something. Do you have an idea about how to handle such task?
#standardSQL
SELECT * EXCEPT(transactions),
(SELECT transaction FROM UNNEST(transactions) transaction LIMIT 1) transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
Recommendation: while playing with large table like this one - I would recommend creating smaller version of it - so it incur less cost for your dev/test. Below can help with this - you can run it in BigQuery UI with destination table which you will then be using for your dev. Make sure you set Allow Large Results and unset Flatten Results so you preserve original schema
#legacySQL
SELECT *
FROM [bigquery-public-data:bitcoin_blockchain.blocks#1529518619028]
The value of 1529518619028 is taken from below query (at a time of running) - the reason I took four days ago is that I know number of rows in this table that time was just 912 vs current 528,858
#legacySQL
SELECT INTEGER(DATE_ADD(USEC_TO_TIMESTAMP(NOW()), -24*4, 'HOUR')/1000)
An alternative approach to Mikhail's: Just ask for the first row of an array with [OFFSET(0)]:
#StandardSQL
SELECT timestamp,
block_id,
transactions[OFFSET(0)] first_transaction
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 10
That first row from the array still has some nested data, that you might want to flatten to only their first row too:
#standardSQL
SELECT timestamp
, block_id
, transactions[OFFSET(0)].transaction_id first_transaction_id
, transactions[OFFSET(0)].inputs[OFFSET(0)] first_transaction_first_input
, transactions[OFFSET(0)].outputs[OFFSET(0)] first_transaction_first_output
FROM `bigquery-public-data.bitcoin_blockchain.blocks`
LIMIT 1000

SQL Server query behavior

vw_project is a view which involves 20 CTEs, join them multiple times and return 56 columns
many of these CTEs are self-joins (the classic "last row per group", in our case we get the last related object product / customer / manager per Project)
most of the tables (maybe 40 ?) involved don't exceed 1000 rows, the view itself returns 634 rows.
We are trying to improve the very bad performances of this view.
We denormalized (went from TPT to TPH), and reduce by half the number of joins with almost no impact.
But i don't understand the following results i am obtaining :
select * from vw_Project (TPT)
2 sec
select * from vw_Project (TPH)
2 sec
select Id from vw_Project (TPH , TPT is instant)
6 sec
select 1 from vw_Project (TPH , TPT is instant)
6 sec
select count(1) from vw_Project (TPH , TPT is instant)
6 sec
Execution plan for the last one (6 sec) :
https://www.brentozar.com/pastetheplan/?id=r1DqRciBW
execution plan after sp_updatestats
https://www.brentozar.com/pastetheplan/?id=H1Cuwsor-
To me, that seems absurd, I don't understand what's happening and it's hard to know whether my optimization strategies are relevant since I have no idea what justifies the apparently irrationnal behaviors I'm observing...
Any clue ?
CTE has no guarantee order to run the statements and 20 CTEs are far too many in my opinion. You can use OPTION (FORCE ORDER) to force execution from top to bottom.
For selecting few thousand rows however anything more than 1 sec is not acceptable regardless of complexity. I would choose an approach of a table function so i would have the luxury to create hash tables or table variables inside to have full control of each step. This way you limit the optimizer scope within each step alone.

Optimizing Hive GROUP BY when rows are sorted

I have the following (very simple) Hive query:
select user_id, event_id, min(time) as start, max(time) as end,
count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;
The table has the following structure:
user_id event_id time interaction
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 0
Ex833Lli36nxTvGTA1Dv juCUv6EnkVundBHSBzQevw 1430481530295 1
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512179696 0
n0w4uQhOuXymj5jLaCMQ G+Oj6J9Q1nI1tuosq2ZM/g 1430512217124 0
n0w4uQhOuXymj5jLaCMQ mqf38Xd6CAQtuvuKc5NlWQ 1430512179696 1
I know for a fact that rows are sorted first by user_id and then by event_id.
The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.
Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).
Create a bucketed sorted table. The optimizer will know it sorted from metadata.
See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables
Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.