Hive Query Efficiency - sql

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!
Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.
Queries: I tried two two queries as follows:
query 1
drop table if exists tb_1 purge;
create table if not exists tb_1 as
select 'v1' as name, min(v1) as min_value from src_tb union all
select 'v2' as name, min(v2) as min_value from src_tb union all
select 'v3' as name, min(v3) as min_value from src_tb union all
...
select 'v200' as name, min(v200) as min_value from src_tb
;
query 2
drop table if exists tb_2 purge;
create table if not exists tb_2 as
select min(v1) as min_v1
, min(v2) as min_v2
, min(v3) as min_v3
...
, min(v200) as min_v200
from src_tb
;
Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.
What do you think? Thank you in advance.

Query execution time depends on environment that you execute it.
In MSSQL.
Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.
For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)
You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.
Edit (Hive)
Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.
Please see LanguageManual Explain for more information.

What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

Related

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

Is there performance impact when Non-Aggregate SQL functions are used in a SELECTed Column?

We have a report that uses a long and complex query that has the SELECT statement like below:
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
NVL(NrDostawcy,'BRAK') supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
We recently made modifications to this Query and just modified the 3rd SELECTed column to add a REGEXP_LIKE
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
--NVL(NrDostawcy,'BRAK') supplier_registration,
Case When (NrDostawcy is not null and regexp_like(substr(NrDostawcy,1,2),'^[a-zA-Z]*$')) Then substr(NrDostawcy,3) else NVL(NrDostawcy,'BRAK') End supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
I checked the Explain Plans of both queries and they turned out to have the same Plan hash value.
Does this mean there's no impact on performance if i use Seeded, non-aggregate, SQL Functions in SELECTed columns?
I believe there is an impact in performance if they're used in the WHERE clause, but i wasn't sure if the same applies to the SELECTed columns.
Apologies in advance as i can't provide the exact query since it's propietary and is very long and complex.
I also don't think I can create a good enough sample that would match the Explain plan of actual query as it joins over 10 tables, with thousand rows of data.
Thank you!
Since you are running this query on Oracle here's my advice. Run the query with Oracle hint /*+ gather_plan_statistics */. Run it with the first query without regex and with the regex. Then find this query in sharedpool (v$sql). The hint will give you the exact buffer gets, physical reads an also time spent in each step of the plan. With that data you can analyze in details how much more time query with regex needed to execute. I advice you, that you do this on data that returns you more than lets say 10k rows. In this way the difference should be seen (if you run this with 100 rows no difference will be seen).
The execution plan is the same as it needs to query exactly the same data from the same tables. You should also see the amount of data (logical IO) unchanged.
What will not be the same however is the execution time, as the regexp_like will consume more CPU, even if you see the logical IO unchanged.
Note that if you changed the selected columns, the execution plan could change as if all selected columns were part of an index, the optimizer might skip the table access and read the data from an index only.
it depends upon the query and the IO's being done to get the data. Sometimes you can try creating a Oracle Function based index, you may see some improvements.
Check this link, it could help you.
https://jeffkemponoracle.com/2007/11/will-oracle-use-my-regexp-function-based-index/
thanks

Performance Tuning for an insert query

Can someone help me in tuning this query as I am new to performance tuning in oracle.
INSERT INTO mdm_id_relation
SELECT
pat_key, hub_pat_id, msa_pat_id, pat_id
FROM
ods_raw_patient_mdm_process p1
WHERE NVL (pat_id, 'NULL') IN (SELECT pat_id
FROM mdm_id_relation)
AND NOT EXISTS (SELECT pat_key
FROM mdm_id_relation p2
WHERE p1.pat_key = p2.pat_key);
To tune an INSERT query, you'll need the follwowing ingredients:
A place to test your query. Ideally a separate database, but a separate schema might do as well. Minimally a copy of the tables and indexes that are involved. Reason: the INSERT will change data, you'll need to run different versions of the query until you are happy with the performance.
The test tables need to have exactly the same structure as the real table and roughly the same amount of data as the real thing. Reason: The performance of the INSERT depends heavily on both structure and amount.
Up to date statistics: Look up DBMS_STATS.GATHER_TABLE_STATS and how to use it. Reason: Give the query optimizer a chance to find a good query plan.
A way to measure performance (wall clock seconds or Oracle costs etc.), and, even better, access to the query plan (SQL Developer: Explain Plan button, or have a look at William's script).
When I need to tune INSERT statements, I normally start with the SELECT part until I am happy with it. Firstly run SELECT ..., when that is fine, I run a CREATE TABLE foo NOLOGGING AS SELECT ... to measure the SELECT of all the rows. When thats fine, I test the whole INSERT ... SELECT ... statement.
Any issue with performance is going to be the select, not the insert. I think this is an equivalent query:
INSERT INTO mdm_id_relation (pat_key, hub_pat_id, msa_pat_id, pat_id) -- always list the columns!
SELECT pat_key, hub_pat_id, msa_pat_id, pat_id
FROM ods_raw_patient_mdm_process p1
WHERE EXISTS (SELECT 1
FROM mdm_id_relation mir
WHERE mir.pat_id = p1.pat_id
) AND
NOT EXISTS (SELECT 1
FROM mdm_id_relation mir
WHERE p1.pat_key = mir.pat_key
);
For this query, you want two indexes: mdm_id_relation(pat_id) and mdm_id_relation(pat_key). These should be a big help on performance.
Note: Test the select first before doing the insert.

Impala Query Performance

I am running on a POC environment where there are only one name node and one data node running. Impala daemon is running on data node. Both of the nodes have 128GB memory each. I had set the mem_limit to 60GB.
I had two big tables in Impala. First table has around 635 million records while second table is around 250000 records. I inner join this 2 tables using a common parameter. The SQL statement is as the following:
select a.*, b.* from table_a a inner join table_b b on a.param=b.param order by a.t_date desc
When i use EXPLAIN, it showed Estimated Per-Host Requirements: Memory=992.03MB VCores=2. When i run this query, it took more than one hour and the result yet to be return. I am wondering why it took so long. Is this related to mem_limit settings? How can i tune such query?
Try tuning as Impala performance
Some ideal
Try big_table join small_table
Partition on param column
If have many
query execute in the same time, you should enable Admission
controll (2) and Dynamic Resource Pools (3)
Try execute summary after execute your query in impala-shell to see what step take long time.
And plz post all result of EXPLAIN statement
P/S: Sorry because im not enough reputation to post more than 2 link

Fastest execution time for querying on Big size table

i need advice how to get fastest result for querying on big size table.
I am using SQL Server 2012, my condition is like this:
I have 5 tables contains transaction record, each table has 35 millions of records.
All tables has 14 columns, the columns i need to search is GroupName, CustomerName, and NoRegistration. And I have a view that contains 5 of all these tables.
The GroupName, CustomerName, and NoRegistration records is not unique each tables.
My application have a function to search to these column.
The query is like this:
Search by Group Name:
SELECT DISTINCT(GroupName) FROM TransactionRecords_view WHERE GroupName LIKE ''+#GroupName+'%'
Search by Name:
SELECT DISTINCT(CustomerName) AS 'CustomerName' FROM TransactionRecords_view WHERE CustomerName LIKE ''+#Name+'%'
Search by NoRegistration:
SELECT DISTINCT(NoRegistration) FROM TransactionRecords_view WHERE LOWER(NoRegistration) LIKE LOWER(#NoRegistration)+'%'
My question is how can i achieve fastest execution time for searching?
With my condition right now, every time i search, it took 3 to 5 minutes.
My idea is to make a new tables contains the distinct of GroupName, CustomerName, and NoRegistration from all 5 tables.
Is my idea is make execution time is faster? or any other idea?
Thank you
EDIT:
This is query for view "TransactionRecords_view"
CREATE VIEW TransactionRecords_view
AS
SELECT * FROM TransactionRecords_1507
UNION ALL
SELECT * FROM TransactionRecords_1506
UNION ALL
SELECT * FROM TransactionRecords_1505
UNION ALL
SELECT * FROM TransactionRecords_1504
UNION ALL
SELECT * FROM TransactionRecords_1503
You must show sql of TransactionRecords_view. Do you have indexes? What is the collation of NoRegistration column? Paste the Actual Execution Plan for each query.
Ok, so you don't need to make those new tables. If you create Non-Clustered indexes based upon these fields it will (in effect) do what you're after. The index will only store data on the columns that you indicate, not the whole table. Be aware, however, that indexes are excellent to aid in SELECT statements but will negatively affect any write statements (INSERT, UPDATE etc).
Next you want to run the queries with the actual execution plan switched on. This will show you how the optimizer has decided to run each query (in the back end). Are there any particular issues here, are any of the steps taking up a lot of the overall operator cost? There are plenty of great instructional videos about execution plans on youtube, check them out if you haven't looked at exe plans before.
Did you try to check if there were missing indexes with the actual execution plan ?
Moreover, as you use clause on varchar, I've heard about Full-Text Search.. maybe it can be useful for you :
https://msdn.microsoft.com/en-us/library/ms142571(v=sql.120).aspx