Is there performance impact when Non-Aggregate SQL functions are used in a SELECTed Column? - sql

We have a report that uses a long and complex query that has the SELECT statement like below:
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
NVL(NrDostawcy,'BRAK') supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
We recently made modifications to this Query and just modified the 3rd SELECTed column to add a REGEXP_LIKE
SELECT
NVL(nazwawystawcy,'BRAK') supplier_name,
NVL(AdresDostawcy,'BRAK') supplier_address,
--NVL(NrDostawcy,'BRAK') supplier_registration,
Case When (NrDostawcy is not null and regexp_like(substr(NrDostawcy,1,2),'^[a-zA-Z]*$')) Then substr(NrDostawcy,3) else NVL(NrDostawcy,'BRAK') End supplier_registration,
DowodZakupu document_number,
DataZakupu document_issue_date,
DataWplywu document_recording_date,
trx_id,
KodKrajuNadaniaTIN country_code,
DokumentZakupu document_type_code,
payment_split MPP,
box_number box_number,
box_amount box_amount,
box_type box_type,
display_order display_order
...
FROM table1 t1
,table2 t2
....
I checked the Explain Plans of both queries and they turned out to have the same Plan hash value.
Does this mean there's no impact on performance if i use Seeded, non-aggregate, SQL Functions in SELECTed columns?
I believe there is an impact in performance if they're used in the WHERE clause, but i wasn't sure if the same applies to the SELECTed columns.
Apologies in advance as i can't provide the exact query since it's propietary and is very long and complex.
I also don't think I can create a good enough sample that would match the Explain plan of actual query as it joins over 10 tables, with thousand rows of data.
Thank you!

Since you are running this query on Oracle here's my advice. Run the query with Oracle hint /*+ gather_plan_statistics */. Run it with the first query without regex and with the regex. Then find this query in sharedpool (v$sql). The hint will give you the exact buffer gets, physical reads an also time spent in each step of the plan. With that data you can analyze in details how much more time query with regex needed to execute. I advice you, that you do this on data that returns you more than lets say 10k rows. In this way the difference should be seen (if you run this with 100 rows no difference will be seen).

The execution plan is the same as it needs to query exactly the same data from the same tables. You should also see the amount of data (logical IO) unchanged.
What will not be the same however is the execution time, as the regexp_like will consume more CPU, even if you see the logical IO unchanged.
Note that if you changed the selected columns, the execution plan could change as if all selected columns were part of an index, the optimizer might skip the table access and read the data from an index only.

it depends upon the query and the IO's being done to get the data. Sometimes you can try creating a Oracle Function based index, you may see some improvements.
Check this link, it could help you.
https://jeffkemponoracle.com/2007/11/will-oracle-use-my-regexp-function-based-index/
thanks

Related

Hive Query Efficiency

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!
Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.
Queries: I tried two two queries as follows:
query 1
drop table if exists tb_1 purge;
create table if not exists tb_1 as
select 'v1' as name, min(v1) as min_value from src_tb union all
select 'v2' as name, min(v2) as min_value from src_tb union all
select 'v3' as name, min(v3) as min_value from src_tb union all
...
select 'v200' as name, min(v200) as min_value from src_tb
;
query 2
drop table if exists tb_2 purge;
create table if not exists tb_2 as
select min(v1) as min_v1
, min(v2) as min_v2
, min(v3) as min_v3
...
, min(v200) as min_v200
from src_tb
;
Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.
What do you think? Thank you in advance.
Query execution time depends on environment that you execute it.
In MSSQL.
Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.
For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)
You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.
Edit (Hive)
Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.
Please see LanguageManual Explain for more information.
What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

DB2 query using zero value in IN CLAUSE is causing table scan, index on column is ignored

SELECT * FROM TABLE1 WHERE COL1 in( 597966104, 597966100);
SELECT * FROM TABLE1 WHERE COL1 in( 0, 597966100)
In the above 2 queries the first query uses index created on COL1 but the second query does not use index. The only difference in both queries is that zero (0) is used in the IN CLAUSE of the second query. Why is the zero causing the index to be ignored. This leading to table scan and slowing down the query performance. Is there any solution for this problem. Any help on this issue is welcome and appreciated. Database used is DB2
DB2 has a cost based optimizer. It tries to fugure out the best access plan and uses its statistics and configuration to determine it.
In your case the number of rows with col1 = 0 could really matter. For example when col1=0 for 40% of your data it could be cheaper to do the table scan.
If you want to figure out more details explain the query and you will see how the data is accessed and how much rows the optimizer guesses for the result set.
Make sure you have the correct and up-to-date statistics by running runstats for the table(s) as this will be the most important source of information for the optimizer.

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

In which sequence are queries and sub-queries executed by the SQL engine?

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.

Oracle SQL query - unexpected query plan

I have a very simple query that's giving me unexpected results. Hints on where to troubleshoot it would be welcome.
Simplified, the query is:
SELECT Obs.obsDate,
Obs.obsValue,
ObsHead.name
FROM ml.Obs Obs
JOIN ml.ObsHead ObsHead ON ObsHead.hdId = Obs.hdId
WHERE obs.hdId IN (53, 54)
This gives me a query cost of: 963. However, if I change the query to:
SELECT Obs.obsDate,
Obs.obsValue,
ObsHead.name
FROM ml.Obs Obs
JOIN ml.ObsHead ObsHead ON ObsHead.hdId = Obs.hdId
WHERE ObsHead.name IN ('BP SYSTOLIC', 'BP DIASTOLIC')
Although it (should) return the same data, the estimated cost shoots up to 17688. Where is the problem here likely to lie? Thanks.
Edit: The query plan says that the index on ObsHead.Name is being used for a range scan, and the table access on ObsHead only costs 4. There's another index on Obs.hdId that's being used for a range scan costing 94: it's the Nested Loops join between the tables that jumps up to 17K.
As has already been stated, the plan's cost is not intended for comparing two different queries, only for comparing different paths for the same query.
This is only a guess, but in this case, the cardinality field of the plan might be more useful to you. If the index on OBSHEAD is not unique and the statistics were gathered using an estimate, then the optimizer may not know exactly how many rows to expect when querying that table. The cardinality will tell you whether this is true or not (ideally, you'll be seeing a cardinality of 2 for OBSHEAD).
Another suggestion is to check the statistics on OBS. It seems likely that is a table that grows frequently, in which case, January 28th is not recent enough to have gathered the statistics. Assuming monitoring is turned on for this table, the queries below can tell you if the statistics are stale and need to be refreshed.
select owner, table_name, last_analyzed, stale_stats
from all_tab_statistics
where owner = 'ML' and table_name = 'OBS';
select owner, index_name, last_analyzed, stale_stats
from all_ind_statistics
where owner = 'ML' and table_name = 'OBS';
There is probably an index on hdId (which there is if it's the primary key, which I suspect is the case) and not on name which means that the second query will have to do a full table scan.
Costs are only useful for comparing different plans for one query; they're not so useful for comparing different queries.
You need to look at the plans and compare them in terms of the actions they perform.
I suspect the actual performance of these queries will be similar - however it would be interesting to know whether the first query uses a hash join, which might help things if the percentage of records in obs that are matched is significant.
I find the costs supplied by the optimizer to be interesting but not particularly useful. The best way I've found to compare queries is to run them and see how they perform relative to one another.
Share and enjoy.