improve perfromance of query - sql

I am using SQLite.
I have a query which gets records after going through 6 different tables.
Each table contain many records.
The query below has been written based on the PK-FK relationship, but it's taking too much time to retrieve the data.
I can't be able to do Altering, Indexing on database.
Select distinct A.LINK_ID as LINK_ID,
B.POI_ID
from RDF_LINK as A,
RDF_POI as B,
RDF_POI_ADDRESS as c,
RDF_LOCATION as d,
RDF_ROAD_LINK as e,
RDF_NAV_LINK as f
where B.[CAT_ID] = '5800'
AND B.[POI_ID] = c.[POI_ID]
AND c.[LOCATION_ID] = d.[LOCATION_ID]
AND d.[LINK_ID] = A.[LINK_ID]
AND A.[LINK_ID] = e.[LINK_ID]
AND A.[LINK_ID] = f.[LINK_ID]
Am I using the wrong method? Do I need to use IN?
EXPLAIN QUERY PLAN command output ::
0 0 3 SCAN TABLE RDF_LOCATION AS d (~101198 rows)
0 1 0 SEARCH TABLE RDF_LINK AS A USING COVERING INDEX sqlite_autoindex_RDF_LINK_1 (LINK_ID=?) (~1 rows)
0 2 5 SEARCH TABLE RDF_NAV_LINK AS f USING COVERING INDEX sqlite_autoindex_RDF_NAV_LINK_1 (LINK_ID=?) (~1 rows)
0 3 4 SEARCH TABLE RDF_ROAD_LINK AS e USING COVERING INDEX NX_RDFROADLINK_LINKID (LINK_ID=?) (~2 rows)
0 4 1 SEARCH TABLE RDF_POI AS B USING AUTOMATIC COVERING INDEX (CAT_ID=?) (~7 rows)
0 5 2 SEARCH TABLE RDF_POI_ADDRESS AS c USING COVERING INDEX sqlite_autoindex_RDF_POI_ADDRESS_1 (POI_ID=? AND LOCATION_ID=?) (~1 rows)
0 0 0 USE TEMP B-TREE FOR DISTINCT

There is an AUTOMATIC index on RDF_POI.CAT_ID.
This means that the database thinks it is worthwhile to create a temporary index just for this query.
You should create this index permanently:
CREATE INDEX whatever ON RDF_POI(CAT_ID);
Furthermore, the CAT_ID lookup does not appear to have a high selectivity.
Run ANALYZE so that the database has a better idea of the shape of your data.

Related

How to optimize a 2 table query where data can be only discriminated based on both tables?

I have the following 2 tables and data distribution:
drop table if exists line;
drop table if exists header;
create table header (header_id serial primary key, type character);
create table line (line_id serial primary key, header_id serial not null, type character, constraint line_header foreign key (header_id) references header (header_id)) ;
create index inv_type_idx on header (type);
create index line_type_idx on line (type);
insert into header (type) select case when floor(random()*2+1) = 1 then 'A' else 'B' end from generate_series(1,100000);
insert into line (header_id, type) select header_id, case when floor(random()*10000+1) = 1 then (case when type ='A' then 'B' else 'A' end) else type end from header, generate_series(1,5);
header table has 100K rows: 50% of type A and 50% of B
line table has 500K rows:
each header has 5 lines
overall there are 50% of lines of type A and 50% of B
type of a line is the same as its header in 99.99% of the cases, in only 0.01% they are different
Data distribution:
# select h.type header_type, l.type line_type, count(*) from line l inner join header h on l.header_id = h.header_id group by 1,2 order by 1,2;
header_type | line_type | count
-------------+-----------+--------
A | A | 250865
A | B | 25
B | A | 29
B | B | 249081
(4 rows)
I need to get all the lines with type B whose header is A. Even the total amount is very limited (25 out of 500000 rows) the plan I obtain (PostgreSQL 10) is the following, which performs a sequential scan in both both tables:
explain
select * from line l
inner join header h on l.header_id = h.header_id
where h.type ='A' and l.type='B';
QUERY PLAN
---------------------------------------------------------------------------
Hash Join (cost=2323.29..14632.89 rows=125545 width=19)
Hash Cond: (l.header_id = h.header_id)
-> Seq Scan on line l (cost=0.00..11656.00 rows=248983 width=13)
Filter: (type = 'B'::bpchar)
-> Hash (cost=1693.00..1693.00 rows=50423 width=6)
-> Seq Scan on header h (cost=0.00..1693.00 rows=50423 width=6)
Filter: (type = 'A'::bpchar)
(7 rows)
Is there any way to optimize this kind of queries where data discrimination is very high but only when combining information from more than one table?
Of course, as workaround I could denormalize information storing in lines information from header which would make this query much more performant. But if possible, I'd prefer not to have to do so because I'd need to maintain this duplicated information.
alter table line add column compound_type char(2);
create index compound_idx on line (compound_type);
update line l
set compound_type = h.type || l.type
from header h
where h.header_id = l.header_id;
# explain select * from line where compound_type = 'BA';
QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using compound_idx on line (cost=0.42..155.58 rows=50 width=13)
Index Cond: (compound_type = 'BA'::bpchar)
(2 rows)
1) You can use materialized view with proper index. It can update in "background". Otherwise it's similar to your composed index in line.
2) You can reverse the search to header-to-line if you create index on (line.header_id, line.type) and force subquery like this:
select header_id
from header h
where type='A' and
exists(select * from line l where l.header_id=h.header_id and l.type='B')
After you get all headers make another select of lines with corresponding header_id.
It might be good idea to include type into some headers index so that 2 indexes are everything needed for look-up.
Still it will read ~50K lines in headers index and look up each of them in 2nd index. In general it's not effective, but if indexes fully fit into memory, it might be not so bad.

Do indexes work in NOT IN or <> clause?

I have read that normal indexes in (atleast Oracle) database are basically B- tree structures, and hence store the records treating appropriate root nodes. Records 'lesser than' the root are iteratively stored in the left portion of the tree, while records 'greater than' the root are stored to the right portion. It is this storage approach that helps in a faster scan, through tree traversal since depth and breadth is reduced.
However,while creating indexes or for performance tuning of a where clause, most guides speak about first prioritize the columns where equality is to be considered (IN or = clause) and then alone move to the columns with inequality clauses. (NOT IN, <>). What is the cause of this advise? Should it not be feasible to predict that a given value does not exist as easily as it is to predict a given value exists, using tree traversal?
Do indexes not work with negation?
The issue is locality within the index. If you have two columns with letters in col1 and numbers in col 2, then an index might look like:
Ind col1 col2
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
7 B 2
8 B 3
9 B 3
10 C 2
11 C 3
(ind is the position in the index. The record locator is left out.)
If you are looking for col1 = 'B', then you can find position 5 and then scan the index until position 9. If you are looking for col1 <> 'B', then you need to find the first record that is not 'B' scan and repeat for the first record after. This becomes worse with IN and NOT IN.
An additional factor is that if a relative handful of records satisfy the equality condition, then almost all records will fail -- and often indexes are not useful when almost all records need to be read. One sometimes-exception to this are clustered indexes.
Oracle has better index optimizations than most databases -- it will do multiple scans starting in different locations. Even so, an inequality is often much less useful for an index.

Performance of MERGE vs. UPDATE with subquery

Note that I've modified table/field names etc. for readability. Some of the original names are quite confusing.
I have three different tables:
Retailer (Id+Code is a unique key)
- Id
- Code
- LastReturnDate
- ...
Delivery/DeliveryHistory (combination of Date+RetailerId is unique)
- Date
- RetailerId
- HasReturns
- ...
Delivery and DeliveryHistory are almost identical. Data is periodically moved to the history table, and there's no surefire way to know when this last happened. In general, the Delivery-table is quite small -- usually less than 100,000 rows -- while the history table will typically have millions of rows.
My task is to update the LastReturnDate field for each retailer based on the current highest date value for which HasReturns is true in Delivery or DeliveryHistory.
Previously this has been solved with a view defined as follows:
SELECT Id, Code, MAX(Date) Date
FROM Delivery
WHERE HasReturns = 1
GROUP BY Id, Code
UNION
SELECT Id, Code, MAX(Date) Date
FROM DeliveryHistory
WHERE HasReturns = 1
GROUP BY Id, Code
And the following UPDATE statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code AND EXISTS (
SELECT * FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code
HAVING
MAX(Date) > LastReturnDate OR
(LastReturnDate IS NULL AND MAX(Date) IS NOT NULL))
The EXISTS-clause guards against updating fields where the current value is greater than the new one, but this is actually not a significant concern, because it's hard to see how that could ever happen during normal program execution. Note also how the AND Max(Date) IS NOT NULL part is in fact superfluous, since it's impossible for Date to be null in DeliveryView. But the EXISTS-clause appears to actually improve performance slightly.
However, the performance of the UPDATE has recently been horrendous. In a database where the Retailer table contains only 1000-2000 relevant entries, the UPDATE has been taking more than five minutes to run. Note that it does this even if I remove the entire EXISTS clause, i.e. with this very simply statement:
UPDATE Retailer SET LastReturnDate = (
SELECT MAX(Date) FROM DeliveryView
WHERE Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHERE Code = :Code
I've therefore been looking into a better solution. My first idea was to create a temporary table, but after a while I tried to write it as a MERGE statement:
MERGE INTO Retailer
USING (SELECT Id, Code, MAX(Date) Date FROM DeliveryView GROUP BY Id, Code)
ON (Retailer.Id = DeliveryView.Id AND Retailer.Code = DeliveryView.Code)
WHEN MATCHED THEN
UPDATE SET LastReturnDate = Date WHERE Code = :Code
This seems to work, and it's more than an order of magnitude faster than the UPDATE.
I have three questions:
Can I be certain that this will have the same effect as the UPDATE in all cases (disregarding the edge case of LastReturnDate already being larger than MAX(Date))?
Why is it so much faster?
Is there some better solution?
Query plans
MERGE plan
Cost: 25,831, Bytes: 1,143,828
Plain language
Every row in the table SCHEMA.Delivery is read.
The rows were sorted in order to be grouped.
Every row in the table SCHEMA.DeliveryHistory is read.
The rows were sorted in order to be grouped.
Return all rows from steps 2, 4 - including duplicate rows.
The rows from step 5 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 6.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 8.
Every row in the table SCHEMA.Retailer is read.
The result sets from steps 9, 10 were joined (hash).
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were merged.
Rows were remotely merged.
Technical
Plan Cardinality Distribution
14 MERGE STATEMENT REMOTE ALL_ROWS
Cost: 25 831 Bytes: 1 143 828 3 738
13 MERGE SCHEMA.Retailer ORCL
12 VIEW SCHEMA.
11 HASH JOIN
Cost: 25 831 Bytes: 1 192 422 3 738
9 VIEW SCHEMA.
Cost: 25 803 Bytes: 194 350 7 475
8 SORT GROUP BY
Cost: 25 803 Bytes: 194 350 7 475
7 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 25 802 Bytes: 194 350 7 475
6 SORT UNIQUE
Cost: 25 802 Bytes: 134 550 7 475
5 UNION-ALL
2 SORT GROUP BY
Cost: 97 Bytes: 25 362 1 409
1 TABLE ACCESS FULL TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 94 Bytes: 210 654 11 703
4 SORT GROUP BY
Cost: 25 705 Bytes: 109 188 6 066
3 TABLE ACCESS FULL TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 16 827 Bytes: 39 333 636 2 185 202
10 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 653 390 2 230
UPDATE plan
Cost: 101,492, Bytes: 272,060
Plain language
Every row in the table SCHEMA.Retailer is read.
One or more rows were retrieved using index SCHEMA.DeliveryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.Delivery were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
One or more rows were retrieved using index SCHEMA.DeliveryHistoryHasReturns . The index was scanned in ascending order.
Rows from table SCHEMA.DeliveryHistory were accessed using rowid got from an index.
The rows were sorted in order to be grouped.
Return all rows from steps 4, 7 - including duplicate rows.
The rows from step 8 were sorted to eliminate duplicate rows.
A view definition was processed, either from a stored view SCHEMA.DeliveryView or as defined by steps 9.
The rows were sorted in order to be grouped.
A view definition was processed, either from a stored view SCHEMA. or as defined by steps 11.
Rows were updated.
Rows were remotely updated.
Technical
Plan Cardinality Distribution
14 UPDATE STATEMENT REMOTE ALL_ROWS
Cost: 101 492 Bytes: 272 060 1 115
13 UPDATE SCHEMA.Retailer ORCL
1 TABLE ACCESS FULL TABLE SCHEMA.Retailer [Analyzed] ORCL
Cost: 27 Bytes: 272 060 1 115
12 VIEW SCHEMA.
Cost: 90 Bytes: 52 2
11 SORT GROUP BY
Cost: 90 Bytes: 52 2
10 VIEW VIEW SCHEMA.DeliveryView ORCL
Cost: 90 Bytes: 52 2
9 SORT UNIQUE
Cost: 90 Bytes: 36 2
8 UNION-ALL
4 SORT GROUP BY
Cost: 15 Bytes: 18 1
3 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.Delivery [Analyzed] ORCL
Cost: 14 Bytes: 108 6
2 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHasReturns [Analyzed] ORCL
Cost: 2 12
7 SORT GROUP BY
Cost: 75 Bytes: 18 1
6 TABLE ACCESS BY INDEX ROWID TABLE SCHEMA.DeliveryHistory [Analyzed] ORCL
Cost: 74 Bytes: 4 590 255
5 INDEX RANGE SCAN INDEX SCHEMA.DeliveryHistoryHasReturns [Analyzed] ORCL
Cost: 6 509

How can I speed up queries that are looking for the root node of a transitive closure?

I have a historical transitive closure table that represents a tree.
create table TRANSITIVE_CLOSURE
(
CHILD_NODE_ID number not null enable,
ANCESTOR_NODE_ID number not null enable,
DISTANCE number not null enable,
FROM_DATE date not null enable,
TO_DATE date not null enable,
constraint TRANSITIVE_CLOSURE_PK unique (CHILD_NODE_ID, ANCESTOR_NODE_ID, DISTANCE, FROM_DATE, TO_DATE)
);
Here's some sample data:
CHILD_NODE_ID | ANCESTOR_NODE_ID | DISTANCE
--------------------------------------------
1 | 1 | 0
2 | 1 | 1
2 | 2 | 0
3 | 1 | 2
3 | 2 | 1
3 | 3 | 0
Unfortunately, my current query for finding the root node causes a full table scan:
select *
from transitive_closure tc
where
distance = 0
and not exists (
select null
from transitive_closure tci
where tc.child_node_id = tci.child_node_id
and tci.distance <> 0
);
On the surface, it doesn't look too expensive, but as I approach 1 million rows, this particular query is starting to get nasty... especially when it's part of a view that grabs the adjacency tree for legacy support.
Is there a better way to find the root node of a transitive closure? I would like to rewrite all of our old legacy code, but I can't... so I need to build the adjacency list somehow. Getting everything except the root node is easy, so is there a better way? Am I thinking about this problem the wrong way?
Query plan on a table with 800k rows.
OPERATION OBJECT_NAME OPTIONS COST
SELECT STATEMENT 2301
HASH JOIN RIGHT ANTI 2301
Access Predicates
TC.CHILD_NODE_ID=TCI.CHILD_NODE_ID
TABLE ACCESS TRANSITIVE_CLOSURE FULL 961
Filter Predicates
TCI.DISTANCE = 1
TABLE ACCESS TRANSITIVE_CLOSURE FULL 962
Filter Predicates
DISTANCE=0
How long does the query take to execute, and how long do you want it to take? (You usually do not want to use the cost for tuning. Very few people know what the explain plan cost really means.)
On my slow desktop the query only took 1.5 seconds for 800K rows. And then 0.5 seconds after the data was in memory. Are you getting something significantly worse,
or will this query be run very frequently?
I don't know what your data looks like, but I'd guess that a full table scan will always be best for this query. Assuming that your hierarchical data
is relatively shallow, i.e. there are many distances of 0 and 1 but very few distances of 100, the most important column will not be very distinct. This means
that any of the index entries for distance will point to a large number of blocks. It will be much cheaper to read the whole table at once using multi-block reads
than to read a large amount of it one block at a time.
Also, what do you mean by historical? Can you store the results of this query in a materialized view?
Another possible idea is to use analytic functions. This replaces the second table scan with a sort. This approach is usually faster, but for me this
query actually takes longer, 5.5 seconds instead of 1.5. But maybe it will do better in your environment.
select * from
(
select
max(case when distance <> 0 then 1 else 0 end)
over (partition by child_node_id) has_non_zero_distance
,transitive_closure.*
from transitive_closure
)
where distance = 0
and has_non_zero_distance = 0;
Can you try adding an index on distance and child_node_id, or change the order of these column in the existing unique index? I think it should then be possible for the outer query to access the table by the index by distance while the inner query needs only access to the index.
Add ONE root node from which all your current root nodes are descended. Then you would simply query the children of your one root. Problem solved.

SQL optimization - execution plan changes based on constraint value - Why?

I've got a table ItemValue full of data on a SQL 2005 Server running in 2000 compatibility mode that looks something like (it's a User-Defined values table):
ID ItemCode FieldID Value
-- ---------- ------- ------
1 abc123 1 D
2 abc123 2 287.23
4 xyz789 1 A
5 xyz789 2 3782.23
6 xyz789 3 23
7 mno456 1 W
9 mno456 3 45
... and so on.
FieldID comes from the ItemField table:
ID FieldNumber DataFormatID Description ...
-- ----------- ------------ -----------
1 1 1 Weight class
2 2 4 Cost
3 3 3 Another made up description
. . x xxx
. . x xxx
. . x xxx
x 91 (we have 91 user-defined fields)
Because I can't PIVOT in 2000 mode, we're stuck building an ugly query using CASEs and GROUP BY to get the data to look how it should for some legacy apps, which is:
ItemNumber Field1 Field2 Field3 .... Field51
---------- ------ ------- ------
abc123 D 287.23 NULL
xyz789 A 3782.23 23
mno456 W NULL 45
You can see we only need this table to show values up to the 51st UDF. Here's the query:
SELECT
iv.ItemNumber,
,MAX(CASE WHEN f.FieldNumber = 1 THEN iv.[Value] ELSE NULL END) [Field1]
,MAX(CASE WHEN f.FieldNumber = 2 THEN iv.[Value] ELSE NULL END) [Field2]
,MAX(CASE WHEN f.FieldNumber = 3 THEN iv.[Value] ELSE NULL END) [Field3]
...
,MAX(CASE WHEN f.FieldNumber = 51 THEN iv.[Value] ELSE NULL END) [Field51]
FROM ItemField f
LEFT JOIN ItemValue iv ON f.ID = iv.FieldID
WHERE f.FieldNumber <= 51
GROUP BY iv.ItemNumber
When the FieldNumber constraint is <= 51, the execute plan goes something like:
SELECT <== Computer Scalar <== Stream Aggregate <== Sort (Cost: 70%) <== Hash Match <== (Clustered Index Seek && Table Scan)
and it's fast! I can pull back 100,000+ records in about a second, which suits our needs.
However, if we had more UDFs and I change the constraint to anything above 66 (yes, I tested them one by one) or if I remove it completely, I lose the Sort in the Execution plan, and it gets replaced with a whole bunch of Parallelism blocks that gather, repartition, and distribute streams, and the entire thing is slow (30 seconds for even just 1 record).
FieldNumber has a clustered, unique index, and is part of composite primary key with the ID column (non-clustered index) in the ItemField table. The ItemValue table's ID and ItemNumber columns make a PK, and there is an extra non-clustered index on the ItemNumber column.
What is the reasoning behind this? Why does changing my simple integer constraint change the entire execution plan?
And if you're up to it... what would you do differently? There's a SQL upgrade planned for a couple months from now but I need to get this problem fixed before that.
SQL Server is smart enough to take CHECK constraints into account when optimizing the queries.
Your f.FieldNumber <= 51 is optimized out and the optimizer sees that the whole two tables should be joined (which is best done with a HASH JOIN).
If you don't have the constraint, the engine needs to check the condition and most probably uses index traversal to do this. This may be slower.
Could please post the whole plans for the queries? Just run SET SHOWPLAN_TEXT ON and then the queries.
Update:
What is the reasoning behind this? Why does changing my simple integer constraint change the entire execution plan?
If by a constraint you mean the WHERE condition, this is probably the other thing.
Set operations (that's what SQL does) have no single most efficient algorithm: efficiency of each algorithm depends heavily on the data distribution in the sets.
Say, for taking a subset (that's what the WHERE clause does) you can either find the range of record in the index and use the index record pointers to locate the data rows in the table, or just scan all records in the table and filter them using the WHERE condition.
Efficiency of the former operation is m × const, that of the latter is n, where m is the number of record satisfying the condition, n is the total number of records in the table and const > 1.
This means that for larger values of m the fullscan is more efficient.
SQL Server is aware of that and changes execution plans accordingly to the constants that affect the data distribution in the set operations.
TO do this, SQL Server maintains statistics: aggregated histograms of the data distribution in each indexed column and uses them to build the query plans.
So changing the integer in the WHERE condition in fact affects the size and the data distribution of the underlying sets and makes SQL Server to reconsider the algorithms best fit to work with the sets of that size and layout.
it gets replaced with a whole bunch of Parallelism blocks
Try this:
SELECT
iv.ItemNumber,
,MAX(CASE WHEN f.FieldNumber = 1 THEN iv.[Value] ELSE NULL END) [Field1]
,MAX(CASE WHEN f.FieldNumber = 2 THEN iv.[Value] ELSE NULL END) [Field2]
,MAX(CASE WHEN f.FieldNumber = 3 THEN iv.[Value] ELSE NULL END) [Field3]
...
,MAX(CASE WHEN f.FieldNumber = 51 THEN iv.[Value] ELSE NULL END) [Field51]
FROM ItemField f
LEFT JOIN ItemValue iv ON f.ID = iv.FieldID
WHERE f.FieldNumber <= 51
GROUP BY iv.ItemNumber
OPTION (Maxdop 1)
By using Option(Maxdop 1), this should prevent the parellelism in the execution plan.
At 66 you are hitting some internal cost estimate threshold that decides it is better to use one plan vs. the other. What that threshold is and why it happens is not really important. Note that your query differ with each FieldNumber value, as you are not only changing the WHERE: you also change the pseudo-'pivot' projected fields.
Now I don't know all the details of your table and your queries and insert/update/delete/pattern, but for the particular query you posted the proper clustered index structure for the ItemValue table is this:
CREATE CLUSTERED INDEX [cdxItemValue] ON ItemValue (FieldID, ItemNumber);
This structure eliminate the need to intermediate sort the results for this 'pivot' query.