I have a database table (sqlite) containing items that form a tree hierarchy. Each item has an id field (for itself) and a parentId for its parent. Now given an item, I must retrieve the whole chain from the root to the item.
Basically the algorithm in pseudocode looks like:
cursor is item
retrieve parentItem for cursor by parentId
if parentItem is not rootItem, then cursor = parentItem and goto 2.
So I have to perform an SQL SELECT query for each item.
Is it possible to retrieve the whole chain rootItem -> ... -> item by performing only one SQL query?
There are lots of creative ways of organizing hierarchial data in a database, but consistently I find it easiest bring back the data in non-hierarchial format, then match up parent and child records programmatically.
Total amount of effort: 1 query + 1 programmatic pass through your dataset to create the hierarchy.
Alternative approach:
I've used this method in the past with limited success. You can store the path of each item in your tree using a varchar(max) column as follows:
ID ParentID Path
-- -------- ----
1 null 1/
2 1 1/2/
3 null 3/
4 2 1/2/4/
5 4 1/2/4/5/
6 null 6/
7 5 1/2/4/5/7/
9 5 1/2/4/5/9/
From that point, getting all of the nodes under ID = 5 is a very simple:
SELECT *
FROM table
WHERE Path like (SELECT Path FROM Table WHERE ID = 5) + '%'
Not with ANSI standard SQL it isn't, no. Well, that's not strictly true. You can do left outer joins and put in enough to cover the likely maximum depth but unless you restrain the max depth and include that many joins, it won't always work.
If your set of rows is sufficiently small (say less than 1000), just retrieve them all and then figure it out. It'll be faster than single read traversals in all likelihood.
You could batch the parent traversal. Have a query like:
SELECT t1.id id1, t1.parent parent1,
t2.id id2, t2.parent parent2,
t3.id id3, t3.parent parent3,
t4.id id4, t4.parent parent4,
t5.id id5, t5.parent parent5
FROM mytable t1
LEFT OUTER JOIN mytable t2 ON t1.parent = t2.id
LEFT OUTER JOIN mytable t3 ON t2.parent = t3.id
LEFT OUTER JOIN mytable t4 ON t3.parent = t4.id
LEFT OUTER JOIN mytable t5 ON t4.parent = t5.id
WHERE t1.id = 1234
and extend it to whatever number you want. If the last retrieved parent isn't null you aren't at the top of the tree yet so run the query again. This way you should hopefully reduce it to 1-2 roundtrips.
Other than that you could look at ways of encoding that data in the ID. This isn't recommended but if you limit, say, each node to having 100 children you could say that node with an ID 10030711 has path of 10 -> 03 -> 07 -> 11. That of course has other problems (like max ID length) and of course it's hacky.
It's also worth noting that there are two basic models for hierarchical data in SQL. Adjacency lists and nested sets. Your way (which is pretty common) is an adjacency set. Nested sets wouldn't really help with this situation though and they are complicated to do inserts on.
are you able to change the table structure? Looks like storing left and right nodes would be easier to work with than storing just a parent because then a single select is possible. See the following links:
http://www.mail-archive.com/sqlite-users#sqlite.org/msg23867.html
http://weblogs.asp.net/aghausman/archive/2009/03/16/storing-retrieving-hierarchical-data-in-sql-server-database.aspx (this is SQLServer, but they have a diagram that might help.)
Related
From a table with column structure (parent, child) I need:
For a particular parent I need all children.
From the result of (1) I need the children's children too.
For example for parent=1:
parent|child parent|child parent|child
1 a a d b f
b e g
This gets you the information you say you want, I think. Two columns: child and grandchild (if any, or else NULL). Not sure if it's the schema you'd like, since you don't specify. You may add JOINs to increase the recursion depth.
select t1.child, t2.child
from T as t1 left join T as t2
on t1.child = t2.parent
where t1.parent = 1
This works on SQLite; I think it's quite standard. Regarding schema if this one doesn't serve you, hopefully it may give you ideas; or else please specify more.
I have written the following query in order to achieve the following:
1) Select all regulatory languages that do not have a specified ID.
2) Link those regulatory languages based on a hierarchy field (RL_ID_DEFINED - this field is the ID of the parent regulatory language).
My first variation used NOT IN, but after looking into it I decided that NOT EXISTS would be a more efficient approach. Additionally, I was thinking that adding a WITH clause might make it run a bit faster, since in my current code it is running the nested SELECT statement for each ID in the iteration. Would it be possible to rewrite with using a WITH clause for that nested SELECT?
SELECT
T1.ID
FROM
REGULATORY_LANGUAGES T1
WHERE
T1.INACTIVE_DATE IS NULL
AND NOT EXISTS (
SELECT
NULL
FROM
REGULATORY_LANGUAGES T2,
REVIEW_REGULATIONS T3
WHERE
T3.RVWTYPYR_ID = ?
AND T3.RL_ID = T2.ID
AND T1.ID = T2.ID)
START WITH
RL_ID_DEFINED IS NULL
AND INACTIVE_DATE IS NULL
CONNECT BY
PRIOR ID = RL_ID_DEFINED
The problem I'm running into is that when I look at the structure of a WITH clause, I would be creating it prior to my main SELECT. However, that would require me to have defined my T1 table already. Any thoughts?
(Note - this is being called in a java method, hence the ? in the line T3.RVWTYPYR_ID = ?. When I test this in the database editor via Toad, I just hard code a value for the ?).
While speed is important, so is accuracy. You mentioned that you switched from not in to not exists for efficiency. They do different things. There is another way to speed up the logic of not in. Instead of this:
where someField not in
(select someField
from etc
)
Do this
where someField in
(select someField
from etc
where whatever
minus
select someField
from etc
where whatever
and more filters that identify records to exclude
)
Now for the with keyword. It speeds up performance when you want to run the exact same subquery more than once. So, instead of this:
where field1 in
(sql for subquery)
and field 2 in
(exact same sql as above)
you do this:
with temp as (sql for subquery)
select etc
where field1 in
(select something from temp)
and field 2 in
(select something from temp)
However, that's not your situation. What you probably want to do is to investigate ways to send a list of parameters from java so that your query looks like this:
T3.RVWTYPYR_ID in (?,?,etc)
Then you wouldn't have to repeat the subquery.
Much thanks to Tom H for his insight. I've rewritten the query using JOIN:
SELECT
T1.ID
FROM
REGULATORY_LANGUAGES T1
LEFT JOIN (
SELECT
T2.ID ID
FROM
REGULATORY_LANGUAGES T2
INNER JOIN
REVIEW_REGULATIONS T3
ON
T3.RVWTYPYR_ID = ?
AND T3.RL_ID = T2.ID) T_JOIN
ON T1.ID = T_JOIN.ID
WHERE
T1.INACTIVE_DATE IS NULL
AND T_JOIN.ID IS NULL
START WITH
T1.RL_ID_DEFINED IS NULL
AND T1.INACTIVE_DATE IS NULL
CONNECT BY
PRIOR T1.ID = T1.RL_ID_DEFINED
My query runs slowly when the result set is empty. When there is something to return, it is lightning fast.
;with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.ParentId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
), branch(NodeId,CategoryId,ParentId) as
(
select NodeId, CategoryId, ParentId from dbo.CategoryTree as t
where t.NodeId = 6
union all
select NodeId, CategoryId, ParentId
from tree as t
),facil(FacilityId) as(
select distinct fct.FacilityId
from dbo.FacilitiesCategoryTree as fct
inner join branch b on b.NodeId = fct.CategoryNodeId
)
select top 51 f.Id, f.CityId, f.NameGEO,
f.NameENG, f.NameRUS, f.DescrGEO, f.DescrENG,
f.DescrRUS, f.MoneyMin, f.MoneyAvg, f.Lat, f.Lng, f.SortIndex,
f.FrontImgUrl from dbo.Facilities f
inner join facil t2 on t2.FacilityId = f.Id
and f.EnabledUntil > 'Jan 14 2015 10:23PM'
order by f.SortIndex
Principal tables are:
Facilities table holds facilities, 256k records.
CategoryTree is used to group categories in a hierarchy.
NodeId int,
CategoryId int,
ParentId int
FacilitiesCategoryTree is used to link CategoryTree to Facilities.
Given NodeId, the second CTE returns all the nodes that are descendant of the given node including itself. Then there is a third CTE that returns facility ids that belong to these nodes.
Finally, the last CTE is joined to actual facilities table. The result is ordered by SortIndex which is used to manually indicate the order of facilities.
This query runs very fast when there is something to return even if I include many more predicates including full-text search and others, but when the given branch does not have any facilities, this query takes approx. 2 seconds to run.
If I exclude the order by clause, the query runs very fast again. All these tables are indexed and the query optimizer does not suggest any improvements.
What do you think is the problem and what can be done to improve the performance of queries with empty results?
Thank you.
Update1:
I am adding execution plans.
http://www.filedropper.com/withorderby
http://www.filedropper.com/withoutorderby
Update2:
I went through the recommendations of oryol and tried to save facility IDs from tree to the table variable and join it with facilities table and order by SortIndex. It eliminated the problem with empty results, but increased the execution time of queries with a result set from 250ms to 950ms.
I also changed the query to select from facil and join to the Facilities and added option (force order). The result was the same as above.
Finally, I denormalized facility/category mapping table to include SortIndex in this table. It increased the execution time of normal queries slightly from 250ms to 300ms, but it resolved the empty result set problem. I guess, I’ll stick to this method.
The first thing - you can slightly simplify the first two CTEs to just one:
with tree(NodeId,CategoryId,ParentId) as (
select ct.NodeId, ct.CategoryId, ct.ParentId
from dbo.CategoryTree as ct
where ct.NodeId = 6
union all
select t.NodeId, t.CategoryId, t.ParentId from dbo.CategoryTree as t
inner join tree as t2 on t.ParentId = t2.NodeId
)
The main problem that optimizer don't know or incorrectly estimate number of facilities which will be returned for your categories. And because you need facilities ordered by SortIndex optimizer decides to:
Go through all facilities ordered by SortIndex (using the appropriate index)
Skip rows which are not covered by other filters (EnabledUntil)
Using given Facility Id find one row in facilities from categories tree. If it exists returns result row. If not - skip this facility.
Repeat these iteration until 51 rows will be returned
So, in the worst case (if there are no 51 such facilities or they have very big SortIndex) it will require scan of all idx_Facilities_SortIndex and it requires a lot of time.
There are several ways to resolve this issue (including hints to optimizer to tell about row count or join order) to find the best way it's better to work with real database. First option which can be tried is to change query to:
Save facility IDs from tree to the table variable
Join it with facilities table and order by SortIndex
Another option (can be also comined with the first one) is to try to use FORCE ORDER query hint. In such case you will need to modify your select statement to select from facil and join it to the Facilities and add option (force order) query hint to the end of statement.
Query without order by select all facilities from tree. And after that extract other facility fields from facilities table.
Also, it's important to know about actual size of facilities in the tree (according to the estimates in the execution plan without order by it's really big - 395982). Does this estimate (more or less) correct?
If you really have a big amount of facilities returned after joining with category tree and facility/categories mapping table then the best solution will be to denormalize facility/category mapping table to include SortIndex in this table and add index to this table by NodeId and SortIndex.
So actually, we need to test queries / indexes with real data. Or to know different statistics of data:
Categories amount
Number of facilities per category and total number of rows in facilities / categories mapping table
SortIndex distribution (is it unique?)
etc.
I have five results to retrieve from a table and I want to write a store procedure that will return all desired rows.
I can write the query like that temporarily:
Select * from Table where Id = 1 OR Id = 2 or Id = 3
I supposed I need to receive a list of Ids to split, but how do I write the WHERE clause?
So, if you're just trying to learn SQL, this is a short and good example to get to know the IN operator. The following query has the same result as your attempt.
SELECT *
FROM TABLE
WHERE ID IN (SELECT ID FROM TALBE2)
This translates into what is your attempt. And judging by your attempt, this might be the simplest version for you to understand. Although, in the future I would recommend using a JOIN.
A JOIN has the same functionality as the previous code, but will be a better alternative. If you are curious to read more about JOINs, here are a few links from the most important sources
Joins - wikipedia
and also a visual representation of how different types of JOIN work
Another way to do it. The inner join will only include rows from T1 that match up with a row from T2 via the Id field.
select T1.* from T1 inner join T2 on T1.Id = T2.Id
In practice, inner joins are usually preferable to subqueries for performance reasons.
Like the title says, if anyone has the answer I would like to know. I've been googling but couldn't find a straight answer.
Example:
This works
SELECT COUNT(*) FROM Table1 TB1, Table2 TB2
WHERE TB1.Field1 = TB2.Table2
This seems to take hours
SELECT COUNT(*) FROM Table1 TB1, Table2 TB2
WHERE TB1.Field1 <> TB2.Table2
Because they are different SQL sentences. In the first one, you are joining two tables using Field1 and Table2 fields. Probably returning a few records.
In the second one, your query is probably returning a lot of records, since you are doing a cross join, and a lot of rows will satisfy your Field1 <> Table2 condition.
A very simplified example
Table1
Field1
------
1
2
5
9
Table2
Table2
------
3
4
5
6
9
Query1 will return 2 since only 5 and 9 are common.
Query2 will return 18 since a lot of rows from cross join will count.
If you have table with a lot of records, it will take a while to process your second query.
It's important to realize that SQL is a declarative language and not an imperative one. You describe what conditions you want your data to fit and not how those comparisons should be executed. It's the job of the database to find the fastest way to give you an answer (a task taken over by the query optimizer). This means that a seemingly small change in your query can result in a wildly different query plan, which in turn results in a wildly different runtime behaviour.
The = comparison can be converted to and optimized the same way as a simple join on the two fields. This means that normal indices can be used to execute the query very fast, probably without reading the actual data and using only the indices instead.
A <> comparison on the other hand requires a full cartesian product to be calculated and checked for the condition, usually (there might be a way to optimize this with the correct index, but usually an index won't help here). It will also usually return a lot more results, which adds to the execution time.
Probably, the second query processes way more rows than the first one.
(Thinking back to a similar question)
Are you trying to count the rows in Table1 for which there is no matching record in Table2?
If so you could use this
SELECT COUNT(*) FROM Table1 TB1
WHERE NOT EXISTS
(SELECT * FROM Table2 TB2
WHERE TB1.Field1 = TB2.Field2 )
or this for example
SELECT COUNT(*)
FROM
(
SELECT Field1 FROM Table1
MINUS
SELECT Field2 FROM Table2
) T