Recursive subdirectory SQL problem - sql

This is a mental excercise that has been bothering me for a while. What strategy would you use to solve this sort of problem?
Let's consider the following simple database structure. We have directories, obviously a tree of them. Also we have content items, which always reside in some directories.
create table directory (
directoryId integer generated always as identity primary key,
parentId integer default null,
directoryName varchar(100)
);
create table content (
contentId integer generated always as identity primary key,
directory integer references directory(directoryId),
contentTitle varchar(100),
contentText varchar(32000)
);
Now let's assume that our directory tree is massive and the amount of content is massive. The solution must scale well.
The main problem: How to efficiently retrieve all content items that are found from the specified directory and its subdirectories?
The way I see it SQL can not be used to get easily all the directoryIds for a subselect. Am I correct?
One could solve this at application side with simple recursive loop. That might become actually very heavy though and require tricky caching, especially to quarantee reasonable first access times.
One could also perhaps build a materialized query table and add multi-dimensional indexes dynamically for it. Possible but an implementation mess. Too complex.
My far most favorite solution would be probably to add a new table like
create table subdirectories (
directoryId integer,
subdirectoryId integer,
constraint thekey primary key (directoryId,subdirectoryId)
)
and make sure I would always update it manually when directories are being moved/deleted/created. Thus I could always do a select with the directoryId and get all Ids for subdirectories, including as a subselect for more complex queries. I also like the fact that the rdbms is able to optimize the queries well.
What do you guys think?

In SQL Server 2005, PostgreSQL 8.4 and Oracle 11g:
WITH
-- uncomment the next line in PostgreSQL
-- RECURSIVE
q AS
(
SELECT directoryId
FROM directories
WHERE directoryId = 1
UNION ALL
SELECT d.directoryId
FROM q
JOIN directories
WHERE parentId = q.directoryId
)
SELECT c.*
FROM q
JOIN content c
ON c.directory = q.directoryId
In Oracle before 11g:
SELECT c.*
FROM (
SELECT directoryId
FROM directories
START WITH
directoryId = 1
CONNECT BY
parent = PRIOR directoryID
) q
JOIN content c
ON c.directory = q.directoryId
For PostgreSQL 8.3 and below see this article:
Hierarchical queries in PostgreSQL
For MySQL, see this article:
Hierarchical queries in MySQL

This is a standard -- and well understood -- "hard problem" in SQL.
All arc-node graph theory problems are hard because they involve transitive relationships.
There are standard solutions.
Loops with an explicit stack using to manage a list of unvisited nodes of the tree.
Recursion. This is remarkably efficient. It doesn't "require tricky caching" It's really simple and really effective. The recursion stack is the list of unvisited nodes.
Creating a "transitive closure" of the directory tree.
SQL extensions to handle transitive relationships like a directory tree.

Related

Is storing a doubly linked list in a table wholly redundant?

Let's say we have a table called nodes with columns id, next_id &prev_id.
For purpose of below, let's also say $this is some node in memory being used to form the query.
I observe the following queries can be issued:
1) "Follow the node's own links":
SELECT * from nodes where id = $this.next_id;
SELECT * from nodes where id = $this.prev_id;
2) "Look up backlinks to this node elsewhere in the table" (foreign keys, obviously)
This uses the fact that, unlike were this to be a raw datastructure not in a table, in sql we are actually in a table, so we can just look up the backward pointing links (you could not do this on a raw linked list data structure in memory, because there would be no index enabling this)
SELECT * from nodes where next_id = $this.id;
SELECT * from nodes where prev_id = $this.id;
This leads me to conclude that "links" in sql are effectively bi-directional by default.
I know to some experts this may seem a very elementary point but I believe it worth stressing because it may get to the core of something quite important.
A side issue is I am a little confused that when I search for "doubly linked list in sql", there are a few people talking about doing this.
Have they simply got it wrong?
Linked lists -- of the doubly linked sort or not -- are almost never needed in SQL databases. They are a data structure used to optimize certain types of data access, particularly for data stored in memory.
SQL databases store data using different mechanisms. Data is stored on data pages, which typically contain multiple records. Access is improved though the use of indexes, which come in multiple flavors (particularly in Postgres). The assumption is that data is larger than memory -- although many "smaller" databases now do readily fit into available memory.
What are used are foreign key references. In this case nodes could be defined as:
create nodes as (
nodeId serial primary key,
prev_nodeId int references nodes(nodeId),
next_nodeId int references nodes(nodeId)
);
However, nothing (in this declaration) would guarantee that next-->prev = prev-->next, for instance. Enforcing that constraint is a little tricky.
What you are really trying to do with a doubly-linked list is to maintain order. In SQL, you would do that with some sort of ordering column:
create nodes as (
nodeId serial primary key,
ordering int unique not null
);
This would guarantee an ordering. With this approach, you can get the next and previous using lead() and lag():
lead(nodeId) over (order by ordering) as next_nodeId
lag(nodeId) over (order by ordering) as prev_nodeId
If the linked list is merely used to remember "insertion" order, then serial already does that for you and an additional column is not necessary.
In SQL, you get an unordered array of rows automatically when data is in a table. You can readily add a column that specifies the ordering. Under these circumstances, linked lists are much less useful.

Reconstruct Create-Table-statement from Select-Statements

Question
Are there any tools or APIs that can do the following conversion/transformation?
Input: A Select-statement of arbitrary complexity (e.g can contain multiple joins, unions etc. pp.) for example SELECT x, y, z FROM A LEFT JOIN B on A.p = B.p LEFT JOIN C on B.q = C.q
Output: A Create-statement that creates all neccessary tables.
Background:
I have 50+ Select-statements for which I need to generate tables. This is a somehow tedious task.
Additional Questions
I think it is possible to automate. Correct me if I am wrong with this assumption and provide a explanation in case it is not possible to automate. I know the Select-statements can lack information, that the real database would have.
The trouble with this is that you do not have the relevant information in order to create the tables. You would have to make a lot of assumptions
You are missing the following info:
Datatypes CHAR(10), INT, BIT
Constraints NOT NULL, DEFAULT()
Index information
...and that is just the start, not every query will list all fields and you'll probably find that you've missed stuff the next time a table is referenced - what do you do then??
The only way this could be done is if you have very script coding standards in place.
For example:
All columns called Modified are of type DATETIME NOT NULL
All primary keys take the name of the table plus ID, eg. TablenameID INT IDENTITY(1,1) PRIMARY KEY NOT NULL (you can use also use the CONSTRAINT option to create the key with a specific name)
... you get the idea.
I would just bite the bullet and get on with it manually.
Edit
The real question here is why are you doing it this way? Surely the queries have been created against an existing database?
Is it the case that you are migrating from one RDBMS to another?

Would "dereferencing" a foreign key in SQL theoretically work?

Suppose we have 2 tables, a and b:
CREATE TABLE a (id_a INT NOT NULL PRIMARY KEY, id_b INT NOT NULL)
INDEX fk_id_b (id_b ASC),
CONSTRAINT fk_id_b FOREIGN KEY (id_b)
REFERENCES b (id_b);
CREATE TABLE b (id_b INT NOT NULL PRIMARY KEY, b_data INT NOT NULL);
So a has the following columns: id_a and id_b, where id_b is a foreign key to bs id_b.
When I want to get the associated b_data from a, I have to do a join:
SELECT id_a, b_data FROM a JOIN b ON a.id_b=b.id_b;
It works fine, but it's long, I repeat myself (which I shouldn't according to the ruby guys), so I thought of a way to make this query shorter, easier to understand and still unambiguous:
SELECT id_a, id_b->b_data FROM a;
foreign_key->column would behave like a pointer to a structure, the database would automatically join the needed tables.
I know this doesn't exist, that making it a standard would probably take so much time I wouldn't live to see it in production ready database systems and some people wouldn't want it as "it looks weird", but I would at least like to know, if it would be possible, and if not, why.
First
Ruby isn't SQL, SQL isn't Ruby
SQL also predates almost every current mainstream or fashionable language
However, one thing to bear in mind, and the most important...
Repeating the JOIN is not the same as repeating the query. You'll have different
WHERE filters
SELECT list
Maybe an aggregate
Each of these is different query and will require different indexes/plans
Using a view to mask the JOIN will be the next great idea suggestion to "encapsulate" it. However, you'll end up with view joining view joining view... and a view is just a macro that expands. So your queries will start to run poorly.
Using an indexed view may not be a solution because of different filters etc
Edit, from Dems:
These types of ideas work in simple cases, but create more problems in complex cases. The current syntax handles expression of set based queries equally well/poorly across a very wide range of complexity.
One of the major advantages of the relational model of data is that it eliminates the need to rely on hard coded links/pointers/navigational structures between tables. Data access is via table and attribute names using relational expressions like joins.
A model that persisted navigational structures in the database would be less flexible and dynamic - when you changed table structures you would invalidate or have to change the navigational structures as well. Your question also only addresses those joins which happen to be equijoins on foreign keys. Joins are much more general than that.
SQL has a NATURAL JOIN operator e.g. your query would be:
SELECT DISTINCT *
FROM a NATURAL JOIN b;
However, it looks like you want to do a semi-join, for which SQL has no specific operator :(
As you are interested in language design, consider the truly relational language Tutorial D (designed for academic purposes) has a semi-join operator MATCHING e.g. your query would simply be:
a MATCHING b;

Adjacency list tree - how to prevent circular references?

I have an adjacency list in a database with ID and ParentID to represent a tree structure:
-a
--b
---c
-d
--e
Of course in a record the ParentID should never be the same as ID, but I also have to prevent circular references to prevent an endless loop. These circular references could in theory involve more than 2 records. ( a->b, b->c, c->a , etc.)
For each record I store the paths in a string column like this :
a a
b a/b
c a/b/c
d d
e d/e
My question is now :
when inserting/updating, is there a way to check if a circular reference would occur?
I should add that I know all about the nested set model, etc. I chose the adjacency method with stored path's because I find it much more intuitive. I got it working with triggers and a separate paths-table, and it works like a charm, except for the possible circular references.
If you're storing the path like that, you could put in a check that the path does not contain the id.
If you are using Oracle you can implement a check for cycles using the CONNECT BY syntax. The count of nodes should be equal to the count of decendents from the root node.
CHECK (
(SELECT COUNT(*) Nodes
FROM Tree) =
(SELECT COUNT(*) Decendents
FROM Tree
START WITH parent_node IS NULL -- Root Node
CONNECT BY parent_node = PRIOR child_node))
Note, you will still need other checks to enforce the tree. IE
Single root node with null.
Node can have exactly one parent.
You cannot create a check constraint with a subquery, so this will need to go to a view or trigger.

Arrays in database tables and normalization

Is it smart to keep arrays in table columns? More precisely I am thinking of the following schema which to my understanding violates normalization:
create table Permissions(
GroupID int not null default(-1),
CategoryID int not null default(-1),
Permissions varchar(max) not null default(''),
constraint PK_GroupCategory primary key clustered(GroupID,CategoryID)
);
and this:
create table Permissions(
GroupID int not null default(-1),
CategoryID int not null default(-1),
PermissionID int not null default(-1),
constraint PK_GroupCategory primary key clustered(GroupID,CategoryID)
);
UPD3: I envision Permissions as a comma-delimited string since MSSQL is our primary deployment target.
UPD: Forgot to mention, in the scope of this concrete question we will consider that the "fetch rows that have permission X" won't be performed, instead all the lookups will be made by GroupID and CategoryID only
UPD2: I envision the typical usage scenario as following:
int category_id=42;
int[] array_of_groups=new int[]{40,2,42};
if(!Permissions.Check(category_id, array_of_groups, Permission.EatAndDrink)) {
throw new StarveToDeathException();
}
Thoughts?
Thanks in advance!
I'd suggest to take the normalized road for the following reasons:
By having a table containing all possible permissions, you have self-documenting data. You may add a description to each permission. This definitely beats concatenated id values without any meaning.
You get all the advantages of referential integrity and can be sure that there are no bogus permission ids in your data.
Inserting and deleting permissions will be easier - you add or delete records. With the concatenated string you will be updating a column, and delete the record only when you remove the last permission.
Your design is future-proof - you say you only want to query by CategoryID and GroupID, you can do this already with normalized tables. On top of that, you will also for example be able to add other properties to your permissions, query by permission, etc.
Performance-wise, I think it will actually be faster to get a resultset of id's than having to parse a string to integers. To be measured with actual data and implementation...
Your second example should probably be:
constraint PK_GroupCategory primary key clustered(GroupID,CategoryID,PermissionID)
Your first example would violate normal form (and string parsing might not be a good use of your processing time), but that doesn't mean it's necessarily wrong for your application. It really depends how you use the data.
Is it smart
Occasionally, it depends. I'd say it depends how narrowly you define the things being normalised.
If you can see no way in which a table with one row for each item would ever be useful then I'd suggest that the encapsulate-in-a-string might be considered.
In the example given, I'd want to be sure that executing a query to find all group/category combinations for a specified permission would not cause me a problem if I had to write a WHERE clause that used string pattern matching. Of course, if I never have to perform such a query then it's a moot point.
In general I'm happiest with this approach when the data being assembled thus has no significance in isolation: the data only makes sense when considered as a complete set. If there's a little more structure, say a list of data/value pairs, then formatting with XML or JSON can be useful.
If you're only querying by GroupID and/or CategoryID then there's nothing wrong with it. Normalizing would mean more tables, rows, and joins. So for large databases this can have a negative performance impact.
If you're absolutely certain you'll never need a query which processes Permissions, and it's only parsed by your application, there's nothing improper about this solution. It could also be preferable if you always want the complete set of permissions (i.e. you're not querying just to get part of the string, but always want all of its values).
The problem with the first implementation is that it doesn't actually use an array but a concatenated string.
This means that you won't easily be able to use the value stored in that string to perform set based queries such as finding all people with a specific permission or specific set of permissions.
If you were using a database that natively supported arrays as an atomic value such PostgreSQL then the argument would be different.
Based upon the second requirement of the proposed query I'd have to suggest the second one is best as you can simply query SELECT count(*) FROM Permissions WHERE CategoryID = 42 AND GroupID IN (40, 2, 42) AND PermissionID = 2 (assuming EatAndDrink has an ID of 2). The first version however would require retrieving all the permissions for each group and parsing the string before you can test if it includes the requested permission.