Sql recursion without recursion - sql

I have four tables
create table entities{
integer id;
string name;
}
create table users{
integer id;//fk to entities
string email;
}
create table groups{
integer id;//fk to entities
}
create table group_members{
integer group_id; //fk to group
integer entity_id;//fk to entity
}
I want to make a query that returns all groups where a user belongs, directly or indirectly. The obvious solution is to make a recursion at the application level. I’m wondering what changes can I make to my data model to decrease the database access and as a result have a better performance.

In Oracle:
SELECT group_id
FROM group_members
START WITH
entity_id = :user_id
CONNECT BY
entity_id = PRIOR group_id
In SQL Server:
WITH q AS
(
SELECT group_id, entity_id
FROM group_members
WHERE entity_id = #user_id
UNION ALL
SELECT gm.group_id, gm.entity_id
FROM group_members gm
JOIN q
ON gm.entity_id = q.group_id
)
SELECT group_id
FROM q
In PostgreSQL 8.4:
WITH RECURSIVE
q AS
(
SELECT group_id, entity_id
FROM group_members
WHERE entity_id = #user_id
UNION ALL
SELECT gm.group_id, gm.entity_id
FROM group_members gm
JOIN q
ON gm.entity_id = q.group_id
)
SELECT group_id
FROM q
In PostgreSQL 8.3 and below:
CREATE OR REPLACE FUNCTION fn_group_members(INT)
RETURNS SETOF group_members
AS
$$
SELECT group_members
FROM group_members
WHERE entity_id = $1
UNION ALL
SELECT fn_group_members(group_members.group_id)
FROM group_members
WHERE entity_id = $1;
$$
LANGUAGE 'sql';
SELECT group_id
FROM group_members(:myuser) gm

There are ways of avoiding recursion in tree hierarchy queries (in opposition to what people have said here).
The one I've used most is Nested Sets.
As with all life and technical decisions, however, there are trade offs to be made. Nested Sets are often slower to update but much faster to query. There are clever and complicated ways of improving the speed of updating the hierarchy, but there's another trade-off; performance vs code complexity.
A simple example of a nested set...
Tree View:
-Electronics
|
|-Televisions
| |
| |-Tube
| |-LCD
| |-Plasma
|
|-Portable Electronics
|
|-MP3 Players
| |
| |-Flash
|
|-CD Players
|-2 Way Radios
Nested Set Representation
+-------------+----------------------+-----+-----+
| category_id | name | lft | rgt |
+-------------+----------------------+-----+-----+
| 1 | ELECTRONICS | 1 | 20 |
| 2 | TELEVISIONS | 2 | 9 |
| 3 | TUBE | 3 | 4 |
| 4 | LCD | 5 | 6 |
| 5 | PLASMA | 7 | 8 |
| 6 | PORTABLE ELECTRONICS | 10 | 19 |
| 7 | MP3 PLAYERS | 11 | 14 |
| 8 | FLASH | 12 | 13 |
| 9 | CD PLAYERS | 15 | 16 |
| 10 | 2 WAY RADIOS | 17 | 18 |
+-------------+----------------------+-----+-----+
You'll want to read the article I linked to understand this fully, but I'll try to give a short explanation.
An item is a member of another item if (the child's "lft" (Left) value is greater than the parent's "ltf" value) AND (the child's "rgt" value is less than the parent's "rgt" value)
"Flash" is therfore a member of "MP3 PLAYERS", "Portable Electronics" and "Electronics"
Or, conversley, the members of "Portable Electronics" are:
- MP3 Players
- Flash
- CD Players
- 2 Way Radios
Joe Celko has an entire book on "Trees and Hierarchies in SQL". There are more options than you think, but lots of trade off's to make.
Note: Never say something can't be done, some mofo will turn up to show you that in can.

Can you clarify the difference between an entity and a user? Otherwise, your tables look OK. You are making an assumption that there is a many-to-many relationship between groups and entities.
In any case, with standard SQL use this query:
SELECT name, group_id
FROM entities JOIN group_members ON entities.id = group_members.entity_id;
This will give you a list of names and group_ids, one pair per line. If an entity is a member of multiple groups, the entity will be listed several times.
If you're wondering why there's no JOIN to the groups table, it's because there's no data from the groups table that isn't already in the group_members table. If you included, say, a group name in the groups table, and you wanted that group name to be shown, then you'd have to join with groups, too.
Some SQL variants have commands related to reporting. They would allow you to list multiple groups on the same line as a single entity. But it's not standard and wouldn't work across all platforms.

If you want a truly theoretically infinite level of nesting, then recursion is the only option, which precludes any sane version of SQL. If you're willing to limit it, then there are a number of other options.
Check out this question.

You can do the following:
Use the START WITH / CONNECT BY PRIOR constructs.
Create a PL/SQL function.

I don't think there is a need for recursion here as the solution posted by barry-brown seems adequate. If you need a group to be able to be a member of a group, then the tree traversal method offered by Dems works well. Inserts, deletes and updates are pretty straightforward with this scheme, and retrieving the entire hierarchy is accomplished with a single select.
I would suggest including a parent_id field in your group_members table (assuming that is the point at which your recursive relationship occurs). In a navigation editor I've created a nodes table like so:
tbl_nodes
----------
node_id
parent_id
left
right
level
...
My editor creates hierarchically-related objects from a C# node class
class node {
public int NodeID { get; set; }
public Node Parent { get; set; }
public int Left { get; set; }
public int Right { get; set; }
public Dictionary<int,Node> Nodes { get; set; }
public int Level {
get {
return (Parent!=null) ? Parent.Level+1 : 1;
}
}
}
The Nodes property contains a list of child nodes. When the business layer loads the hierarchy, it rectifies the parent/child relationships. When the nav editor saves, I recursively set the left and right property values, then save to the database. That lets me get the data out in the correct order meaning I can set parent/child references during retrieval instead of having to make a second pass. Also means that anything else that needs to display the hierarchy ( say, a report) can easily get the node list out in the correct order.
Without a parent_id field, you can retrieve a breadcrumb trail to the current node with
select n1.*
from nodes n1, nodes n2
where d1.lft <= d2.lft and d1.rgt >= d2.rgt
and d2.id = #id
order by lft;
where #id is the id of the node you're interested in.
Pretty obvious stuff, really, but it applies to items such as nested group membership that might not be obvious, and as others have said eliminates the need to slow recursive SQL.

Related

Keep a relation map in Objection.js while removing the table

I'm developing a reddit-like site where votes are stored per-user (instead of per-post). Here's my relevant schema:
content
id | author_id | title | text
---|-----------|-------------|---
1 | 1 (adam) | First Post | This is a test post by adam
vote: All the votes ever voted by anyone on any post
id | voter_id | content_id | category_id
---|-------------|------------------|------------
1 | 1 (adam) | 1 ("First Post") | 1 (upvote)
2 | 2 (bob) | 1 ("First Post") | 1 (upvote)
vote_count: Current tally ("count") of total votes received by a post by all users
id | content_id | category_id | count
---|------------------|--------------|-------
1 | 1 ("First Post") | 1 (upvote) | 2
I've defined a voteCount relation in Objection.js model for the content table:
class Content extends Model {
static tableName = 'content';
static relationMappings = {
voteCount: {
relation: Model.HasManyRelation,
modelClass: VoteCount,
join: {
from: 'content.id',
to: 'vote_count.content_id'
}
}
}
}
But I recently (learned and) decided that I don't need to keep (and update) a separate vote_count table, when in fact I can just query the vote table and essentially get the same table as a result:
SELECT content_id
, category_id
, COUNT(*) AS count
FROM vote
GROUP
BY content_id
, category_id
So now I wanna get rid of the vote_count table entirely.
But it seems that would break my voteCount relation since there won't be a VoteCount model (not shown here but it's the corresponding the model for the vote_count table) no more either. (Right?)
How do I keep voteCount relation while getting rid of vote_count table (and thus VoteCount model with it)?
Is there a way to somehow specify in the relation that instead of looking at a concrete table, it should look at the result of a query? Or is it possible to define a model class for the same?
My underlying database in PostgreSQL if that helps.
Thanks to #Belayer. Views were exactly the solution to this problem.
Objection.js supports using views (instead of table) in a Model class, so all I had to do was create a view based on the above query.
I'm also using Knex's migration strategy to create/version my database, and although it doesn't (yet) support creating views out of the box, I found you can just use raw queries:
module.exports.up = async function(knex) {
await knex.raw(`
CREATE OR REPLACE VIEW "vote_count" AS (
SELECT content_id
, category_id
, COUNT(*) AS count
FROM vote
GROUP
BY content_id
, category_id
)
`);
};
module.exports.down = async function(knex) {
await knex.raw('DROP VIEW "vote_count";');
};
The above migration step replaces my table vote_count for the equivalent view, and the Objection.js Model class for it (VoteCount) worked as usual without needing any change, and so did the relation voteCount on the Content class.

My SQL query is taking too long, is there another approach?

I want to store 3D vector images in my MariaDB, but I'm finding that retrieving the data is taking far too long to be practical.
I have a few tables:
a points table containing the x,y and z coordinates plus the entity id,
an entity table containing a unique id, an entity type (text, line, polyline,etc) other common attributes such as colour and linetype,
and some auxiliary tables containing additional values like text, text height, line thicknesses and flags split into separate tables based on field type (varchar, int or float).
I am accessing the data through PHP as follows:
if($result = mysqli_query($conn, "SELECT entityID,X,Y,Z FROM dwgpoints WHERE drawing=".$DrawingID." AND blockID=".$blockID.";"))
{
$previous_eID=0;
while($row = mysqli_fetch_array($result))
{
$eID=$row['entityID'];
if($previous_eID!=$eID)
{
if($previous_eID)// confirm it's not zero
renderEntity($image_handle,$DrawingID,$previous_eID,$etype,$colour,$ltype,$points, $transformation, $clip);
$previous_eID=$eID;
if($eResult=mysqli_query($conn,"SELECT colour,ltype,etype FROM entity WHERE drawing=".$ID." AND eID=".$eID.";")){
$erow=mysqli_fetch_assoc($eResult);
$colour=$erow['colour'];
$ltype=$erow['ltype'];
$etype=$erow['etype'];
$points=[[$row['X'],$row['Y'],$row['Z']]];
}
}else{
$points[]=[$row['X'],$row['Y'],$row['Z']];
}
}
}
This process is taking up to ten minutes, but I know that Openstreetmaps, for example, renders tiles from similar amounts of data.
The results of the EXPLAIN directive is as follows:
MariaDB [wptest_11]> EXPLAIN SELECT entityID,X,Y,Z FROM dwgpoints WHERE drawing=2 AND blockID=-1;
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
| 1 | SIMPLE | dwgpoints | ref | idx_id | idx_id | 9 | const,const | 24939 | Using index condition |
+------+-------------+-----------+------+---------------+--------+---------+-------------+-------+-----------------------+
Is it possible to streamline my data searches to make the process time manageable? I'm on a cheap VPS, so there may be hardware performance issues, in which case would an upgrade make much difference? Or do I need to rethink my approach?
Any advice would be most welcome.
For reference, I have run millions of records on a Raspberry Pi3 with better performance. Hardware helps, but I would first look at:
Minimizing your calls, if it makes sense for your application
Add some indexes on fields in your WHERE clause to improving performance
Single Query
Adjust the following query to fit your needs. Then see what the timing is. For 26k results, you should be well under a second.
SELECT
a.entityID
, a.X
, a.Y
, a.Z
, b.colour
, b.ltype
, b.etype
, b.drawing
FROM dwgpoints a
LEFT JOIN entity b
ON b.eID = a.entityID
AND b.drawing = a.drawing
WHERE
a.drawing = ".$ID."
AND a.blockID = ".$blockID.
;
Indexes
Create some indexes:
The combination of drawing and blockID for the dwgpoints table.
The combination of drawing and eID for the entity table.
Re-run the above query, there should be an improvement.
I would also look at adding a foreign key between the tables, which will help improve data integrity of your database.

Check if a value exists in the child-parent tree

I'm creating a simple directory listing page where you can specify what kind of thing you want to list in the directory e.g. a person or a company.
Each user has an UserTypeID and there is a dbo.UserType lookup table. The dbo.UserType lookup table is like this:
UserTypeID | UserTypeParentID | Name
1 NULL Person
2 NULL Company
3 2 IT
4 3 Accounting Software
In the dbo.Users table we have records like this:
UserID | UserTypeID | Name
1 1 Jenny Smith
2 1 Malcolm Brown
3 2 Wall Mart
4 3 Microsoft
5 4 Sage
My SQL (so far) is very simple: (excuse the pseudo-code style)
DECLARE #UserTypeID int
SELECT
*
FROM
dbo.Users u
INNER JOIN
dbo.UserType ut
WHERE
ut.UserTypeID = #UserTypeID
The problem is here is that when people want to search for companies they will enter in '2' as the UserTypeID. But both Microsoft and Sage won't show up because their UserTypeIDs are 3 and 4 respectively. But its the final UserTypeParentID which tells me that they're both Companies.
How could I rewrite the SQL to ask it to return to return records where the UserTypeID = #UserTypeID or where its final UserTypeParentID is also equal to #UserTypeID. Or am I going about this the wrong way?
Schema Change
I would suggest you to break it down this schema a little bit more, to make your queries and life simpler, with this current schema you will end up writing a recursive query every time you want to get simplest data from your Users table, and trust me you dont want to do this to yourself.
I would break down this schema of these tables as follow:
dbo.Users
UserID | UserName
1 | Jenny
2 | Microsoft
3 | Sage
dbo.UserTypes_Type
TypeID | TypeName
1 | Person
2 | IT
3 | Compnay
4 | Accounting Software
dbo.UserTypes
UserID | TypeID
1 | 1
2 | 2
2 | 3
3 | 2
3 | 3
3 | 4
You say that you are "creating" this - excellent because you have the opportunity to reconsider your whole approach.
Dealing with hierarchical data in a relational database is problematic because it is not designed for it - the model you choose to represent it will have a huge impact on the performance and ease of construction of your queries.
You have opted for an Adjacently List model which is great for inserts (and deletes) but a bugger for selects because the query has to effectively reconstruct the hierarchy path. By the way an Adjacency List is the model almost everyone goes for on their first attempt.
Everything is a trade off so you should decide what queries will be most common - selects (and updates) or inserts (and deletes). See this question for starters. Also, since SQL Server 2008, there is a native HeirachyID datatype (see this) which may be of assistance.
Of course, you could store your data in an XML file (in SQL Server or not) which is designed for hierarchical data.

Creating new table from data of other tables

I'm very new to SQL and I hope someone can help me with some SQL syntax. I have a database with these tables and fields,
DATA: data_id, person_id, attribute_id, date, value
PERSONS: person_id, parent_id, name
ATTRIBUTES: attribute_id, attribute_type
attribute_type can be "Height" or "Weight"
Question 1
Give a person's "Name", I would like to return a table of "Weight" measurements for each children. Ie: if John has 3 children names Alice, Bob and Carol, then I want a table like this
| date | Alice | Bob | Carol |
I know how to get a long list of children's weights like this:
select d.date,
d.value
from data d,
persons child,
persons parent,
attributes a
where parent.name='John'
and child.parent_id = parent.person_id
and d.attribute_id = a.attribute_id
and a.attribute_type = "Weight';
but I don't know how to create a new table that looks like:
| date | Child 1 name | Child 2 name | ... | Child N name |
Question 2
Also, I would like to select the attributes to be between a certain range.
Question 3
What happens if the dates are not consistent across the children? For example, suppose Alice is 3 years older than Bob, then there's no data for Bob during the first 3 years of Alice's life. How does the database handle this if we request all the data?
1) It might not be so easy. MS SQL Server can PIVOT a table on an axis, but dumping the resultset to an array and sorting there (assuming this is tied to some sort of program) might be the simpler way right now if you're new to SQL.
If you can manage to do it in SQL it still won't be enough info to create a new table, just return the data you'd use to fill it in, so some sort of external manipulation will probably be required. But you can probably just use INSERT INTO [new table] SELECT [...] to fill that new table from your select query, at least.
2) You can join on attributes for each unique attribute:
SELECT [...] FROM data AS d
JOIN persons AS p ON d.person_id = p.person_id
JOIN attributes AS weight ON p.attribute_id = weight.attribute_id
HAVING weight.attribute_type = 'Weight'
JOIN attributes AS height ON p.attribute_id = height.attribute_id
HAVING height.attribute_type = 'Height'
[...]
(The way you're joining in the original query is just shorthand for [INNER] JOIN .. ON, same thing except you'll need the HAVING clause in there)
3) It depends on the type of JOIN you use to match parent/child relationships, and any dates you're filtering on in the WHERE, if I'm reading that right (entirely possible I'm not). I'm not sure quite what you're looking for, or what kind of database you're using, so no good answer. If you're new enough to SQL that you don't know the different kinds of JOINs and what they can do, it's very worthwhile to learn them - they put the R in RDBMS.
when you do a select, you need to specify the exact columns you want. In other words you can't return the Nth child's name. Ie this isn't possible:
1/2/2010 | Child_1_name | Child_2_name | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name | Child_2_name
Each record needs to have the same amount of columns. So you might be able to make a select that does this:
1/2/2010 | Child_1_name
1/2/2010 | Child_2_name
1/2/2010 | Child_3_name
1/3/2010 | Child_1_name
1/4/2010 | Child_1_name
1/4/2010 | Child_2_name
And then in a report remap it to how you want it displayed

cloning hierarchical data

let's assume i have a self referencing hierarchical table build the classical way like this one:
CREATE TABLE test
(name text,id serial primary key,parent_id integer
references test);
insert into test (name,id,parent_id) values
('root1',1,NULL),('root2',2,NULL),('root1sub1',3,1),('root1sub2',4,1),('root
2sub1',5,2),('root2sub2',6,2);
testdb=# select * from test;
name | id | parent_id
-----------+----+-----------
root1 | 1 |
root2 | 2 |
root1sub1 | 3 | 1
root1sub2 | 4 | 1
root2sub1 | 5 | 2
root2sub2 | 6 | 2
What i need now is a function (preferrably in plain sql) that would take the id of a test record and
clone all attached records (including the given one). The cloned records need to have new ids of course. The desired result
would like this for example:
Select * from cloningfunction(2);
name | id | parent_id
-----------+----+-----------
root2 | 7 |
root2sub1 | 8 | 7
root2sub2 | 9 | 7
Any pointers? Im using PostgreSQL 8.3.
Pulling this result in recursively is tricky (although possible). However, it's typically not very efficient and there is a much better way to solve this problem.
Basically, you augment the table with an extra column which traces the tree to the top - I'll call it the "Upchain". It's just a long string that looks something like this:
name | id | parent_id | upchain
root1 | 1 | NULL | 1:
root2 | 2 | NULL | 2:
root1sub1 | 3 | 1 | 1:3:
root1sub2 | 4 | 1 | 1:4:
root2sub1 | 5 | 2 | 2:5:
root2sub2 | 6 | 2 | 2:6:
root1sub1sub1 | 7 | 3 | 1:3:7:
It's very easy to keep this field updated by using a trigger on the table. (Apologies for terminology but I have always done this with SQL Server). Every time you add or delete a record, or update the parent_id field, you just need to update the upchain field on that part of the tree. That's a trivial job because you just take the upchain of the parent record and append the id of the current record. All child records are easily identified using LIKE to check for records with the starting string in their upchain.
What you're doing effectively is trading a bit of extra write activity for a big saving when you come to read the data.
When you want to select a complete branch in the tree it's trivial. Suppose you want the branch under node 1. Node 1 has an upchain '1:' so you know that any node in the branch of the tree under that node must have an upchain starting '1:...'. So you just do this:
SELECT *
FROM table
WHERE upchain LIKE '1:%'
This is extremely fast (index the upchain field of course). As a bonus it also makes a lot of activities extremely simple, such as finding partial trees, level within the tree, etc.
I've used this in applications that track large employee reporting hierarchies but you can use it for pretty much any tree structure (parts breakdown, etc.)
Notes (for anyone who's interested):
I haven't given a step-by-step of the SQL code but once you get the principle, it's pretty simple to implement. I'm not a great programmer so I'm speaking from experience.
If you already have data in the table you need to do a one time update to get the upchains synchronised initially. Again, this isn't difficult as the code is very similar to the UPDATE code in the triggers.
This technique is also a good way to identify circular references which can otherwise be tricky to spot.
The Joe Celko's method which is similar to the njreed's answer but is more generic can be found here:
Nested-Set Model of Trees (at the middle of the article)
Nested-Set Model of Trees, part 2
Trees in SQL -- Part III
#Maximilian: You are right, we forgot your actual requirement. How about a recursive stored procedure? I am not sure if this is possible in PostgreSQL, but here is a working SQL Server version:
CREATE PROCEDURE CloneNode
#to_clone_id int, #parent_id int
AS
SET NOCOUNT ON
DECLARE #new_node_id int, #child_id int
INSERT INTO test (name, parent_id)
SELECT name, #parent_id FROM test WHERE id = #to_clone_id
SET #new_node_id = ##IDENTITY
DECLARE #children_cursor CURSOR
SET #children_cursor = CURSOR FOR
SELECT id FROM test WHERE parent_id = #to_clone_id
OPEN #children_cursor
FETCH NEXT FROM #children_cursor INTO #child_id
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE CloneNode #child_id, #new_node_id
FETCH NEXT FROM #children_cursor INTO #child_id
END
CLOSE #children_cursor
DEALLOCATE #children_cursor
Your example is accomplished by EXECUTE CloneNode 2, null (the second parameter is the new parent node).
This sounds like an exercise from "SQL For Smarties" by Joe Celko...
I don't have my copy handy, but I think it's a book that'll help you quite a bit if this is the kind of problems you need to solve.