cloning hierarchical data - sql

let's assume i have a self referencing hierarchical table build the classical way like this one:
CREATE TABLE test
(name text,id serial primary key,parent_id integer
references test);
insert into test (name,id,parent_id) values
('root1',1,NULL),('root2',2,NULL),('root1sub1',3,1),('root1sub2',4,1),('root
2sub1',5,2),('root2sub2',6,2);
testdb=# select * from test;
name | id | parent_id
-----------+----+-----------
root1 | 1 |
root2 | 2 |
root1sub1 | 3 | 1
root1sub2 | 4 | 1
root2sub1 | 5 | 2
root2sub2 | 6 | 2
What i need now is a function (preferrably in plain sql) that would take the id of a test record and
clone all attached records (including the given one). The cloned records need to have new ids of course. The desired result
would like this for example:
Select * from cloningfunction(2);
name | id | parent_id
-----------+----+-----------
root2 | 7 |
root2sub1 | 8 | 7
root2sub2 | 9 | 7
Any pointers? Im using PostgreSQL 8.3.

Pulling this result in recursively is tricky (although possible). However, it's typically not very efficient and there is a much better way to solve this problem.
Basically, you augment the table with an extra column which traces the tree to the top - I'll call it the "Upchain". It's just a long string that looks something like this:
name | id | parent_id | upchain
root1 | 1 | NULL | 1:
root2 | 2 | NULL | 2:
root1sub1 | 3 | 1 | 1:3:
root1sub2 | 4 | 1 | 1:4:
root2sub1 | 5 | 2 | 2:5:
root2sub2 | 6 | 2 | 2:6:
root1sub1sub1 | 7 | 3 | 1:3:7:
It's very easy to keep this field updated by using a trigger on the table. (Apologies for terminology but I have always done this with SQL Server). Every time you add or delete a record, or update the parent_id field, you just need to update the upchain field on that part of the tree. That's a trivial job because you just take the upchain of the parent record and append the id of the current record. All child records are easily identified using LIKE to check for records with the starting string in their upchain.
What you're doing effectively is trading a bit of extra write activity for a big saving when you come to read the data.
When you want to select a complete branch in the tree it's trivial. Suppose you want the branch under node 1. Node 1 has an upchain '1:' so you know that any node in the branch of the tree under that node must have an upchain starting '1:...'. So you just do this:
SELECT *
FROM table
WHERE upchain LIKE '1:%'
This is extremely fast (index the upchain field of course). As a bonus it also makes a lot of activities extremely simple, such as finding partial trees, level within the tree, etc.
I've used this in applications that track large employee reporting hierarchies but you can use it for pretty much any tree structure (parts breakdown, etc.)
Notes (for anyone who's interested):
I haven't given a step-by-step of the SQL code but once you get the principle, it's pretty simple to implement. I'm not a great programmer so I'm speaking from experience.
If you already have data in the table you need to do a one time update to get the upchains synchronised initially. Again, this isn't difficult as the code is very similar to the UPDATE code in the triggers.
This technique is also a good way to identify circular references which can otherwise be tricky to spot.

The Joe Celko's method which is similar to the njreed's answer but is more generic can be found here:
Nested-Set Model of Trees (at the middle of the article)
Nested-Set Model of Trees, part 2
Trees in SQL -- Part III

#Maximilian: You are right, we forgot your actual requirement. How about a recursive stored procedure? I am not sure if this is possible in PostgreSQL, but here is a working SQL Server version:
CREATE PROCEDURE CloneNode
#to_clone_id int, #parent_id int
AS
SET NOCOUNT ON
DECLARE #new_node_id int, #child_id int
INSERT INTO test (name, parent_id)
SELECT name, #parent_id FROM test WHERE id = #to_clone_id
SET #new_node_id = ##IDENTITY
DECLARE #children_cursor CURSOR
SET #children_cursor = CURSOR FOR
SELECT id FROM test WHERE parent_id = #to_clone_id
OPEN #children_cursor
FETCH NEXT FROM #children_cursor INTO #child_id
WHILE ##FETCH_STATUS = 0
BEGIN
EXECUTE CloneNode #child_id, #new_node_id
FETCH NEXT FROM #children_cursor INTO #child_id
END
CLOSE #children_cursor
DEALLOCATE #children_cursor
Your example is accomplished by EXECUTE CloneNode 2, null (the second parameter is the new parent node).

This sounds like an exercise from "SQL For Smarties" by Joe Celko...
I don't have my copy handy, but I think it's a book that'll help you quite a bit if this is the kind of problems you need to solve.

Related

Open sums in SQL / dynamic selection of tables

Much ink has been spilled on the topic of sum types in SQL. The standard solutions are called absorption, separation, and partition; see, e.g.: https://www.inf.unibz.it/~montali/teaching/1415/dpm/slides/4.relational-mapping.pdf .
I want to ask about how to encode open sums. Normal sums allow a field to be one of a fixed set of several different types; with open sums, this set is not fixed.
The basic setup in our program: There is a list of "triggers," where each trigger can be one of many different things. Plugins can be written defining new trigger types, although the set of trigger types can be assumed to be known at compile time.
We want a table of all triggers.
Our current best idea:
Dynamically create a materialized view of the following form:
id | id_in_plugin_table | thing_in_main_program_it_refs | plugin_name
---------------------------------------------------------------------
1 | 27 | 8 | RegexTrigger
2 | 27 | 12 | RidiculouslyUnsafeCustomJSTrigger
This relation is automatically generated from the various plugin tables, each of which have their own ID and a thing_in_main_program_it_refs field.
For illustration, here's what the referenced tables may look like.
RegexTrigger table:
id | thing_in_main_program_it_refs | regex
---------------------------------------------------------------------
27 | 8 | hel*o
RidiculouslyUnsafeCustomJSTrigger
id | thing_in_main_program_it_refs | custom_js
---------------------------------------------------------------------
27 | 12 | (x) => isPrime(x.length())
Either use two roundtrips to lookup the plugin table and then query it, or combine them into a single SQL program which uses EXEC.
I'm happy with part 1, but not with part 2. Neither option sounds efficient, and the latter option uses EXEC.
So, we're looking for either (a) a better way to dynamically select a table in a query, or (b) a different approach to open sums.

Check if a value exists in the child-parent tree

I'm creating a simple directory listing page where you can specify what kind of thing you want to list in the directory e.g. a person or a company.
Each user has an UserTypeID and there is a dbo.UserType lookup table. The dbo.UserType lookup table is like this:
UserTypeID | UserTypeParentID | Name
1 NULL Person
2 NULL Company
3 2 IT
4 3 Accounting Software
In the dbo.Users table we have records like this:
UserID | UserTypeID | Name
1 1 Jenny Smith
2 1 Malcolm Brown
3 2 Wall Mart
4 3 Microsoft
5 4 Sage
My SQL (so far) is very simple: (excuse the pseudo-code style)
DECLARE #UserTypeID int
SELECT
*
FROM
dbo.Users u
INNER JOIN
dbo.UserType ut
WHERE
ut.UserTypeID = #UserTypeID
The problem is here is that when people want to search for companies they will enter in '2' as the UserTypeID. But both Microsoft and Sage won't show up because their UserTypeIDs are 3 and 4 respectively. But its the final UserTypeParentID which tells me that they're both Companies.
How could I rewrite the SQL to ask it to return to return records where the UserTypeID = #UserTypeID or where its final UserTypeParentID is also equal to #UserTypeID. Or am I going about this the wrong way?
Schema Change
I would suggest you to break it down this schema a little bit more, to make your queries and life simpler, with this current schema you will end up writing a recursive query every time you want to get simplest data from your Users table, and trust me you dont want to do this to yourself.
I would break down this schema of these tables as follow:
dbo.Users
UserID | UserName
1 | Jenny
2 | Microsoft
3 | Sage
dbo.UserTypes_Type
TypeID | TypeName
1 | Person
2 | IT
3 | Compnay
4 | Accounting Software
dbo.UserTypes
UserID | TypeID
1 | 1
2 | 2
2 | 3
3 | 2
3 | 3
3 | 4
You say that you are "creating" this - excellent because you have the opportunity to reconsider your whole approach.
Dealing with hierarchical data in a relational database is problematic because it is not designed for it - the model you choose to represent it will have a huge impact on the performance and ease of construction of your queries.
You have opted for an Adjacently List model which is great for inserts (and deletes) but a bugger for selects because the query has to effectively reconstruct the hierarchy path. By the way an Adjacency List is the model almost everyone goes for on their first attempt.
Everything is a trade off so you should decide what queries will be most common - selects (and updates) or inserts (and deletes). See this question for starters. Also, since SQL Server 2008, there is a native HeirachyID datatype (see this) which may be of assistance.
Of course, you could store your data in an XML file (in SQL Server or not) which is designed for hierarchical data.

Creating a list tree with SQLite

I'm trying to make a hierarchical list with PHP and an SQLite table setup like this:
| itemid | parentid | name |
-----------------------------------------
| 1 | null | Item1 |
| 2 | null | Item2 |
| 3 | 1 | Item3 |
| 4 | 1 | Item4 |
| 5 | 2 | Item5 |
| 6 | 5 | Item6 |
The lists would be built with unordered lists and allow for this type of tree structure:
Item1
|_Item3
|_Item4
Item2
|_Item5
|_Item6
I've seen this done with directories and flat arrays, but I can't seem to make it work right with this structure and without a depth limit.
You're using a textbook design for storing hierarchical data in an SQL database. This design is called Adjacency List, i.e. each node in the hierarchy has a parentid foreign key to its immediate parent.
With this design, you can't generate a tree like you describe and support arbitrary depth for the tree. You've already figured this out.
Most other SQL databases (PostgreSQL, Microsoft, Oracle, IBM DB2) support recursive queries, which solve this problem.
Update:
MySQL supports recursive CTE queries from version 8.0.1 (2017-04-10). See https://dev.mysql.com/doc/refman/8.0/en/with.html
SQLite supports recursive CTE queries from version 3.34.0 (2020-12-01). See https://www.sqlite.org/lang_with.html
If you use an older version, you need another solution to store the hierarchy. There are several solutions for this. See my presentation Models for Hierarchical Data with PHP and MySQL for descriptions and examples.
I usually prefer a design I call Closure Table, but each design has strength and weaknesses. Which one is best for your project depends on what kinds of queries you need to do efficiently with your data. So you should go study the solutions and choose one for yourself.
I know this was asked log time ago, but with current SQLite version it is trivial to do and no need level depth as #Bill-Karwin says. So the correct answer should be reconsidered :)
My table has columns MCTMPLID and REF_TMPLID and my structure starting node is called ROOT
CREATE TABLE MyStruct (
`TMPLID` text,
`REF_TMPLID` text
);
INSERT INTO MyStruct
(`TMPLID`, `REF_TMPLID`)
VALUES
('Root', NULL),
('Item1', 'Root'),
('Item2', 'Root'),
('Item3', 'Item1'),
('Item4', 'Item1'),
('Item5', 'Item2'),
('Item6', 'Item5');
And here is the main query, that builds tree structure
WITH RECURSIVE
under_root(name,level) AS (
VALUES('Root',0)
UNION ALL
SELECT tmpl.TMPLID, under_root.level+1
FROM MyStruct as tmpl JOIN under_root ON tmpl.REF_TMPLID=under_root.name
ORDER BY 2 DESC
)
SELECT substr('....................',1,level*3) || name as TreeStructure FROM under_root
And here is result
Root
...Item1
......Item3
......Item4
...Item2
......Item5
.........Item6
I'm sure this can be modified to work tik OP's table structure, so let this sample be starting point
Documentation and some samples https://www.sqlite.org/lang_with.html#rcex1

Recursion & MYSQL?

I got a really simple table structure like this:
Table Page Hits
id | title | parent | hits
---------------------------
1 | Root | | 23
2 | Child | 1 | 20
3 | ChildX | 1 | 30
4 | GChild | 2 | 40
As I don't want to have the recursion in my code I would like to do a recurisive SQL.
Is there any SELECT statement to get the sum of Root (23+20+30+40) or Child ( 20 + 40 ) ?
You are organizing your hierarchical data using the adjacency list model. The fact that such recursive operations are difficult is in fact one major drawback of this model.
Some DBMSes, such as SQL Server 2005, Postgres 8.4 and Oracle 11g, support recursive queries using common table expressions with the WITH keyword.
As for MySQL, you may be interested in checking out the following article which describes an alternative model (the nested set model), which makes recursive operations easier (possible):
Mike Hillyer: Managing Hierarchical Data in MySQL
Not in 1 select statment, no.
If you knew the maximum depth of the relationshop (ie parent->child->child or parent->child->child->child) you could write a select statement which would give you a bunch of numbers that you would then have to sum up seperately (1 number per level of depth).
You could, however, do it with a mysql stored procedure which is recursive.

How should I go about implementing an "autonumber" field in SQL Server 2005?

I'm aware of IDENTITY fields but I have a feeling that I couldn't use one to solve my problem.
Let's say I have multiple clients. Each client has multiple orders. Each client needs to have their orders numbered sequentially, specific to them.
Example table structure:
Orders:
OrderID | ClientID | ClientOrderID | etc...
Some example rows for this table would be:
OrderID | ClientID | ClientOrderID | etc...
1 | 1 | 1 | ...
2 | 1 | 2 | ...
3 | 2 | 1 | ...
4 | 3 | 1 | ...
5 | 1 | 3 | ...
6 | 2 | 2 | ...
I know the naive way would be to take the MAX ClientOrderID for any client and use that value for INSERTs but that would be subject to concurrency issues. I was considering using a transaction but I'm not quite sure what the broadest isolation scope that can be used for this. I'll be using LINQ to SQL but I have feeling that isn't relevant.
Somebody correct me if I'm wrong, but as long as your MAX() call is in the same step as your insert, you won't have a problem with concurrency.
So, you could not do
select #newOrderID=max(ClientOrderID) + 1
from orders
where clientid=#myClientID;
insert into ( ClientID, ClientOrderID, ...)
values( #myClientID, #newOrderID, ...);
But you can do
insert into ( ClientID, ClientOrderID, ...)
select #myClientID, max(ClientOrderID) + 1, ...
from orders
where clientid=#myClientID;
I'm assuming OrderID is an identity column.
Again, if I'm incorrect on this, please let me know. Preferably with a URL
You could use a Repository pattern to handle your Orders and let it control the number of each specific clients order number. If you implement the OrderRepository correctly it could control the concurrency and number the order before saving it to the database (let the repository and not the db set the number).
Repository pattern: http://martinfowler.com/eaaCatalog/repository.html
One possibility (though I don't like to do this) is to have a lookup table that would tell you the greatest Order Number given for each vendor. Inside of a transaction, you'd fetch the most recent one from VendorOrderNumber, save your new order, increment the value in VendorOrderNumber, commit transaction.
This is an odd way to store data, but assuming you need it, there is nothing built-in that you can use.
Your suggestion of Max(ClientOrderID) is straight forward and pretty easy to implement (follow John MacIntyre's advice). It will probably work acceptably well on tables with a few thousand orders. As the table grows this approach will of course slow down.
Nick DeVore's suggestion of a lookup table is a little messier to implement but won't substantially be affected by data growth.
Depending on where/when you actually need the ClientOrderID, you could calculate the id when needed like this:
SELECT *,
ROW_NUMBER() OVER(ORDER BY OrderID) AS ClientOrderID
FROM Orders
WHERE ClientID = 1
This assumes that the ClientOrderIDs are in the same sequence as the OrderID. Without actually persisting the ID, it is awkward to use as a key to anything else. This approach should not be affected by data growth.