Store multidimensional array in database: relational or multidimensional? - sql

I have read numerous posts along the lines of multidimensional to single dimension, multidimensional database, and so on, but none of the answers helped. I did find a lot of documentation on Google but that only provided background information and didn't answer the question at hand.
I have a lot of strings that are related to one another. They are needed in a PHP script. The structure is hierarchical. Here is an example.
A:
AA:
AAA
AAC
AB
AE:
AEA
AEE:
AEEB
B:
BA:
BAA
BD:
BDC:
BDCB
BDCE
BDD:
BDDA
BE:
BED:
BEDA
C:
CC:
CCB:
CCBC
CCBE
CCC:
CCCA
CCCE
CE
Each indent supposes a new level in the multidimensional array.
The goal is to retrieve an element with PHP by name and all its descendants. If for instance I query for A, I want to receive an array of string containing array('A', 'AA', 'AAA', 'AAC', 'AB', 'AE', 'AEA', 'AEE', 'AEEB'). The 'issue' is that queries can also be made to lower-level elements. If I query AEE, I want to get array('AEE', 'AEEB').
As I understand the concept of relational databases, this means that I cannot use a relational database because there is no common 'key' between elements. The solution that I thought is possible, is assigning PARENT elements to each cell. So, in a table:
CELL | PARENT
A NULL
AA A
AAA AA
AAC AA
AB A
AE A
AEA AE
AEE AE
AEEB AEE
By doing so, I think you should be able to query the given string, and all items that share this parent, and then recursively go down this path until no more items are found. However, this seems rather slow to me because the whole search space would need to be looked through on each level - which is exactly what you don't want in a multidimensional array.
So I am a bit at loss. Note that there are actually around 100,000 strings structured in this way, so speed is important. Luckily the database is static and would not change. How can I store such a data structure in a database without having to deal with long loops and search times? And which kind of database software and data type is best suited for this? It has come to my attention that PostgreSQL is already present on our servers so I'd rather stick with that.
As I said I am new to databases but I am very eager to learn. Therefore, I am looking for an extensive answer that goes into detail and provides advantages and disadvantages of a certain approach. Performance is key. An expected answer would contain the best database type and language for this use case, and also script in that language to build such a structure.

The goal is to retrieve an element with PHP by name and all its descendants.
If that is all you need, you can use a LIKE search
SELECT *
FROM Table1
WHERE CELL LIKE 'AEE%';
With an index beginning with CELL this is a range check, which is fast.
If your data doesn't look like that, you can create a path column which looks like a directory path and contains all nodes "on the way/path" from root to the element.
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 1 | A | NULL | 1/ |
| 2 | AA | 1 | 1/2/ |
| 3 | AAA | 2 | 1/2/3/ |
| 4 | AAC | 2 | 1/2/4/ |
| 5 | AB | 1 | 1/5/ |
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
To retrieve all descendants of 'AE' (including itself) your query would be
SELECT *
FROM tree t
WHERE path LIKE '1/6/%';
or (MySQL specific concatenation)
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE CONCAT(r.path, '%');
Result:
| id | CELL | parent_id | path |
|====|======|===========|==========|
| 6 | AE | 1 | 1/6/ |
| 7 | AEA | 6 | 1/6/7/ |
| 8 | AEE | 6 | 1/6/8/ |
| 9 | AEEB | 8 | 1/6/8/9/ |
Demo
Performance
I have created 100K rows of fake data on MariaDB with the sequence plugin using the following script:
drop table if exists tree;
CREATE TABLE tree (
`id` int primary key,
`CELL` varchar(50),
`parent_id` int,
`path` varchar(255),
unique index (`CELL`),
unique index (`path`)
);
DROP TRIGGER IF EXISTS `tree_after_insert`;
DELIMITER //
CREATE TRIGGER `tree_after_insert` BEFORE INSERT ON `tree` FOR EACH ROW BEGIN
if new.id = 1 then
set new.path := '1/';
else
set new.path := concat((
select path from tree where id = new.parent_id
), new.id, '/');
end if;
END//
DELIMITER ;
insert into tree
select seq as id
, conv(seq, 10, 36) as CELL
, case
when seq = 1 then null
else floor(rand(1) * (seq-1)) + 1
end as parent_id
, null as path
from seq_1_to_100000
;
DROP TRIGGER IF EXISTS `tree_after_insert`;
-- runtime ~ 4 sec.
Tests
Count all elements under the root:
SELECT count(*)
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '1'
AND t.path LIKE CONCAT(r.path, '%');
-- result: 100000
-- runtime: ~ 30 ms
Get subtree elements under a specific node:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '3B0'
AND t.path LIKE CONCAT(r.path, '%');
-- runtime: ~ 30 ms
Result:
| id | CELL | parent_id | path |
|=======|======|===========|=====================================|
| 4284 | 3B0 | 614 | 1/4/11/14/614/4284/ |
| 6560 | 528 | 4284 | 1/4/11/14/614/4284/6560/ |
| 8054 | 67Q | 6560 | 1/4/11/14/614/4284/6560/8054/ |
| 14358 | B2U | 6560 | 1/4/11/14/614/4284/6560/14358/ |
| 51911 | 141Z | 4284 | 1/4/11/14/614/4284/51911/ |
| 55695 | 16Z3 | 4284 | 1/4/11/14/614/4284/55695/ |
| 80172 | 1PV0 | 8054 | 1/4/11/14/614/4284/6560/8054/80172/ |
| 87101 | 1V7H | 51911 | 1/4/11/14/614/4284/51911/87101/ |
PostgreSQL
This also works for PostgreSQL. Only the string concatenation syntax has to be changed:
SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
AND t.path LIKE r.path || '%';
Demo: sqlfiddle - rextester
How does the search work
If you look at the test example, you'll see that all paths in the result begin with '1/4/11/14/614/4284/'. That is the path of the subtree root with CELL='3B0'. If the path column is indexed, the engine will find them all efficiently, because the index is sorted by path. It's like you would want to find all the words that begin with 'pol' in a dictionary with 100K words. You wouldn't need to read the entire dictionary.

Performance
As others have already mentioned, performance shouldn't be an issue as long as you use a suitable indexed primary key and ensure that relations use foreign keys. In general, an RDBMS is highly optimised to efficiently perform joins on indexed columns and referential integrity can also provide the advantage of preventing orphans. 100,000 may sound a lot of rows but this isn't going to stretch an RDBMS as long as the table structure and queries are well designed.
Choice of RDBMS
One factor in answering this question lies in choosing a database with the ability to perform a recursive query via a Common Table Expression (CTE), which can be very useful to keep the queries compact or essential if there are queries that do not limit the number of descendants being traversed.
Since you've indicated that you are free to choose the RDBMS but it must run under Linux, I'm going to throw PostgreSQL out there as a suggestion since it has this feature and is freely available. (This choice is of course very subjective and there are advantages and disadvantages of each but a few other contenders I'd be tempted to rule out are MySQL since it doesn't currently support CTEs, MariaDB since it doesn't currently support *recursive* CTEs, SQL Server since it doesn't currently support Linux. Other possibilities such as Oracle may be dependent on budget / existing resources.)
SQL
Here's an example of the SQL you'd write to perform your first example of finding all the descendants of 'A':
WITH RECURSIVE rcte AS (
SELECT id, letters
FROM cell
WHERE letters = 'A'
UNION ALL
SELECT c.id, c.letters
FROM cell c
INNER JOIN rcte r
ON c.parent_cell_id = r.id
)
SELECT letters
FROM rcte
ORDER BY letters;
Explanation
The above SQL sets up a "Common Table Expression", i.e. a SELECT to run whenever its alias (in this case rcte) is referenced. The recursion happens because this is referenced within itself. The first part of the UNION picks the cell at the top of the hierarchy. Its descendants are all found by carrying on joining on children in the second part of the UNION until no further records are found.
Demo
The above query can be seen in action on the sample data here: http://rextester.com/HVY63888

You absolutely can do that (if I've read your question correctly).
Depending on your RDBMS you might have to choose a different way.
Your basic structure of having a parent is correct.
SQL Server use recursive common table expression (CTE) to anchor the start and work down
https://technet.microsoft.com/en-us/library/ms186243(v=sql.105).aspx
Edit: For Linux use the same in PostgreSQL https://www.postgresql.org/docs/current/static/queries-with.html
Oracle has a different approach, though I think you might be able to use the CTE as well.
https://oracle-base.com/articles/misc/hierarchical-queries
For 100k rows I don't imagine performance will be an issue, though I'd still index PK & FK because that's the right thing to do. If you're really concerned about speed then reading it into memory and building a hash table of linked lists might work.
Pros & cons - it pretty much comes down to readability and suitability for your RDBMS.
It's an already solved problem (again, assuming I've not missed anything) so you'll be fine.

I have two words for you... "RANGE KEYS"
You may find this technique to be incredibly powerful and flexible. You'll be able to navigate your hierarchies with ease, and support variable depth aggregation without the need for recursion.
In the demonstration below, we'll build the hierarchy via a recursive CTE. For larger hierarchies 150K+, I'm willing to share a much faster build in needed.
Since your hierarchies are slow moving (like mine), I tend to store them in a normalized structure and rebuild as necessary.
How about some actual code?
Declare #YourTable table (ID varchar(25),Pt varchar(25))
Insert into #YourTable values
('A' ,NULL),
('AA' ,'A'),
('AAA' ,'AA'),
('AAC' ,'AA'),
('AB' ,'A'),
('AE' ,'A'),
('AEA' ,'AE'),
('AEE' ,'AE'),
('AEEB','AEE')
Declare #Top varchar(25) = null --<< Sets top of Hier Try 'AEE'
Declare #Nest varchar(25) ='|-----' --<< Optional: Added for readability
IF OBJECT_ID('TestHier') IS NOT NULL
Begin
Drop Table TestHier
End
;with cteHB as (
Select Seq = cast(1000+Row_Number() over (Order by ID) as varchar(500))
,ID
,Pt
,Lvl=1
,Title = ID
From #YourTable
Where IsNull(#Top,'TOP') = case when #Top is null then isnull(Pt,'TOP') else ID end
Union All
Select cast(concat(cteHB.Seq,'.',1000+Row_Number() over (Order by cteCD.ID)) as varchar(500))
,cteCD.ID
,cteCD.Pt
,cteHB.Lvl+1
,cteCD.ID
From #YourTable cteCD
Join cteHB on cteCD.Pt = cteHB.ID)
,cteR1 as (Select Seq,ID,R1=Row_Number() over (Order By Seq) From cteHB)
,cteR2 as (Select A.Seq,A.ID,R2=Max(B.R1) From cteR1 A Join cteR1 B on (B.Seq like A.Seq+'%') Group By A.Seq,A.ID )
Select B.R1
,C.R2
,A.ID
,A.Pt
,A.Lvl
,Title = Replicate(#Nest,A.Lvl-1) + A.Title
Into dbo.TestHier
From cteHB A
Join cteR1 B on A.ID=B.ID
Join cteR2 C on A.ID=C.ID
Order By B.R1
Show The Entire Hier I added the Title and Nesting for readability
Select * from TestHier Order By R1
Just to state the obvious, the Range Keys are R1 and R2. You may also notice that R1 maintains the presentation sequence. Leaf nodes are where R1=R2 and Parents or rollups define the span of ownership.
To Show All Descendants
Declare #GetChildrenOf varchar(25) = 'AE'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetChildrenOf and A.R1 Between B.R1 and B.R2
Order By R1
To Show Path
Declare #GetParentsOf varchar(25) = 'AEEB'
Select A.*
From TestHier A
Join TestHier B on B.ID=#GetParentsOf and B.R1 Between A.R1 and A.R2
Order By R1
Clearly these are rather simple illustrations. Over time, I have created a series of helper functions, both Scalar and Table Value Functions. I should also state that you should NEVER hard code range key in your work because they will change.
In Summary
If you have a point (or even a series of points), you'll have its range and therefore you'll immediately know where it resides and what rolls into it.

This approach does not depend on the existence of a path or parent column. It is relational not recursive.
Since the table is static create a materialized view containing just the leaves to make searching faster:
create materialized view leave as
select cell
from (
select cell,
lag(cell,1,cell) over (order by cell desc) not like cell || '%' as leave
from t
) s
where leave;
table leave;
cell
------
CCCE
CCCA
CCBE
CCBC
BEDA
BDDA
BDCE
BDCB
BAA
AEEB
AEA
AB
AAC
AAA
A materialized view is computed once at creation not at each query like a plain view. Create an index to speed it up:
create index cell_index on leave(cell);
If eventually the source table is altered just refresh the view:
refresh materialized view leave;
The search function receives text and returns a text array:
create or replace function get_descendants(c text)
returns text[] as $$
select array_agg(distinct l order by l)
from (
select left(cell, generate_series(length(c), length(cell))) as l
from leave
where cell like c || '%'
) s;
$$ language sql immutable strict;
Pass the desired match to the function:
select get_descendants('A');
get_descendants
-----------------------------------
{A,AA,AAA,AAC,AB,AE,AEA,AEE,AEEB}
select get_descendants('AEE');
get_descendants
-----------------
{AEE,AEEB}
Test data:
create table t (cell text);
insert into t (cell) values
('A'),
('AA'),
('AAA'),
('AAC'),
('AB'),
('AE'),
('AEA'),
('AEE'),
('AEEB'),
('B'),
('BA'),
('BAA'),
('BD'),
('BDC'),
('BDCB'),
('BDCE'),
('BDD'),
('BDDA'),
('BE'),
('BED'),
('BEDA'),
('C'),
('CC'),
('CCB'),
('CCBC'),
('CCBE'),
('CCC'),
('CCCA'),
('CCCE'),
('CE');

For your scenario, I would suggest you to use Nested Sets Approach in PostgreSQL. It is XML tags based querying using Relational database.
Performance
If you index on lft and rgt columns, then you don't require recursive queries to get the data. Even though, the data seems huge, the retrieval will be very fast.
Sample
/*1A:
2 AA:
3 AAA
4 AAC
5 AB
6 AE:
7 AEA
8 AEE:
9 AEEB
10B:
*/
CREATE TABLE tree(id int, CELL varchar(4), lft int, rgt int);
INSERT INTO tree
("id", CELL, "lft", "rgt")
VALUES
(1, 'A', 1, 9),
(2, 'AA', 2, 4),
(3, 'AAA', 3, 3),
(4, 'AAC', 4, 4),
(5, 'AB', 5, 5),
(6, 'AE', 6, 9),
(7, 'AEA', 7, 7),
(8, 'AEE', 8, 8),
(9, 'AEEB', 9, 9)
;
SELECT hc.*
FROM tree hp
JOIN tree hc
ON hc.lft BETWEEN hp.lft AND hp.rgt
WHERE hp.id = 2
Demo
Querying using Nested Sets approach

Related

Aggregating or Bundle a Many to Many Relationship in SQL Developer

So I have 1 single table with 2 columns : Sales_Order called ccso, Arrangement called arrmap
The table has distinct values for this combination and both these fields have a Many to Many relationship
1 ccso can have Multiple arrmap
1 arrmap can have Multiple ccso
All such combinations should be considered as one single bundle
Objective :
Assign a final map to each of the Sales Order as the Largest Arrangement in that Bundle
Example:
ccso : 100-10015 has 3 arrangements --> Now each of those arrangements have a set of Sales Orders --> Now those sales orders will also have a list of other arrangements and so on
(Image : 1)
Therefore the answer definitely points to something recursively checking. Ive managed to write the below code / codes and they work as long as I hard code a ccso in the where clause - But I don't know how to proceed after this now. (I'm an accountant by profession but finding more passion in coding recently) I've searched the forums and web for things like
Recursive CTEs,
many to many aggregation
cartesian product etc
and I'm sure there must be a term for this which I don't know yet. I've also tried
I have to use sqldeveloper or googlesheet query and filter formulas
sqldeveloper has restrictions on on some CTEs. If recursive is the way I'd like to know how and if I can control the depth to say 4 or 5 iterations
Ideally I'd want to update a third column with the final map if possible but if not, then a select query result is just fine
Codes I've tried
Code 1: As per Screenshot
WITH a1(ccso, amap) AS
(SELECT distinct a.ccso, a.arrmap
FROM rg_consol_map2 A
WHERE a.ccso = '100-10115' -- this condition defines the ultimate ancestors in your chain, change it as appropriate
UNION ALL
SELECT m.ccso, m.arrmap
FROM rg_consol_map2 m
JOIN a1
ON M.arrmap = a1.amap -- or m.ccso=a1.ccso
) /*if*/ CYCLE amap SET nemap TO 1 /*else*/ DEFAULT 0
SELECT DISTINCT amap FROM (SELECT ccso, amap FROM a1 ORDER BY 1 DESC) WHERE ROWNUM = 1
In this the main challenge is how to remove the hardcoded ccso and do a join for each of the ccso
Code 2 : Manual CTEs for depth
Here again the join outside the CTE gives me an error and sqldeveloper does not allow WITH clause with UPDATE statement - only works for select and cannot be enclosed within brackets as subtable
SELECT distinct ccso FROM
(
WITH ar1 AS
(SELECT distinct arrmap
FROM rg_consol_map
WHERE ccso = a.ccso
)
,so1 AS
(SELECT DISTINCT ccso
FROM rg_consol_map
WHERE arrmap IN (SELECT arrmap FROM ar1)
)
,ar2 AS
(SELECT DISTINCT ccso FROM rg_consol_map
where arrmap IN (select distinct arrmap FROM rg_consol_map
WHERE ccso IN (SELECT ccso FROM so1)
))
SELECT ar1.arrmap, NULL ccso FROM ar1
union all
SELECT null, ar2.ccso FROM ar2
UNION ALL
SELECT NULL arrmap, so1.ccso FROM so1
)
Am I Missing something here or is there an easier way to do this? I read something about MERGE and PROC SQL JOIN but was unable to get them to work but if that's the way to go ahead I will try further if someone can point me in the direction
(Image : 2)
(CSV File : [3])
Edit : Fixing CSV file link
https://github.com/karan360note/karanstackoverflow.git
I suppose can be downloaded from here IC mapping many to many.csv
Oracle 11g version is being used
Apologies in advance for the wall of text.
Your problem is a complex, multi-layered Many-to-Many query; there is no "easy" solution to this, because that is not a terribly ideal design choice. The safest best does literally include multiple layers of CTE or subqueries in order to achieve all the depths you want, as the only ways I know to do so recursively rely on an anchor column (like "parentID") to direct the recursion in a linear fashion. We don't have that option here; we'd go in circles without a way to track our path.
Therefore, I went basic, and with several subqueries. Every level checks for a) All orders containing a particular ARRMAP item, and then b) All additional items on those orders. It's clear enough for you to see the logic and modify to your needs. It will generate a new table that contains the original CCSO, the linking ARRMAP, and the related CCSO. Link: https://pastebin.com/un70JnpA
This should enable you to go back and perform the desired updates you want, based on order # or order date, etc... in a much more straightforward fashion. Once you have an anchor column, a CTE in the future is much more trivial (just search for "CTE recursion tree hierarchy").
SELECT DISTINCT
CCSO, RELATEDORDER
FROM myTempTable
WHERE CCSO = '100-10115'; /* to find all orders by CCSO, query SELECT DISTINCT RELATEDORDER */
--WHERE ARRMAP = 'ARR10524'; /* to find all orders by ARRMAP, query SELECT DISTINCT CCSO */
EDIT:
To better explain what this table generates, let me simplify the problem.
If you have order
A with arrangements 1 and 2;
B with arrangement 2, 3; and
C with arrangement 3;
then, by your initial inquiry and image, order A should related to orders B and C, right? The query generates the following table when you SELECT DISTINCT ccso, relatedOrder:
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | C |
| B | A |
+----------------------+
| C | A |
| C | B |
+-------+--------------+
You can see here if you query WHERE CCSO = 'A' OR RelatedOrder = 'A', you'll get the same relationships, just flipped between the two columns.
+-------+--------------+
| CCSO | RelatedOrder |
+----------------------+
| A | B |
| A | C |
+----------------------+
| B | A |
+----------------------+
| C | A |
+-------+--------------+
So query only CCSO or RelatedOrder.
As for the results of WHERE CCSO = '100-10115', see image here, which includes all the links you showed in your Image #1, as well as additional depths of relations.

Filter list of points using list of Polygons

Given a list of points and a list of polygons. How do you return a list of points (subset of original list of points) that is in any of the polygons on the list
I've removed other columns in the sample tables to simplify things
Points Table:
| Longitude| Latitude |
|----------|-----------|
| 7.07491 | 51.28725 |
| 3.674765 | 51.40205 |
| 6.049105 | 51.86624 |
LocationPolygons Table:
| LineString |
|----------------------|
| CURVEPOLYGON (COMPOUNDCURVE (CIRCULARSTRING (-122.20 47.45, -122.81 47.0, -122.942505 46.687131 ... |
| MULTIPOLYGON (((-110.3086 24.2154, -110.30842 24.2185966, -110.3127...
If I had row from the LocationPolygons table I could do something like
DECLARE #homeLocation geography;
SET #homeLocation = (select top 1 GEOGRAPHY::STGeomFromText(LineString, 4326)
FROM LocationPolygon where LocationPolygonId = '123abc')
select Id, Longitude, Latitude, #homeLocation.STContains(geography::Point(Latitude, Longitude, 4326))
as IsInLocation from Points PointId in (1, 2, 3,)
which would return what I want in a format like the below. However this is only true for just one location on the list
| Id | Longitude| Latitude | IsInLocation |
|----|----------|-----------|--------------|
| 1 | 7.07491 | 51.28725 | 0 |
| 2 | 3.674765 | 51.40205 | 1 |
| 3 | 6.049105 | 51.86624 | 0 |
How do I handle the scenario with multiple rows of the LocationPolygon table?
I'd like to know
if any of the points are in any of the locationPolygons?
what specific location polygon they are in? or if they are in more than one polygon.
Question 2 is more of an extra. Can someone help?
Update #1
In response to #Ben-Thul answer.
Unfortunately I don't have access/permission to make changes to the original tables, I can request access but not certain it'll be given. So not certain I'll be able to add the columns or create the index. Although I can create temp tables in a stored proc, I might be able to use test your solution that way
I stumbled on an answer like the below, but slightly worried about performance implications of using a cross join.
WITH cte AS (
select *, (GEOGRAPHY::STGeomFromText(LineString, 4326)).STContains(geography::Point(Latitude, Longitude, 4326)) as IsInALocation from
(
select Longitude, Latitude from Points nolock
) a cross join (
select LineString FROM LocationPolygons nolock
) b
)
select * from cte where IsInALocation = 1
Obviously, it's better to look at a query plan but is the solution I stumbled upon essentially the same as yours? Are there any potential issues that I missed. Apologies for this but my sql isn't very good.
Question 1 shouldn't be too bad. First, some set up:
alter table dbo.Points add Point as (GEOGRAPHY::Point(Latitude, Longitude, 4326));
create spatial index IX_Point on dbo.Points (Point) with (online = on);
alter table dbo.LocationPolygon add Polygon as (GEOGRAPHY::STGeomFromText(LineString, 4326));
create spatial index IX_Polygon on dbo.LocationPolygon (Polygon) with (online = on);
This will create a computed column on each of your tables that is of type geography that has a spatial index on it.
From there, you should be able to do something like this:
select pt.ID,
pt.Longitude,
pt.Latitude,
coalesce(pg.IsInLocation, 0) as IsInLocation
from Points as pt
outer apply (
select top(1) 1 as IsInLocation
from dbo.LocationPolygon as pg
where pg.Polygon.STContains(p.Point) = 1
) as pg;
Here, you're selecting every row from the Points table and using outer apply to see if any polygons contain that point. If one does (it doesn't matter which one), that query will return a 1 in the result set and bubble that back up to the driving select.
To extend this to Question 2, you can remove the top() from the outer apply and have it return either the IDs from the Polygon table or whatever you want. Note though that it'll return one row per polygon that contains the point, potentially changing the cardinality of your result set!

How to write a recursive query which does not add already visited values?

Let's suppose that any given client can have multiple boxes. Each box can contain multiple items as well as multiple boxes (sub-boxes).
BoxA -> Item1, Item2, BoxB, BoxC
Unfortunately, due to the business rules, it's possible to create a circular cycle.
BoxA -> Item1, BoxB, BoxC
BoxB -> BoxA, BoxD
As you can see, BoxA contains BoxB, and BoxB contains BoxA.
The problem I'm attempting to solve is to get all the sub-boxes for a given list of boxes in a client.
So if I was looking for the sub-boxes for BoxA (from the previous example), I would get the following: BoxB, BoxC, BoxA, BoxD.
This is what I have so far:
WITH box_info AS (
-- This is typically a bit more complicated, that's why I have it in a seperate WITH clause
SELECT sub_box_id
FROM client_box
WHERE box_id = 1
),
all_sub_boxes(sub_box_id) AS (
SELECT sub_box_id
FROM box_info
WHERE sub_box_id IS NOT NULL
UNION ALL
SELECT cb.sub_box_id
FROM client_box cb, all_sub_boxes asb
WHERE cb.box_id = asb.sub_box_id AND cb.sub_box_id IS NOT NULL
-- AND cb.sub_box_id NOT IN (SELECT sub_box_id FROM all_sub_boxes)
)
SELECT sub_box_id FROM all_sub_boxes;
However, since it's possible to get stuck in a recursive loop, the "all_sub_boxes" WITH clause will fail. The commented out line is what I would intuitively put in since it prevents already visited sub-boxes from getting added to the recursive list, but it seems that we can't reference "all_sub_boxes" from within.
So essentially, I need a way to not include already included sub-boxes in the recursive query.
Perhaps I could create a temp table? But I don't even know if it's possible to insert into a table during a recursive query. And additionally, what is the cost each time this query is run if we do create a temp table every time?
I'm trying to write this query so that it could be used across different commercial DBs, so if I can avoid non-standard sql, that would be great. But I understand that if it's not possible, then it is what it is.
Edit
For the sake of clarity, here's how the client_box table might look:
+--------+---------+------------+
| BOX_ID | ITEM_ID | SUB_BOX_ID |
+--------+---------+------------+
| BoxA | Item1 | (null) |
| BoxA | (null) | BoxB |
| BoxA | (null) | BoxC |
| BoxB | (null) | BoxA |
| BoxB | (null) | BoxD |
+--------+---------+------------+
You were on the right track, and you seem to just need a little help dealing with cycles. See the CYCLE clause right at the end of the recursive CTE definition (even though the CYCLE clause comes AFTER the closing parenthesis for the recursive CTE, it is still part of it):
with
-- Begin simulated data.
client_box ( box_id, item_id, sub_box_id ) as (
select 'BoxA', 'Item1', null from dual union all
select 'BoxA', null , 'BoxB' from dual union all
select 'BoxA', null , 'BoxC' from dual union all
select 'BoxB', null , 'BoxA' from dual union all
select 'BoxB', null , 'BoxD' from dual
),
-- End of simulated data (for testing only, not part of the solution).
-- SQL query consists of the keyword WITH (above) and the lines below.
-- Use your actual table and column names.
-- Use whatever mechanism works for you in the ANCHOR branch of r (below).
r ( box_id ) as (
select 'BoxA' from dual -- Modify this for inputs
union all
select c.sub_box_id
from client_box c join r on c.box_id = r.box_id
where c.sub_box_id is not null
)
cycle box_id set cycle to 1 default 0
select box_id
from r
where cycle = 0
;
BOX_ID
------
BoxA
BoxB
BoxC
BoxD

Visiting a directed graph as if it were an undirected one, using a recursive query

I need your help about the visit of a directed graph stored in a database.
Consider the following directed graph
1->2
2->1,3
3->1
A table stores those relations:
create database test;
\c test;
create table ownership (
parent bigint,
child bigint,
primary key (parent, child)
);
insert into ownership (parent, child) values (1, 2);
insert into ownership (parent, child) values (2, 1);
insert into ownership (parent, child) values (2, 3);
insert into ownership (parent, child) values (3, 1);
I'd like to extract all the semi-connected edges (i.e. the connected edges ignoring the direction) of the graph reachable from a node. I.e., if I start from parent=1, I'd like to have the following output
1,2
2,1
2,3
3,1
I'm using postgresql.
I've modified the example on Postgres' manual which explains recursive queries, and I've adapted the join condition to go "up" and "down" (doing so I ignore the directions). My query is the following one:
\c test
WITH RECURSIVE graph(parent, child, path, depth, cycle) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0, false
from ownership o
where o.parent = 1
UNION ALL
SELECT
o.parent, o.child,
path||ROW(o.parent, o.child),
depth+1,
ROW(o.parent, o.child) = ANY(path)
from
ownership o, graph g
where
(g.parent = o.child or g.child = o.parent)
and not cycle
)
select g.parent, g.child, g.path, g.cycle
from
graph g
its output follows:
parent | child | path | cycle
--------+-------+-----------------------------------+-------
1 | 2 | {"(1,2)"} | f
2 | 1 | {"(1,2)","(2,1)"} | f
2 | 3 | {"(1,2)","(2,3)"} | f
3 | 1 | {"(1,2)","(3,1)"} | f
1 | 2 | {"(1,2)","(2,1)","(1,2)"} | t
1 | 2 | {"(1,2)","(2,3)","(1,2)"} | t
3 | 1 | {"(1,2)","(2,3)","(3,1)"} | f
1 | 2 | {"(1,2)","(3,1)","(1,2)"} | t
2 | 3 | {"(1,2)","(3,1)","(2,3)"} | f
1 | 2 | {"(1,2)","(2,3)","(3,1)","(1,2)"} | t
2 | 3 | {"(1,2)","(2,3)","(3,1)","(2,3)"} | t
1 | 2 | {"(1,2)","(3,1)","(2,3)","(1,2)"} | t
3 | 1 | {"(1,2)","(3,1)","(2,3)","(3,1)"} | t
(13 rows)
I have a problem: the query extracts the same edges many times, as they are reached through different paths, and I'd like to avoid this. If I modify the outer query into
select distinct g.parent, g.child from graph
I have the desired result, but inefficiencies remain in the WITH query, as unneeded joins are done.
So, is there a solution to extract the reachable edges of a graph in a db, starting from a given one, without using distinct?
I also have another problem (this one is solved, look at the bottom): as you can see from the output, cycles stop only when a node is reached for the second time. I.e. I have (1,2) (2,3) (1,2).
I'd like to stop the cycle before cycling over that last node again, i.e. having (1,2) (2,3).
I've tried to modify the where condition as follows
where
(g.parent = o.child or g.child = o.parent)
and (ROW(o.parent, o.child) <> any(path))
and not cycle
to avoid visiting already visited edges, but it doesn't work and I cannot understand why ((ROW(o.parent, o.child) <> any(path)) should avoid cycling before going on the cycled edge again but doesn't work). How can I do to stop cycles one step before the node that closes the cycle?
Edit: as danihp suggested, to solve the second problem I used
where
(g.parent = o.child or g.child = o.parent)
and not (ROW(o.parent, o.child) = any(path))
and not cycle
and now the output contains no cycles. Rows went from 13 to 6, but I still have duplicates, so the main (the first) problem of extracting all the edges without duplicates and without distinct is still alive. Current output with and not ROW
parent | child | path | cycle
--------+-------+---------------------------+-------
1 | 2 | {"(1,2)"} | f
2 | 1 | {"(1,2)","(2,1)"} | f
2 | 3 | {"(1,2)","(2,3)"} | f
3 | 1 | {"(1,2)","(3,1)"} | f
3 | 1 | {"(1,2)","(2,3)","(3,1)"} | f
2 | 3 | {"(1,2)","(3,1)","(2,3)"} | f
(6 rows)
Edit #2:: following what Erwin Brandstetter suggested, I modified my query, but if I'm not wrong, the proposed query gives MORE rows than mine (ROW comparison is still there as it seems more clear to me, even I understood that string comparison will be more efficient).
Using the new query, I obtain 20 rows, while mine gives 6 rows
WITH RECURSIVE graph(parent, child, path, depth) AS (
SELECT o.parent, o.child, ARRAY[ROW(o.parent, o.child)], 0
from ownership o
where 1 in (o.child, o.parent)
UNION ALL
SELECT
o.parent, o.child,
path||ROW(o.parent, o.child),
depth+1
from
ownership o, graph g
where
g.child in (o.parent, o.child)
and ROW(o.parent, o.child) <> ALL(path)
)
select g.parent, g.child from graph g
Edit 3: so, as Erwin Brandstetter pointed out, the last query was still wrong, while the right one can be found in his answer.
When I posted my first query, I hadn't understood that I was missing some joins, as it happens in the following case: if I start with the node 3, the db selects the rows (2,3) and (3,1). Then, the first inductive step of the query would select, joining from these rows, the rows (1,2), (2,3) and (3,1), missing the row (2,1) that should be included in the result as conceptually the algorithm would imply ( (2,1) is "near" (3,1) )
When I tried to adapt the example in Postgresql manual, I was right trying to join ownership's parent and child, but I was wrong not saving the value of graph that had to be joined in each step.
These type of queries seem to generate a different set of rows depending on the starting node (i.e. depending on the set of rows selected in the base step). So, I think it could be useful to select just one row containing the starting node in the base step, as you'll get any other "adjacent" node anyway.
Could work like this:
WITH RECURSIVE graph AS (
SELECT parent
,child
,',' || parent::text || ',' || child::text || ',' AS path
,0 AS depth
FROM ownership
WHERE parent = 1
UNION ALL
SELECT o.parent
,o.child
,g.path || o.child || ','
,g.depth + 1
FROM graph g
JOIN ownership o ON o.parent = g.child
WHERE g.path !~~ ('%,' || o.parent::text || ',' || o.child::text || ',%')
)
SELECT *
FROM graph
You mentioned performance, so I optimized in that direction.
Major points:
Traverse the graph only in the defined direction.
No need for a column cycle, make it an exclusion condition instead. One less step to go. That is also the direct answer to:
How can I do to stop cycles one step before the node that closes the
cycle?
Use a string to record the path. Smaller and faster than an array of rows. Still contains all necessary information. Might change with very big bigint numbers, though.
Check for cycles with the LIKE operator (~~), should be much faster.
If you don't expect more that 2147483647 rows over the course of time, use plain integer columns instead of bigint. Smaller and faster.
Be sure to have an index on parent. Index on child is irrelevant for my query. (Other than in your original where you traverse edges in both directions.)
For huge graphs I would switch to a plpgsql procedure, where you can maintain the path as temp table with one row per step and a matching index. A bit of an overhead, that will pay off with huge graphs, though.
Problems in your original query:
WHERE (g.parent = o.child or g.child = o.parent)
There is only one endpoint of your traversal at any point in the process. As you wlak the directed graph in both directions, the endpoint can be either parent or child - but not both of them. You have to save the endpoint of every step, and then:
WHERE g.child IN (o.parent, o.child)
The violation of the direction also makes your starting condition questionable:
WHERE parent = 1
Would have to be
WHERE 1 IN (parent, child)
And the two rows (1,2) and (2,1) are effectively duplicates this way ...
Additional solution after comment
Ignore direction
Still walk any edge only once per path.
Use ARRAY for path
Save original direction in path, not actual direction.
Note, that this way (2,1) and (1,2) are effective duplicates, but both can be used in the same path.
I introduce the column leaf which saves the actual endpoint of every step.
WITH RECURSIVE graph AS (
SELECT CASE WHEN parent = 1 THEN child ELSE parent END AS leaf
,ARRAY[ROW(parent, child)] AS path
,0 AS depth
FROM ownership
WHERE 1 in (child, parent)
UNION ALL
SELECT CASE WHEN o.parent = g.leaf THEN o.child ELSE o.parent END -- AS leaf
,path || ROW(o.parent, o.child) -- AS path
,depth + 1 -- AS depth
FROM graph g
JOIN ownership o ON g.leaf in (o.parent, o.child)
AND ROW(o.parent, o.child) <> ALL(path)
)
SELECT *
FROM graph

SQL magic - query shouldn't take 15 hours, but it does

Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher date_updated and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm doing it wrong.
First, monstrous table layout:
category
element_id
date_updated
value
weight
source_prefix
source_name
Only key here is on element_id (BTREE, ~8k unique elements).
And calculation process:
Make hash for each group and subgroup.
CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` ))
SELECT `category`,
`element_id`,
`source_prefix`,
`source_name`,
`date_updated`,
`value`,
`weight`,
MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`,
MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash`
FROM `bigbigtable` WHERE `date_updated` <= '2009-04-28'
I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.
Find maximum date for each subgroup
CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` ))
SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash`
FROM `temp1`
GROUP BY `subcat_hash`;
Join temp1 with temp2 to find weighted average values for categories
CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` ))
SELECT `temp1`.`element_id`,
`temp1`.`category`,
`temp1`.`source_prefix`,
`temp1`.`source_name`,
`temp1`.`date_updated`,
AVG(`temp1`.`value`) AS `avg_value`,
SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating`
FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash`
WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash`
AND `temp1`.`date_updated` = `temp2`.`maxdate`
GROUP BY `temp1`.`cat_hash`;
(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).
Still, is there a normal way to do so?
UPD: some picture for reference:
removed dead ImageShack link
UPD: EXPLAIN for proposed solution:
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
| 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where |
+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+
Using hashses is one of the ways in which a database engine can execute a join. It should be very rare that you'd have to write your own hash-based join; this certainly doesn't look like one of them, with a 900k rows table with some aggregates.
Based on your comment, this query might do what you are looking for:
SELECT cur.source_prefix,
cur.source_name,
cur.category,
cur.element_id,
MAX(cur.date_updated) AS DateUpdated,
AVG(cur.value) AS AvgValue,
SUM(cur.value * cur.weight) / SUM(cur.weight) AS Rating
FROM eev0 cur
LEFT JOIN eev0 next
ON next.date_updated < '2009-05-01'
AND next.source_prefix = cur.source_prefix
AND next.source_name = cur.source_name
AND next.element_id = cur.element_id
AND next.date_updated > cur.date_updated
WHERE cur.date_updated < '2009-05-01'
AND next.category IS NULL
GROUP BY cur.source_prefix, cur.source_name,
cur.category, cur.element_id
The GROUP BY performs the calculations per source+category+element.
The JOIN is there to filter out old entries. It looks for later entries, and then the WHERE statement filters out the rows for which a later entry exists. A join like this benefits from an index on (source_prefix, source_name, element_id, date_updated).
There are many ways of filtering out old entries, but this one tends to perform resonably well.
Ok, so 900K rows isn't a massive table, it's reasonably big but and your queries really shouldn't be taking that long.
First things first, which of the 3 statements above is taking the most time?
The first problem I see is with your first query. Your WHERE clause doesn't include an indexed column. So this means that it has to do a full table scan on the entire table.
Create an index on the "data_updated" column, then run the query again and see what that does for you.
If you don't need the hash's and are only using them to avail of the dark magic then remove them completely.
Edit: Someone with more SQL-fu than me will probably reduce your whole set of logic into one SQL statement without the use of the temporary tables.
Edit: My SQL is a little rusty, but are you joining twice in the third SQL staement? Maybe it won't make a difference but shouldn't it be :
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 LEFT JOIN temp2 ON temp1.subcat_hash = temp2.subcat_hash
WHERE temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;
or
SELECT temp1.element_id,
temp1.category,
temp1.source_prefix,
temp1.source_name,
temp1.date_updated,
AVG(temp1.value) AS avg_value,
SUM(temp1.value * temp1.weight) / SUM(weight) AS rating
FROM temp1 temp2
WHERE temp2.subcat_hash = temp1.subcat_hash
AND temp1.date_updated = temp2.maxdate
GROUP BY temp1.cat_hash;