I want to represent a file-and-folder hierarchy in a Postgres 10 database. A structure like
Photos/
|-- Dog.jpg
|-- Cat.jpg
|-- Places/
|-- Paris.jpg
|-- Berlin.jpg
Songs/
|-- Happy.mp3
would be represented as something like
| id | filename | parent_id |
|----|------------|-----------|
| 1 | Photos | null |
| 2 | Songs | null |
| 3 | Cat.jpg | 1 |
| 4 | Happy.mp3 | 2 |
| 5 | Places | 1 |
| 6 | Berlin.jpg | 5 |
| 7 | Dog.jpg | 1 |
| 8 | Paris.jpg | 5 |
The database would track multiple users, and each user would have their own file-folder hierarchy.
I've been reading up on Postgres's ltree extension, and it seems like the solution to my problem, but I don't know if it is, and it's difficult to test. The labels seem like arbitrary strings -- is it possible to tell Postgres that a label should be an ID field in the same table? Would I need to create one initial root node for each user, only let them attach children to that or children of children of that, then issue a select * from nodes where path ># that rootnode, can you select descendants that way?
Or am I forcing Postgres to do something it was never intended to do, when I should be looking at other kinds of database?
To answer correctly: It depends on your use case.
If your trees are very static (sub nodes change not very often) then ltree is a really good choice. You can do very fast and comfortable queries for sub nodes and ordering. In that case I would do a single root reference for each user as you mentioned.
On the other hand: moving a sub tree in ltree could force you a huge rewrite of the ltree data structure. E.g. if you have a tree like 1.1.1, 1.1.2 and 1.2.1 and you want to change the order of the sub trees in first level after the root, all three values have to change their data.
So, if you tree structure is very dynamical I would try the structure you mentioned above: saving the parent node in an adjacency list (and maybe an index for ordering) and do the query with recursive CTE queries
https://www.postgresql.org/docs/current/static/queries-with.html
http://schinckel.net/2014/09/13/long-live-adjacency-lists/
Last but not least you could try it with the "nested sets" structure (https://en.wikipedia.org/wiki/Nested_set_model).
Every approach has its very own benefits and drawbacks. You have to do a very detailed analysis and maybe create some prototypes to test which one is the best for you.
Further reading:
https://medium.com/notes-from-a-messy-desk/representing-trees-in-postgresql-cbcdae419022
https://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/
Related
I feel I have reached a fundamental dillema in writing BDD scenarios as a tester.
When writing BDD scenarios from testing perspective, I tend to end up using concrete examples with concrete data and observing the state, i.e. Given these initial values, When user performs an action, Then these final values should be observed. Example with an initial dataset given in Background:
Background:
Given following items are in the store
| type | name | X | Y | Z | tags |
| single | el1 | 10 | 20 | 1.03 | t1 |
| multi | el2 | 10 | 20 | 30 | t2 |
| single | el3 | 10 | 3.02 | 30 | t3 |
Scenario: Adding tag to multi-type item
Given Edit Item Popup is opened for item: el2
When user adds tag NEWTAG
And user clicks on Apply changes button
Then item store should display following items
| type | name | X | Y | Z | tags |
| single | el1 | 10 | 20 | 1.03 | t1 |
| multi | el2 | 10 | 20 | 30 | t2, NEWTAG |
| single | el3 | 10 | 3.02 | 30 | t3 |
The initial dataset from Background can be reused in all (or most) scenarios that deal with modifying and adding/deleting items, in relation to particular feature. I can also iterate the scenario over some data set that explores the problem space, boundary conditions etc. (trivial example here: tags with too many or forbidden chars).
But when requirements are not entirely clear I sometimes go with a different approach and start from a more abstract description of the behavior (so that scenarios can become the specification), which seems to me as the more (for lack of a better word) correct way of doing BDD. So I end up with behavior descriptions which are perfectly clear when a human is reading them from the requirement analysis position, but appear to be extremely vague when you shift to testing perspective:
Scenario: Adding tag to multi-type item
Given Edit Item Popup is opened for multi-type item
When user adds a new tag
And user clicks on Apply changes button
Then that item should have that tag displayed in item store
For some reason I feel way better writing a scenario like that, as it seems closer to BDD ideals (describing the behavior, doh!). But at the same time I feel terrible because of 2 reasons:
A lot of details are implicit here and thus hidden deep in the implementation. Because of that, while implementing, we need to ask ourselvs a ton of questions like 'what initial data should I use here?', 'how to keep track of which item are we handling?', 'how deep should I examine the final state?'. This all goes away when you just compare final state with a reference table, as in the first approach.
(Possibly more serious) I am not exploring the problem space here at all, while bugs often await us somewhere in dark corners of that space.
One could argue that these 2 approaches I presented are just extreme ends of a spectrum, but I still see them fundamentally different, and often find myself wondering which approach to chose.
So, how do you write your BDD (test) scenarios? Data-driven and state-comparing, or full blown abstract descriptions of behavior?
Somebody pointed out that my data structure architecture sucks.
The task
I have a locations table which stores the name of a location. Then I have a tags table which stores information about those locations. The locations have a hierarchie which I want to use to get all tags.
Example
Locations:
USA <- California <- San Francisco <- Mission St
Tags:
USA: English
California: Sunny
California: West coast
San Francisco: Sea side
Mission St: Cable car station
If somebody requests information about the Mission St I want to deliver all tags of it and it's ancestors (["English", "Sunny", "West coast", "Sea side", "Cable car station"]. If I request all tags of California the answer would be ["English", "Sunny", "West coast"].
I'm looking for the best read performance! I don't care about write performance. This data is not changed very often. And I don't care about table sizes either. If I need more or larger tables to solve this quicker so be it.
The tables
So currently I'm thinking about setting up these tables:
locations
id | name
---|--------------
1 | USA
2 | California
3 | San Francisco
4 | Mission St
tags
id | location_id | name
---|-------------|------------------
1 | 1 | English
2 | 2 | Sunny
3 | 2 | West coast
4 | 3 | Sea side
5 | 4 | Cable car station
ancestors
I added a position field to store the hierarchy.
| id | location_id | ancestor_id | position |
|----|-------------|-------------|----------|
| 1 | 2 | 1 | 1 |
| 2 | 3 | 2 | 1 |
| 3 | 3 | 1 | 2 |
| 4 | 4 | 3 | 1 |
| 5 | 4 | 2 | 2 |
| 6 | 4 | 1 | 3 |
Question
Is this a good solution to solve the problem or is there a better one? I want to select as fast as possible all tags of any given location including all the tags of it's ancestors. I'm using a PostgreSQL database but I think this is a pure SQL architecture problem.
Your problem seems to consist of two challenges. The most interesting is "how do I store hierarchies in a relational database". There are lots of answers to that - the one you've proposed is the most common.
There's an alternative called "nested set" which is faster for reading (in your example, finding all locations within a particular hierarchy would be "between x and y".
Postgres has dedicated support for hierachies; I'd assume this would also provide great performance.
The second part of your question is "given a path in my hierarchy, retrieve all matching tags". The easiest option is to join to the tags table as you suggest.
The final aspect is "should you denormalize/precalculate". I usually recommend building and optimizing the "normalized" solution and only denormalize when you need to.
If you want to deliver all tags for a particular location, then I would recommend replicating the data and storing the tags in a tags array on a row for each location.
You say that the locations don't change very much. So, I would simply batch create the entire table, when any underlying data changes.
Modifying the data in situ is rather problematic. A single update could end up affecting a zillion different rows -- consider a tag change on USA. Recalculating the entire table is going to be more efficient.
If you need to search on the tags as well as return them, then I would go for a more traditional structure of a table with two important columns, location and tag. Then you can have indexes on both (location) and (tag) to facilitate searching in either direction.
If write performance is not crucial, I would go for denormalization of the database. That means you use the above structure for your write operations and fill a table for your read operations by a trigger or a some async job, if you are afraid of triggers. Then the read performance is optimal, but you have to invest a bit more into the write logic.
Using the above structure for read operations is indeed not a smart solution, cause you don't know how deep the tree can get.
I have a linked table to a Outlook Mailitem folder in my Access Database. This is handy in that it keeps itself constantly updated, but I can't add an extra field to relate these records to a parent table.
My workaround was to put an automatically generated/added ID String into the Subject so I could work from there. In order to make my form work the way I need it to, I'm trying to create a query that takes the fields I need from the linked table and adds a calculated field with the extracted ID so it can be referenced for relating records in the form.
The query works fine (I get all the records and their IDs extracted) but when I try to filter records from this query by the calculated field I get:
This expression is typed incorrectly, or it is too complex to be evaluated. For example, a numeric expression may contain too many complicated elements. Try simplifying the expression by assigning parts of the expression to variables.
I tried separating the calculated field out into three fields so it's easier to read, hoping that would make it easier to evaluate for Access, but I still get the same error. My base query is currently:
SELECT InStr(Subject,"Support Project #CS")+19 AS StartID,
InStr(StartID,Subject," ") AS EndID,
Int(Mid(Subject,StartID,EndID-StartID)) AS ID,
ProjectEmails.Subject,
ProjectEmails.[From],
ProjectEmails.To,
ProjectEmails.Received,
ProjectEmails.Contents
FROM ProjectEmails
WHERE (((ProjectEmails.[Subject]) Like "*Support Project [#]CS*"));
I've tried to bind a subform to this query on qryProjectEmailWithID.ID = SupportProject.ID where the main form is bound to SupportProject, and I get the above error. I tried building a query that selects all records from that query where the ID = a given parameter and I still get the same error.
The working query that adds Support Project IDs would look like:
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
| ID | Subject | To | From | Received | Contents |
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
| 1 | RE: Support Project #CS1 ID Extra... | questions#so.com | Isaac.Reefman#so.com | 2019-03-11 | Trying to work out how to add... |
| 1 | RE: Support Project #CS1 ID Extra... | isaac.reefman#so.com | questions#so.com | 2019-03-11 | Thanks for your question. The... |
| 1 | RE: Support Project #CS1 ID Extra... | isaac.reefman#so.com | questions#so.com | 2019-03-11 | You should use a different me... |
| 2 | RE: Support Project #CS2 IT issue... | support#domain.com | someone#company.com | 2019-02-21 | I really need some help with ... |
| 2 | RE: Support Project #CS2 IT issue... | someone#company.com | support#domain.com | 2019-02-21 | Thanks for your question. The... |
| 2 | RE: Support Project #CS2 IT issue... | someone#company.com | support#domain.com | 2019-02-21 | Have you tried turning it off... |
| 3 | RE: Support Project #CS3 email br... | support#domain.com | someone#company.com | 2019-02-12 | my email server is malfunccti... |
| 3 | RE: Support Project #CS3 email br... | someone#company.com | support#domain.com | 2019-02-12 | Thanks for your question. The... |
| 3 | RE: Support Project #CS3 email br... | someone#company.com | support#domain.com | 2019-02-13 | I've just re-started the nece... |
+----+--------------------------------------+----------------------+----------------------+------------+----------------------------------+
The view in question would populate a datasheet that looks the same with just the items whos ID matches the ID of the current SupportProject record, updating when a new record is selected. A separate text box should show the full content of whichever record is selected in that grid, like this:
Have you tried turning it off and on again?
From: support#domain.com
On: 21/02/2019
Thanks for your question. The matter has been assigned to Support Project #CS2, and a support staff member will be in touch shortly to help you out. As it is considered of medium priority, you should expect daily updates.
Thanks,
Support
From: someone#company
On: 21/02/2019
I really need some help with my computer. It seems really slow and I can't do my work efficiently.
Neither of these things happens as when I try to use the calculated number to relate to the PK of the SupportProject table...
I don't know if this is a part of the problem, but whether I use Int(Mid(Subject... or Val(Mid(Subject... I still apparently get a Double, where the ID field (as an autoincrement ID) is a Long. I can't work out how to force it to return a Long, so I can't test whether that's the problem.
So that is output resulting from posted SQL? I really wanted raw data but close enough. If requirement is to extract number after ...CS, calculate in query and save query:
Val(Mid([Subject],InStr([Subject],"CS")+2))
Then build another query to join first query to table.
SELECT qryProjectEmailWithID.*, SupportProject.tst
FROM qryProjectEmailWithID
INNER JOIN SupportProject ON qryProjectEmailWithID.ID = SupportProject.ID;
Filter criteria can be applied to either ID field.
A subform can display the related child records synchronized with SupportProject records on main form.
I tested the ID calc with your data and then with a link to my Inbox. No issue with query join.
I'm afraid this might be a somewhat simple question, but I can't seem to figure it out.
I have a spreadsheet with many objects, each of which has many attributes (one per column), like this (sorry, I can't post images, so this is the best I can do):
OBJECT ID | PERIOD | COLOR | REPRESENTATION
1 | Early Intermediate | Bichrome | Abstract
2 | Middle Horizon | Multicolored | Representational
… and I'd like each column to become a separate row — which would mean that each object would be listed a number of times. Like this:
OBJECT | ATTRIBUTE
Object 1 | Early Intermediate
Object 1 | Bichrome
Object 1 | Abstract
Object 2 | Middle Horizon
Object 2 | Multicolored
Object 2 | Representational
I'm not seeing an obvious way to do this, and I can't find an answer here, though perhaps I'm not using the right search terms.
Thanks for any help you can offer!
I have a table containing hierarchical data. There are currently ~8 levels in this hierarchy.
I really like the way the data is structured, but performance is dismal when I need to know if a record at level 8 is a child of a record at level 1.
I have PL/SQL stored functions which do these lookups for me, each having a select * from tbl start with ... connect by... statement. This works fine when I'm querying a handful of records, but I'm in a situation now where I need to query ~10k records at once and for each of them run this function. It's taking 2-3 minutes where I need it to run in just a few seconds.
Using some heuristics based on my knowledge of the current data, I can get rid of the lookup function and just do childrecord.key || '%' LIKE parentrecord.key but that's a really dirty hack and will not always work.
So now I'm thinking that for this hierarchically-defined table I need to have a separate parent-child table, which will contain every relationship...for a hierarchy going from level 1-8 there would be 8! records, associating 1 with 2, 1 with 3,...,1 with 8 and 2 with 3, 2 with 4,...,2 with 8. And so forth.
My thought is that I would need to have an insert trigger where it will basically run the connect by query and for every match going up the hierarchy it will insert a record in the lookup table. And to deal with old data I'll just set up foreign keys to the main table with cascading deletes.
Are there better options than this? Am I missing another way that I could determine these distant ancestor/descendant relationships more quickly?
EDIT: This appears to be exactly what I'm thinking about: http://evolt.org/working_with_hierarchical_data_in_sql_using_ancestor_tables
So what you want is to materialize the transitive closures. That is, given this application table ...
ID | PARENT_ID
------+----------
1 |
2 | 1
3 | 2
4 | 2
5 | 4
... the graph table would look like this:
PARENT_ID | CHILD_ID
-----------+----------
1 | 2
1 | 3
1 | 4
1 | 5
2 | 3
2 | 4
2 | 5
4 | 5
It is possible to maintain a table like this in Oracle, although you will need to roll your own framework for it. The question is whether it is worth the overhead. If the source table is volatile then keeping the graph data fresh may cost more cycles than you will save on the queries. Only you know your data's profile.
I don't think you can maintain such a graph table with CONNECT BY queries and cascading foreign keys. Too much indirect activity, too hard to get right. Also a materialized view is out, because we cannot write a SQL query which will zap the 1->5 record when we delete the source record for ID=4.
So what I suggest you read a paper called Maintaining Transitive Closure of Graphs in SQL by Dong, Libkin, Su and Wong. This contains a lot of theory and some gnarly (Oracle) SQL but it will give you the grounding to build the PL/SQL you need to maintain a graph table.
"can you expand on the part about it
being too difficult to maintain with
CONNECT BY/cascading FKs? If I control
access to the table and all
inserts/updates/deletes take place via
stored procedures, what kinds of
scenarios are there where this would
break down?"
Consider the record 1->5 which is a short-circuit of 1->2->4->5. Now what happens if, as I said before, we delete the the source record for ID=4? Cascading foreign keys could delete the entries for 2->4 and 4->5. But that leaves 1->5 (and indeed 2->5) in the graph table although they no longer represent a valid edge in the graph.
What might work (I think, I haven't done it) would be to use an additional synthetic key in the source table, like this.
ID | PARENT_ID | NEW_KEY
------+-----------+---------
1 | | AAA
2 | 1 | BBB
3 | 2 | CCC
4 | 2 | DDD
5 | 4 | EEE
Now the graph table would look like this:
PARENT_ID | CHILD_ID | NEW_KEY
-----------+----------+---------
1 | 2 | BBB
1 | 3 | CCC
1 | 4 | DDD
1 | 5 | DDD
2 | 3 | CCC
2 | 4 | DDD
2 | 5 | DDD
4 | 5 | DDD
So the graph table has a foreign key referencing the relationship in the source table which generated it, rather than linking to the ID. Then deleting the record for ID=4 would cascade deletes of all records in the graph table where NEW_KEY=DDD.
This would work if any given ID can only have zero or one parent IDs. But it won't work if it is permissible for this to happen:
ID | PARENT_ID
------+----------
5 | 2
5 | 4
In other words the edge 1->5 represents both 1->2->4->5 and 1->2->5. So, what might work depends on the complexity of your data.