Hierarchical tagging in SQL - sql

I have a PHP web application which uses a MySQL database for object tagging, in which I've used the tag structure accepted as the answer to this SO question.
I'd like to implement a tag hierarchy, where each tag can have a unique parent tag. Searches for a parent tag T would then match all descendants of T (i.e. T, tags whos parent is T (children of T), grandchildren of T, etc.).
The easiest way of doing this seems to be to add a ParentID field to the tag table, which contains the ID of a tag's parent tag, or some magic number if the tag has no parent. Searching for descendants, however, then requires repeated full searches of the database to find the tags in each 'generation', which I'd like to avoid.
A (presumably) faster, but less normalised way of doing this would be to have a table containing all the children of each tag, or even all the descendants of each tag. This however runs the risk of inconsistent data in the database (e.g. a tag being the child of more than one parent).
Is there a good way to make queries to find descendants fast, while keeping the data as normalised as possible?

I implemented it using two columns. I simplify it here a little, because I had to keep the tag name in a separate field/table because I had to localize it for different languages:
tag
path
Look at these rows for example:
tag path
--- ----
database database/
mysql database/mysql/
mysql4 database/mysql/mysql4/
mysql4-1 database/mysql/mysql4-1/
oracle database/oracle/
sqlserver database/sqlserver/
sqlserver2005 database/sqlserver/sqlserver2005/
sqlserver2005 database/sqlserver/sqlserver2008/
etc.
Using the like operator on the path field you can easily get all needed tag rows:
SELECT * FROM tags WHERE path LIKE 'database/%'
There are some implementation details like when you move a node in the hierarchy you have to change all children too etc., but it's not hard.
Also make sure that the length of your path is long enough - in my case I used not the tag name for the path, but another field to make sure that I don't get too long paths.

Ali's answer has a link to Joe Celko's Trees and Hierarchies in SQL for Smarties, which confirms my suspicion - there isn't a simple database structure that offers the best of all worlds. The best for my purpose seems to be the "Frequent Insertion Tree" detailed in this book, which is like the "Nested Set Model" of Ali's link, but with non-consecutive indexing. This allows O(1) insertion (a la unstructured BASIC line numbering), with occasional index reorganisation as and when needed.

A few ways here

You could build what Kimball calls a Hierarchy Helper Table.
Say you hierarchy looks like this: A -> B | B -> C | C -> D
you'd insert records into a table that looks like this
ParentID, ChildID, Depth, Highest Flag, Lowest Flag
A, A, 0, Y, N
A, B, 1, N, N
A, C, 2, N, N
A, D, 3, N, Y
B, B, 0, N, N
B, C, 1, N, N
B, D, 2, N, Y
C, C, 0, N, N
C, D, 1, N, Y
D, D, 0. N, Y
I think I have that correct.... anyways. The point is you still store you hierarchy correctly, you just build this table FROM your proper table. THIS table queries like a Banshee. Say you want to know what all the first level below B are.
WHERE parentID = 'B' and Depth = 1

I would use some kind of array to store the children tags, this should be a lot faster than joining a table on itself (especially if you have a large number of tags). I had a look, and I can't tell if mysql has a native array data type, but you can emulate this by using a text column and storing a serialized array in it. If you want to speed things up further, you should be able to put a text search index on that column to find out which tags are related.
[Edit]
After reading Ali's article, I did some more hunting and found this presentation on a bunch of approaches for implementing hierarchies in postgres. Might still be helpful for explanatory purposes.

Related

I'm having problems undestanding some cypher concepts - naming nodes

I'm following one tutorial and I'm having a hard time understating the very basic things and I need your help. The code is:
CREATE (u:User {name: "Alice"})-[:Likes]->(m:Software {name: "Memgraph"});
The explanation for this code is: The query above will create 2 nodes in the database, one labeled "User" with name "Alice" and the other labeled "Software" with name "Memgraph". It will also create a relationship that "Alice" likes "Memgraph".
This part I get.
What I don't get is this:
To find created nodes and relationships, execute the following query:
MATCH (u:User)-[r]->(x) RETURN u, r, x;
If I've created a node that has a variable (or how is u called), why is the relation related to as r, and software as x? When a relation was created, it was just defined as :Likes, and software was m.
From where do r and x come from? Is there any connection between CREATE and MATCH or is the order the only an important thing and names are not important at all?
When you were creating nodes and a relationship between them, you used variables, but you could have done it like this:
CREATE (:User {name: "Alice"})-[:Likes]->(:Software {name: "Memgraph"});
The above query would also create nodes and a relationship. We use variables when we want to use certain objects in the other parts of one query. For example, if you wanted to also return the created nodes, you would execute the following query:
CREATE (u:User {name: "Alice"})-[:Likes]->(m:Software {name: "Memgraph"})
RETURN u, m;
You can notice that you needed variables u and m, in order to return them. Also, you can't return the relationship object, since you did not assign variable to it.
Now to answer the other part of your question - variables are used in one query and they are not memorized in any way to be used in other query. So if you ran:
CREATE (u:User {name: "Alice"})-[:Likes]->(m:Software {name: "Memgraph"});
and then you want to get u and m in some other query, those won't be Alice and Memgraph nodes anymore. Hence, when you execute:
MATCH (u:User)-[r]->(x) RETURN u, r, x;
you get every node labeled with User, regardless of Alice from the query before (but Alice will be included too, since they are a person), and all other nodes that the User nodes are related to, along with those relationships. You choose your variable names, so it does not matter whether they are u, r and x or m, n and l.

OrientDB graph query that match specific relationship

I am developing an application using OrientDB as a database. The database is already filled, and now I need to make some queries to obtain specific information.
I have 3 classes and 3 edges to be concerned. What I need to do is query the database to see if some specific relationship exists. The relationship is like this:
ParlamentarVertex --Realiza> TransacaoVertex --FornecidaPor> EmpresaFornecedoraVertex AND ParlamentarVertex --SocioDe> EmpresaFornecedoraVertex
The names with a vertex in it are a vertex of course, and the arrows are the edges between the two vertexes.
I've tried to do this:
SELECT TxNomeParlamentar, SgPartido, SgUF FROM Parlamentar where ...
SELECT EXPAND( out('RealizaTransacao').out('FornecidaPor') ) FROM Parlamentar
But I do not know how to specify the relationships after the where clause.
I've also tried to use match
MATCH {class: Parlamentar, as: p} -Realiza-> {as:realiza}
But I am not sure how to specify the and clause that is really important for my query.
Does anyone have some tip, so I can go in the right direction?
Thanks in advance!
EDIT 1
I've managed to use the query below:
SELECT EXPAND( out('RealizaTransacao').out('FornecidaPor').in('SocioDe') ) FROM Parlamentar
It almost works, but return some relationships incorrectly. It looks like a join that I did not bind the Pk and FK.
The easiest thing here is to use a MATCH as follows:
MATCH
{class:ParlamentarVertex, as:p} -Realiza-> {class:TransacaoVertex, as:t}
-FornecidaPor-> {class:EmpresaFornecedoraVertex, as:e},
{as:p} -SocioDe-> {as:e}
RETURN p, p.TxNomeParlamentar, p.SgPartido, p.SgUF, t, e
(or RETURN whatever you need)
As you can see, the AND is represented as the addition of multiple patterns, separated by a comma

SQL: List-Field contains sublist

Quick preface: I use the SQL implementation persistent (Haskell) and esqueleto.
Anyway, I want to have a SQL table with a column of type [String], i.e. a list of strings. Now I want to make a query which gives me all the records where a given list is a sublist of the one in the record.
For instance the table with
ID Category
0 ["math", "algebra"]
1 ["personal", "life"]
2 ["algebra", "university", "personal"]
with a query of ["personal", "algebra"] would return only the record with ID=2, since ["personal", "algebra"] is a sublist of ["algebra", "university", "personal"].
Is a query like this possible with variable-length of my sought-after sublist and "basic" SQL operators?
If someone knows their way around persistent/esqueleto that would of course be awesome.
Thanks.
Expanding on the comment of Gordon Linoff and the previous answer:
SQL databases are sometimes limited in their power. Since the order of your Strings in [String] does not seem to matter, you are trying to put something like a set into a relational database and for your query you suggest something like a is a subset of operator.
If there was a database engine that provides those structures, there would be nothing wrong about using it (I don't know any). However, approximating your set logic (or any logic that is not natively supported by the database) has disadvantages:
You have to explicitly deal with edge cases (cf. xnyhps' answer)
Instead of hiding the complexity of storing data, you need to explicitly deal with it in your code
You need to study the database engine rather than writing your Haskell code
The interface between database and Haskell code becomes blurry
A mightier approach is to reformulate your storing task to something that fits easily into the relational database concept. I.e. try to put it in terms of relations.
Entities and relations are simple, thus you avoid edge cases. You don't need to bother how exactly the db backend stores your data. You don't have to bother with the database much at all. And your interface is reduced to rather straightforward queries (making use of joins). Everything that cannot be (comparatively) easily realized with a query, belongs (probably) into the Haskell code.
Of course, the details differ based on the specific circumstances.
In your specific case, you could use something like this:
Table: Category
ID Description
0 math
1 algebra
2 personal
3 life
4 university
Table: CategoryGroup
ID CategoryID
0 0
0 1
1 2
1 3
2 1
2 4
2 2
... where the foreign key relation allows to have groups of categories. Here, you are using a relational database where it excels. In order to query for CategoryGroup you would join the two tables, resulting in a result of type
[(Entity CategoryGroup, Entity Category)]
which I would transform in Haskell to something like
[(Entity CategoryGroup, [Entity Category])]
Where the Category entities are collected for each CategoryGroup (that requires deriving (Eq, Ord) in your CategoryGroup-Model).
The set-logic as described above, for a given List cs :: [Entity Category], would then go like
import qualified Data.Set as Set
import Data.Set (isSubsetOf)
let s = Set.fromList ["personal", "algebra"]
let s0 = Set.fromList $ map (categoryDescription . entityVal) cs
if s `isSubsetOf` s0 -- ... ?
Getting used to the restrictions of relational databases can be annoying in the beginning. I guess, for something of central importance (persisting data) a robust concept is often better than a mighty one and it pays out to always know what your database is doing exactly.
By using [String], persistent converts the entire list to a quoted string, making it very hard to work with from SQL.
You can do something like:
mapM (\cat ->
where_ (x ^. Category `like` (%) ++. val (show cat) ++. (%)))
["personal", "algebra"]
But this is very fragile (may break when the categories contain ", etc.).
Better approaches are:
You could do the filtering in Haskell if the database is small enough.
It would be much easier to model your data as:
Objects:
ID ...
0 ...
1 ...
2 ...
ObjectCategories:
ObjectID Category
0 math
0 algebra
1 personal
1 life
2 algebra
2 university
2 personal

SQL to Return Id Arguments of Records That Are Found not to Exist

I have a query where by I'm working with an array of properties e.g. [1,2,3,4] etc. I need to check a table to see if a record with that id exists, I'm interested in the ones that don't exist rather than the ones that do.
At the moment I'm looping over the array in Ruby then making separate SELECT requests for each one. This works, but often the array is very long and it seems very inefficient making many separate requests.
I was wondering if there's a way to pass the whole array to Postgres and then Postgres hand me back all of the id's that don't exist.
Thanks,
Chris
Okay figured it out:
SELECT *
FROM unnest(ARRAY[1,2,3,4,5]) as s
WHERE s NOT IN (SELECT id FROM my_table);

Django query for large number of relationships

I have Django models setup in the following manner:
model A has a one-to-many relationship to model B
each record in A has between 3,000 to 15,000 records in B
What is the best way to construct a query that will retrieve the newest (greatest pk) record in B that corresponds to a record in A for each record in A? Is this something that I must use SQL for in lieu of the Django ORM?
Create a helper function for safely extracting the 'top' item from any queryset. I use this all over the place in my own Django apps.
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
# Extracts a single element collection w/ top item
result = queryset[0:1]
# Return that element or None if there weren't any matches
return result[0] if result else None
This uses a bit of a trick w/ the slice operator to add a limit clause onto your SQL.
Now use this function anywhere you need to get the 'top' item of a query set. In this case, you want to get the top B item for a given A where the B's are sorted by descending pk, as such:
latest = top_or_none(B.objects.filter(a=my_a).order_by('-pk'))
There's also the recently added 'Max' function in Django Aggregation which could help you get the max pk, but I don't like that solution in this case since it adds complexity.
P.S. I don't really like relying on the 'pk' field for this type of query as some RDBMSs don't guarantee that sequential pks is the same as logical creation order. If I have a table that I know I will need to query in this fashion, I usually have my own 'creation' datetime column that I can use to order by instead of pk.
Edit based on comment:
If you'd rather use queryset[0], you can modify the 'top_or_none' function thusly:
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
try:
return queryset[0]
except IndexError:
return None
I didn't propose this initially because I was under the impression that queryset[0] would pull back the entire result set, then take the 0th item. Apparently Django adds a 'LIMIT 1' in this scenario too, so it's a safe alternative to my slicing version.
Edit 2
Of course you can also take advantage of Django's related manager construct here and build the queryset through your 'A' object, depending on your preference:
latest = top_or_none(my_a.b_set.order_by('-pk'))
I don't think Django ORM can do this (but I've been pleasantly surprised before...). If there's a reasonable number of A record (or if you're paging), I'd just add a method to A model that would return this 'newest' B record. If you want to get a lot of A records, each with it's own newest B, I'd drop to SQL.
remeber that no matter which route you take, you'll need a suitable composite index on B table, maybe adding an order_by=('a_fk','-id') to the Meta subclass