I have the following DAG
A --> B
| |
v v
C --> D
Here is the closure table
| Ancestor | Descendant | Depth |
---------------------------------
| A | A | 0 |
| A | B | 1 |
| A | C | 1 |
| A | D | 2 |
| A | D | 2 |
| B | B | 0 |
| B | D | 1 |
| C | C | 0 |
| C | D | 1 |
| D | D | 0 |
How would I go about removing path B > D (thus removing A > B > D) without also removing A > C > D and C > D.
Right now I'm using the following query but it only works when every node only has 1 parent.
DELETE FROM `Closure`
WHERE `Descendant` IN (SELECT `Descendant` FROM `Closure` WHERE `Ancestor`=#Node)
AND `Ancestor` NOT IN (SELECT `Descendant` FROM `Closure` WHERE `Ancestor`=#Node);
Given your current model I'm not sure it's possible. I'd propose that you add a column to count the number of paths tracking how many different ways there are to get from any node X to node Y.
So rather than your table you end up with
| Ancestor | Descendant | Depth | Refs |
-----------------------------------------
| A | A | 0 | 1 |
| A | B | 1 | 1 |
| A | C | 1 | 1 |
| A | D | 2 | 2 |
| B | B | 0 | 1 |
| B | D | 1 | 1 |
| C | C | 0 | 1 |
| C | D | 1 | 1 |
| D | D | 0 | 1 |
Removing a node would then entail an update statement followed by a delete statement.
The update would, instead of deleting the entries it finds, decrement the number of references for that entry. Then you could delete the entries with 0 or fewer references afterwards.
Coming up with a SQL query which does the update is escaping me at the moment but in theory this should work without having to completely reconstruct the closure table...
First, I believe there is a duplicate entry in your table. (A,D) appears twice. Second, after removing the edge (B,D), the following paths should remain:
Node self-maps: (A,A), (B,B), (C,C), (D,D)
(A,B)
(A,C)
(A,D) ( through C )
Thus, to remove the edge (B,D) in this example, all that is required is to remove that one row:
Delete MyTable
Where Ancestor = 'B'
And Descendant = 'D'
A closure table is still only mapping relations between two nodes. What makes it special is that it is mapping every indirect relation effectively as a direct relation. The edge (B,D) is simply saying that you can get from B to D. That row alone says nothing about how you got to B nor does it say anything about how many nodes it took to get from B to D; it simply saying you can get from B to D. Thus, there is no edge listed for A > B > D per se. Rather, all that is captured is that you can get from A to B and from A to D which is still true even if the edge (B,D) is removed.
In natural language, that would be: "Delete ancestor-descendant relashionship to D, if there is no parent of D besides B that is also a descendant of A". Is that correct?
(Edit: no, that's not correct; not only relashionships to D must be removed, but also relashionships to every descendant of D. Thus, that criteria is not valid...)
My tentative SQL would then be:
DELETE a
FROM Closure AS a
INNER JOIN Closure AS b ON a.Descendant = b.Descendant
WHERE
a.Descendant IN (SELECT Descendant FROM Closure WHERE Ancestor = {Child}) AND
b.Depth = 1 AND
b.Ancestor != {Parent} AND
a.Ancestor NOT IN (SELECT Ancestor FROM Closure WHERE Descendant = b.Ancestor)
(Sorry if I got the query wrong - or used non-standard features - I'm not actually experienced with that. But my natural language description should give an insight for what actually needs to go on the query)
Update: On second thought, I don't believe my query will work for all cases. Consider this:
A --> B --> D --> E --> F
F is a descendant of D (True)
E is a parent of F (True)
E is not B (True)
A is not an ancestor of E (False)
Thus, A >> F won't be removed, even though it should. Sorry I couldn't help, but that seems a problem too big to fit in a single query. I'd suggest looking for an algorithmic solution first, then seeing how that could be implemented in your case.
While tracking depth and allowing multiple parents at the same time is probably possible, I get a code smell from it, especially when the solution involves having duplicates pairs. Thomas' answer outlines that issue.
I'm going to simplify the question a bit to just focus on unlinking a node when multiple parents are allowed, because it's a tricky enough problem on its own. I'm discarding the depth column entirely and assuming there are no duplicate pairs.
You have to take into account children of D, which gets complicated fast. Say we have:
A --> B
| |
v v
C --> D --> E
We want to unlink D from B, which means we also have to remove links between E and B. But what if they're connected like this:
A --> B --> C
| |
v v
D --> E < -- G
|
V
H --> F
In this case if we disconnect B > D, we don't want to unlink B and E anymore, because E is still linked to B through C. But we do want to disconnect F from B.
I'll go through my solution below using that second example. I know that D only has one parent in this example, but it still works perfectly if D has multiple parents; I can more easily demonstrate some of the edge cases this way so that's why I'm doing it like this.
Here's what the table would look like:
| Ancestor | Descendant |
-------------------------
| A | A |
| A | B |
| A | C |
| A | D |
| A | E |
| A | F |
| B | B |
| B | C |
| B | D |
| B | E |
| B | F |
| C | C |
| C | E |
| D | D |
| D | E |
| D | F |
| E | E |
| F | F |
| G | G |
| G | E |
| H | H |
| H | F |
Query 1: Get all descendants of D, including D
SELECT `Descendant`
FROM `Closure`
WHERE `Ancestor` = #Node
This will return: D, E, F
Query 2: Get all ancestors of B, including B
SELECT `Ancestor`
FROM `Closure`
WHERE `Descendant` = #ParentNode
This will return: A, B
Query 3a: Get all ancestors of the items in Query 1 that do not appear in Query 1 or 2
SELECT DISTINCT `Ancestor`
FROM `Closure`
WHERE `Descendant` IN (#Query1)
AND `Ancestor` NOT IN (#Query1)
AND `Ancestor` NOT IN (#Query2)
This will return: C, G, H
The goal here is to get all parents of E and F that may reconnect farther up the chain.
Query 3b: this is exactly the same as Query 3a, except it returns both ancestors and descendants
SELECT DISTINCT `Ancestor`, `Descendant`
[ ... ]
This will return: (C, E), (G, E), (H, F)
We'll need this later.
Query 4: Filter results of Query 3a down to nodes that reconnect farther up the chain
SELECT `Ancestor`,`Descendant`
FROM `Closure`
WHERE `Descendant` IN (#Query3a)
AND (`Ancestor` IN (#Query2) OR `Ancestor` = `Descendant`))
This will return: (A, C), (B, C), (C, C)
We now have references to all parents of C that should not be unlinked. Note that we have no links to parents of F. That's because F is not connected to B and A except through D (which we're unlinking).
Query 5: Construct the pairs that should be kept, using the results of Query 3b as a bridge between Queries 1 and 4
SELECT `Query4`.`ancestor`,`Query3b`.`descendant`
FROM (#Query3b) as `Query3b`
LEFT JOIN (#Query4) as `Query4`
WHERE `Query3b`.`descendant` IN (#Query1)
This will return: (A, E), (B, E)
Query 6: Run the regular query for orphaning a node and its children, except exclude all pairs returned by Query 5
DELETE FROM `Closure`
WHERE `Descendant` IN (#Query1)
AND `Ancestor` IN (#Query2)
AND (`Ancestor`, `Descendant`) NOT IN (#Query5)
After this operation, we will have removed the following links:
| Ancestor | Descendant |
-------------------------
| A | D |
| A | F |
| B | D |
| B | F |
Both D and F are correctly unlinked, and E correctly retains its connections to A and B, leaving:
A --> B --> C
|
v
D --> E < -- G
|
V
H --> F
Let me know if I've missed anything! I just solved this problem myself today and I may run into more edge cases as time goes on. If I find any I'll update this answer.
Related
I'm honestly not sure how to title this - so apologies if it is unclear.
I have two tables I need to compare. One table contains tree names and nodes that belong to that tree. Each Tree_name/Tree_node combo will have its own line. For example:
Table: treenode
| TREE_NAME | TREE_NODE |
|-----------|-----------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
I have another table that contains names of queries and what tree_nodes they use. Example:
Table: queryrecord
| QUERY | TREE_NODE |
|---------|-----------|
| Alpha | A |
| Alpha | B |
| Alpha | D |
| BRAVO | A |
| BRAVO | B |
| BRAVO | D |
| CHARLIE | A |
| CHARLIE | B |
| CHARLIE | F |
I need to create an SQL where I input the QUERY name, and it returns any ‘TREE_NAME’ that includes all the nodes associated with the query. So if I input ‘ALPHA’, it would return TREE_NAME 1 & 2. If I ask it for CHARLIE, it would return nothing.
I only have read access, and don’t believe I can create temp tables, so I’m not sure if this is possible. Any advice would be amazing. Thank you!
You can use group by and having as follows:
Select t.tree_name
From tree_node t
join query_record q
on t.tree_node = q.tree_node
WHERE q.query = 'ALPHA'
Group by t.tree_name
Having count(distinct t.tree_node)
= (Select count(distinct q.tree_node) query_record q WHERE q.query = 'ALPHA');
Using an IN condition (a semi-join, which saves time over a join):
with prep (tree_node) as (select tree_node from queryrecord where query = :q)
select tree_name
from treenode
where tree_node in (select tree_node from prep)
group by tree_name
having count(*) = (select count(*) from prep)
;
:q in the prep subquery (in the with clause) is the bind variable to which you will assign the various QUERY values at runtime.
EDIT
I don't generally set up the test case on online engines; but in a comment below this answer, the OP said the query didn't work for him. So, I set up the example on SQLFiddle, here:
http://sqlfiddle.com/#!4/b575e/2
A couple of notes: for some reason, SQLFiddle thinks table names should be at most eight characters, so I had to change the second table name to queryrec (instead of queryrecord). I changed the name in the query, too, of course. And, second, I don't know how I can give bind values on SQLFiddle; I hard-coded the name 'Alpha'. (Note also that in the OP's sample data, this query value is not capitalized, while the other two are; of course, text values in SQL are case sensitive, so one should pay attention when testing.)
You can do this with a join and aggregation. The trick is to count the number of nodes in query_record before joining:
select qr.query, t.tree_name
from (select qr.*,
count(*) over (partition by query) as num_tree_node
from query_record qr
) qr join
tree_node t
on t.tree_node = qr.tree_node
where qr.query = 'ALPHA'
group by qr.query, t.tree_name, qr.num_tree_node
having count(*) = qr.num_tree_node;
Here is a db<>fiddle.
So I searched a bit and found that you can do a product of rows in Oracle using a GROUP BY and a nifty mathematical formula: exp(sum(ln(some_col))). It's pretty awesome, but unfortunately doesn't work when some_col is potentially zero (because ln(0) is not possible as ln(x) is negative infinity as it approaches zero).
Example query:
select
a, b, c
sum(d) d,
sum(e) e,
exp(sum(ln(f))) f
from x
group by a, b, c;
Obviously since this is a product of values, if one of them is zero, the product would be zero. The immediate thought would be to use a case, but that would require the case statement to be on an aggregate value or something in the GROUP BY... which it isn't. I can't just exclude those rows because I still need sum(d) and sum(e).
Any thoughts on a good way to do this while dealing with potential zeroes? I was thinking about something involving over(partition by ...), but in reality, my query groups by 12 columns and there are 20 other columns being aggregated. That query could get reeaaaal ugly, but if it's the only solution, I suppose I don't have a choice.
Side question: is there any particular reason there isn't a product function in Oracle? Seems like it'd be such a basic thing to include like sum is.
Note: This is Oracle 12c.
Example:
If I had an input table like this (matching with the query above):
| a | b | c | d | e | f |
+-----+-----+-----+---+---+---+
| foo | bar | hoo | 1 | 2 | 2 |
| foo | bar | hoo | 3 | 4 | 3 |
| foo | bar | hoo | 2 | 5 | 0 |
| foo | bar | mee | 1 | 2 | 2 |
| foo | bar | mee | 3 | 4 | 3 |
I would expect output like this:
| a | b | c | d | e | f |
+-----+-----+-----+---+----+---+
| foo | bar | hoo | 6 | 11 | 0 |
| foo | bar | mee | 4 | 6 | 6 |
However, because the third row has a 0 for f, we naturally get ORA-01428: argument '0' is out of range for ln(0).
First, log(0) is not undefined - it's negative infinity.
Second: in Oracle you can generate a negative infinity, but you'll have to use BINARY_FLOAT.
select a, b, c,
sum(d) d,
sum(e) e,
exp(sum(CASE WHEN f <> 0 THEN ln(f) ELSE -1/0F END)) f
from x
group by a, b, c;
Using your data this generates:
A B C D E F
foo bar hoo 6 11 0
foo bar mee 4 6 6.0000001304324524E+000
Note that introducing logarithms and power functions will introduce some rounding issues, but this should at least get you started.
dbfiddle here
TO NEGATIVE INFINITY...AND BEYOND!!!!!!
:-)
I'm using sqlite with python. Suppose that I have a datatable that looks like this:
Table 1
1 | 2 | 3 | 4 | 5
__|___|___|___|__
A | B | B | C | D
B | D | B | D | C
A | D | C | C | A
B | D | B | D | C
D | B | B | C | D
D | B | B | C | D
Question: How can I create (very quickly/efficiently/viable for very large databases) an index column for each row where if row x and row y are identical they get assigned the same index? For the example database I would want something like this:
Table 1
Index| 1 | 2 | 3 | 4 | 5
_____|___|___|___|___|___
23 | A | B | B | C | D
32 | B | D | B | D | C
106| A | D | C | C | A
72 | B | D | B | D | C
80 | D | B | B | C | D
80 | D | B | B | C | D
I don't care what the actual indexes are, as long as duplicate rows (like the last two in the example) get the same index.
You COULD create an index made up of every field in the table.
create index on table1 (field1, field2, field3, field4, field5)
But that's probably not a good idea. It makes a huge index that will be slow to build and slow to process. Some database engines won't let you create an index where the combination of fields is over a certain length. I'm not sure if there's such a limit in sqllite or what it might be.
The normal thing to do is to pick some field or combination of a small number of fields that is likely to be short and well distributed.
By "short" I mean literally and simply, the data in the field only takes a few bytes. It's an int or a varchar with a small length, varchar(4) or some such. There's no absolute rule about how short "short" is, but you should pick the shortest otherwise suitable field. A varchar(4000) would be a bad choice.
By "well distributed" I mean that there are many different values. Ideally, each row has a unique value, that is, there is no value that is the same for any two rows. If there is no such field, then pick one that comes as close to this as possible. A field where sometimes 2 or 3 rows share a value but rarely more than that is good. A field where half the records all have the same value is not.
If there is no one field that is well distributed, you can create an index on a combination of two or three fields. But if you use too many fields, you start breaking the "short" condition.
If you can parse your file row by row why not use a dict with the row as a string or a tuple?
my_dico = {}
index_counter = 1
with open(my_db) as my_database, open(out_file) as out:
for row in my_database:
my_row_as_a_tuple = tuple(row.strip().split())
if my_row_as_a_tuple in my_dico:
out.write(my_dico[my_row_as_a_tuple] + '<your separator>' + row)
else:
index_counter += 1
out.write(str(index_counter) + '<your separator>' + row)
my_dico[my_row_as_a_tuple] = str(index_counter)
I have something like these 2 tables (but millions of rows in real):
items:
| X | Y |
---------
| 1 | 2 |
| 3 | 4 |
---------
details:
| X | A | B |
-------------
| 1 | a | b |
| 1 | c | d |
| 3 | e | f |
| 3 | g | h |
-------------
I have to aggregate several rows of one table details for one row in another table items to show them in a GridView like this:
| items.X | items.Y | details.A | details.B |
---------------------------------------------
| 1 | 2 | a, c | b, d |
| 3 | 4 | e, g | f, h |
---------------------------------------------
I already read this and the related questions and I know about GROUP_CONCAT, but I am not allowed to install it on the customer system. Because I don't have a chance to do this natively, I created a stored function (which I'm allowed to create), which returns a table with the columns X, A and B. This function works fine so far, but I don't seem to get these columns added to my result set.
Currently I'm trying to join the function result with a query on items, join-criterion would be the X-column in the example above. I made a minimal example with the AdventureWorks2012 database, which contains a table function dbo.ufnGetContactInformation(#PersonID INT) to join with the [Person].[EmailAddress] table on BusinessEntityID:
SELECT
[EmailAddress]
-- , p.[FirstName]
-- , p.[LastName]
FROM
[Person].[EmailAddress] e
INNER JOIN
dbo.ufnGetContactInformation(e.[BusinessEntityID]) p
ON p.[PersonID] = e.[BusinessEntityID]
The 2 commented lines indicate, what I try to do in reality, but if not commented, they hide the actual error I get:
Event 4104, Level 16, Status 1, Line 6
The multi-part identifier 'e.BusinessEntityID' could not be bound.
I understand, that during the joining process there is no value for e.[BusinessEntityID] yet. So I cannot select a specific subset in the function by using the function parameters, this should be in the join criteria anyway. Additionally I cannot have the function return all rows or create a temporary table, because this is insanely slow and expensive regarding both time and space in my specific situation.
Is there another way to achieve something like this with 2 existing tables and a table function?
Use Apply
Cross apply is similar to inner join,Outer apply is similar to left join
SELECT
[EmailAddress]
-- , p.[FirstName]
-- , p.[LastName]
FROM
[Person].[EmailAddress] e
cross apply
dbo.ufnGetContactInformation(e.[BusinessEntityID]) p
I'm trying to make a query that pair a worker that work on the same place. The relational model I'm asking looks like this:
Employee(EmNum, name)
Work(FiNum*, EmNum*)
Field(FiNum, Title)
(bold indicates primary key)
right now my code looks like
SELECT work.finum, e1.name,e1.emnum,e2.name,e2.emnum
FROM employee e1
INNER JOIN employee e2
on e1.EmNum = e2.EmNum
INNER JOIN work
on e1.emnum = work.emnum
This gives me result like
| finum | name | emnum | name_1 | emnum_1 |
| 1 | a | 1 | a | 1 |
| 1 | b | 2 | b | 2 |
| 2 | c | 3 | c | 3 |
| 3 | d | 4 | d | 4 |
| 3 | e | 5 | e | 5 |
while I want the result to be like
| finum | name | emnum | name_1 | emnum_1 |
| 1 | a | 1 | b | 2 |
| 1 | b | 2 | a | 1 |
| 3 | d | 4 | e | 4 |
| 3 | e | 5 | d | 5 |
I'm quite new at sql so I can't really think of a way to do this. Any help or input would be helpful.
Thanks
Your question is slightly unclear, but my guess is that you're trying to find employees that worked on the same place = same finum in work, but different row. That you can do this way:
SELECT w1.finum, e1.name,e1.emnum, e2.name,e2.emnum
from work w1
join work w2 on w1.finum = w2.finum and w1.emnum != w2.emnum
join employee e1 on e1.emnum = w1.emnum
join employee e2 on e2.emnum = w2.emnum
If you don't want to repeat the records (1 <-> 2 + 2 <-> 1 change the != in the join to > or <)
I'm trying to make a query that pair a worker that work on the same place.
Presumably the "places" are represented by the Field table. If you want to pair up employees on that basis then you should be performing a join conditioned on field numbers being the same, as opposed to one conditioned on employee numbers being the same.
It looks like your main join wants to be a self-join of Work to Work of records with matching FiNum. To get the employee names in the result then you will need also to join Employee twice. To avoid employees being paired with themselves, you will want to filter those cases out via a WHERE clause.