Product in a GROUP BY with potential zeroes - sql

So I searched a bit and found that you can do a product of rows in Oracle using a GROUP BY and a nifty mathematical formula: exp(sum(ln(some_col))). It's pretty awesome, but unfortunately doesn't work when some_col is potentially zero (because ln(0) is not possible as ln(x) is negative infinity as it approaches zero).
Example query:
select
a, b, c
sum(d) d,
sum(e) e,
exp(sum(ln(f))) f
from x
group by a, b, c;
Obviously since this is a product of values, if one of them is zero, the product would be zero. The immediate thought would be to use a case, but that would require the case statement to be on an aggregate value or something in the GROUP BY... which it isn't. I can't just exclude those rows because I still need sum(d) and sum(e).
Any thoughts on a good way to do this while dealing with potential zeroes? I was thinking about something involving over(partition by ...), but in reality, my query groups by 12 columns and there are 20 other columns being aggregated. That query could get reeaaaal ugly, but if it's the only solution, I suppose I don't have a choice.
Side question: is there any particular reason there isn't a product function in Oracle? Seems like it'd be such a basic thing to include like sum is.
Note: This is Oracle 12c.
Example:
If I had an input table like this (matching with the query above):
| a | b | c | d | e | f |
+-----+-----+-----+---+---+---+
| foo | bar | hoo | 1 | 2 | 2 |
| foo | bar | hoo | 3 | 4 | 3 |
| foo | bar | hoo | 2 | 5 | 0 |
| foo | bar | mee | 1 | 2 | 2 |
| foo | bar | mee | 3 | 4 | 3 |
I would expect output like this:
| a | b | c | d | e | f |
+-----+-----+-----+---+----+---+
| foo | bar | hoo | 6 | 11 | 0 |
| foo | bar | mee | 4 | 6 | 6 |
However, because the third row has a 0 for f, we naturally get ORA-01428: argument '0' is out of range for ln(0).

First, log(0) is not undefined - it's negative infinity.
Second: in Oracle you can generate a negative infinity, but you'll have to use BINARY_FLOAT.
select a, b, c,
sum(d) d,
sum(e) e,
exp(sum(CASE WHEN f <> 0 THEN ln(f) ELSE -1/0F END)) f
from x
group by a, b, c;
Using your data this generates:
A B C D E F
foo bar hoo 6 11 0
foo bar mee 4 6 6.0000001304324524E+000
Note that introducing logarithms and power functions will introduce some rounding issues, but this should at least get you started.
dbfiddle here
TO NEGATIVE INFINITY...AND BEYOND!!!!!!
:-)

Related

Oracle SQL query comparing multiple rows with same identifier

I'm honestly not sure how to title this - so apologies if it is unclear.
I have two tables I need to compare. One table contains tree names and nodes that belong to that tree. Each Tree_name/Tree_node combo will have its own line. For example:
Table: treenode
| TREE_NAME | TREE_NODE |
|-----------|-----------|
| 1 | A |
| 1 | B |
| 1 | C |
| 1 | D |
| 1 | E |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | C |
| 3 | D |
| 3 | E |
| 3 | F |
I have another table that contains names of queries and what tree_nodes they use. Example:
Table: queryrecord
| QUERY | TREE_NODE |
|---------|-----------|
| Alpha | A |
| Alpha | B |
| Alpha | D |
| BRAVO | A |
| BRAVO | B |
| BRAVO | D |
| CHARLIE | A |
| CHARLIE | B |
| CHARLIE | F |
I need to create an SQL where I input the QUERY name, and it returns any ‘TREE_NAME’ that includes all the nodes associated with the query. So if I input ‘ALPHA’, it would return TREE_NAME 1 & 2. If I ask it for CHARLIE, it would return nothing.
I only have read access, and don’t believe I can create temp tables, so I’m not sure if this is possible. Any advice would be amazing. Thank you!
You can use group by and having as follows:
Select t.tree_name
From tree_node t
join query_record q
on t.tree_node = q.tree_node
WHERE q.query = 'ALPHA'
Group by t.tree_name
Having count(distinct t.tree_node)
= (Select count(distinct q.tree_node) query_record q WHERE q.query = 'ALPHA');
Using an IN condition (a semi-join, which saves time over a join):
with prep (tree_node) as (select tree_node from queryrecord where query = :q)
select tree_name
from treenode
where tree_node in (select tree_node from prep)
group by tree_name
having count(*) = (select count(*) from prep)
;
:q in the prep subquery (in the with clause) is the bind variable to which you will assign the various QUERY values at runtime.
EDIT
I don't generally set up the test case on online engines; but in a comment below this answer, the OP said the query didn't work for him. So, I set up the example on SQLFiddle, here:
http://sqlfiddle.com/#!4/b575e/2
A couple of notes: for some reason, SQLFiddle thinks table names should be at most eight characters, so I had to change the second table name to queryrec (instead of queryrecord). I changed the name in the query, too, of course. And, second, I don't know how I can give bind values on SQLFiddle; I hard-coded the name 'Alpha'. (Note also that in the OP's sample data, this query value is not capitalized, while the other two are; of course, text values in SQL are case sensitive, so one should pay attention when testing.)
You can do this with a join and aggregation. The trick is to count the number of nodes in query_record before joining:
select qr.query, t.tree_name
from (select qr.*,
count(*) over (partition by query) as num_tree_node
from query_record qr
) qr join
tree_node t
on t.tree_node = qr.tree_node
where qr.query = 'ALPHA'
group by qr.query, t.tree_name, qr.num_tree_node
having count(*) = qr.num_tree_node;
Here is a db<>fiddle.

How to assign indexes to your datatable rows in sqlite in large databases?

I'm using sqlite with python. Suppose that I have a datatable that looks like this:
Table 1
1 | 2 | 3 | 4 | 5
__|___|___|___|__
A | B | B | C | D
B | D | B | D | C
A | D | C | C | A
B | D | B | D | C
D | B | B | C | D
D | B | B | C | D
Question: How can I create (very quickly/efficiently/viable for very large databases) an index column for each row where if row x and row y are identical they get assigned the same index? For the example database I would want something like this:
Table 1
Index| 1 | 2 | 3 | 4 | 5
_____|___|___|___|___|___
23 | A | B | B | C | D
32 | B | D | B | D | C
106| A | D | C | C | A
72 | B | D | B | D | C
80 | D | B | B | C | D
80 | D | B | B | C | D
I don't care what the actual indexes are, as long as duplicate rows (like the last two in the example) get the same index.
You COULD create an index made up of every field in the table.
create index on table1 (field1, field2, field3, field4, field5)
But that's probably not a good idea. It makes a huge index that will be slow to build and slow to process. Some database engines won't let you create an index where the combination of fields is over a certain length. I'm not sure if there's such a limit in sqllite or what it might be.
The normal thing to do is to pick some field or combination of a small number of fields that is likely to be short and well distributed.
By "short" I mean literally and simply, the data in the field only takes a few bytes. It's an int or a varchar with a small length, varchar(4) or some such. There's no absolute rule about how short "short" is, but you should pick the shortest otherwise suitable field. A varchar(4000) would be a bad choice.
By "well distributed" I mean that there are many different values. Ideally, each row has a unique value, that is, there is no value that is the same for any two rows. If there is no such field, then pick one that comes as close to this as possible. A field where sometimes 2 or 3 rows share a value but rarely more than that is good. A field where half the records all have the same value is not.
If there is no one field that is well distributed, you can create an index on a combination of two or three fields. But if you use too many fields, you start breaking the "short" condition.
If you can parse your file row by row why not use a dict with the row as a string or a tuple?
my_dico = {}
index_counter = 1
with open(my_db) as my_database, open(out_file) as out:
for row in my_database:
my_row_as_a_tuple = tuple(row.strip().split())
if my_row_as_a_tuple in my_dico:
out.write(my_dico[my_row_as_a_tuple] + '<your separator>' + row)
else:
index_counter += 1
out.write(str(index_counter) + '<your separator>' + row)
my_dico[my_row_as_a_tuple] = str(index_counter)

Combine column x to n in OpenRefine

I have a table with an unknown number of columns, and I need to combine all columns after a certain point. Consider the following:
| A | B | C | D | E |
|----|----|---|---|---|
| 24 | 25 | 7 | | |
| 12 | 3 | 4 | | |
| 5 | 5 | 5 | 5 | |
Columns A-C are known, and the information in them correct. But column D to N (an unknown number of columns starting with D) needs to be combined as they are all parts of the same string. How can I combine an unknown number of columns in OpenRefine?
As some columns may have empty cells (the string may be of various lengths) I also need to disregard empty cells.
There is a two step approach to this that should work for you.
From the first column you want to merge (Col D in this case) choose Transpose->Transpose cells across columns into rows
You will be asked to set some options. You'll want to choose 'From Column' D and 'To Column' N. Then choose to transpose into One Column, assign a name to that column, make sure the option to 'Ignore Blank Cells' is checked (should be checked by default. Then click Transpose.
You'll get the values that were previously in cols D-N appearing in rows. e.g.
| A | B | C | D | E | F |
|----|----|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 |
Transposes to:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 4 |
| | | | 5 |
| | | | 6 |
You can then use the dropdown menu from the head of the 'new' column to choose
Edit cells->Join multi-value cells
You'll be asked what character you want to use to separate the characters in the joined cell. Probably in your use case you can delete the joining character and combine the cells without any joining characters. This will give you:
| A | B | C | new |
|----|----|---|-----|
| 1 | 2 | 3 | 456 |

How to count columns where values differ?

I have a large table and I need to check for similar rows. I don't need all column values be the same, just similar. The rows must not be "distant" (determined by a query over other table), no value may be too different (I have already done the queries for these conditions) and most other values must be the same. I must expect some ambiguity, so one or two different values shouldn't break the "similarity" (well, I could get better performance by accepting only "completely equal" rows, but this simplification could cause errors; I will do this as an option).
The way I am going to solve this is through PL/pgSQL: to make a FOR LOOP iterating through the results of previous queries. For each column, I have an IF testing whether it differs; if yes, I increment a difference counter and go on. At the end of each loop, I compare the value to a threshold and see if I should keep the row as "similar" or not.
Such a PL/pgSQL-heavy approach seems slow in comparison to a pure SQL query, or to an SQL query with some PL/pgSQL functions involved. It would be easy to test for rows with all but X equivalent rows if I knew which rows should be different, but the difference can occur at any of some 40 rows. Is there any way how to solve this by a single query? If not, is there any faster way than to examine all the rows?
EDIT: I mentioned a table, in fact it is a group of six tables linked by 1:1 relationship. I don't feel like explaining what is what, that's a different question. Extrapolating from doing this over one table to my situation is easy for me. So I simplified it (but not oversimplified it - it should demonstrate all the difficulties I have there) and made an example demonstrating what I need. Null and anything else should count as "different". No need to make a script testing it all - I just need to find out whether it is possible to do in any way more efficient than I thought about.
The point is that I don't need to count rows (as usual), but columns.
EDIT2: previous fiddle - this wasn't so short, so I let it here just for archiving reasons.
EDIT3: simplified example here - just NOT NULL integers, preprocessing omitted. Current state of data:
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5
----+------+------+------+------+------
1 | 4 | 2 | 3 | 4 | 11
2 | 4 | 2 | 4 | 3 | 11
3 | 6 | 3 | 3 | 5 | 13
When I run select similar_records( 1 );, I should get only row 2 (2 columns with different values; this is within limit), not 3 (4 different values - outside the limit of two differences at most).
To find rows that only differ on a given maximum number of columns:
WITH cte AS (
SELECT id
,unnest(ARRAY['bar1', 'bar2', 'bar3', 'bar4', 'bar5']) AS col -- more
,unnest(ARRAY[bar1::text, bar2::text, bar3::text
, bar4::text, bar5::text]) AS val -- more
FROM foo
)
SELECT b.id, count(a.val <> b.val OR NULL) AS cols_different
FROM (SELECT * FROM cte WHERE id = 1) a
JOIN (SELECT * FROM cte WHERE id <> 1) b USING (col)
GROUP BY b.id
HAVING count(a.val <> b.val OR NULL) < 3 -- max. diffs allowed
ORDER BY 2;
I ignored all the other distracting details in your question.
Demonstrating with 5 columns. Add more as required.
If columns can be NULL you may want to use IS DISTINCT FROM instead of <>.
This is using the somewhat unorthodox, but handy parallel unnest(). Both arrays must have the same number of elements to work. Details:
Is there something like a zip() function in PostgreSQL that combines two arrays?
SQL Fiddle (building on yours).
In instead of a loop to compare each row to all the others do a self join
select f0.id, f1.id
from foo f0 inner join foo f1 on f0.id < f1.id
where
f0.bar1 = f1.bar1 and f0.bar2 = f1.bar2
and
#(f0.bar3 - f1.bar3) <= 1
and
f0.bar4 = f1.bar4 and f0.bar5 = f1.bar5
or
f0.bar4 = f1.bar5 and f0.bar5 = f1.bar4
and
#(f0.bar6 - f1.bar6) <= 2
and
f0.bar7 is not null and f1.bar7 is not null and #(f0.bar7 - f1.bar7) <= 5
or
f0.bar7 is null and f1.bar7 <= 3
or
f1.bar7 is null and f0.bar7 <= 3
and
f0.bar8 = f1.bar8
and
#(f0.bar11 - f1.bar11) <= 5
;
id | id
----+----
1 | 4
1 | 5
4 | 5
(3 rows)
select * from foo;
id | bar1 | bar2 | bar3 | bar4 | bar5 | bar6 | bar7 | bar8 | bar9 | bar10 | bar11
----+------+------+------+------+------+------+------+------+------+-------+-------
1 | abc | 4 | 2 | 3 | 4 | 11 | 7 | t | t | f | 42.1
2 | abc | 5 | 1 | 6 | 2 | 8 | 39 | t | t | t | 19.6
3 | xyz | 4 | 2 | 3 | 5 | 14 | 82 | t | f | | 95
4 | abc | 4 | 2 | 4 | 3 | 11 | 7 | t | t | f | 42.1
5 | abc | 4 | 2 | 3 | 4 | 13 | 6 | t | t | | 37.7
Are you aware that the and operator has priority over or? I'm asking because it looks like the where clause in your function is not what you want. I mean in your expression it is enough to f0.bar7 is null and f1.bar7 <= 3 to be true to include the pair

Moving in Closure Table with Multiple Parents

I have the following DAG
A --> B
| |
v v
C --> D
Here is the closure table
| Ancestor | Descendant | Depth |
---------------------------------
| A | A | 0 |
| A | B | 1 |
| A | C | 1 |
| A | D | 2 |
| A | D | 2 |
| B | B | 0 |
| B | D | 1 |
| C | C | 0 |
| C | D | 1 |
| D | D | 0 |
How would I go about removing path B > D (thus removing A > B > D) without also removing A > C > D and C > D.
Right now I'm using the following query but it only works when every node only has 1 parent.
DELETE FROM `Closure`
WHERE `Descendant` IN (SELECT `Descendant` FROM `Closure` WHERE `Ancestor`=#Node)
AND `Ancestor` NOT IN (SELECT `Descendant` FROM `Closure` WHERE `Ancestor`=#Node);
Given your current model I'm not sure it's possible. I'd propose that you add a column to count the number of paths tracking how many different ways there are to get from any node X to node Y.
So rather than your table you end up with
| Ancestor | Descendant | Depth | Refs |
-----------------------------------------
| A | A | 0 | 1 |
| A | B | 1 | 1 |
| A | C | 1 | 1 |
| A | D | 2 | 2 |
| B | B | 0 | 1 |
| B | D | 1 | 1 |
| C | C | 0 | 1 |
| C | D | 1 | 1 |
| D | D | 0 | 1 |
Removing a node would then entail an update statement followed by a delete statement.
The update would, instead of deleting the entries it finds, decrement the number of references for that entry. Then you could delete the entries with 0 or fewer references afterwards.
Coming up with a SQL query which does the update is escaping me at the moment but in theory this should work without having to completely reconstruct the closure table...
First, I believe there is a duplicate entry in your table. (A,D) appears twice. Second, after removing the edge (B,D), the following paths should remain:
Node self-maps: (A,A), (B,B), (C,C), (D,D)
(A,B)
(A,C)
(A,D) ( through C )
Thus, to remove the edge (B,D) in this example, all that is required is to remove that one row:
Delete MyTable
Where Ancestor = 'B'
And Descendant = 'D'
A closure table is still only mapping relations between two nodes. What makes it special is that it is mapping every indirect relation effectively as a direct relation. The edge (B,D) is simply saying that you can get from B to D. That row alone says nothing about how you got to B nor does it say anything about how many nodes it took to get from B to D; it simply saying you can get from B to D. Thus, there is no edge listed for A > B > D per se. Rather, all that is captured is that you can get from A to B and from A to D which is still true even if the edge (B,D) is removed.
In natural language, that would be: "Delete ancestor-descendant relashionship to D, if there is no parent of D besides B that is also a descendant of A". Is that correct?
(Edit: no, that's not correct; not only relashionships to D must be removed, but also relashionships to every descendant of D. Thus, that criteria is not valid...)
My tentative SQL would then be:
DELETE a
FROM Closure AS a
INNER JOIN Closure AS b ON a.Descendant = b.Descendant
WHERE
a.Descendant IN (SELECT Descendant FROM Closure WHERE Ancestor = {Child}) AND
b.Depth = 1 AND
b.Ancestor != {Parent} AND
a.Ancestor NOT IN (SELECT Ancestor FROM Closure WHERE Descendant = b.Ancestor)
(Sorry if I got the query wrong - or used non-standard features - I'm not actually experienced with that. But my natural language description should give an insight for what actually needs to go on the query)
Update: On second thought, I don't believe my query will work for all cases. Consider this:
A --> B --> D --> E --> F
F is a descendant of D (True)
E is a parent of F (True)
E is not B (True)
A is not an ancestor of E (False)
Thus, A >> F won't be removed, even though it should. Sorry I couldn't help, but that seems a problem too big to fit in a single query. I'd suggest looking for an algorithmic solution first, then seeing how that could be implemented in your case.
While tracking depth and allowing multiple parents at the same time is probably possible, I get a code smell from it, especially when the solution involves having duplicates pairs. Thomas' answer outlines that issue.
I'm going to simplify the question a bit to just focus on unlinking a node when multiple parents are allowed, because it's a tricky enough problem on its own. I'm discarding the depth column entirely and assuming there are no duplicate pairs.
You have to take into account children of D, which gets complicated fast. Say we have:
A --> B
| |
v v
C --> D --> E
We want to unlink D from B, which means we also have to remove links between E and B. But what if they're connected like this:
A --> B --> C
| |
v v
D --> E < -- G
|
V
H --> F
In this case if we disconnect B > D, we don't want to unlink B and E anymore, because E is still linked to B through C. But we do want to disconnect F from B.
I'll go through my solution below using that second example. I know that D only has one parent in this example, but it still works perfectly if D has multiple parents; I can more easily demonstrate some of the edge cases this way so that's why I'm doing it like this.
Here's what the table would look like:
| Ancestor | Descendant |
-------------------------
| A | A |
| A | B |
| A | C |
| A | D |
| A | E |
| A | F |
| B | B |
| B | C |
| B | D |
| B | E |
| B | F |
| C | C |
| C | E |
| D | D |
| D | E |
| D | F |
| E | E |
| F | F |
| G | G |
| G | E |
| H | H |
| H | F |
Query 1: Get all descendants of D, including D
SELECT `Descendant`
FROM `Closure`
WHERE `Ancestor` = #Node
This will return: D, E, F
Query 2: Get all ancestors of B, including B
SELECT `Ancestor`
FROM `Closure`
WHERE `Descendant` = #ParentNode
This will return: A, B
Query 3a: Get all ancestors of the items in Query 1 that do not appear in Query 1 or 2
SELECT DISTINCT `Ancestor`
FROM `Closure`
WHERE `Descendant` IN (#Query1)
AND `Ancestor` NOT IN (#Query1)
AND `Ancestor` NOT IN (#Query2)
This will return: C, G, H
The goal here is to get all parents of E and F that may reconnect farther up the chain.
Query 3b: this is exactly the same as Query 3a, except it returns both ancestors and descendants
SELECT DISTINCT `Ancestor`, `Descendant`
[ ... ]
This will return: (C, E), (G, E), (H, F)
We'll need this later.
Query 4: Filter results of Query 3a down to nodes that reconnect farther up the chain
SELECT `Ancestor`,`Descendant`
FROM `Closure`
WHERE `Descendant` IN (#Query3a)
AND (`Ancestor` IN (#Query2) OR `Ancestor` = `Descendant`))
This will return: (A, C), (B, C), (C, C)
We now have references to all parents of C that should not be unlinked. Note that we have no links to parents of F. That's because F is not connected to B and A except through D (which we're unlinking).
Query 5: Construct the pairs that should be kept, using the results of Query 3b as a bridge between Queries 1 and 4
SELECT `Query4`.`ancestor`,`Query3b`.`descendant`
FROM (#Query3b) as `Query3b`
LEFT JOIN (#Query4) as `Query4`
WHERE `Query3b`.`descendant` IN (#Query1)
This will return: (A, E), (B, E)
Query 6: Run the regular query for orphaning a node and its children, except exclude all pairs returned by Query 5
DELETE FROM `Closure`
WHERE `Descendant` IN (#Query1)
AND `Ancestor` IN (#Query2)
AND (`Ancestor`, `Descendant`) NOT IN (#Query5)
After this operation, we will have removed the following links:
| Ancestor | Descendant |
-------------------------
| A | D |
| A | F |
| B | D |
| B | F |
Both D and F are correctly unlinked, and E correctly retains its connections to A and B, leaving:
A --> B --> C
|
v
D --> E < -- G
|
V
H --> F
Let me know if I've missed anything! I just solved this problem myself today and I may run into more edge cases as time goes on. If I find any I'll update this answer.