Find first common parent for multiple children from mixed hierarchy levels - sql

Find the first common parent, if any, from many different children.
Example:
1
/ \
2 3
/ / \
7 8 9
/ \
10 11
Input: [10, 9]
Output: 3 (first common parent for this elements)
Table example:
+------------------+-----------+------+
|EmployeePositionId|Subdivision|Parent|
+------------------+-----------+------+
|4718 |485 |42 |
|4719 |5064 |485 |
|4720 |5065 |5064 |
|4721 |5065 |5064 |
|4722 |3000 |null |
+------------------+-----------+------+
If I try to search for EmployeePositionId [4719, 4720, 4721],
I would like to get the Subdivision 5064, because it is the closest common subdivision for both employees (5065 nested in 5064).
If I were looking for 4719, 4720, 4721, 4722, then I would like to get null, because these elements do not have a common parent.
Or the answer will help me how get the data so that later solve this in Python

This class of problems is hard for SQL.
It's even harder with your particular table. It's not properly normalized. There is no level indicator. And input IDs can be from mixed hierarchy levels.
Setup
You clarified in a later comment that every path is terminated with a row that has "Parent" IS NULL (root), even if sample data in the question suggest otherwise. That helps a bit.
I assume valid "EmployeePositionId" as input. And no loops in your tree or the CTE enters an endless loop.
If you don't have a level of hierarchy in the table, add it. It's a simple task. If you can't add it, create a VIEW or, preferably, a MATERIALIZED VIEW instead:
CREATE MATERIALIZED VIEW mv_tbl AS
WITH RECURSIVE cte AS (
SELECT *, 0 AS level
FROM tbl
WHERE "Parent" IS NULL
UNION ALL
SELECT t.*, c.level + 1
FROM cte c
JOIN tbl t ON t."Parent" = c."Subdivision"
)
TABLE cte;
These would be the perfect indices for the task:
CREATE UNIQUE INDEX mv_tbl_id_uni ON mv_tbl ("EmployeePositionId") INCLUDE ("Subdivision", "Parent", level);
CREATE INDEX mv_tbl_subdivision_idx ON mv_tbl ("Subdivision") INCLUDE ("Parent", level);
See:
Covering index for top read performance
Query
Pure SQL solution with recursive CTE, based on a table with level indicator (or the MV from above):
WITH RECURSIVE init AS (
SELECT "Subdivision", "Parent", level
FROM mv_tbl
WHERE "EmployeePositionId" IN (4719, 4720, 4721) -- input
)
, cte AS (
TABLE init
UNION
SELECT c."Parent", t."Parent", c.level - 1
FROM cte c
JOIN mv_tbl t ON t."Subdivision" = c."Parent" -- recursion terminated at "Parent" IS NULL
)
, agg AS (
SELECT level, min("Subdivision") AS "Subdivision", count(*) AS ct
FROM cte
GROUP BY level
)
SELECT "Subdivision"
FROM agg a
WHERE ct = 1 -- no other live branch
AND level < (SELECT max(level) FROM cte WHERE "Parent" IS NULL) IS NOT TRUE -- no earlier dead end
AND level <= (SELECT min(level) FROM init) -- include highest (least) level
ORDER BY level DESC -- pick earliest (greatest) qualifying level
LIMIT 1;
db<>fiddle here
Covers all possible input, works for any modern version of Postgres.
I added basic explanation in the code.
Related:
How to aggregate a table with tree-structure to a single nested JSON object?
How to turn a set of flat trees into a single tree with multiple leaves?
Legal, lower-case, unquoted identifiers make your life with Postgres easier. See:
Are PostgreSQL column names case-sensitive?

Related

Bigquery SQL: convert array to columns

I have a table with a field A where each entry is a fixed length array A of integers (say length=1000). I want to know how to convert it into 1000 columns, with column name given by index_i, for i=0,1,2,...,999, and each element is the corresponding integer. I can have it done by something like
A[OFFSET(0)] as index_0,
A[OFFSET(1)] as index_1
A[OFFSET(2)] as index_2,
A[OFFSET(3)] as index_3,
A[OFFSET(4)] as index_4,
...
A[OFFSET(999)] as index_999,
I want to know what would be an elegant way of doing this. thanks!
The first thing to say is that, sadly, this is going to be much more complicated than most people expect. It can be conceptually easier to pass the values into a scripting language (e.g. Python) and work there, but clearly keeping things inside BigQuery is going to be much more performant. So here is an approach.
Cross-joining to turn array fields into long-format tables
I think the first thing you're going to want to do is get the values out of the arrays and into rows.
Typically in BigQuery this is accomplished using CROSS JOIN. The syntax is a tad unintuitive:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw
CROSS JOIN UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
UNNEST(raw.a) is taking those arrays of values and turning each array into a set of (five) rows, every single one of which is then joined to the corresponding value of name (the definition of a CROSS JOIN). In this way we can 'unwrap' a table with an array field.
This will yields results like
name | vals
-------------
A | 1
A | 2
A | 3
A | 4
A | 5
B | 5
B | 4
B | 3
B | 2
B | 1
Confusingly, there is a shorthand for this syntax in which CROSS JOIN is replaced with a simple comma:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals
FROM raw, UNNEST(raw.a) AS vals
)
SELECT * FROM long_format
This is more compact but may be confusing if you haven't seen it before.
Typically this is where we stop. We have a long-format table, created without any requirement that the original arrays all had the same length. What you're asking for is harder to produce - you want a wide-format table containing the same information (relying on the fact that each array was the same length.
Pivot tables in BigQuery
The good news is that BigQuery now has a PIVOT function! That makes this kind of operation possible, albeit non-trivial:
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN (0,1,2,3,4)
)
This makes use of WITH OFFSET to generate an extra offset column (so that we know which order the values in the array originally had).
Also, in general pivoting requires us to aggregate the values returned in each cell. But here we expect exactly one value for each combination of name and offset, so we simply use the aggregation function ANY_VALUE, which non-deterministically selects a value from the group you're aggregating over. Since, in this case, each group has exactly one value, that's the value retrieved.
The query yields results like:
name vals_0 vals_1 vals_2 vals_3 vals_4
----------------------------------------------
A 1 2 3 4 5
B 5 4 3 2 1
This is starting to look pretty good, but we have a fundamental issue, in that the column names are still hard-coded. You wanted them generated dynamically.
Unfortunately expressions for the pivot column values aren't something PIVOT can accept out-of-the-box. Note that BigQuery has no way to know that your long-format table will resolve neatly to a fixed number of columns (it relies on offset having the values 0-4 for each and every set of records).
Dynamically building/executing the pivot
And yet, there is a way. We will have to leave behind the comfort of standard SQL and move into the realm of BigQuery Procedural Language.
What we must do is use the expression EXECUTE IMMEDIATE, which allows us to dynamically construct and execute a standard SQL query!
(as an aside, I bet you - OP or future searchers - weren't expecting this rabbit hole...)
This is, of course, inelegant to say the least. But here is the above toy example, implemented using EXECUTE IMMEDIATE. The trick is that the executed query is defined as a string, so we just have to use an expression to inject the full range of values you want into this string.
Recall that || can be used as a string concatenation operator.
EXECUTE IMMEDIATE """
WITH raw AS (
SELECT "A" AS name, [1,2,3,4,5] AS a
UNION ALL
SELECT "B" AS name, [5,4,3,2,1] AS a
),
long_format AS (
SELECT name, vals, offset
FROM raw, UNNEST(raw.a) AS vals WITH OFFSET
)
SELECT *
FROM long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
|| """
)
)
"""
Ouch. I've tried to make that as readable as possible. Near the bottom there is an expression that generates the list of column suffices (pivoted values of offset):
(SELECT STRING_AGG(CAST(x AS STRING)) FROM UNNEST(GENERATE_ARRAY(0,4)) AS x)
This generates the string "0,1,2,3,4" which is then concatenated to give us ...FOR offset IN (0,1,2,3,4)... in our final query (as in the hard-coded example before).
REALLY dynamically executing the pivot
It hasn't escaped my notice that this is still technically insisting on your knowing up-front how long those arrays are! It's a big improvement (in the narrow sense of avoiding painful repetitive code) to use GENERATE_ARRAY(0,4), but it's not quite what was requested.
Unfortunately, I can't provide a working toy example, but I can tell you how to do it. You would simply replace the pivot values expression with
(SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM long_format)
But doing this in the example above won't work, because long_format is a Common Table Expression that is only defined inside the EXECUTE IMMEDIATE block. The statement in that block won't be executed until after building it, so at build-time long_format has yet to be defined.
Yet all is not lost. This will work just fine:
SELECT *
FROM d.long_format PIVOT(
ANY_VALUE(vals) AS vals
FOR offset IN ("""
|| (SELECT STRING_AGG(DISTINCT CAST(offset AS STRING)) FROM d.long_format)
|| """
)
)
... provided you first define a BigQuery VIEW (for example) called long_format (or, better, some more expressive name) in a dataset d. That way, both the job that builds the query and the job that runs it will have access to the values.
If successful, you should see both jobs execute and succeed. You should then click 'VIEW RESULTS' on the job that ran the query.
As a final aside, this assumes you are working from the BigQuery console. If you're instead working from a scripting language, that gives you plenty of options to either load and manipulate the data, or build the query in your scripting language rather than massaging BigQuery into doing it for you.
Consider below approach
execute immediate ( select '''
select * except(id) from (
select to_json_string(A) id, * except(A)
from your_table, unnest(A) value with offset
)
pivot (any_value(value) index for offset in ('''
|| (select string_agg('' || val order by offset) from unnest(generate_array(0,999)) val with offset) || '))'
)
If to apply to dummy data like below (with 10 instead of 1000 elements)
select [10,11,12,13,14,15,16,17,18,19] as A union all
select [20,21,22,23,24,25,26,27,28,29] as A union all
select [30,31,32,33,34,35,36,37,38,39] as A
the output is

Recursive query within recursive query

I would like to solve a problem consisting of 2 recursions.
In one of the 2 recursions I find out the answer to one question which is "What is the leaf member of a specific input (template)?" This is already solved.
In a second recursion I would like to run this query for a number of other inputs (templates).
1st part of the problem:
I have a tree and would like to find the leaf of it. This part of the recursion can be solved using this query:
with recursive full_tree as (
select id, "previousVersionId", 1 as level
from template
where
template."id" = '5084520a-bb07-49e8-b111-3ea8182dc99f'
union all
select c.id, c."previousVersionId", p.level + 1
from template c
inner join full_tree p on c."previousVersionId" = p.id
)
select * from full_tree
order by level desc
limit 1
The query output is one record including the leaf id I'm interested in. This is fine.
2nd part of the query:
Here's the problem. I would like to run the first query n times.
Currently I can run the query only if it's just one id ('5084520a-bb07-49e8-b111-3ea8182dc99f' in the example). But what If I have a list of 100 such ids.
My ultimate goal is to get one id response (the leaf id) to each of the 100 template ids in the list.
In theory, a query that allows me to run above query for each of my e.g. 100 template ids would solve my problem.

Retrieve hierarchical groups ... with infinite recursion

I've a table like this which contains links :
key_a key_b
--------------
a b
b c
g h
a g
c a
f g
not really tidy & infinite recursion ...
key_a = parent
key_b = child
Require a query which will recompose and attribute a number for each hierarchical group (parent + direct children + indirect children) :
key_a key_b nb_group
--------------------------
a b 1
a g 1
b c 1
**c a** 1
f g 2
g h 2
**link responsible of infinite loop**
Because we have
A-B-C-A
-> Only want to show simply the link as shown.
Any idea ?
Thanks in advance
The problem is that you aren't really dealing with strict hierarchies; you're dealing with directed graphs, where some graphs have cycles. Notice that your nbgroup #1 doesn't have any canonical root-- it could be a, b, or c due to the cyclic reference from c-a.
The basic way of dealing with this is to think in terms of graph techniques, not recursion. In fact, an iterative approach (not using a CTE) is the only solution I can think of in SQL. The basic approach is explained here.
Here is a SQL Fiddle with a solution that addresses both the cycles and the shared-leaf case. Notice it uses iteration (with a failsafe to prevent runaway processes) and table variables to operate; I don't think there's any getting around this. Note also the changed sample data (a-g changed to a-h; explained below).
If you dig into the SQL you'll notice that I changed some key things from the solution given in the link. That solution was dealing with undirected edges, whereas your edges are directed (if you used undirected edges the entire sample set is a single component because of the a-g connection).
This gets to the heart of why I changed a-g to a-h in my sample data. Your specification of the problem is straightforward if only leaf nodes are shared; that's the specification I coded to. In this case, a-h and g-h can both get bundled off to their proper components with no problem, because we're concerned about reachability from parents (even given cycles).
However, when you have shared branches, it's not clear what you want to show. Consider the a-g link: given this, g-h could exist in either component (a-g-h or f-g-h). You put it in the second, but it could have been in the first instead, right? This ambiguity is why I didn't try to address it in this solution.
Edit: To be clear, in my solution above, if shared braches ARE encountered, it treats the whole set as a single component. Not what you described above, but it will have to be changed after the problem is clarified. Hopefully this gets you close.
You should use a recursive query. In the first part we select all records which are top level nodes (have no parents) and using ROW_NUMBER() assign them group ID numbers. Then in the recursive part we add to them children one by one and use parent's groups Id numbers.
with CTE as
(
select t1.parent,t1.child,
ROW_NUMBER() over (order by t1.parent) rn
from t t1 where
not exists (select 1 from t where child=t1.parent)
union all
select t.parent,t.child, CTE.rn
from t
join CTE on t.parent=CTE.Child
)
select * from CTE
order by RN,parent
SQLFiddle demo
Painful problem of graph walking using recursive CTEs. This is the problem of finding connected subgraphs in a graph. The challenge with using recursive CTEs is to prevent unwarranted recursion -- that is, infinite loops In SQL Server, that typically means storing them in a string.
The idea is to get a list of all pairs of nodes that are connected (and a node is connected with itself). Then, take the minimum from the list of connected nodes and use this as an id for the connected subgraph.
The other idea is to walk the graph in both directions from a node. This ensures that all possible nodes are visited. The following is query that accomplishes this:
with fullt as (
select keyA, keyB
from t
union
select keyB, keyA
from t
),
CTE as (
select t.keyA, t.keyB, t.keyB as last, 1 as level,
','+cast(keyA as varchar(max))+','+cast(keyB as varchar(max))+',' as path
from fullt t
union all
select cte.keyA, cte.keyB,
(case when t.keyA = cte.last then t.keyB else t.keyA
end) as last,
1 + level,
cte.path+t.keyB+','
from fullt t join
CTE
on t.keyA = CTE.last or
t.keyB = cte.keyA
where cte.path not like '%,'+t.keyB+',%'
) -- select * from cte where 'g' in (keyA, keyB)
select t.keyA, t.keyB,
dense_rank() over (order by min(cte.Last)) as grp,
min(cte.Last)
from t join
CTE
on (t.keyA = CTE.keyA and t.keyB = cte.keyB) or
(t.keyA = CTE.keyB and t.keyB = cte.keyA)
where cte.path like '%,'+t.keyA+',%' or
cte.path like '%,'+t.keyB+',%'
group by t.id, t.keyA, t.keyB
order by t.id;
The SQLFiddle is here.
you might want to check with COMMON TABLE EXPRESSIONS
here's the link

Is there an SQL function which generates a given range of sequential numbers?

I need to generate an array of sequential integers with a given range in order to use it in:
SELECT tbl.pk_id
FROM tbl
WHERE tbl.pk_id NOT IN (sequential array);
If you have a given range - ie a start point and an end point - of sequential integers you should just be able to use the BETWEEN keyword:
SELECT tbl.pk_id
FROM tbl
WHERE tbl.pk_id NOT BETWEEN START_INT AND END_INT
or am I misunderstanding your question..?
Because you say you've already got a number table, I would suggest this:
SELECT element
FROM series
WHERE element NOT IN (SELECT pk_id
FROM tbl)
Might possibly be more efficient than the query you've tried.
Two thoughts . . .
First, there's no standard SQL function that does that. But some systems include a non-standard function that does generates a series. In PostgreSQL, for example, you can use the generate_series() function.
select generate_series(1,100000);
1
2
3
...
100000
That function essentially returns a table; it can be used in joins.
If Informix doesn't have a function that does something similar, maybe you can write an Informix SPL function that does.
Second, you could just create a one-column table and populate it with a series of integers. This works on all platforms, and doesn't require programming. It requires only minimal maintenance. (You need to keep more integers in this table than you're using in your production table.)
create table integers (
i integer primary key
);
Use a spreadsheet or a utility program to generate a series of integers to populate it. The easiest way if you have a Unix, Linux, or Cygwin environment laying around is to use seq.
$ seq 1 5 > integers
$ cat integers
1
2
3
4
5
Informix has a free developer version you can download. Maybe you can build a compelling demo with it, and management will let you upgrade.
i'll suggest a generic solution to create a result set containing the positive integers 0 .. 2^k-1 for a given k for subsequent use as a subquery, view or materialized view.
the code below illustrates the technique for k=2.
SELECT bv0 + 2* bv1 + 4*bv2 val
FROM (
SELECT *
FROM
(
SELECT 0 bv0 FROM DUAL
UNION
SELECT 1 bv0 FROM DUAL
) bit0
CROSS JOIN (
SELECT 0 bv1 FROM DUAL
UNION
SELECT 1 bv1 FROM DUAL
) bit1
CROSS JOIN (
SELECT 0 bv2 FROM DUAL
UNION
SELECT 1 bv2 FROM DUAL
) bit2
) pow2
;
i hope that helps you with your task
best regards,
carsten

Flattening a Recursive Hierarchy Into a Dimension with SSIS

I have a recursive hierarchy in a relational database, this reflects teams and their position within a hierarchy.
I wish to flatten this hierarchy into a dimension for data warehousing, it's a SQL Server database, using SSIS to SSAS.
I have a table, teams:
teamid Teamname
1 Team 1
2 Team 2
And a table teamhierarchymapping:
Teamid heirarchyid
1 4
2 2
And a table hierarchy:
sequenceid parentsequenceid Name
1 null root
2 1 Level 1.1
3 1 Level 1.2
4 3 Level 1.2 1
Giving
Level 1.1 (Contains Team 2)
root <
Level 1.2 <
Level 1.2 1 (Contains Team 1)
I want to flatten this to a dimension like:
Team Name Level 1 Level 2 Level 3
Team 1 Root Level 1.1 [None]
Team 2 Root Level 1.2 Level 1.2 1
I've tried various nasty sets of SQL to try and bring that together, and various piping around in SSIS (which I am just starting to pick up), and I'm not finding a solution that brings it together.
Can anyone help?
(Edit corrected issue with sample data, I think)
Do you have an error in your sample data? I can't see how the hierarchy mapping connects to the hierarchy table to get the results you want, unless the hierarchy mapping is teamid 1 => hierid 2 and teamid 2 => hierid 4.
SSIS may not be able to do it (easily), so it may be better to create a OLEDB Source that executes SQL of the following format. Note this does assume you're using SQL Server 2008 as the 'PIVOT' function was introduced there...
WITH hier AS (
SELECT parentseqid, sequenceid, hiername as parentname, hiername FROM TeamHierarchy
UNION ALL
SELECT hier.parentseqid, TH.sequenceid, hier.parentname, TH.hiername FROM hier
INNER JOIN TeamHierarchy TH ON TH.parentseqid = hier.sequenceid
),
teamhier AS (
SELECT T.*, THM.hierarchyid FROM Teams T
INNER JOIN TeamHierarchyMapping THM ON T.teamid = THM.teamid
)
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY teamname ORDER BY teamname, sequenceid, parentseqid) AS 'Depth', hier.parentname, teamhier.teamname
FROM hier
INNER JOIN teamhier ON hier.sequenceid = teamhier.hierarchyid
) as t1
PIVOT (MAX(parentname) FOR Depth IN ([1],[2],[3],[4],[5],[6],[7],[8],[9])) AS pvtTable
ORDER BY teamname;
There's a few different elements to this, and there may be a better way to do it, but for flattening hierarchies, CTE's are ideal.
Two CTEs are created: 'hier' which takes care of flattening the hierarchy and 'teamhier' which is just a helper "view" to make the joins later on simpler. IF you just take the hier CTE and run it, you'll get your flattened view:
WITH hier AS (
SELECT parentseqid, sequenceid, hiername as parentname, hiername FROM TeamHierarchy
UNION ALL
SELECT hier.parentseqid, TH.sequenceid, hier.parentname, TH.hiername FROM hier
INNER JOIN TeamHierarchy TH ON TH.parentseqid = hier.sequenceid
)
SELECT * FROM hier ORDER BY parentseqid, sequenceid
The next part of it basically takes this flattened view, joins it to your team tables (to get the team name) and uses SQL Server's PIVOT to rotate it round and get everything aligned as you want it. More information on PIVOT is available on the MSDN.
If you're using SQL Server 2005, then you can just take the hierarchy flattening bit and you should be able to use SSIS's native 'PIVOT' transformation block to hopefully do the dirty pivoting work.
Is there a reason why you need to flatten the hierarchy? Consider the parent-child hierarchy dimension type in SSAS, this handles variable-depth hierarchies, and would allow extra functions/features that a flattened hierarchy would not:
http://msdn.microsoft.com/en-us/library/ms174846.aspx
http://msdn.microsoft.com/en-us/library/ms174935.aspx