Denormalized table. SQL Select - sql

As I understand I have a denormalized table. Here is some list of table columns:
... C, F, T, C1, F1, T1, .... C8, T8, F8.....
Is it possible to select those values in a rows?
Something like this:
C, F, T
C1, F1, T1
......
C8, F8, T8

You can do it easily with a union all:
select C, F, T from table t
union all
select C1, F1, T1 from table t
union all
. . .
select C8, F8, T8 from table t;
Note the use of union all instead of union. union does automatic duplicate elimination, so you might not get all your values with union (as well as it being a more expensive operation).
This will generally result in the table being scanned 9 times. If you have a large table, there are other methods that are likely to be more efficient.
EDIT:
A more efficient method is likely to be a cross join and case. In DB2, I think this would be:
select (case n.n when 0 then C
when 1 then C1
. . .
when 8 then C8
end) as C,
(case n.n when 0 then F
when 1 then F1
. . .
when 8 then F8
end) as F,
(case n.n when 0 then T
when 1 then T1
. . .
when 8 then T8
end) as T
from table t cross join
(select 0 as n from sysibm.sysdummy1 union all select 1 from sysibm.sysdummy1 union all . . .
select 9 from sysibm.sysdummy1
) n;
This may seem like more work, but it should only be reading the bigger table once, with the rest of the work being in-memory operations.

select c,f,t from table
union all
select c1,f1,t1 from table
union all
select c8,f8,t8 from table
Make sure to filter by WHERE clause each SELECT statement.

A trick similar to Gordons is to use LATERAL to avoid multiple scans:
with t(c1,t1,f1,c2,t2,f2) as ( values (1,2,3,4,5,6) )
select y.c, y.t, y.f
from t x
cross join lateral ( values (x.c1, x.t1, x.f1)
, (x.c2, x.t2, x.f2) ) y(c,t,f)
C T F
----------- ----------- -----------
1 2 3
4 5 6

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Search duplicated values and show their context

I'm not sure if the solution for this question is incredible simple or not possible in pure SQL.
I have a simple table with 2 columns
Number Text
1 a
1 b
2 m
3 x
3 y
3 z
Now the task is:
Search all repeated numbers and show the "Text" which uses these duplicated numbers.
We see: 1 is used twice (with a and b), 3 is used with x and y and z. But no line is completely duplicated.
Edit:
So I expect something like this.
Dup_Num Text
1 a
1 b
3 x
3 y
3 z
The search for the duplicate is easy, but I don't have an idea how to connect is with "Text", because when I add "Text" to my SELECT I have to use it for GROUP and this give no duplicates ..
Thanks for help on a lousy day ..
If I understand correctly, you can use exists:
select t.*
from t
where exists (select 1 from t t2 where t2.number = t.number and t2.text <> t.text)
order by t.number;
For performance, you want an index on (number, text).
The canonical way to find duplicates in SQL is the self join.
In your example:
select s1.*
from stuff s1
inner join stuff s2
on s1.number = s2.number
and s1.text <> s2.text
In your case, you might want to use LISTAGG to group those values and its relation to the other column
SQL> with result
as (
select '1' as c1 , 'a' as c2 from dual union all
select '1' as c1 , 'b' as c2 from dual union all
select '2' as c1 , 'm' as c2 from dual union all
select '3' as c1 , 'x' as c2 from dual union all
select '3' as c1 , 'y' as c2 from dual union all
select '3' as c1 , 'z' as c2 from dual )
select c1, listagg(c2,',') within group(order by c1) as c3 from result
group by c1;
C C3
- --------------------
1 a,b
2 m
3 x,y,z
SQL>
This might also help you.
select * from t where Number in
(select Number from t group by Number having count(*) > 1)
order by Text
Fiddle

Is there an alternative approach to using Union for this scenario in SQL?

I have a table A that has data like the following
Index Chapter1 Chapter2 .... Chapter20
1 CHE MTH NULL
2 ML NULL BIO
3 NULL DB HIST
I want to build a view on top of the table which should give me result like the following
Index Chapter
1 CHECh1
1 MTHCh2
..
2 MLCh1
..
2 BIOCh2
3 DBCh2
...and so on
I could build a view which gives me the correct result by using UNION operation.
SELECT Index, CASE WHEN Chapter1 IS NOT NULL THEN Chapter1||'Ch1' END
from A
union
SELECT Index, CASE WHEN Chapter2 IS NOT NULL THEN Chapter2||'Ch2' END
from A
...
..
SELECT Index, CASE WHEN Chapter2 IS NOT NULL THEN Chapter20||'Ch20' END
from A
Is there any other optimized approach of doing this? Any help would be appreciated.
Various forms of SQL support lateral joins. One syntax looks like:
select t.index, v.chapter || 'Ch' || v.n
from t cross join lateral
(values (chapter1, 1), (chapter2, 2), . . . ) v(chapter, n)
where chapter is not null;
Do not despair, though, if your database does not. You can do something similar with a cross join:
select id, chapter || 'Ch' || n
from (select t.id,
(case when n = 1 then chapter1
when n = 2 then chapter2
. . .
when n = 20 then chapter20
end) as chapter, v.n
from t cross join
(values (1), (2), (3), . . . ) v(n) -- or the equivalent for your database to get a table with 20 numbers
) tc
where chapter is not null;
Both of these approaches only scan the table once, which should be a performance improvement. If the "table" is really a complicated query, this can be a big improvement.

Array column and aggregation in BigQuery SQL: Why the values are not all aggregated?

I've executed the code below in BigQuery
SELECT ( --inner query
SELECT STRING_AGG(c) FROM t1.array_column c
)
FROM (
select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
I expected something like
Row|f0_
1 | 1,2,3,4,5,6,7
because there is no GROUP BY in the inner query. So, I'm expecting STRING_AGG to be evaluated on all the lines.
SELECT STRING_AGG(c) FROM t1.array_column c
Instead I'm getting something like this:
Row|f0_
1 |1,2,3
2 |5,6,7
I'm having troubles understand why I have this result
This is your query:
SELECT (SELECT STRING_AGG(c) FROM t1.array_column c
)
FROM (select 1 as f1, ['1', '2', '3'] as array_column
union all
select 2 as f1, ['5', '6', '7'] as array_column
) t1;
First, I'm surprised it works. I thought you needed unnest():
SELECT (SELECT STRING_AGG(c) FROM UNNEST(t1.array_column) c
)
What is happening? Well, this would be more obvious if you selected f1. Then you would get:
1 1,2,3
2 5,6,7
This should make it more clear. For each row in t1 (and there are two rows), your code is:
unnesting the array into rows with a column called c.
reaggregating those rows into a string (with no name)
If you want to combine the elements in the arrays, use array_concat_agg():
SELECT array_concat_agg(array_column)
FROM (select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
If you want this represented as a string instead of an array, use array_to_string():
SELECT array_to_string(array_concat_agg(array_column), ',')
FROM (select 1 as f1, ['1','2','3'] as array_column
union all
select 2 as f1, ['5','6','7'] as array_column
) t1;
Below is for BigQuery Standard SQL
#standardSQL
SELECT STRING_AGG((SELECT STRING_AGG(c) FROM t1.array_column c))
FROM (
SELECT 1 AS f1, ['1','2','3'] AS array_column UNION ALL
SELECT 2 AS f1, ['5','6','7'] AS array_column
) t1
and produces
Row f0_
1 1,2,3,5,6,7
Note 1: you were almost there - you were just missing extra STRING_AGG that does final grouping of strings created off of respective array in each row
Note 2: because array_column is of ARRAY type it is treated as inner table referenced as t1.array_column as as such - FROM t1.array_column c is equivalent to FROM UNNEST(array_column) c - very cool hidden feature :o)

SELECT DISTINCT for data groups

I have following table:
ID Data
1 A
2 A
2 B
3 A
3 B
4 C
5 D
6 A
6 B
etc. In other words, I have groups of data per ID. You will notice that the data group (A, B) occurs multiple times. I want a query that can identify the distinct data groups and number them, such as:
DataID Data
101 A
102 A
102 B
103 C
104 D
So DataID 102 would resemble data (A,B), DataID 103 would resemble data (C), etc. In order to be able to rewrite my original table in this form:
ID DataID
1 101
2 102
3 102
4 103
5 104
6 102
How can I do that?
PS. Code to generate the first table:
CREATE TABLE #t1 (id INT, data VARCHAR(10))
INSERT INTO #t1
SELECT 1, 'A'
UNION ALL SELECT 2, 'A'
UNION ALL SELECT 2, 'B'
UNION ALL SELECT 3, 'A'
UNION ALL SELECT 3, 'B'
UNION ALL SELECT 4, 'C'
UNION ALL SELECT 5, 'D'
UNION ALL SELECT 6, 'A'
UNION ALL SELECT 6, 'B'
In my opinion You have to create a custom aggregate that concatenates data (in case of strings CLR approach is recommended for perf reasons).
Then I would group by ID and select distinct from the grouping, adding a row_number()function or add a dense_rank() your choice. Anyway it should look like this
with groupings as (
select concat(data) groups
from Table1
group by ID
)
select groups, rownumber() over () from groupings
The following query using CASE will give you the result shown below.
From there on, getting the distinct datagroups and proceeding further should not really be a problem.
SELECT
id,
MAX(CASE data WHEN 'A' THEN data ELSE '' END) +
MAX(CASE data WHEN 'B' THEN data ELSE '' END) +
MAX(CASE data WHEN 'C' THEN data ELSE '' END) +
MAX(CASE data WHEN 'D' THEN data ELSE '' END) AS DataGroups
FROM t1
GROUP BY id
ID DataGroups
1 A
2 AB
3 AB
4 C
5 D
6 AB
However, this kind of logic will only work in case you the "Data" values are both fixed and known before hand.
In your case, you do say that is the case. However, considering that you also say that they are 1000 of them, this will be frankly, a ridiculous looking query for sure :-)
LuckyLuke's suggestion above would, frankly, be the more generic way and probably saner way to go about implementing the solution though in your case.
From your sample data (having added the missing 2,'A' tuple, the following gives the renumbered (and uniqueified) data:
with NonDups as (
select t1.id
from #t1 t1 left join #t1 t2
on t1.id > t2.id and t1.data = t2.data
group by t1.id
having COUNT(t1.data) > COUNT(t2.data)
), DataAddedBack as (
select ID,data
from #t1 where id in (select id from NonDups)
), Renumbered as (
select DENSE_RANK() OVER (ORDER BY id) as ID,Data from DataAddedBack
)
select * from Renumbered
Giving:
1 A
2 A
2 B
3 C
4 D
I think then, it's a matter of relational division to match up rows from this output with the rows in the original table.
Just to share my own dirty solution that I'm using for the moment:
SELECT DISTINCT t1.id, D.data
FROM #t1 t1
CROSS APPLY (
SELECT CAST(Data AS VARCHAR) + ','
FROM #t1 t2
WHERE t2.id = t1.id
ORDER BY Data ASC
FOR XML PATH('') )
D ( Data )
And then going analog to LuckyLuke's solution.