How to complare one Master table against several child tables - sql

I have one master table (PO_BreakOutAll) with ~3000 rows made up of only two columns (PO_ID, PO_LN_NO) which together make up the primary key. I also have several other tables each with a subset of the data from the master table (or so it is supposed to be). All the tables are the same schema as the master table.
All tables have this exact schema:
PO_ID char(5) PK
PO_LN_NO int PK
I need to do two different types of comparisons for validation and to find duplicates.
First to make sure that every row in the master table exists in one, and only one, of the other child tables.
Second I need to make sure that no row is duplicated across any of the child tables. The same row can exist in two or more child tables and I need to find them.
I can do each table in a separate query but have not figured out how to write one query that compares all the child tables at once.
Here is what I have so far but it does not work:
SELECT a.PO_ID as all_PO,
a.PO_LN_NO,
c.PO_ID as Cummings_PO,
c.PO_LN_NO,
f.PO_ID as filter_PO,
f.PO_LN_NO,
fo.PO_ID as fixedObl_PO,
fo.PO_LN_NO
FROM
PO_BreakOutAll a
LEFT OUTER JOIN
PO_Cummins c ON (c.PO_ID = a.PO_ID AND c.PO_LN_NO = a.PO_LN_NO)
LEFT OUTER JOIN
PO_Filters f ON (f.PO_ID = a.PO_ID AND f.PO_LN_NO = a.PO_LN_NO)
LEFT OUTER JOIN
PO_FixedOblig fo ON (fo.PO_ID = a.PO_ID AND fo.PO_LN_NO = a.PO_LN_NO)

I think #gordon linoff has an overall solution. If you want to work with the CTE paradigm here is an example based on your Fiddle that answers the duplicate question:
WITH CTE (PO_ID,PO_LN_NO,TableName) AS
(SELECT
PO_ID,
PO_LN_NO,
'Cummings' as TableName
FROM PO_Cummins
UNION ALL
SELECT
PO_ID,
PO_LN_NO,
'Filters' as TableName
FROM PO_Filters
UNION ALL
SELECT
PO_ID,
PO_LN_NO,
'Office' as TableName
FROM PO_Office )
SELECT
PO_BreakOutAll.PO_ID,
PO_BreakOutAll.PO_LN_NO,
CHILD_DATA.TABLENAME AS DUP_TABLENAME
FROM
PO_BreakOutAll
INNER JOIN (
SELECT PO_ID, PO_LN_NO, COUNT(1) AS DUP_COUNTER
FROM CTE
GROUP BY PO_ID, PO_LN_NO
HAVING COUNT(1) > 1
) DUPS ON DUPS.PO_ID = PO_BreakOutAll.PO_ID AND DUPS.PO_LN_NO = PO_BreakOutAll.PO_LN_NO
INNER JOIN (
SELECT PO_ID, PO_LN_NO, TABLENAME
FROM CTE
) CHILD_DATA
ON CHILD_DATA.PO_ID = PO_BreakOutAll.PO_ID AND CHILD_DATA.PO_LN_NO = PO_BreakOutAll.PO_LN_NO
ORDER BY PO_ID, PO_LN_NO, DUP_TABLENAME

I wouldn't use join for this; I would use union all. Here is a way to get a count of how the records overlap among the tables:
select isAll, isCummins, isFilters, isOblig, count(*)
from (select PO_ID, PO_LN_NO, sum(isAll) as isAll, sum(isCummins) as isCummins,
sum(isFilters) as isFilters, sum(isOblig) as isOblig
from ((select PO_ID, PO_LN_NO, 1 as isAll, 0 as isCummins, 0 as isFilters, 1 as isOblig
from PO_BreakOutAll
) union all
(select PO_ID, PO_LN_NO, 0, 1, 0, 0
from PO_Cummins
) union all
(select PO_ID, PO_LN_NO, 0, 0, 1, 0
from PO_Filters
) union all
(select PO_ID, PO_LN_NO, 0, 0, 0, 1
from PO_FixedOblig
)
) t
group by PO_ID, PO_LN_NO
) t
group by isAll, isCummins, isFilters, isOblig;
If you want to find rows that fail your test, just use the subquery with where conditions:
select PO_ID, PO_LN_NO, sum(isAll) as isAll, sum(isCummins) as isCummins,
sum(isFilters) as isFilters, sum(isOblig) as isOblig
from ((select PO_ID, PO_LN_NO, 1 as isAll, 0 as isCummins, 0 as isFilters, 1 as isOblig
from PO_BreakOutAll
) union all
(select PO_ID, PO_LN_NO, 0, 1, 0, 0
from PO_Cummins
) union all
(select PO_ID, PO_LN_NO, 0, 0, 1, 0
from PO_Filters
) union all
(select PO_ID, PO_LN_NO, 0, 0, 0, 1
from PO_FixedOblig
)
) t
group by PO_ID, PO_LN_NO
having sum(isAll) <> 1 or
(sum(isAll) = 1 and (sum(isCummins) + sum(isFilters) + sum(isOblig) <> 1)
);

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.
One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).
Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.
The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Simplifying SELECT statement

so I have a statement I believe should work... However it feels pretty suboptimal and I can't for the life of me figure out how to optimise it.
I have the following tables:
Transactions
[Id] is PRIMARY KEY IDENTITY
[Hash] has a UNIQUE constraint
[BlockNumber] has an Index
Transfers
[Id] is PRIMARY KEY IDENTITY
[TransactionId] is a Foreign Key referencing [Transactions].[Id]
TokenPrices
[Id] is PRIMARY KEY IDENTITY
TokenPriceAttempts
[Id] is PRIMARY KEY IDENTITY
[TransferId] is a Foreign Key referencing [Transfers].[Id]
What I want to do, is select all the transfers, with a few bits of data from their related transaction (one transaction to many transfers), where I don't currently have a price stored in TokenPrices related to that transfer.
In the first part of the query, I am getting a list of all the transfers, and calculating the DIFF between the nearest found token price. If one isn't found, this is null (what I ultimately want to select). I am allowing 3 hours either side of the transaction timestamp - if nothing is found within that timespan, it will be null.
Secondly, I am selecting from this set, ensuring first that diff is null as this means the price is missing, and finally that token price attempts either doesn't have an entry for attempting to get a price, or if it does than it has fewer than 5 attempts listed and the last attempt was more than a week ago.
The way I have laid this out results in essentially 3 of the same / similar SELECT statements within the WHERE clause, which feels hugely suboptimal...
How could I improve this approach?
WITH [transferDateDiff] AS
(
SELECT
[t1].[Id],
[t1].[TransactionId],
[t1].[From],
[t1].[To],
[t1].[Value],
[t1].[Type],
[t1].[ContractAddress],
[t1].[TokenId],
[t2].[Hash],
[t2].[Timestamp],
ABS(DATEDIFF(SECOND, [tp].[Timestamp], [t2].[Timestamp])) AS diff
FROM
[dbo].[Transfers] AS [t1]
LEFT JOIN
[dbo].[Transactions] AS [t2]
ON [t1].[TransactionId] = [t2].[Id]
LEFT JOIN
(
SELECT
*
FROM
[dbo].[TokenPrices]
)
AS [tp]
ON [tp].[ContractAddress] = [t1].[ContractAddress]
AND [tp].[Timestamp] >= DATEADD(HOUR, - 3, [t2].[Timestamp])
AND [tp].[Timestamp] <= DATEADD(HOUR, 3, [t2].[Timestamp])
WHERE
[t1].[Type] < 2
)
SELECT
[tdd].[Id],
[tdd].[TransactionId],
[tdd].[From],
[tdd].[To],
[tdd].[Value],
[tdd].[Type],
[tdd].[ContractAddress],
[tdd].[TokenId],
[tdd].[Hash],
[tdd].[Timestamp]
FROM
[transferDateDiff] AS tdd
WHERE
[tdd].[diff] IS NULL AND
(
(
SELECT
COUNT(*)
FROM
[dbo].[TokenPriceAttempts] tpa
WHERE
[tpa].[TransferId] = [tdd].[Id]
)
= 0 OR
(
(
SELECT
COUNT(*)
FROM
[dbo].[TokenPriceAttempts] tpa
WHERE
[tpa].[TransferId] = [tdd].[Id]
)
< 5 AND
(
DATEDIFF(DAY,
(
SELECT
MAX([tpa].[Created])
FROM
[dbo].[TokenPriceAttempts] tpa
WHERE
[tpa].[TransferId] = [tdd].[Id]
),
CURRENT_TIMESTAMP
) >= 7
)
)
)
Here is an attempt to help simplify. I stripped out all the [brackets] that really are not required unless you are running into something like a reserved keyword, or columns with spaces in their name (bad to begin with).
Anyhow, your main query had 3 instances of a select per ID. To eliminate that, I did a LEFT JOIN to a subquery that pulls all transfers of type < 2 AND JOINS to the price attempts ONCE. This way, the result will have already pre-aggregated the count(*) and Max(Created) done ONCE for the same basis of transfers in question with your WITH CTE declaration. So you dont have to keep running the 3 queries each time, and you dont have to query the entire table of ALL transfers, just those with same underlying type < 2 condition. The result subquery alias "PQ" (preQuery)
This now simplifies the readability of the outer WHERE clause from the redundant counts per Id.
WITH transferDateDiff AS
(
SELECT
t1.Id,
t1.TransactionId,
t1.From,
t1.To,
t1.Value,
t1.Type,
t1.ContractAddress,
t1.TokenId,
t2.Hash,
t2.Timestamp,
ABS( DATEDIFF( SECOND, tp.Timestamp, t2.Timestamp )) AS diff
FROM
dbo.Transfers t1
LEFT JOIN dbo.Transactions t2
ON t1.TransactionId = t2.Id
LEFT JOIN dbo.TokenPrices tp
ON t1.ContractAddress = tp.ContractAddress
AND tp.Timestamp >= DATEADD(HOUR, - 3, t2.Timestamp)
AND tp.Timestamp <= DATEADD(HOUR, 3, t2.Timestamp)
WHERE
t1.Type < 2
)
SELECT
tdd.Id,
tdd.TransactionId,
tdd.From,
tdd.To,
tdd.Value,
tdd.Type,
tdd.ContractAddress,
tdd.TokenId,
tdd.Hash,
tdd.Timestamp
FROM
transferDateDiff tdd
LEFT JOIN
( SELECT
t1.Id,
COUNT(*) Attempts,
MAX(tpa.Created) MaxCreated
FROM
dbo.Transfers t1
JOIN dbo.TokenPriceAttempts tpa
on t1.Id = tpa.TransferId
WHERE
t1.Type < 2
GROUP BY
t1.Id ) PQ
on tdd.Id = PQ.Id
WHERE
tdd.diff IS NULL
AND ( PQ.Attempts IS NULL
OR PQ.Attempts = 0
OR ( PQ.Attempts < 5
AND DATEDIFF(DAY, PQ.MaxCreated, CURRENT_TIMESTAMP ) >= 7
)
)
REVISED to remove the WITH CTE into a single query
SELECT
t1.Id,
t1.TransactionId,
t1.From,
t1.To,
t1.Value,
t1.Type,
t1.ContractAddress,
t1.TokenId,
t2.Hash,
t2.Timestamp
FROM
-- Now, this pre-query is left-joined to token price attempts
-- so ALL Transfers of type < 2 are considered
( SELECT
t1.Id,
coalesce( COUNT(*), 0 ) Attempts,
MAX(tpa.Created) MaxCreated
FROM
dbo.Transfers t1
LEFT JOIN dbo.TokenPriceAttempts tpa
on t1.Id = tpa.TransferId
WHERE
t1.Type < 2
GROUP BY
t1.Id ) PQ
-- Now, we can just directly join to transfers for the rest
JOIN dbo.Transfers t1
on PQ.Id = t1.Id
-- and the rest from the WITH CTE construct
LEFT JOIN dbo.Transactions t2
ON t1.TransactionId = t2.Id
LEFT JOIN dbo.TokenPrices tp
ON t1.ContractAddress = tp.ContractAddress
AND tp.Timestamp >= DATEADD(HOUR, - 3, t2.Timestamp)
AND tp.Timestamp <= DATEADD(HOUR, 3, t2.Timestamp)
WHERE
ABS( DATEDIFF( SECOND, tp.Timestamp, t2.Timestamp )) IS NULL
AND ( PQ.Attempts = 0
OR ( PQ.Attempts < 5
AND DATEDIFF(DAY, PQ.MaxCreated, CURRENT_TIMESTAMP ) >= 7 )
)
I don't understand why you are doing this:
LEFT JOIN
(
SELECT
*
FROM
[dbo].[TokenPrices]
)
AS [tp]
ON [tp].[ContractAddress] = [t1].[ContractAddress]
AND [tp].[Timestamp] >= DATEADD(HOUR, - 3, [t2].[Timestamp])
AND [tp].[Timestamp] <= DATEADD(HOUR, 3, [t2].[Timestamp])
Isn't this a
LEFT JOIN [dbo].[TokenPrices] as TP ...
This:
SELECT
COUNT(*)
FROM
[dbo].[TokenPriceAttempts] tpa
WHERE
[tpa].[TransferId] = [tdd].[Id]
could be another CTE instead of being a sub...
In fact any of your sub queries could be CTE's, that's part of a CTE is making things easier to read.
,TPA
AS
(
SELECT COUNT(*)
FROM [dbo].[TokenPriceAttempts] tpa
WHERE [tpa].[TransferId] = [tdd].[Id]
)

Count Joins from Multiple Tables

For reference, I am using Postgres 9.2.23.
I have several tables where one table (user_group) is related to some other tables (eg: posts, group_invites, and some more other ones). There is, also a groups table, but it doesn't hold any data that I need for the purposes of these queries.
Table user_group:
fk_user_group_id, fk_user_id, fk_group_id, fk_invite_id user_status, ...
Table message:
pk_message_id, fk_user_id, fk_group_id, child_message_id, ...
Table group_prospective_user:
pk_prospective_user_id, fk_group_id, ...
I want to get some statistics for each of the related tables for a list of specified group ids if the user is a member of the group.
Right now I do this with one query for each related table, eg:
select
"public"."user_group"."fk_group_id" as "groupId",
count(case
when (
"public"."message"."child_message_id" is null
and "public"."message"."pk_message_id" is not null
) then "public"."message"."pk_message_id"
end) as "numDiscussions",
count("public"."message"."pk_message_id") as "numDiscussionPosts"
from "public"."user_group"
left outer join "public"."message"
on "public"."message"."fk_group_id" = "public"."user_group"."fk_group_id"
where (
"public"."user_group"."fk_group_id" in (
1, 11, 23, 530, 1070
)
and "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
)
and "public"."user_group"."fk_user_id" = 17517
)
group by "public"."user_group"."fk_group_id"
And for invites:
select
"public"."user_group"."fk_group_id" as "groupId",
count(case
when "public"."prospective_user"."status" = 1 then "public"."prospective_user"."pk_prospective_user_id"
end) as "numInviteesExternal"
from "public"."user_group"
left outer join "public"."prospective_user"
on "public"."prospective_user"."fk_group_id" = "public"."user_group"."fk_group_id"
where (
"public"."user_group"."fk_group_id" in (
1, 11, 23, 530, 6176
)
and "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
)
and "public"."user_group"."fk_user_id" = 17517
)
group by "public"."user_group"."fk_group_id"
The query to count the number of group invites is very similar to the above query. Just the count when and join on change.
Each of the queries to these tables has the same related logic for checking the groups to which the current user is an active member. Is there efficient way to merge multiple similar queries like this into a single query?
I tried using multiple LEFT JOINs with select count distinct, but that ran into performance issues on groups with both lots of messages, and lots of invites. Is there a way to easily/efficiently do this with, say, a subquery?
The answer from user #Parfait was the most scalable solution I could find. I based my queries on this tutorial: https://www.sqlteam.com/articles/using-derived-tables-to-calculate-aggregate-values.
While this isn't perfect, and results in a bunch of subqueries running, it does get me all the data at once, and with a single trip to the DB.
It ended up like this:
"groups"."groupId",
coalesce(
"members"."member_count",
0
) as "numActiveMembers",
coalesce(
"members"."invitee_count",
0
) as "numInviteesInternal",
coalesce(
"discussions"."discussions_count",
0
) as "numDiscussions",
coalesce(
"discussions"."posts_count",
0
) as "numDiscussionPosts"
from (
select "public"."user_group"."fk_group_id" as "groupId"
from "public"."user_group"
where (
"public"."user_group"."fk_group_id" in (
1, 2, 3, 4, 5
)
and "public"."user_group"."role" = 'ADMINISTRATOR'
and "public"."user_group"."fk_user_id" = 123
)
group by "public"."user_group"."fk_group_id"
) as "groups"
left outer join (
select
"public"."user_group"."fk_group_id" as "members_group_id",
count(distinct case
when "public"."user_group"."role" in (
'ADMINISTRATOR', 'MODERATOR', 'MEMBER'
) then "public"."user_group"."pk_user_group_id"
end) as "member_count",
count(distinct case
when "public"."user_group"."role" = 'INVITEE' then "public"."user_group"."pk_user_group_id"
end) as "invitee_count"
from "public"."user_group"
group by "public"."user_group"."fk_group_id"
) as "members"
on "members_group_id" = "groupId"
left outer join (
select
"public"."message"."fk_group_id" as "discussions_group_id",
count(case
when (
"public"."message"."child_message_id" is null
and "public"."message"."pk_message_id" is not null
) then "public"."message"."pk_message_id"
end) as "discussions_count",
count("public"."message"."pk_message_id") as "posts_count"
from "public"."message"
group by "public"."message"."fk_group_id"
) as "discussions"
on "discussions_group_id" = "groupId"```

Maximum number of statements within CTE

WITH abidAccount AS
(
SELECT
[ID], AzureBlobInsertDate = MAX(AzureBlobInsertDate)
FROM
[dba].[CurAccounts]
GROUP BY
ID
),
recentAccount AS
(
SELECT ca.*
FROM [dba].[CurAccounts] ca
JOIN abidAccount aa ON aa.[id] = ca.[ID]
AND aa.azureblobInsertDate = ca.azureBlobInsertDate
),
abidDevice AS
(
SELECT deviceID, azureBlobInsertDate = MAX(azureBlobInsertDate)
FROM [dba].[CurrentDevices]
GROUP BY DeviceID
),
recentDevice AS
(
SELECT cd.*
FROM [RZRExploreLayer3].[CurrentDevices] cd
JOIN abidDevice ad ON ad.DeviceID = cd.DeviceID
AND ad.azureBlobInsertDate = cd.azureblobinsertdate
)
SELECT
rd.deviceId,
rd.[DeviceReturned],
rd.accountNumber,
rd.[DeviceFormat],
rd.[DeviceLabel],
MAX(ra.azureBlobInsertDate) AS AzureBlobInsertDate
FROM
recentAccount ra
JOIN
recentDevice rd ON ra.[id] = rd.accountNumber
WHERE
rd.deviceReturned NOT LIKE 'Null'
GROUP BY
rd.deviceId, rd.[DeviceReturned], rd.accountNumber,
rd.[DeviceFormat], rd.[DeviceLabel]
/* deviceID, rd.[DeviceReturned], rd.accountNumber, ra.azureBlobInsertDate */
HAVING
COUNT(1) > 1
How do I combine multiple CTE into one query?
My query is attempting to determine if there are duplicate records and if so only keep the max(AzureBlobInsertDate) record and remove other duplicates. then combine all the results from the CurAccounts & Devices tables.
Any assistance you can offer is greatly appreciated.

How to create a view from existing table records, but also adding new records that do not exist

I am trying to create a view from an existing views data, but also if there are certain lines that do not exist per part/date combo, then have those lines be created. I have the below query that is showing what I currently have for the particular s_date/part_no combos:
SELECT
s_date,
part_no,
issue_group,
s_level,
qty_filled
FROM
current_view
WHERE
part_no = 'xxxxx'
AND s_date IN (
'201802',
'201803'
)
ORDER BY
s_date,
part_no,
issue_group,
DECODE(s_level, '80', 1, '100', 2, 'Late', 3)
Which produces the below:
I know how to create a view with that data, that's the easy part. But what I'm needing is a line for each issue_group and s_level combo, and if it's a created line, to put 0 as the qty_filled.
Every part_no / s_date combo should have 6 rows that go with it
- issue_group = '1' / s_level = '80'
- issue_group = '1' / s_level = '100'
- issue_group = '1' / s_level = 'Late'
- issue_group = '2/3 ' / s_level = '80'
- issue_group = '2/3 ' / s_level = '100'
- issue_group = '2/3 ' / s_level = 'Late'
So if one of the above combinations already exists for the current s_date/part_no, then it obviously takes the qty_filled info from the current view. If not, a new line is created, and qty_filled = 0. So I'm trying to get it to look like this:
I've only shown 1 part, with a couple dates, just to get the point across. There are 10k+ parts within the table and there will never be more than 1 part/date combo for each of the 6 issue_group/s_level combos.
The idea is to generate the rows using CROSS JOIN and then bring in the extra information with a LEFT JOIN. In Oracle syntax, this looks like:
WITH v as (
SELECT v.*
FROM current_view v
WHERE part_no = 'xxxxx' AND
s_date IN ('201802', '201803')
)
SELECT d.s_date, ig.part_no, ig.issue_group, l.s_level,
COALESCE(v.qty_filled, 0) as qty_filled
FROM (SELECT DISTINCT s_date FROM v) d CROSS JOIN
(SELECT DISTINCT part_no, ISSUE_GROUP FROM v) ig CROSS JOIN
(SELECT '80' as s_level FROM DUAL UNION ALL
SELECT '100' FROM DUAL UNION ALL
SELECT 'LATE' FROM DUAL
) l LEFT JOIN
v
ON v.s_date = d.s_date AND v.part_no = ig.part_no AND
v.issue_group = ig.issue_group AND v.s_level = l.s_level
ORDER BY s_date, part_no, issue_group,
(CASE s_level WHEN '80' THEN 1 WHEN '100' THEN 2 WHEN 'Late' THEN 3 END)
One solution could be to generate a cartesian product of all expected rows using a cartesian product between the (fixed) list of values, and then LEFT JOIN it with current_view.
The following query guarantees that you will get a record for each given s_date/part_no/issue_group/s_level tuple. If no record matches in current_view, the query will display a 0 quantity.
SELECT
sd.s_date,
pn.part_no,
ig.issue_group,
sl.s_level,
COALESCE(cv.qty_filled, 0) qty_filled
FROM
(SELECT '201802' AS s_date UNION SELECT '201803') AS sd
CROSS JOIN (SELECT 'xxxxx' AS part_no) AS pn
CROSS JOIN (SELECT '1' AS issue_group UNION SELECT '2') AS ig
CROSS JOIN (SELECT '80' AS s_level UNION SELECT '100' UNION SELECT 'Late') AS sl
LEFT JOIN current_view cv
ON cv.s_date = sd.s_date
AND cv.part_no = pn.part_no
AND cv.issue_group = ig.issue_group
AND cv.s_level = ig.s_level
ORDER BY
sd.s_date,
pn.part_no,
ig.issue_group,
DECODE(sl.s_level, '80', 1, '100', 2, 'Late', 3)
NB : you did not tag your RDBMS. This should work on most of them, excepted Oracle, where you need to add FROM DUAL to each select in the queries that list the allowed values, like :
(SELECT '201802' AS s_date FROM DUAL UNION SELECT '201803' FROM DUAL) AS sd