Efficiently outer join two array columns row-wise in BigQuery table - sql

I'll first state the question as simply as possible, then elaborate with more detail and an example.
Concise question without context
I have a table with rows containing columns of arrays. I need to outer join the elements of some pairs of these, compute some variables, and then aggregate the results back into a new array. I'm currently using a pattern where I:
unnest each column in the pair to be joined (cross join to PK of row)
full outer join the two on the PK and compute desired fields
group by PK to get back to single row with array column that summarizes the results
Is there a way to do this without the multiple unnesting and grouping back down?
More context and an example
I have a table which represents edits to an entity that is made up of multiple sub-records. Each row represents a single entity. There is a column before that contains the records before the edit, and another after that contains the records afterwards.
My goal is to label each sub-record with exactly one of the four valid edit types:
DELETE - record exists in before but not after
ADD - record exists in after but not before
EDIT - record exists in both before and after but any field was changed
NONE - record exists in both before and after and no fields were changed
Each of the sub-record values is represented by its ID and a hash of all of its fields. I've created some fake data and provided my initial implementation below. This works, but it seems very roundabout.
WITH source_data AS (
SELECT
1 AS pkField,
[
STRUCT(1 AS id, 1 AS fieldHash),
STRUCT(2 AS id, 2 AS fieldHash),
STRUCT(3 AS id, 3 AS fieldHash)
] AS before,
[
STRUCT(1 AS id, 1 AS fieldHash),
STRUCT(2 AS id, 0 AS fieldHash), -- record 2 edited
-- record 3 deleted
STRUCT(4 AS id, 4 AS fieldHash), -- record 4 added
STRUCT(5 AS id, 5 AS fieldHash) -- record 5 added
] AS after
)
SELECT
pkField,
ARRAY_AGG(STRUCT(
id,
CASE
WHEN beforeHash IS NULL THEN "ADD"
WHEN afterHash IS NULL THEN "DELETE"
WHEN beforeHash <> afterHash THEN "EDIT"
ELSE "NONE"
END AS editType
)) AS edits
FROM (
SELECT pkField, id, fieldHash AS beforeHash
FROM source_data
CROSS JOIN UNNEST(source_data.before)
)
FULL OUTER JOIN (
SELECT pkField, id, fieldHash AS afterHash
FROM source_data
CROSS JOIN UNNEST(source_data.after)
)
USING (pkField, id)
GROUP BY pkField
Is there a simpler and/or more efficient way to do this? Perhaps something that avoids the multiple unnesting and grouping back down?

I think, what you have is already simple and efficient way!
Meantime, you can consider below optimized version
select pkField,
array(select struct(
id, case
when b.fieldHash is null then 'ADD'
when a.fieldHash is null then 'DELETE'
when b.fieldHash != a.fieldHash then 'EDIT'
else 'NONE'
end as editType
) edits
from (select id, fieldHash from t.before) b
full outer join (select id, fieldHash from t.after) a
using(id)
) edits
from source_data t
if applied to sample data in your question - output is

Related

Use a CASE expression without typing matched conditions manually using PostgreSQL

I have a long and wide list, the following table is just an example. Table structure might look a bit horrible using SQL, but I was wondering whether there's a way to extract IDs' price using CASE expression without typing column names in order to match in the expression
IDs
A_Price
B_Price
C_Price
...
A
23
...
B
65
82
...
C
...
A
10
...
..
...
...
...
...
Table I want to achieve:
IDs
price
A
23;10
B
65
C
82
..
...
I tried:
SELECT IDs, string_agg(CASE IDs WHEN 'A' THEN A_Price
WHEN 'B' THEN B_Price
WHEN 'C' THEN C_Price
end::text, ';') as price
FROM table
GROUP BY IDs
ORDER BY IDs
To avoid typing A, B, A_Price, B_Price etc, I tried to format their names and call them from a subquery, but it seems that SQL cannot recognise them as columns and cannot call the corresponding values.
WITH CTE AS (
SELECT IDs, IDs||'_Price' as t FROM ID_list
)
SELECT IDs, string_agg(CASE IDs WHEN CTE.IDs THEN CTE.t
end::text, ';') as price
FROM table
LEFT JOIN CTE cte.IDs=table.IDs
GROUP BY IDs
ORDER BY IDs
You can use a document type like json or hstore as stepping stone:
Basic query:
SELECT t.ids
, to_json(t.*) ->> (t.ids || '_price') AS price
FROM tbl t;
to_json() converts the whole row to a JSON object, which you can then pick a (dynamically concatenated) key from.
Your aggregation:
SELECT t.ids
, string_agg(to_json(t.*) ->> (t.ids || '_price'), ';') AS prices
FROM tbl t
GROUP BY 1
ORDER BY 1;
Converting the whole (big?) row adds some overhead, but you have to read the whole table for your query anyway.
A union would be one approach here:
SELECT IDs, A_Price FROM yourTable WHERE A_Price IS NOT NULL
UNION ALL
SELECT IDs, B_Price FROM yourTable WHERE B_Price IS NOT NULL
UNION ALL
SELECT IDs, C_Price FROM yourTable WHERE C_Price IS NOT NULL;

Query to return nonmatching lines in two arbitrary tables

I have two sets of tables (i.e. a.1, a.2, a.3, b.1, b.2, b.3, etc) created using slightly different logic. The analogous table in the two schemas have the exact same columns (i.e. a.1 has the same columns as b.1). My belief is that the tables in the two schemas should contain the exact same information, but I want to test that belief. Therefore I want to write a query that compares two analogous tables and returns lines that are not in both tables. Is there an easy way to write a query to do that without manually writing the join? In other words, can I have a query that can produce the results that I want where I only have to change the table names I want to compare while leaving the rest of the query unchanged?
To be a bit more explicit, I'm looking to do something like the following:
select *
from a.1
where (all columns in a.1) not in (select * from b.1);
If I could write something like this then all I would have to do to compare a.2 to b.2 would be to change the table names. However, it's not clear to me how to come up with the (all columns in a.1) piece in a general way.
Based on a recommendation in the comments, I've created the following showing the kind of thing I'd like to see:
https://dbfiddle.uk/?rdbms=db2_11.1&fiddle=ad0141b0daf8f8f92e6e3fa8d57e67ad
I was looking for the except clause.
So
select *
from a.1
where (all columns in a.1) not in (select * from b.1);
can be written as
select * from a.1
except
select * from b.1
In db-fiddle I give an explicit exmaple of what I wanted.
If you have a primary key to match rows between the tables, then you can try a full anti-join. For example:
select a.id as aid, b.id as bid
from a
full join b on b.id = a.id
where a.id is null or b.id is null
If the tables are:
A: 1, 2, 3
B: 1, 2, 4
The result is:
AID BID
---- ----
null 4 -- means it's present in B, but not in A
3 null -- means it's present in A, but not in B
See running example at db<>fiddle.
Of course, if your tables do not have a primary key, or if the rows are inconsistent (same PK, different data), then you'll need to adjust the query.
As an alternative you can try this:
select 'a1' t,* from (
select a1.*,row_number() over (partition by c1 order by 1) as rn from a1
minus
select b1.*,row_number() over (partition by c1 order by 1) as rn from b1
)
union all
select 'b1' t,* from (
select b1.*,row_number() over (partition by c1 order by 1) as rn from b1
minus
select a1.*,row_number() over (partition by c1 order by 1) as rn from a1
)
fiddle
edit: you can shorten the query by precalculating the rn part, instead of doing the same calculation again.

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Order by data as per supplied Id in sql

Query:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
Right now the data is ordered by the auto id or whatever clause I'm passing in order by.
But I want the data to come in sequential format as per id's I have passed
Expected Output:
All Data for 1247881
All Data for 174772
All Data for 808454
All Data for 2326154
Note:
Number of Id's to be passed will 300 000
One option would be to create a CTE containing the ration_card_id values and the orders which you are imposing, and the join to this table:
WITH cte AS (
SELECT 1247881 AS ration_card_id, 1 AS position
UNION ALL
SELECT 174772, 2
UNION ALL
SELECT 808454, 3
UNION ALL
SELECT 2326154, 4
)
SELECT t1.*
FROM [MemberBackup].[dbo].[OriginalBackup] t1
INNER JOIN cte t2
ON t1.ration_card_id = t2.ration_card_id
ORDER BY t2.position DESC
Edit:
If you have many IDs, then neither the answer above nor the answer given using a CASE expression will suffice. In this case, your best bet would be to load the list of IDs into a table, containing an auto increment ID column. Then, each number would be labelled with a position as its record is being loaded into your database. After this, you can join as I have done above.
If the desired order does not reflect a sequential ordering of some preexisting data, you will have to specify the ordering yourself. One way to do this is with a case statement:
SELECT *
FROM [MemberBackup].[dbo].[OriginalBackup]
where ration_card_id in
(
1247881,174772,
808454,2326154
)
ORDER BY CASE ration_card_id
WHEN 1247881 THEN 0
WHEN 174772 THEN 1
WHEN 808454 THEN 2
WHEN 2326154 THEN 3
END
Stating the obvious but note that this ordering most likely is not represented by any indexes, and will therefore not be indexed.
Insert your ration_card_id's in #temp table with one identity column.
Re-write your sql query as:
SELECT a.*
FROM [MemberBackup].[dbo].[OriginalBackup] a
JOIN #temps b
on a.ration_card_id = b.ration_card_id
order by b.id

SQL identifying where the previous row number contained xyz

I wanted to know how I can make a query in SQL where I have pulled out some data and given row_numbers (partitioned by a trade id) and when I pull a item that has the specific action 'AMEND' done on it (e.g. if row 2 record had the AMEND action) then I want to see what the row 3 action was.
If row 3 action was 'UPDATE' then I want a new field to say 'REMOVE' on both and if there was either no row 3 for that trade or row 3 action was not UPDATE then I want the new field to say 'KEEP'
Is this easily done?
Thanks
You have to self-reference that table so you will be able to pull out the previus row data
with cte as( select *
, rownumber() over by (partition by trade_id order by......) as seq
from table )
select * from cte t1
left join cte t2 on t1.seq= t2.seq + 1
Lets say, your field names in cte are aliesed as A and B, since you have self referenced the table. Now insert the generated results into a temp table.
Then, you can structure your CASE statement based on the ACTION from the previous row:
select
a.action
, case when (b.action= 'Update' and a.action= 'amend') then 'REmove'
else 'keep'
end NEWField
from #temp