Capture changes between 2 datasets with duplicates - sql

This is a follow-up question of Capture changes in 2 datasets.
I need to capture change between 2 datasets based on key(s): one historical and another current version of the same dataset (both datasets share same schema). These datasets can have duplicate rows as well. In below example id is considered key for comparison:
-- Table t_curr
-------
id col
-------
1 A
1 B
2 C
3 F
-- Table t_hist
-------
id col
-------
1 B
2 C
2 D
4 G
-- Expected output t_change
----------------
id col change
----------------
1 A modified -- change status is 'modified' as first row for id=1 is different for both tables
1 B inserted
2 C same
2 D deleted
3 F inserted
4 G deleted
I'm looking for an efficient solution to get the desired output.
EDIT
Explanation: While fetching data from t_curr if records come in the same order as shown and records were ranked wrt to id:
1/A is first and 1/B second records in t_curr
1/B is the first records in t_hist
1st record for both datasets compared ie 1/A in t_curr compared with 1/B of t_hist hence 1/A marked as modified in t_change
Since 1/B present only in t_curr it's
marked inserted

I was able to do it using full outer join and row_number(). Query:
with t_hist as (
select 1 as id, 'B' as col union all
select 2 as id, 'C' as col union all
select 2 as id, 'D' as col union all
select 4 as id, 'G' as col
),
t_curr as (
select 1 as id1, 'A' as col1 union all
select 1 as id1, 'B' as col1 union all
select 2 as id1, 'C' as col1 union all
select 3 as id1, 'F' as col1
)
select
case when id1 is null then id else id1 end as id_,
case when col1 is null then col else col1 end as col_,
case
when id is null then 'inserted'
when id1 is null then 'deleted'
when col = col1 then 'same'
else 'modified'
end
as change
from
(select t_curr.*, t_hist.* from (select *, row_number() over (partition by id1 order by id1) r1 from t_curr) t_curr
full outer join (select *, row_number() over (partition by id) r from t_hist ) t_hist on id1 = id and r1 = r )
order by id_

Related

Split one row to many in same database table

We have a requirement where we want to split one row to many rows ( in the same table ) based on some conditions.
Let's suppose we have this table :
ID
Value
1
V1
2
V2
3
V3
Requirement is,
if ID=1, split this row into two more rows where IDs of new rows will be 4 and 5 and the value will be V1 (same as ID = 1 value) only.
if ID=2, don't split.
if ID=3, split this row into one more row where ID of the new row will be 6 and value will be V3 (same as ID = 3 value) only.
The final o/p will be :
ID
Value
1
V1
4
V1
5
V1
2
V2
3
V3
6
V3
I am looking out for some SQL script/Stored Proc that will help me in achieving the same.
You can generate the rows with a join and derived table . . . and then use union all to bring in the existing rows:
select id, value
from t
union all
select x.new_id, t.value
from t join
(select 1 as old_id, 4 as new_id from dual union all
select 1 as old_id, 5 as new_id from dual union all
select 3 as old_id, 6 as new_id from dual
) x
on t.id = x.old_id;
If you just want to insert the values, use insert with the second query.
You can join your table with numbers as follows:
select case when t.id = 2 then t.id
when t.id = 3 then t.id * lvl
when t.id = 1 and lvl > 1 then lvl+2
else lvl
end as id, t.value
from your_table t
cross join (select level as lvl from dual connect by level <=3)
where t.id = 1 or (t.id=2 and lvl=1) or (t.id = 3 and lvl <= 2)

How to pass list of columns from one table to another table in bigquery

Following is my Table-A
Column_Name Flag
----------- ----
col-A 1
col-B 1
col-C 1
col-D 2
Columns col-A, col-B, col-C, col-D are present in Table-B as follows
ID col-A col-B col-C col-D
1 a b c d
I want to write a query something like
select (select column_name from table-A where flag = 1) from table-B
Above query should translate to something like
select col-A, col-B, col-C from table-B.
I tried the following:
select Array_to_String(Array(select column_name from table-A where flag = 1)) from table-B
But the above function Array_to_string gives me the list of columns as a single string.
Below is example for BigQuery Standard SQL
I've added yet another table (tableC) to have association between group)flag) and respective action. Overall it is simplified (as your question is) but shows possible approach - that you still will need to apply to your specific use case
Note, based on your comments - I assume you apply aggregation on columns within the rows
#standardSQL
WITH `tableA` AS (
SELECT 'colA' Column_Name, 1 Flag UNION ALL
SELECT 'colB', 1 UNION ALL
SELECT 'colC', 1 UNION ALL
SELECT 'colD', 2 UNION ALL
SELECT 'colE', 2
), `tableB` AS (
SELECT 1 id, 1 colA, 2 colB, 3 colC, 4 colD, 5 colE UNION ALL
SELECT 2, 5, 6, 7, 8, 9
), `tableC` AS (
SELECT 1 flag, 'SUM' action UNION ALL
SELECT 2, 'BIT_AND'
)
SELECT id,
CASE action
WHEN 'SUM' THEN SUM(CAST(val AS INT64))
WHEN 'BIT_AND' THEN BIT_AND(CAST(val AS INT64))
END val
FROM (
SELECT id,
SPLIT(kv, ':')[OFFSET(0)] col,
SPLIT(kv, ':')[OFFSET(1)] val
FROM `tableB` t,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'^{|}$|"', ''))) kv
)
JOIN `tableA` ON col = Column_Name AND flag = 1 -- set flag here
JOIN `tableC` USING(flag)
GROUP BY id, action
You just need to set value for flag
If you set flag = 1 - result will be SUM of columns
Row id val
1 1 6
2 2 18
If you set flag = 2 - result will be BIT_AND applied to "survived" columns
Row id val
1 1 4
2 2 8

ORA-00905: missing keyword when using Case in order by

I have the below query, where if the edit date is not null, then the most recent record needs to be returned and also should be randomized else the records should be randomized. I tried the below order by , but I am getting the missing keyword error.
SELECT * FROM ( SELECT c.id,c.edit_date, c.name,l.title
FROM tableA c, tableb l
WHERE c.id = l.id
AND c.published_ind = 'Y'
AND lc.type_id != 4
AND TRIM(c.img_file) IS NOT NULL
ORDER BY DBMS_RANDOM.VALUE
)
WHERE ROWNUM = 1
order by case when c.edit_date = 'null'
then DBMS_RANDOM.VALUE
else DBMS_RANDOM.VALUE, c.edit_date desc
end
If I get you correct, you try to get a record per ID with either the highest date (a random one if more records with the same date exists) or with a NULL date (again random one when more NULL records with the same ID exists.
So assuming this data
ID EDIT_DATE TEXT
---------- ------------------- ----
1 01.01.2015 00:00:00 A
1 01.01.2016 00:00:00 B
1 01.01.2016 00:00:00 C
2 01.01.2015 00:00:00 D
2 01.01.2016 00:00:00 E
2 F
2 G
You expect either B or C for ID =1 and either F or G for ID = 2.
This query do it.
The features used are ordering with NULLS FIRST and adding a random value as a last ordering column - to get random result if all preceeding columns are the same..
with dta as (
select 1 id, to_date('01012015','ddmmyyyy') edit_date, 'A' text from dual union all
select 1 id, to_date('01012016','ddmmyyyy') edit_date, 'B' text from dual union all
select 1 id, to_date('01012016','ddmmyyyy') edit_date, 'C' text from dual union all
select 2 id, to_date('01012015','ddmmyyyy') edit_date, 'D' text from dual union all
select 2 id, to_date('01012016','ddmmyyyy') edit_date, 'E' text from dual union all
select 2 id, NULL edit_date, 'F' text from dual union all
select 2 id, NULL edit_date, 'G' text from dual),
dta2 as (
select ID, EDIT_DATE, TEXT,
row_number() over (partition by ID order by edit_date DESC NULLS first, DBMS_RANDOM.VALUE) as rn
from dta)
select *
from dta2 where rn = 1
order by id
;
ID EDIT_DATE TEXT RN
---------- ------------------- ---- ----------
1 01.01.2016 00:00:00 B 1
2 F 1
Hopefully you can re-use thhe idea if you need a bit different result...
Statement WHERE always apply before statement ORDER BY. So in your query at first will applied WHERE ROWNUM = 1 and only after that will applied order by case ... for single record.
Perhaps you need add another subquery that at first execute ORDER BY, get rowset in proper order and after that execute WHERE ROWNUM = 1 to select single row.
Statment ORDER BY ... DBMS_RANDOM.VALUE, c.edit_date look strange. In fact, recordset will be sorted by DBMS_RANDOM.VALUE and if rowset has couple of rows have equal DBMS_RANDOM.VALUE we additionally will sort them by c.edit_date.

Find matching column data between two rows in the same table

I want to find the matching value between two rows in the same sqlite table. For example, if I have the following table:
rowid, col1, col2, col3
----- ---- ---- ----
1 5 3 1
2 3 6 9
3 9 12 5
So comparing row 1 and 2, I get the value 3.
Row 2 and 3 will give 9.
Row 3 and 1 will give 5.
There will always be one and only one matching value between any two rows in the table.
What it the correct sqlite query for this?
I hardcoded the values for the rows because i do not know how to declare variables in sqllite.
select t1.rowid as r1, t2.rowid as r2, t2.col as matchvalue from <yourtable> t1 join
(
select rowid, col1 col from <yourtable> where rowid = 3 union all
select rowid, col2 from <yourtable> where rowid = 3 union all
select rowid, col3 from <yourtable> where rowid = 3
) t2
on t2.col in (t1.col1, t1.col2, t1.col3)
and t1.rowid < t2.rowid -- you don't need this if you have two specific rows
and t1.rowid = 1
select col from
(
select rid, c1 as col from yourtable
union
select rid, c2 from yourtable
union
select rid, c3 from yourtable
) v
where rid in (3,2)
group by col
order by COUNT(*) desc
limit 1

How to select row based on existance of value in other column

I realise the title to this question may be vague but I am not sure how to phrase it. I have the following table:
i_id option p_id
---- ------ ----
1 A 4
1 B 8
1 C 6
2 B 3
2 C 5
3 A 7
3 B 3
4 E 11
How do I select a row based on the value of the option column for each unique i_id: if 'C' exists, select the row, else select row with 'B' else with 'A' so that result set is:
i_id option p_id
---- ------ ----
1 C 6
2 C 5
3 B 3
select i_id, option, p_id
from (
select
i_id,
option,
p_id,
row_number() over (partition by i_id order by case option when 'C' then 0 when 'B' then 1 when 'A' then 2 end) takeme
from thetable
where option in ('A', 'B', 'C')
) foo
where takeme = 1
This will give you the values ordered by C, B, A, while removing any i_id record that does not have one of these values.
WITH ranked AS
(
SELECT i_id, [option], p_id
, ROW_NUMBER() OVER (PARTITION BY i_id ORDER BY CASE [option]
WHEN 'C' THEN 1
WHEN 'B' THEN 2
WHEN 'A' THEN 3
ELSE 4
END) AS rowNumber
FROM yourTable
WHERE [option] IN ('A', 'B', 'C')
)
SELECT r.i_id, r.[option], r.p_id
FROM ranked AS r
WHERE r.rowNumber = 1
create table t2 (
id int,
options varchar(1),
pid int
)
insert into t2 values(1, 'A', 4)
insert into t2 values(1, 'B', 8)
insert into t2 values(1, 'C', 6)
insert into t2 values(1, 'E', 7)
select t2.* from t2,
(select id, MAX(options) as op from t2
where options <> 'E'
group by id) t
where t2.id = t.id and t2.options = t.op
Well, I would suggest that this problem can be made easier if you can assign a numeric "score" to each letter, such that "better" letters have higher scores. Then you can use MAX to find, for each group, the row with the highest "score" for the option. Since 'A' < 'B' < 'C', we could cheat here and use option as the score, and thus:
SELECT t1.i_id, t1.option, t1.p_id
FROM thetable t1
INNER JOIN (SELECT t2.i_id, MAX(option)
FROM thetable t2
GROUP BY t2.i_id) AS maximums
ON t1.i_id = maximums.i_id
WHERE option != 'D'
This assumes that {i_id, option} is a natural key of the table (i.e., that no two rows will have the same combination of values for those two columns; or, alternatively, that you have an uniqueness constraint on that pair of columns).