Compare and find differences in two tables in Oracle - sql

I have 2 tables:
account: ID, ACC, AE_CCY, DRCR_IND, AMOUNT, MODULE
flex: ID, ACC, AE_CCY, DRCR_IND, AMOUNT, MODULE
I want to show differences comparing only by: AE_CCY, DRCR_IND, AMOUNT, MODULE and ACC by first 4 characters
Example:
ID ACC AE_CCY DRCR_IND AMOUNT MODULE
-- --------- ------ -------- ------ ------
1 734647674 USD D 100 OP
and in flex:
ID ACC AE_CCY DRCR_IND AMOUNT MODULE
-- --------- ------ -------- ------ ------
1 734647654 USD D 100 OP
2 734665474 USD D 100 OP
9 734611111 USD D 100 OP
ID's 2 and 9 should be shown as differences.
If I use FULL JOIN I'll get no differences as substr(account.ACC,1,4) = substr(flex.ACC,1,4) are equal and others are equal and MINUS doesn't work because ID's different.

Do you mean you want to group by the first 4 characters of ACC, then diff them?
And, if not, why is Flex:ID=1 NOT a difference to account:ID=1, if ID=2 and ID=9 are, especially since it reads that ID is not a comparison field?

a brute-force set theory answer:
SELECT * FROM ID
UNION
SELECT * FROM FLEX
MINUS
(SELECT * FROM ID
INTERSECT
SELECT * FROM FLEX)

I think what you want is the full join with an additional condition. Something like:
select F.ID, F.AE_CCY, F.DRCR_IND, F.AMOUNT, F.MODULE, F.ACC
from account a join flex f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
where a.AE_CCY <> f.AE_CCY
or a.DRCR_IND <> f.DRCR_IND
or a.AMOUNT <> f.AMOUNT
or a.MODULE <> f.MODULE
or a.ACC <> f.ACC
This way, the join is still performed on the first 4 characters, but the where condition checks the entire field (as well as the other four).
Revised solution: This is something of a stab-in-the-dark, by I'm wondering if what you're really looking for is a list of records that don't have a match in the other table. In that case, a full outer join might be the answer:
select coalesce(F.ID,a.ID) as ID,
coalesce(F.AE_CCY,a.AE_CCY) as AE_CCY,
coalesce(F.DRCR_IND,a.DRCR_IND) as DRCR_IND,
coalesce(F.AMOUNT,a.AMOUNT) as AMOUNT,
coalesce(F.MODULE,a.MODULE) as MODULE,
coalesce(F.ACC,a.ACC) as ACC
from account a full outer join flex f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
and a.AE_CCY = f.AE_CCY
and a.DRCR_IND = f.DRCR_IND
and a.AMOUNT = f.AMOUNT
and a.MODULE = f.MODULE
where a.id is null
or f.id is null
Third attempted solution: Thinking about it further, I think you're saying that you want each record from the first table to match to exactly one record in the second table (and vice-versa). That's a difficult problem because relational databases aren't really design work that way.
The solution below uses the full outer join again, to get only rows that don't appear in the other table. This time, we're adding ROW_NUMBER to assign a unique number to each member of a set of duplicate values found in either table. In the example from your comment, with 5 identical rows in one table and 1 of the same row in another, the first table will be numbered 1-5 and the second will be 1. Therefore, by adding that as a join condition, we assure that each row has only one match. The one flaw in this design is that a perfect match on ACC is not guaranteed to take precedence over another value. Making that work would be quite a bit more difficult.
select coalesce(F.ID,a.ID) as ID,
coalesce(F.AE_CCY,a.AE_CCY) as AE_CCY,
coalesce(F.DRCR_IND,a.DRCR_IND) as DRCR_IND,
coalesce(F.AMOUNT,a.AMOUNT) as AMOUNT,
coalesce(F.MODULE,a.MODULE) as MODULE,
coalesce(F.ACC,a.ACC) as ACC
from (select a.*,
row_number()
over (partition by AE_CCY,DRCR_IND,AMOUNT,MODULE,substr(ACC,1,4)
order by acc) as rn
from account a) a
full outer join
(select f.*,
row_number()
over (partition by AE_CCY,DRCR_IND,AMOUNT,MODULE,substr(ACC,1,4)
order by acc) as rn
from flex f) f
on substr(a.ACC,1,4) = substr(f.ACC,1,4)
and a.AE_CCY = f.AE_CCY
and a.DRCR_IND = f.DRCR_IND
and a.AMOUNT = f.AMOUNT
and a.MODULE = f.MODULE
and a.RN = f.RN
where a.id is null
or f.id is null

I like to use:
SELECT min(which) which, id, ae_ccy, drcr_ind, amount, module, acc
FROM (SELECT DISTINCT 'account' which, id, ae_ccy, drcr_ind, amount, module,
substr(acc, 1, 4) acc
FROM ACCOUNT
UNION ALL
SELECT DISTINCT 'flex' which, id, ae_ccy, drcr_ind, amount, module,
substr(acc, 1, 4) acc
FROM flex)
GROUP BY id, ae_ccy, drcr_ind, amount, module, acc
HAVING COUNT(*) != 2
ORDER BY id, 1
It will show both the new rows, the old missing rows and any difference.

Related

How to find combination of intersection from many tables?

I have a list of different channels that could potentially bring users to a website (organic, SEO, online marketing, etc.). I would like to find an efficient way to count daily active user that comes from the combination of these channels. Each channel has its own table and track its respective users.
The tables looks like the following,
channel A
date user_id
2020-08-01 A
2020-08-01 B
2020-08-01 C
channel B
date user_id
2020-08-01 C
2020-08-01 D
2020-08-01 G
channel C
date user_id
2020-08-01 A
2020-08-01 C
2020-08-01 F
I want to know the following combinations
Only visit channel A
Only visit channel A & B
Only visit channel B & C
Only visit channel B
etc.
However, when there are a lot of channels (I have around 8 channels) the combination is a lot. What I've done roughly is as simple as this (this one includes channel A)
SELECT
a.date,
COUNT(DISTINCT IF(b.user_id IS NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a,
COUNT(DISTINCT IF(b.user_id IS NOT NULL AND c.user_id IS NULL, a.user_id, NULL)) AS dau_a_b,
...
FROM a LEFT JOIN b ON a.user_id = b.user_id AND a.date = b.date
LEFT JOIN c ON a.user_id = c.user_id AND a.date = c.date
GROUP BY 1
but extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this? I was thinking to use FULL OUTER JOIN but can't seem to get the grasp out of it. Answers really appreciated.
I would approach this with union all and two levels of aggregation:
select date, channels, count(*) as num_users
from (select date, user_id, string_agg(channel order by channel) as channels
from ((select distinct date, user_id, 'a' as channel from a) union all
(select distinct date, user_id, 'b' as channel from b) union all
(select distinct date, user_id, 'c' as channel from c)
) abc
group by date, user_id
) c
group by date, channels;
However, when there are a lot of channels (I have around 8 channels) the combination is a lot
extremely tedious when the total channels is 8 (28 variations for 2 combinations, 56 for 3, 70 for 4, and many more).
Any smart ideas to solve this?
Below is for BigQuery Standard SQL and addresses exactly above aspect of the OP's concerns
#standardSQL
CREATE TEMP FUNCTION generate_combinations(a ARRAY<INT64>)
RETURNS ARRAY<STRING>
LANGUAGE js AS '''
var combine = function(a) {
var fn = function(n, src, got, all) {
if (n == 0) {
if (got.length > 0) {
all[all.length] = got;
} return;
}
for (var j = 0; j < src.length; j++) {
fn(n - 1, src.slice(j + 1), got.concat([src[j]]), all);
} return;
}
var all = []; for (var i = 1; i < a.length; i++) {
fn(i, a, [], all);
}
all.push(a);
return all;
}
return combine(a)
''';
with users as (
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
), visits as (
select date, user_id,
string_agg(channel, ' & ' order by channel) combination
from users
group by date, user_id
), channels AS (
select channel, cast(row_number() over(order by channel) as string) channel_num
from (select distinct channel from users)
), combinations as (
select string_agg(channel, ' & ' order by channel_num) combination
from unnest(generate_combinations(generate_array(1,(select count(1) from channels)))) AS items,
unnest(split(items)) AS channel_num
join channels using(channel_num)
group by items
)
select date,
combination as channels_visited_only,
count(distinct user_id) dau
from visits
join combinations using (combination)
group by date, combination
order by combination
If to apply to sample data from your question - output is
Some explanations to help with using above
CTE users just simply union all tables and adds channel column to be able to distinguish from which table respective row came
CTE visits extracts list of all visited channels for each user-date combination
CTE channels just simply prepares list of channels and assigns number for later use
CTE combinations uses JS UDF to generate all combinations of channels' numbers and then joins them back to channels to generate channels combinations
and final SELECT statement is simply looks for those users whose list of visited channels match channels combination generated in previous step
Some recommendations for further streamlining above code
assuming your channel tables names follow channel_* pattern
you can use wildcard tables feature in users CTE and instead of
select distinct date, user_id, 'A' channel from channel_A union all
select distinct date, user_id, 'B' from channel_B union all
select distinct date, user_id, 'C' from channel_C
you can use something like below - so just one line instead of as many lines as cannles you have
select distinct date, user_id, _TABLE_SUFFIX as channel from channel_*
I think you could use set operators to answer your questions: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#set_operators
E.g.
is (A except B) except C
is A intersect B
etc.
I am thinking full join and aggregation:
select date, a.channel_a, b.channel_b, c.channel_c, count(*) cnt
from (select 'a' channel_a, a.* from channel_a) a
full join (select 'b' channel_b, b.* from channel_b b) b using (date, user_id)
full join (select 'c' channel_c, c.* from channel_c c) c using (date, user_id)
group by date, a.channel_a, b.channel_b, c.channel_c

SQL query to join smae table multiple times

I have a scenario to join the same table multiple times to get the desired output. For ex I have two tables TABLE A and TABLE B.
Step 1: I want to take the all the parties from TABLE A which have
lowest Idate. Lowest idate will be fetched based partyid and idate
column.
Step 2: Then based on CID which is fetched from TABLE A in step 1,
we need to fetch the corresponding MID from TABLE B which have
MIDTYPE=130300.
Step 3: Then based on the MID fetched in step 2 we need to traverse
the same table and find out the latest record for the same MID based
on idate in TABLE B and fetch the corresponding CID for the MID.
Step 4: Now for that CID we need to fetch MID value for MIDTYPE
130307 in the same table(TABLEB). And my final output should be combination of MID
which we fetched for step 3 and MID fetched for 130307 in step 4.
I write a query like this ..but its taking lot of time for the query to run as we are going through the same table(TABLEB) multiple times and TABLEB have millions of rows. Is there anyway we can rewrite this query in different way. Could some one can help with this me.
SELECT
ident.mid mid1,
b.mid mid2
FROM
(
SELECT
*
FROM
tableb
WHERE
midtype = 130307
) ident
INNER JOIN (
SELECT
s.cid,
s.mid,
s.midtype
FROM
(
SELECT
cid,
partyid,
admin_sys_tp_cd,
mid,
ilast
FROM
(
SELECT
cq.cid,
RANK() OVER(
PARTITION BY cq.partyid
ORDER BY
cq.idate ASC
) rnk,
cq.idate,
cq.partyid,
i.mid,
i.idate AS ilast
FROM
tablea cq
INNER JOIN tableb i ON cq.cid = i.cid
INNER JOIN tablec ON i.cid = c.cid
WHERE
i.midtype = 130300
)
WHERE
rnk = 1
) a
INNER JOIN (
SELECT
*
FROM
(
SELECT
cid,
mid,
midtype,
RANK() OVER(
PARTITION BY mid
ORDER BY
idate DESC
) rnk_mpid
FROM
tableb
)
WHERE
rnk_mpid = 1
) s ON a.mid = s.mid
AND s.midtype = 130300
) b ON ident.cid = b.cid
AND ident.midtype = 130307
not what you asked, but before others and I, spent time trying to get different approaches for you, let's make sure the basics are covered.
No matter how different you can write an SQL query, they will never perform fast, in a MILLION base table if you don't have the proper indexes for it. Specially in your case, as you have to access it 3 times at least.
Just by looking at your detailed steps. I would say that you should have at least 3 different indexes created to support this query.
TableA_Index1 ( PARTYID, LDATE, INCLUDES CID)
TableB_Index1 (CID, MIDTYPE, INCLUDES MID )
TableB_Index2 (MID, LDATE, INCLUDES CID )
Do you have them ?
Have you ever tried to run this query on db2-advisor (db2advis) to get recommended indexes for it ?

Find duplicates in MS SQL table

I know that this question has been asked several times but I still cannot figure out why my query is returning values which are not duplicates. I want my query to return only the records which have identical value in the column Credit. The query executes without any errors but values which are not duplicated are also being returned. This is my query:
Select
_bvGLTransactionsFull.AccountDesc,
_bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate,
_bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit,
_bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName
From
_bvGLAccountsFinancial Inner Join
_bvGLTransactionsFull On _bvGLAccountsFinancial.AccountLink =
_bvGLTransactionsFull.AccountLink
Where
_bvGLTransactionsFull.Credit
IN
(SELECT Credit AS NumOccurrences
FROM _bvGLTransactionsFull
GROUP BY Credit
HAVING (COUNT(Credit) > 1 ) )
Group By
_bvGLTransactionsFull.AccountDesc, _bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate, _bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit, _bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(_bvGLTransactionsFull.Reference), _bvGLTransactionsFull.TrCode
Having
_bvGLTransactionsFull.TxDate > 01 / 11 / 2014 And
_bvGLTransactionsFull.Reference Like '5_____' And
_bvGLTransactionsFull.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
That's because you're matching on the credit field back to your table, which contains duplicates. You need to isolate the rows that are duplicated with ROW_NUMBER:
;WITH CTE AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CREDIT ORDER BY (SELECT NULL)) AS RN
FROM _bvGLTransactionsFull)
Select
CTE.AccountDesc,
_bvGLAccountsFinancial.Description,
CTE.TxDate,
CTE.Description,
CTE.Credit,
CTE.Reference,
CTE.UserName
From
_bvGLAccountsFinancial Inner Join
CTE On _bvGLAccountsFinancial.AccountLink = CTE.AccountLink
WHERE CTE.RN > 1
Group By
CTE.AccountDesc, _bvGLAccountsFinancial.Description,
CTE.TxDate, CTE.Description,
CTE.Credit, CTE.Reference,
CTE.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(CTE.Reference), CTE.TrCode
Having
CTE.TxDate > 01 / 11 / 2014 And
CTE.Reference Like '5_____' And
CTE.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
Just as a side note, I would consider using aliases to shorten your queries and make them more readable. Prefixing the table name before each column in a join is very difficult to read.
I trust your code in terms of extracting all data per your criteria. With this, let me have a different approach and see your script "as-is". So then, lets keep first all the records in a temp.
Select
_bvGLTransactionsFull.AccountDesc,
_bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate,
_bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit,
_bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName
-- temp table
INTO #tmpTable
From
_bvGLAccountsFinancial Inner Join
_bvGLTransactionsFull On _bvGLAccountsFinancial.AccountLink =
_bvGLTransactionsFull.AccountLink
Where
_bvGLTransactionsFull.Credit
IN
(SELECT Credit AS NumOccurrences
FROM _bvGLTransactionsFull
GROUP BY Credit
HAVING (COUNT(Credit) > 1 ) )
Group By
_bvGLTransactionsFull.AccountDesc, _bvGLAccountsFinancial.Description,
_bvGLTransactionsFull.TxDate, _bvGLTransactionsFull.Description,
_bvGLTransactionsFull.Credit, _bvGLTransactionsFull.Reference,
_bvGLTransactionsFull.UserName, _bvGLAccountsFinancial.Master_Sub_Account,
IsNumeric(_bvGLTransactionsFull.Reference), _bvGLTransactionsFull.TrCode
Having
_bvGLTransactionsFull.TxDate > 01 / 11 / 2014 And
_bvGLTransactionsFull.Reference Like '5_____' And
_bvGLTransactionsFull.Credit > 0.01 And
_bvGLAccountsFinancial.Master_Sub_Account = '90210'
Then remove the "single occurrence" data by creating a row index and remove all those 1 time indexes.
SELECT * FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY Credit) AS rowIdx
, *
FROM #tmpTable) AS innerTmp
WHERE
rowIdx != 1
You can change your preference through PARTITION BY <column name>.
Should you have any concerns, please raise it first as these are so far how I understood your case.
EDIT : To include those credits that has duplicates.
SELECT
tmp1.*
FROM #tmpTable tmp1
RIGHT JOIN (
SELECT
Credit
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY Credit ORDER BY Credit) AS rowIdx
, *
FROM #tmpTable) AS innerTmp
WHERE
rowIdx != 1
) AS tmp2
ON tmp1.Credit = tmp2.Credit

Recursive calculation in SQL (Oracle)

I'm having a tough time finding a solution to ETL some data into my resulting table. I think I cannot accomplish this using pure SQL and need to use PL-SQL due to the looping. Could the sql gurus help me go towards the right direction or provide some pointers to solve this problem?
Here's the scenario:
Tables: TABLEA and TABLEB.
Steps:
Group records in TABLEA by A_CD and SUM the A_AMT FIELD. (Lets assume A_FLAG is always same for any A_CD.). Lets call the grouped resultset as TABLEA_GRP (This is not a table, it is a grouped query).
Pick a row from TABLEB and if B_FLG is 'N' then pick all rows in TABLEA_GRP where A_FLG is 'N'. If the B_FLG is 'Y' then pick all rows in TABLEA_GRP.
Starting first record of rows picked in step 2, calculate the ratio of its TOTAL_AMT to SUM of ALL TOTAL_AMT for the selected rows. Multiply the ratio to B_AMT and add resulting amount to the rows TOTAL_AMT and store in RESULTING_AMT. Repeat this calculation for all rows picked in step 2.
Repeat step 2 and 3, now using the starting TOTAL_AMT VALUE from the RESULTING_AMT value from previous calculation of the same A_CD.
RESULTING _RATIO field is not needed to be saved, it is just given for demo purpose. How would you do this?
Basically I want to get data in RESULTING_TABLE from TABLEA and TABLEB
Could anyone help? Thanks a lot in advance for any guidance.
EDIT: I added A_DATE and B_DATE for supporting join between the two tables. For simplicity you can just do A.A_DATE = B.B_DATE, example this basic join:
SELECT
A.A_CD,
SUM(A.A_AMT) AS TOTAL_AMT,
A.A_FLAG,
A.A_DATE,
B.B_ID,
B.B_AMT,
B.B_FLAG
FROM
TABLEA A
JOIN TABLEB B
ON A.A_DATE = B.B_DATE
GROUP BY
A.A_CD,
A.A_FLAG,
A.A_DATE,
B.B_ID,
B.B_AMT,
B.B_FLAG
;
Okay I think I've got the solution. The numbers are a bit different to yours, but I'm fairly sure mine is doing what you want. We can do everything in steps 1 & 2 using a single query (main_sql). 3 and 4 have to be done using a recursive statement (recur_sql).
with main_sql as (
select a.*,
b.*,
sum(a_amt) over (partition by b_id) as cd_amt,
rank() over (partition by a_cd order by b_id) as rnk
from (select a_cd, a_flag, sum(a_amt) as a_amt
from tablea
group by a_cd, a_flag) a,
tableb b
where a.a_flag = case when b.b_flag = 'Y' then a.a_flag else b.b_flag end
order by b_id, a_cd
),
recur_sql (a_cd, b_id, total_amt, cd_amt, resulting_ratio, resulting_amt, rnk) as (
select m.a_cd,
m.b_id,
m.a_amt as total_amt,
m.cd_amt, m.a_amt / m.cd_amt as resulting_ratio,
m.a_amt + (m.a_amt / m.cd_amt * m.b_amt) as resulting_amt,
rnk
from main_sql m
where rnk = 1
union all
select m.a_cd,
m.b_id,
r.resulting_amt as total_amt,
m.cd_amt,
r.resulting_amt / m.cd_amt as resulting_ratio,
r.resulting_amt + (r.resulting_amt / m.cd_amt * m.b_amt) as resulting_amt,
m.rnk
from recur_sql r,
main_sql m
where m.rnk > 1
and r.a_cd = m.a_cd
and m.rnk - 1 = r.rnk
)
select a_cd, b_id, total_amt, resulting_ratio, resulting_amt
from recur_sql
order by 2, 1

Optimization of multiple aggregate sorting in SQL

I have a postgres query written for the Spree Commerce store that sorts all of it's products in the following order: In stock (then first available), Backorder (then first available), Sold out (then first available).
In order to chain it with rails scopes I had to put it in the order by clause as opposed to anywhere else. The query itself works, and is fairly performant, but complex. I was curious if anyone with a bit more knowledge could discuss a better way to do it? I'm interested in performance, but also different ways to approach the problem.
ORDER BY (
SELECT
CASE
WHEN tt.count_on_hand > 0
THEN 2
WHEN zz.backorderable = true
THEN 1
ELSE 0
END
FROM (
SELECT
row_number() OVER (dpartition),
z.id,
bool_or(backorderable) OVER (dpartition) as backorderable
FROM (
SELECT DISTINCT ON (spree_variants.id) spree_products.id, spree_stock_items.backorderable as backorderable
FROM spree_products
JOIN "spree_variants" ON "spree_variants"."product_id" = "spree_products"."id" AND "spree_variants"."deleted_at" IS NULL
JOIN "spree_stock_items" ON "spree_stock_items"."variant_id" = "spree_variants"."id" AND "spree_stock_items"."deleted_at" IS NULL
JOIN "spree_stock_locations" ON spree_stock_locations.id=spree_stock_items.stock_location_id
WHERE spree_stock_locations.active = true
) z window dpartition as (PARTITION by id)
) zz
JOIN (
SELECT
row_number() OVER (dpartition),
t.id,
sum(count_on_hand) OVER (dpartition) as count_on_hand
FROM (
SELECT DISTINCT ON (spree_variants.id) spree_products.id, spree_stock_items.count_on_hand as count_on_hand
FROM spree_products
JOIN "spree_variants" ON "spree_variants"."product_id" = "spree_products"."id" AND "spree_variants"."deleted_at" IS NULL
JOIN "spree_stock_items" ON "spree_stock_items"."variant_id" = "spree_variants"."id" AND "spree_stock_items"."deleted_at" IS NULL
) t window dpartition as (PARTITION by id)
) tt ON tt.row_number = 1 AND tt.id = spree_products.id
WHERE zz.row_number = 1 AND zz.id=spree_products.id
) DESC, available_on DESC
The FROM shown above determines whether or not a product is backorderable, and the JOIN shown above determines the stock in inventory. Note that these are very similar queries, except that I need to determine if something is backorderable based on a locations ability to support backorders and its state, WHERE spree_stock_locations.active=true.
Thanks for any advice!