Update statement using a WHERE clause that contains columns with null Values - sql

I am updating a column on one table using data from another table. The WHERE clause is based on multiple columns and some of the columns are null. From my thinking, this nulls are what are throwing off your standard UPDATE TABLE SET X=Y WHERE A=B statement.
See this SQL Fiddle of the two tables where am trying to update table_one based on data from table_two.
My query currently looks like this:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number
The query fails for some rows in that table_one.x isn't getting updated when any of the columns in either table are null. i.e. it only gets updated when all columns have some data.
This question is related to my earlier one here on SO where I was getting distinct values from a large data set using Distinct On. What I now I want is to populate the large data set with a value from the table which has unique fields.
UPDATE
I used the first update statement provided by #binotenary. For small tables, it runs in a flash. Example is had one table with 20,000 records and the update was completed in like 20 seconds. But another table with 9 million plus records has been running for 20 hrs so far!. See below the output for EXPLAIN function
Update on table_one (cost=0.00..210634237338.87 rows=13615011125 width=1996)
-> Nested Loop (cost=0.00..210634237338.87 rows=13615011125 width=1996)
Join Filter: ((((my_update_statement_here))))
-> Seq Scan on table_one (cost=0.00..610872.62 rows=9661262 width=1986)
-> Seq Scan on table_two (cost=0.00..6051.98 rows=299998 width=148)
The EXPLAIN ANALYZE option took also forever so I canceled it.
Any ideas on how to make this type of update faster? Even if it means using a different update statement or even using a custom function to loop through and do the update.

Since null = null evaluates to false you need to check if two fields are both null in addition to equality check:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
(table_one.invoice_number = table_two.invoice_number
OR (table_one.invoice_number is null AND table_two.invoice_number is null))
AND
(table_one.submitted_by = table_two.submitted_by
OR (table_one.submitted_by is null AND table_two.submitted_by is null))
AND
-- etc
You could also use the coalesce function which is more readable:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
AND -- etc
But you need to be careful about the default values (last argument to coalesce).
It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data
E.g coalesce(null, 1) = coalesce(1, 1) is a situation you'd want to avoid.
Update (regarding performance):
Seq Scan on table_two - this suggests that you don't have any indexes on table_two.
So if you update a row in table_one then to find a matching row in table_two the database basically has to scan through all the rows one by one until it finds a match.
The matching rows could be found much faster if the relevant columns were indexed.
On the flipside if table_one has any indexes then that slows down the update.
According to this performance guide:
Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.
Another suggestion from the same guide that might be helpful is:
If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.
So for example if table_one an id column you could add something like
and table_one.id between x and y
to the where condition and run the query several times changing the values of x and y so that all rows are covered.
The EXPLAIN ANALYZE option took also forever
You might want to be careful when using the ANALYZE option with EXPLAIN when dealing with statements with sideffects.
According to documentation:
Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.

Try below, similar to the above #binoternary. Just beat me to the answer.
update table_one
set column_x = (select column_y from table_two
where
(( table_two.invoice_number = table_one.invoice_number)OR (table_two.invoice_number IS NULL AND table_one.invoice_number IS NULL))
and ((table_two.submitted_by=table_one.submitted_by)OR (table_two.submitted_by IS NULL AND table_one.submitted_by IS NULL))
and ((table_two.passport_number=table_one.passport_number)OR (table_two.passport_number IS NULL AND table_one.passport_number IS NULL))
and ((table_two.driving_license_number=table_one.driving_license_number)OR (table_two.driving_license_number IS NULL AND table_one.driving_license_number IS NULL))
and ((table_two.national_id_number=table_one.national_id_number)OR (table_two.national_id_number IS NULL AND table_one.national_id_number IS NULL))
and ((table_two.tax_pin_identification_number=table_one.tax_pin_identification_number)OR (table_two.tax_pin_identification_number IS NULL AND table_one.tax_pin_identification_number IS NULL))
and ((table_two.vat_number=table_one.vat_number)OR (table_two.vat_number IS NULL AND table_one.vat_number IS NULL))
and ((table_two.ggcg_number=table_one.ggcg_number)OR (table_two.ggcg_number IS NULL AND table_one.ggcg_number IS NULL))
and ((table_two.national_association_number=table_one.national_association_number)OR (table_two.national_association_number IS NULL AND table_one.national_association_number IS NULL))
);

You can use a null check function like Oracle's NVL.
For Postgres, you will have to use coalesce.
i.e. your query can look like :
UPDATE table_one SET table_one.x =(select table_two.y from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1))
where table_one.table_one_pk in (select table_one.table_one_pk from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1));

Your current query joins two tables using Nested Loop, which means that the server processes
9,661,262 * 299,998 = 2,898,359,277,476
rows. No wonder it takes forever.
To make the join efficient you need an index on all joined columns. The problem is NULL values.
If you use a function on the joined columns, generally the index can't be used.
If you use an expression like this in the JOIN:
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
an index can't be used.
So, we need an index and we need to do something with NULL values to make index usable.
We don't need to make any changes in table_one, because it has to be scanned in full in any case.
But, table_two definitely can be improved. Either change the table itself, or create a separate (temporary) table. It has only 300K rows, so it should not be a problem.
Make all columns that are used in the JOIN to be NOT NULL.
CREATE TABLE table_two (
id int4 NOT NULL,
invoice_number varchar(30) NOT NULL,
submitted_by varchar(20) NOT NULL,
passport_number varchar(30) NOT NULL,
driving_license_number varchar(30) NOT NULL,
national_id_number varchar(30) NOT NULL,
tax_pin_identification_number varchar(30) NOT NULL,
vat_number varchar(30) NOT NULL,
ggcg_number varchar(30) NOT NULL,
national_association_number varchar(30) NOT NULL,
column_y int,
CONSTRAINT table_two_pkey PRIMARY KEY (id)
);
Update the table and replace NULL values with '', or some other appropriate value.
Create an index on all columns that are used in JOIN plus column_y. column_y has to be included last in the index. I assume that your UPDATE is well-formed, so index should be unique.
CREATE UNIQUE INDEX IX ON table_two
(
invoice_number,
submitted_by,
passport_number,
driving_license_number,
national_id_number,
tax_pin_identification_number,
vat_number,
ggcg_number,
national_association_number,
column_y
);
The query will become
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
Note, that COALESCE is used only on table_one columns.
It is also a good idea to do UPDATE in batches, rather than the whole table at once. For example, pick a range of ids to update in a batch.
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.id >= <some_starting_value> AND
table_one.id < <some_ending_value> AND
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number

You can use coalesce function which will return true every time when any variable passed is null. Null check function will help you.
Null-related functions here.

Related

Multiple Columns and different Join Conditions for Oracle update

I have 2 tables , 1 is location and the Other one is Look up table. I have to look into the look up table for the location values and if they are present mark them as 'Y' and 'N' along with their corresponding values
I have written individual update Statements as below:
**Location1,L1value**
Update Location
set (Location1,L1value) =
(select UPPER(VAlue),'Y' from Location_lookup where trim(Location1)=Location
where exists (select 1 from Location_lookup where trim(Location1)=Location);
commit;
**Location2,value**
Update Location
set (Location2,L2value) =
(select UPPER(VAlue),'Y' from Location_lookup where trim(Location2)=Location
where exists (select 1 from Location_lookup where trim(Location2)=Location);
commit;
Similarly for 3rd flag and value.
Is there a way to write single update for all the three conditions? Reason why I am looking for single update is that I have 10+ million records and I do not want to scan the records three different times. The lookup table has > 32 million records.
Here is a solution which uses Oracle's bulk FORALL ... UPDATE capability. This is not quite as performative as a pure SQL solution but it is simpler to code and the efficiency difference probably won't matter much for 10 million rows on a modern enterprise server, especially if this is a one-off exercise.
Points to note:
You don't say whether LOCATION has a primary key. For this answer I have assumed it has an ID column. The solution won't work if there isn't a primary key, but if your table doesn't have a primary key you've likely got bigger problems.
Your question mentions setting the FLAG columns "as 'Y' and 'N'" but the required output only shows 'Y' setting. I have included processing for 'N' but see the coda underneath.
declare
cursor get_locations is
with lkup as (
select *
from location_lookup
)
select locn.id
,locn.location1
,upper(lup1.value) as l1value
,nvl2(lup1.value, 'Y', 'N') as l1flag
,locn.location2
,upper(lup2.value) as l2value
,nvl2(lup2.value, 'Y', 'N') as l2flag
,locn.location3
,upper(lup3.value) as l3value
,nvl2(lup3.value, 'Y', 'N') as l3flag
from location locn
left outer join lkup lup1 on trim(locn.location1) = lup1.location
left outer join lkup lup2 on trim(locn.location2) = lup2.location
left outer join lkup lup3 on trim(locn.location3) = lup3.location
where lup1.location is not null
or lup2.location is not null
or lup3.location is not null;
type t_locations_type is table of get_locations%rowtype index by binary_integer;
t_locations t_locations_type;
begin
open get_locations;
loop
fetch get_locations bulk collect into t_locations limit 10000;
exit when t_locations.count() = 0;
forall idx in t_locations.first() .. t_locations.last()
update location
set l1value = t_locations(idx).l1value
,l1flag = t_locations(idx).l1flag
,l2value = t_locations(idx).l2value
,l2flag = t_locations(idx).l2flag
,l3value = t_locations(idx).l3value
,l3flag = t_locations(idx).l3flag
where id = t_locations(idx).id;
end loop;
close get_locations;
end;
/
There is a working demo on db<>fiddle here. The demo output doesn't exactly match the sample output posted in the query, because that doesn't the given input data.
Setting flags to 'Y' or 'N'?
The code above uses left outer joins on the lookup table. If a row is found the NVL2() function will return 'Y' otherwise it returns 'N'. This means the flag columns are always populated, regardless of whether the value columns are. The exception is for rows which have no matches in LOCATION_LOOKUP for any location (ID=4000 in my demo). In this case the flag columns will be null. This inconsistency follows from the inconsistencies in the question.
To resolve it:
if you want all flag columns to be populated with 'N' remove the WHERE clause from the get_locations cursor query.
if you don't want to set flags to 'N' change the NVL2() function calls accordingly: nvl2(lup1.value, 'Y', null) as l1flag

Update columns in DB2 using randomly chosen static values provided at runtime

I would like to update rows with values chosen randomly from a set of possible values.
Ideally I would be able to provide this values at runtime, using JdbcTemplate from Java application.
Example:
In a table, column "name" can contain any name. The goal is to run through the table and change all names to equal to either "Bob" or "Alice".
I know that this can be done by creating a sql function. I tested it and it was fine but I wonder if it is possible to just use simple query?
This will not work, seems that the value is computed once, and applied to all rows:
UPDATE test.table
SET first_name =
(SELECT a.name
FROM
(SELECT a.name, RAND() idx
FROM (VALUES('Alice'), ('Bob')) AS a(name) ORDER BY idx FETCH FIRST 1 ROW ONLY) as a)
;
I tried using MERGE INTO, but it won't even run (possible_names is not found in SET query). I am yet to figure out why:
MERGE INTO test.table
USING
(SELECT
names.fname
FROM
(VALUES('Alice'), ('Bob'), ('Rob')) AS names(fname)) AS possible_names
ON ( test.table.first_name IS NOT NULL )
WHEN MATCHED THEN
UPDATE SET
-- select random name
first_name = (SELECT fname FROM possible_names ORDER BY idx FETCH FIRST 1 ROW ONLY)
;
EDIT: If possible, I would like to only focus on fields being updated and not depend on knowing primary keys and such.
Db2 seems to be optimizing away the subselect that returns your supposedly random name, materializing it only once, hence all rows in the target table receive the same value.
To force subselect execution for each row you need to somehow correlate it to the table being updated, for example:
UPDATE test.table
SET first_name =
(SELECT a.name
FROM (VALUES('Alice'), ('Bob')) AS a(name)
ORDER BY RAND(ASCII(SUBSTR(first_name, 1, 1)))
FETCH FIRST 1 ROW ONLY)
or may be even
UPDATE test.table
SET first_name =
(SELECT a.name
FROM (VALUES('Alice'), ('Bob')) AS a(name)
ORDER BY first_name, RAND()
FETCH FIRST 1 ROW ONLY)
Now that the result of subselect seems to depend on the value of the corresponding row in the target table, there's no choice but to execute it for each row.
If your table has a primary key, this would work. I've assumed the PK is column id.
UPDATE test.table t
SET first_name =
( SELECT name from
( SELECT *, ROW_NUMBER() OVER(PARTITION BY id ORDER BY R) AS RN FROM
( SELECT *, RAND() R
FROM test.table, TABLE(VALUES('Alice'), ('Bob')) AS d(name)
)
)
AS u
WHERE t.id = u.id and rn = 1
)
;
There might be a nicer/more efficient solution, but I'll leave that to others.
FYI I used the following DDL and data to test the above.
create table test.table(id int not null primary key, first_name varchar(32));
insert into test.table values (1,'Flo'),(2,'Fred'),(3,'Sue'),(4,'John'),(5,'Jim');

Comparing datetime value of columns from different tables

I have my sql query like this
if not exists(select RowId from dbo.Cache where StringSearched = #FirstName and colName = 'FirstName')
begin
--some code here
end
The purpose of above if statement is not to execute the piece of code inside of it if value of StringSearched is already present in Cache table which means it has been looked up before and so no need to make calculations again. The code inside of if statement if executed returns row number of rows from Table Band those are then inserted into Cache table to continue maintaining the cache. anyway .I need the records to be picked from Cache only if ModifiedAt column of Cache table is latest than ModifiedAt column of rows of Table B.
Note: I understand that I may need to use a subquery in where clause but in where clause itself, I need to check ModifiedAt column of Table B only for RowId's returned by Outer select query .
How can I proceed without making it much complex ?
You can use the subquery in the current query along with the Where clause.You didn't specified what are the columns to know for figure out which rows to get value so I assumed your tableB also has StringSearched and colName to get max(ModifiedAt) for that string vlaue.
IF NOT EXISTS (SELECT * from dbo.Cache as c WHERE StringSearched = #FirstName
AND colName = 'FirstName'
AND ModifiedAt > (Select MAX(ModifiedAt) FROM tableB as tabB WHERE tabB.RowID = c.RowID ))
BEGIN
--your query
END

Update query comparing two tables - Oracle

I am trying to change the sign of amount field in tdataseg table when the account_type in aif_hyp_acct_type table is ' ','R','L','Q'. aif_hyp_acct_type is the master table. It has loadid, account and account_type fields. tdataseg table has account, amount and many other fields.
I tried this query and get ORA-01427 error.
Single row subquery returns more than one row.
update tdataseg
set tdataseg.amount =
(select decode(sign(tdataseg.amount),-1,abs(tdataseg.amount),1,-abs(tdataseg.amount),0)
from tdataseg, aif_hyp_acct_type
where tdataseg.loadid = aif_hyp_acct_type.loadid
and tdataseg.account = aif_hyp_acct_type.account
and aif_hyp_acct_type.account_type in (' ','R','L','Q'))
Presumably the problem is that you think you have a correlated subquery, but you don't. The outer table is mentioned in the inner query. You need to remove that reference:
update tdataseg
set tdataseg.amount = (select decode(sign(tdataseg.amount), -1, abs(tdataseg.amount),
1,-abs(tdataseg.amount), 0)
from aif_hyp_acct_type
where tdataseg.loadid = aif_hyp_acct_type.loadid and
tdataseg.account = aif_hyp_acct_type.account and
aif_hyp_acct_type.account_type in (' ','R','L','Q')
);
EDIT:
You would get that error if there were no matches and the column were declared not null. Here is one fix:
update tdataseg
set tdataseg.amount = (select coalesce(max(decode(sign(tdataseg.amount), -1, abs(tdataseg.amount),
1,-abs(tdataseg.amount), 0)), 0)
from aif_hyp_acct_type
where tdataseg.loadid = aif_hyp_acct_type.loadid and
tdataseg.account = aif_hyp_acct_type.account and
aif_hyp_acct_type.account_type in (' ','R','L','Q')
);
That will set non-matches to 0.
I'd try to solve such update problems with a MERGE statement. I find it a bit more natural to write.
MERGE INTO tdataseg
USING (
SELECT loadid, account
FROM aif_hyp_acct_type
WHERE account_type IN (' ','R','L','Q')
) q
ON (tdataseg.loadid=q.loadid AND tdataseg.account=q.account)
WHEN MATCHED THEN UPDATE SET amount = - ABS(amount);
You put the table you want to change after the MERGE INTO. The master table is subsetted in the USING query. The join condition goes into the ON clause, and the actual data change is specified after WHEN MATCHED THEN UPDATE SET.

compare data between 2 table

Hey i have a requirement to compare two tables of same structure.
Table1
EmpNO - Pkey
EmpName
DeptName
FatherName
IssueDate
ValidDate
I need to pass the EMPNO as parameter and I need to compare whether any of the column get changes? and return YES OR NO value.
can I able to do that using a PL/SQL Funcation? I was thinking of using the CONCAT in-build function to do that.
I'm trying the below one
Table1Concat = Select CONCAT(Column1.....6) from tbale1 where emp_no= in_empno;
Table2Concat = Select CONCAT(Column1.....6) from tbale2 where emp_no= in_empno;
IF(Table1Concat<>Table2Concat ) THEN return data_changed :='YES';
else data_changed :='NO';
END;
If you only want to detect whether any value is different then ...
select count(*)
from (select * from table1 where emp_no = my_emp_no
union
select * from table2 where emp_no = my_emp_no
)
If it returns 1 then the rows are the same, if it returns 2 then there is a difference.
The columns must be in the same order for this to work, or you'll have to list out all the column names in the order in which they match.
If you wanted to do this in bulk for a great many rows then you'd most likely use a different solution, s do not loop through every emp_no running this code for each one.
For bulk data where all emp_id's are present in both tables, use a query of the form:
select table1.emp_no,
case when table1.column1 = table2.column1 and
table2.column2 = table2.column2 and
table2.column3 = table2.column3 and
...
then 'Yes'
else 'No
end columns_match
from table1
join table2 on table1.emp_no = table2.emp_no
You can insert this result directly into a logging table.
Take care of null values though. "any_value = null" is never true, and "any_value != Null" is also never true, so you might need to add logic to take care of cases where one or both values are null.