Oracle SQL statement to update column values based on specific condition - sql

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.

I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).

This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);

Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;

The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

Related

Max match same numbers from each row

To generate 1mln rows of report with the below mentioned script is taking almost 2 days so, really appreciate if somebody could help me with different script which the report can be generated within 10-15mins please.
The requirement of the report is as following;
Table “cover” contains 5mln rows & 6 columns of data and likewise table “data” contains 500,000 rows and 6 columns.
So, each numbers of the rows in table cover has to go through table date and provide the maximum matches.
For instance, as mentioned on the below tables, there could be 3 matches in row #1, 2 matches in row #2 and 5 matches in row #3 so the script has to select the max selection which is 5 in row #3.
Sample table
UPDATE public.cover_sheet AS fc
SET maxmatch = (SELECT MAX(tmp.mtch)
FROM (
SELECT (SELECT CASE WHEN fc.a=drwo.a THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.b=drwo.b THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.c=drwo.c THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.d=drwo.d THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.e=drwo.e THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.f=drwo.f THEN 1 ELSE 0 END) AS mtch
FROM public.data AS drwo
) AS tmp)
WHERE fc.code>0;
SELECT *
FROM public.cover_sheet AS fc
WHERE fc.maxmatch>0;
As #a_horse_with_no_name mentioned in the comment to the question, your question is not clear...
Seems, you want to get the number of records which 6 fields from both tables are equal.
I'd suggest to:
reduce the number of select statements, then the speed of query execution will increase,
split your query into few smaller ones (good practice), to check your logic,
use join to get equal data, see: Visual Representation of SQL Joins
use subquery or cte to get result on which you'll be able to update table.
I think you want to get result as follow:
SELECT COUNT(*) mtch
FROM public.cover_sheet AS fc INNER JOIN public.data AS drwo ON
fc.a=drwo.a AND fc.b=drwo.b AND fc.c=drwo.c AND fc.d=drwo.d AND fc.e=drwo.e AND fc.f=drwo.f
If i'm not wrong and above query is correct, the time of execution of above query will reduce to about 1-2 minutes.
Finally, update query may look like:
WITH qry AS
(
-- proper select statement here
)
UPDATE public.cover_sheet AS fc
SET maxmatch = qry.<fieldname>
FROM qry
WHERE fc.code>0 AND fc.<key> = qry.<key>;
Note:
I do not see your data and i know nothing about its structure, relationships, etc. So, you have to change above query to your needs.

Subset large table for use in multiple UNIONs

Suppose I have a table with the following structure:
id measure_1_actual measure_1_predicted measure_2_actual measure_2_predicted
1 1 0 0 0
2 1 1 1 1
3 . . 0 0
I want to create the following table, for each ID (shown is an example for id = 1):
measure actual predicted
1 1 0
2 0 0
Here's one way I could solve this problem (I haven't tested this, but you get the general idea, I hope):
SELECT 1 AS measure,
measure_1_actual AS actual,
measure_1_predicted AS predicted
FROM tb
WHERE id = 1
UNION
SELECT 2 AS measure,
measure_2_actual AS actual,
measure_2_predicted AS predicted
FROM tb WHERE id = 1
In reality, I have five of these "measures" and tens of millions of people - subsetting such a large table five times for each member does not seem the most efficient way of doing this. This is a real-time API, receiving tens of requests a minute, so I think I'll need a better way of doing this. My other thought was to perhaps create a temp table/view for each member once the request is received, and then UNION based off of that subsetted table.
Does anyone have a more efficient way of doing this?
You can use a lateral join:
select t.id, v.*
from t cross join lateral
(values (1, measure_1_actual, measure_1_predicted),
(2, measure_2_actual, measure_2_predicted)
) v(measure, actual, predicted);
Lateral joins were introduced in Postgres 9.4. You can read about them in the documentation.

SQL - Update top n records for each value in column a where n = count of column b

I have one table with the following columns and sample values:
[test]
ID | Sample | Org | EmployeeNumber
1 100 6513241
2 200 3216542
3 300 5649841
4 100 9879871
5 200 6546548
6 100 1116594
My example count query based on [test] returns these sample values grouped by Org:
Org | Count of EmployeeNumber
100 3
200 2
300 1
My question is can I use this count to update test.Sample to 'x' for the top 3 records of Org 100, the top 2 records of Org 200, and the top 1 record of Org 300? It does not matter which records are updated, as long as the number of records updated for the Org = the count of EmployeeNumber.
I realize that I could just update all records in this example but I have 175 Orgs and 900,000 records and my real count query includes an iif that only returns a partial count based on other columns.
The db that I am taking over uses a recordset and loop to update. I am trying to write this in one SQL update statement. I have tried several variations of nested select statements but can't quite figure it out. Any help would save my brain from exploding. Thanks!
Assuming, that id is the unique ID of the row, you could use a correlated subquery to select the count of row IDs of the rows sharing the current organization, that are less than or equal to the current row ID and check, that this count is less than or equal to the number of records from that organization you want to designate.
For example to mark 3 records of the organization 100 you could use:
UPDATE test
SET sample = 'x'
WHERE org = 100
AND (SELECT count(*)
FROM test t
WHERE t.org = test.org
AND t.id <= test.id) <= 3;
And analog for the other cases.
(Disclaimer: I don't have access to Access (ha, ha, pun), so I could not test it. But I guess it's basic enough, to work in almost every DBMS, also in Access.)

MonetDB: Enumerate groups of rows based on a given "boundary" condition

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;

SQL based data diff: longest common subsequence

I'm looking for research papers or writings in applying Longest Common Subsquence algorithm to SQL tables for obtaining a data diff view. Other sugestions on how to resolve a table diff problem are also welcomed. The challenge being that SQL tables have this nasty habit of geting rather BIG and applying straightforward algorithms designed for text processing may result in a program that never ends...
so given a table Original:
Key Content
1 This row is unchanged
2 This row is outdated
3 This row is wrong
4 This row is fine as it is
and the table New:
Key Content
1 This row was added
2 This row is unchanged
3 This row is right
4 This row is fine as it is
5 This row contains important additions
I need to find out the Diff:
+++ 1 This row was added
--- 2 This row is outdated
--- 3 This row is wrong
+++ 3 This row is right
+++ 5 This row contains important additions
If you export your tabls into csv files, you can use http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the
possibility to select the separator. Differences will be shown like:
"Column XYZ in record 999" is different. After this, the actual and the
expected result for this column will be shown.
This is probably too simple for what you're after, and it's not research :-), but just conceptual. I imagine you're looking to compare different methods for processing overhead (?).
--This is half of what you don't want ( A )
SELECT o.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is the other half of what you don't want ( B )
SELECT n.Key FROM tbl_ORIGINAL o INNER JOIN tbl_NEW n WHERE o.Content = n.Content
--This is half of what you DO want ( C )
SELECT '+++' as diff, n.key, Content FROM tbl_New n WHERE n.KEY NOT IN( B )
--This is the other half of what you DO want ( D )
SELECT '---' as diff, o.key, Content FROM tbl_Original o WHERE o.Key NOT IN ( A )
--Combining C & D
( C )
Union
( D )
Order By diff, key
Improvements...
try creating indexed views of the
base tables first
try reducing the length of the
content field to it's min for
uniqueness (trial/error), and then
use that shorter result to do your
comparisons
-- e.g. to get min length (1000 is arbitrary -- just need an exit)
declare #i int
set #i = 1
While i < 1000 and Exists (
Select Count(key), Left(content,#i) From Table Having Count(key) > 1 )
BEGIN
i = #i + 1
END