DB2 large update from another table - sql

I have a table with 600 000+ rows called asset. The customer has added a new column and would like it populated with a value from another table:
ASSET TEMP
| id | ... | newcol | | id | condition |
--------------------- ------------------
|0001| ... | - | |0001| 3 |
If I try to update it all at once, it times out/claims there is a dead lock:
update asset set newcol = (
select condition from temp where asset.id = temp.id
) where newcol is null;
The way I got around it was by only doing a 100 rows at a time:
update (select id, newcol from asset where newcol is null
fetch first 100 rows only) a1
set a1.newcol = (select condition from temp a2 where a1.id = a2.id);
At the moment I am making good use of the copy/paste utility, but I'd like to know of a more elegant way to do it (as well as a faster way).
I have tried putting it in a PL/SQL loop but I can't seem to get it to work with DB2 as a standalone script.

Related

Merge duplicate table rows than delete duplicates without timeout

I have the following (simplified) table structure:
table_user_book
id | book_id | user_id | page
---+---------+---------+------
1 | b1 | c1 | 16
2 | b1 | c1 | 130
3 | b2 | c1 | 54
4 | b4 | c2 | 97
5 | b5 | c2 | 625
6 | b5 | p3 | 154
as you can see I have duplicates. I want a single entry for the pair user/book. page is the last page read by the user. I have the following query that allows me to have the right merged data but I now want to update the table and delete the wrong records.
select
min("id") as "id",
max("page") as "page",
"book_id",
"user_id",
count(*) as "duplicates"
from myschema."table_user_book"
group by "book_id", "user_id"
order by count(*) DESC
I've run an update query using the above query as source matchng on "id".
I've then tried with the classic delete query WHERE a.id < b.id AND ... to delete the wrong rows but my DB goes to timeout.
I dont have the exact delete query I used but it was quite similar to:
delete FROM
myschema."table_user_book" a
using myschema."table_user_book" b
where
a.id > b.id
and a."book_id" = b."book_id"
and a."user_id"= b."user_id";
I've also tried to use sets to delete the rows but that is a even worse solution.
delete FROM
myschema."table_user_book"
where "id" not in ( ... above filter query with only ids selected ... )
I have around 400k rows in that table and 10-15% of them are duplicates!
For now I'm creating a tmp table with the right values, dropping the original one, and recreating it:
create table myschema.tmp as ( ... query above ... )
drop table myschema.table_user_book
create table myschema.table_user_book as ( select * from myschema."tmp" )
For deleting only duplicated records you have two variants:
Creating a recursive function for detecting and deleting duplicates
Using CTE commands (with as) by materialized technology and delete without recursive, by one-time execute the query.
For sample query please visit this url

SQL - Set column value to the SUM of all references

I want to have the column "CurrentCapacity" to be the SUM of all references specific column.
Lets say there are three rows in SecTable which all have FirstTableID = 1. Size values are 1, 1 and 3.
The row in FirstTable which have ID = 1 should now have a value of 5 in the CurrentCapacity column.
How can I make this and how to do automatically on insert, update and delete?
Thanks!
FirstTable
+----+-------------+-------------------------+
| ID | MaxCapacity | CurrentCapacity |
+----+-------------+-------------------------+
| 1 | 5 | 0 (desired result = 5) |
+----+-------------+-------------------------+
| 2 | 5 | 0 |
+----+-------------+-------------------------+
| 3 | 5 | 0 |
+----+-------------+-------------------------+
SecTable
+----+-------------------+------+
| ID | FirstTableID (FK) | Size |
+----+-------------------+------+
| 1 | 1 | 2 |
+----+-------------------+------+
| 2 | 1 | 3 |
+----+-------------------+------+
In general, a view is a better solution than trying to keep a calculated column up-to-date. For your example, you could use this:
CREATE VIEW capacity AS
SELECT f.ID, f.MaxCapacity, COALESCE(SUM(s.Size), 0) AS CurrentCapacity
FROM FirstTable f
LEFT JOIN SecTable s ON s.FirstTableID = f.ID
GROUP BY f.ID, f.MaxCapacity
Then you can simply
SELECT *
FROM capacity
to get the results you desire. For your sample data:
ID MaxCapacity CurrentCapacity
1 5 5
2 5 0
3 5 0
Demo on SQLFiddle
Got this question to work with this trigger:
CREATE TRIGGER UpdateCurrentCapacity
ON SecTable
AFTER INSERT, UPDATE, DELETE
AS
BEGIN
SET NOCOUNT ON
DECLARE #Iteration INT
SET #Iteration = 1
WHILE #Iteration <= 100
BEGIN
UPDATE FirstTable SET FirstTable.CurrentCapacity = (SELECT COALESCE(SUM(SecTable.Size),0) FROM SecTable WHERE FirstTableID = #Iteration) WHERE ID = #Iteration;
SET #Iteration = #Iteration + 1
END
END
GO
Personally, I would not use a trigger either or store CurrentCapacity as a value since it breaks Normalization rules for database design. You have a relation and can already get the results by creating a view or setting CurrentCapacity to a calculated column.
Your view can look like this:
SELECT Id, MaxCapacity, ISNULL(O.SumSize,0) AS CurrentCapacity
FROM dbo.FirstTable FT
OUTER APPLY
(
SELECT ST.FirstTableId, SUM(ST.Size) as SumSize FROM SecTable ST
WHERE ST.FirstTableId = FT.Id
GROUP BY ST.FirstTableId
) O
Sure, you could fire a proc every time a row is updated/inserted or deleted in the second table and recalculate the column, but you might as well calculate it on the fly. If it's not required to have the column accurate, you can have a job update the values every X hours. You could combine this with your view to have both a "live" and "cached" version of the capacity data.

Update column in one table for a user based on count of required records in another table for same user without cursor

I have 2 tables A and B. I need to update a column in table A for all userid's based on the count of records that userid has in another table based on defined rules. If count of records in another table is 3 and is required for that userID, then mark IsCorrect as 1 else 0, if count is 2 and required is 5 then IsCorrect as 0 For e.g. Below is what I am trying to achieve
Table A
UserID | Required | IsCorrect
----------------------------------
1 | SO;GO;PE | 1
2 | SO;GO;PE;PR | 0
3 | SO;GO;PE | 1
Table B
UserID | PPName
-----------------------
1 | SO
1 | GO
1 | PE
2 | SO
2 | GO
3 | SO
3 | GO
3 | PE
I tried using Update in table joining another table, but cannot up with one. Also, do not want to use cursors, because of its overhead. I know I will have to create a stored Procedure for it for the rules, but how to pass the userID's to it without cursor is what am i am looking for.
This is an update for my earlier question. Thanks for the help.
Here's a solution for PostgreSQL:
update TableA
set IsCorrect =
case when
string_to_array(Required, ';') <#
(select array_agg(PPName)
from TableB
where TableA.UserID = TableB.UserID)
then 1
else 0
end;
You can also see it live on SQL Fiddle.
use sub-query and aggregate function and then case when for conditional update
update TableA A
set A.IsCorrect= case when T.cnt>=3 then 1 else 0 end
inner join
(
select B.UserID ,count(*) as cnt from TableB as B
group by UserID
) as T
on A.userid=T.UserID

Finding & updating duplicate rows

I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.
It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.
If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.

SQL optimisation - Word count in string - Postgresql

I am trying to update a large table (about 1M rows) with the count of words in a field on Postgresql.
This query works, and sets the token_count field counting the words (tokens) in longtext in table my_table:
UPDATE my_table mt SET token_count =
(select count(token) from
(select unnest(regexp_matches(t.longtext, E'\\w+','g')) as token
from my_table as t where mt.myid = t.myid)
as tokens);
myid is the primary key of the table.
\\w+ is necessary because I want to count words, ignoring special characters.
For example, A test . ; ) would return 5 with space-based count, while 2 is the right value.
The issue is that it's horribly slow, and 2 days are not enough to complete it on 1M rows.
What would you do to optimised it? Are there ways to avoid the join?
How can I split the batch into blocks, using for example limit and offset?
Thanks for any tips,
Mulone
UPDATE: I measured the performance of the array_split, and the update is gonna be slow anyway. So maybe a solution would consist of parallelising it. If I run different queries from psql, only one query works and the others wait for it to finish. How can I parallelise an update?
Have you tried using array_length?
UPDATE my_table mt
SET token_count = array_length(regexp_split_to_array(trim(longtext), E'\\W+','g'), 1)
http://www.postgresql.org/docs/current/static/functions-array.html
# select array_length(regexp_split_to_array(trim(' some long text '), E'\\W+'), 1);
array_length
--------------
3
(1 row)
UPDATE my_table
SET token_count = array_length(regexp_split_to_array(longtext, E'\\s+'), 1)
Or your original query without a correlation
UPDATE my_table
SET token_count = (
select count(*)
from (select unnest(regexp_matches(longtext, E'\\w+','g'))) s
);
Using tsvector and ts_stat
get statistics of a tsvector column
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
No sample data to try it but it should work.
Sample Data
CREATE TEMP TABLE my_table
AS
SELECT $$A paragraph (from the Ancient Greek παράγραφος paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.$$::text AS longtext;
SELECT *
FROM ts_stat($$
SELECT to_tsvector(t.longtext)
FROM my_table AS t
$$);
word | ndoc | nentry
--------------+------+--------
παράγραφος | 1 | 1
written | 1 | 1
write | 1 | 2
unit | 1 | 1
sentenc | 1 | 1
self-contain | 1 | 1
self | 1 | 1
point | 1 | 1
particular | 1 | 1
paragrapho | 1 | 1
paragraph | 1 | 2
one | 1 | 1
idea | 1 | 1
greek | 1 | 1
discours | 1 | 1
deal | 1 | 1
contain | 1 | 1
consist | 1 | 1
besid | 1 | 2
ancient | 1 | 1
(20 rows)
Make sure myid is indexed, being the first field in the index.
Consider doing this outside DB in the first place. Hard to say without benchmarking, but the counting may be more costly than select+update; so may be worth it.
use COPY command (BCP equivalent for Postgres) to copy out the table data efficiently in bulk to a file
Run a simple Perl script to count. 1 million rows should take couple of minutes to 1 hour for Perl, depending on how slow your IO is.
use COPY to copy the table back into DB (possibly into a temp table, then update from that temp table; or better yet truncate the main table and COPY straight into it if you can afford the downtime).
For both your approach, AND the last step of my approach #2, update the token_count in batches of 5000 rows (e.g. set rowcount to 5000, and loop the updates adding where token_count IS NULL to the query