Optimize INSERT / UPDATE / DELETE operation - sql

I wonder if the following script can be optimized somehow. It does write a lot to disk because it deletes possibly up-to-date rows and reinserts them. I was thinking about applying something like "insert ... on duplicate key update" and found some possibilities for single-row updates but I don't know how to apply it in the context of INSERT INTO ... SELECT query.
CREATE OR REPLACE FUNCTION update_member_search_index() RETURNS VOID AS $$
DECLARE
member_content_type_id INTEGER;
BEGIN
member_content_type_id :=
(SELECT id FROM django_content_type
WHERE app_label='web' AND model='member');
DELETE FROM watson_searchentry WHERE content_type_id = member_content_type_id;
INSERT INTO watson_searchentry (engine_slug, content_type_id, object_id
, object_id_int, title, description, content
, url, meta_encoded)
SELECT 'default',
member_content_type_id,
web_member.id,
web_member.id,
web_member.name,
'',
web_user.email||' '||web_member.normalized_name||' '||web_country.name,
'',
'{}'
FROM web_member
INNER JOIN web_user ON (web_member.user_id = web_user.id)
INNER JOIN web_country ON (web_member.country_id = web_country.id)
WHERE web_user.is_active=TRUE;
END;
$$ LANGUAGE plpgsql;
EDIT: Schemas of web_member, watson_searchentry, web_user, web_country: http://pastebin.com/3tRVPPVi.
The main point is to update columns title and content in watson_searchentry. There is a trigger on the table that sets value of column search_tsv based on these columns.
(content_type_id, object_id_int) in watson_searchentry is unique pair in the table but atm the index is not present (there is no use for it).
This script should be run at most once a day for full rebuilds of search index and occasionally after importing some data.

Modified table definition
If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:
COLUMN | TYPE | Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...
DDL statement would be:
ALTER TABLE watson_searchentry ALTER COLUMN engine_slug DEFAULT 'default';
Etc.
Then you don't have to insert those values manually every time.
Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...
I'll go with your updated requirement:
The main point is to update columns title and content in watson_searchentry
Of course, you must add a UNIQUE constraint to enforce your requirements:
ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)
The accompanying index will be used. By this query for starters.
BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.
Query with data-modifying CTEs
This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.
WITH ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.
The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.
In the CTE del I avoid deleting rows that can be updated.
In the CTE up those rows are updated instead.
Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.
Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.
Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.
Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.
Speed first
If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:
CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;
INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;
I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.
Also:
There is a trigger on the table that sets value of column search_tsv
based on these columns.
It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.

Related

How to find intersecting geographies between two tables recursively

I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.
Be sure to have spatial GiST indexes on geographies.geography and paths.path:
CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:
Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE
You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:
It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).
I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].
Also since you commented:
There are always exactly 13 fk_ids passed in.
This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Pass multiple values in single parameter
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary
intermediate table. The manual:
For UNION (but not UNION ALL), discard duplicate rows and rows that
duplicate any previous result row. Include all remaining rows in the
result of the recursive query, and also place them in a temporary
intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
See the OP's feedback in the comment.
I figured it'd be good to post my own solution here even if it isn't optimal.
Here is what I came up with (using Steve Chambers' advice):
CREATE OR REPLACE FUNCTION public.function_name(
_fk_ids character varying[])
RETURNS TABLE(id uuid, type character varying)
LANGUAGE 'plpgsql'
COST 100.0
VOLATILE
ROWS 1000.0
AS $function$
DECLARE
_pathLength bigint;
_geographyLength bigint;
_currentPathLength bigint;
_currentGeographyLength bigint;
BEGIN
DROP TABLE IF EXISTS _pathIds;
DROP TABLE IF EXISTS _geographyIds;
CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);
-- get all geographies in the specified _fk_ids
INSERT INTO _geographyIds
SELECT g.id
FROM geographies g
WHERE g.fk_id= ANY(_fk_ids);
_pathLength := 0;
_geographyLength := 0;
_currentPathLength := 0;
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- _pathIds := ARRAY[]::uuid[];
WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
_pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
-- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids
INSERT INTO _pathIds
SELECT DISTINCT p.id
FROM paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE
g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
p.id NOT IN (SELECT _pathIds.id from _pathIds);
-- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
INSERT INTO _geographyIds
SELECT DISTINCT g.id
FROM geographies g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE
p.id IN (SELECT _pathIds.id FROM _pathIds) AND
g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);
_currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
_currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
END LOOP;
RETURN QUERY
SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
UNION ALL
SELECT _pathIds.id, 'path' AS type FROM _pathIds;
END;
$function$;
Sample plot and data from this script
It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.
create table path (
path_text text primary key,
path geometry(linestring) not null
);
create table point (
point_text text primary key,
point geometry(point) not null
);
A type to keep the aggregate function's state:
create type mpath_mpoint as (
mpath geometry(multilinestring),
mpoint geometry(multipoint)
);
The state building function:
create or replace function path_point_intersect (
_i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$
with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
select array_agg((mpath, mpoint)::mpath_mpoint)
from (
select
st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
(
select st_collect(gd)
from (
select gd from st_dump(i.mpath) a (a, gd)
union all
select gd from st_dump(e.mpath) b (a, gd)
) s
) as mpath
from i inner join e on st_intersects(i.mpoint, e.mpoint)
union all
select i.mpoint, i.mpath
from i inner join e on not st_intersects(i.mpoint, e.mpoint)
union all
select e.mpoint, e.mpath
from e
where not exists (
select 1 from i
where st_intersects(i.mpoint, e.mpoint)
)
) s;
$$ language sql;
The aggregate:
create aggregate path_point_agg (mpath_mpoint) (
sfunc = path_point_intersect,
stype = mpath_mpoint[]
);
This query will return a set of multilinestring, multipoint strings containing the matched paths/points:
select st_astext(mpath), st_astext(mpoint)
from unnest((
select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
from (
select path, st_union(point) as mpoint
from
path
inner join
point on st_intersects(path, point)
group by path
) s
)) m (mpath, mpoint)
;
st_astext | st_astext
-----------------------------------------------------------+-----------------------------
MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6)) | MULTIPOINT(-8 -8,2 -8)
MULTILINESTRING((-7 -4,-3 4,-5 6)) | MULTIPOINT(-6 -2)

Update statement using a WHERE clause that contains columns with null Values

I am updating a column on one table using data from another table. The WHERE clause is based on multiple columns and some of the columns are null. From my thinking, this nulls are what are throwing off your standard UPDATE TABLE SET X=Y WHERE A=B statement.
See this SQL Fiddle of the two tables where am trying to update table_one based on data from table_two.
My query currently looks like this:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number
The query fails for some rows in that table_one.x isn't getting updated when any of the columns in either table are null. i.e. it only gets updated when all columns have some data.
This question is related to my earlier one here on SO where I was getting distinct values from a large data set using Distinct On. What I now I want is to populate the large data set with a value from the table which has unique fields.
UPDATE
I used the first update statement provided by #binotenary. For small tables, it runs in a flash. Example is had one table with 20,000 records and the update was completed in like 20 seconds. But another table with 9 million plus records has been running for 20 hrs so far!. See below the output for EXPLAIN function
Update on table_one (cost=0.00..210634237338.87 rows=13615011125 width=1996)
-> Nested Loop (cost=0.00..210634237338.87 rows=13615011125 width=1996)
Join Filter: ((((my_update_statement_here))))
-> Seq Scan on table_one (cost=0.00..610872.62 rows=9661262 width=1986)
-> Seq Scan on table_two (cost=0.00..6051.98 rows=299998 width=148)
The EXPLAIN ANALYZE option took also forever so I canceled it.
Any ideas on how to make this type of update faster? Even if it means using a different update statement or even using a custom function to loop through and do the update.
Since null = null evaluates to false you need to check if two fields are both null in addition to equality check:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
(table_one.invoice_number = table_two.invoice_number
OR (table_one.invoice_number is null AND table_two.invoice_number is null))
AND
(table_one.submitted_by = table_two.submitted_by
OR (table_one.submitted_by is null AND table_two.submitted_by is null))
AND
-- etc
You could also use the coalesce function which is more readable:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
AND -- etc
But you need to be careful about the default values (last argument to coalesce).
It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data
E.g coalesce(null, 1) = coalesce(1, 1) is a situation you'd want to avoid.
Update (regarding performance):
Seq Scan on table_two - this suggests that you don't have any indexes on table_two.
So if you update a row in table_one then to find a matching row in table_two the database basically has to scan through all the rows one by one until it finds a match.
The matching rows could be found much faster if the relevant columns were indexed.
On the flipside if table_one has any indexes then that slows down the update.
According to this performance guide:
Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.
Another suggestion from the same guide that might be helpful is:
If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.
So for example if table_one an id column you could add something like
and table_one.id between x and y
to the where condition and run the query several times changing the values of x and y so that all rows are covered.
The EXPLAIN ANALYZE option took also forever
You might want to be careful when using the ANALYZE option with EXPLAIN when dealing with statements with sideffects.
According to documentation:
Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.
Try below, similar to the above #binoternary. Just beat me to the answer.
update table_one
set column_x = (select column_y from table_two
where
(( table_two.invoice_number = table_one.invoice_number)OR (table_two.invoice_number IS NULL AND table_one.invoice_number IS NULL))
and ((table_two.submitted_by=table_one.submitted_by)OR (table_two.submitted_by IS NULL AND table_one.submitted_by IS NULL))
and ((table_two.passport_number=table_one.passport_number)OR (table_two.passport_number IS NULL AND table_one.passport_number IS NULL))
and ((table_two.driving_license_number=table_one.driving_license_number)OR (table_two.driving_license_number IS NULL AND table_one.driving_license_number IS NULL))
and ((table_two.national_id_number=table_one.national_id_number)OR (table_two.national_id_number IS NULL AND table_one.national_id_number IS NULL))
and ((table_two.tax_pin_identification_number=table_one.tax_pin_identification_number)OR (table_two.tax_pin_identification_number IS NULL AND table_one.tax_pin_identification_number IS NULL))
and ((table_two.vat_number=table_one.vat_number)OR (table_two.vat_number IS NULL AND table_one.vat_number IS NULL))
and ((table_two.ggcg_number=table_one.ggcg_number)OR (table_two.ggcg_number IS NULL AND table_one.ggcg_number IS NULL))
and ((table_two.national_association_number=table_one.national_association_number)OR (table_two.national_association_number IS NULL AND table_one.national_association_number IS NULL))
);
You can use a null check function like Oracle's NVL.
For Postgres, you will have to use coalesce.
i.e. your query can look like :
UPDATE table_one SET table_one.x =(select table_two.y from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1))
where table_one.table_one_pk in (select table_one.table_one_pk from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1));
Your current query joins two tables using Nested Loop, which means that the server processes
9,661,262 * 299,998 = 2,898,359,277,476
rows. No wonder it takes forever.
To make the join efficient you need an index on all joined columns. The problem is NULL values.
If you use a function on the joined columns, generally the index can't be used.
If you use an expression like this in the JOIN:
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
an index can't be used.
So, we need an index and we need to do something with NULL values to make index usable.
We don't need to make any changes in table_one, because it has to be scanned in full in any case.
But, table_two definitely can be improved. Either change the table itself, or create a separate (temporary) table. It has only 300K rows, so it should not be a problem.
Make all columns that are used in the JOIN to be NOT NULL.
CREATE TABLE table_two (
id int4 NOT NULL,
invoice_number varchar(30) NOT NULL,
submitted_by varchar(20) NOT NULL,
passport_number varchar(30) NOT NULL,
driving_license_number varchar(30) NOT NULL,
national_id_number varchar(30) NOT NULL,
tax_pin_identification_number varchar(30) NOT NULL,
vat_number varchar(30) NOT NULL,
ggcg_number varchar(30) NOT NULL,
national_association_number varchar(30) NOT NULL,
column_y int,
CONSTRAINT table_two_pkey PRIMARY KEY (id)
);
Update the table and replace NULL values with '', or some other appropriate value.
Create an index on all columns that are used in JOIN plus column_y. column_y has to be included last in the index. I assume that your UPDATE is well-formed, so index should be unique.
CREATE UNIQUE INDEX IX ON table_two
(
invoice_number,
submitted_by,
passport_number,
driving_license_number,
national_id_number,
tax_pin_identification_number,
vat_number,
ggcg_number,
national_association_number,
column_y
);
The query will become
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
Note, that COALESCE is used only on table_one columns.
It is also a good idea to do UPDATE in batches, rather than the whole table at once. For example, pick a range of ids to update in a batch.
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.id >= <some_starting_value> AND
table_one.id < <some_ending_value> AND
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
You can use coalesce function which will return true every time when any variable passed is null. Null check function will help you.
Null-related functions here.

PostgreSQL - Check foreign key exists when doing a SELECT

Suppose I have the following data:
Table some_table:
some_table_id | value | other_table_id
--------------------------------------
1 | foo | 1
2 | bar | 2
Table other_table:
other_table_id | value
----------------------
1 | foo
2 | bar
Here, some_table has a foreign key to column other_table_id from other_table into the column of some name.
With the following query in PostgreSQL:
SELECT *
FROM some_table
WHERE other_table_id = 3;
As you see, 3 does not exists in other_table This query obviously will return 0 results.
Without doing a second query, is there a way to know if the foreign key that I am using as a filter effectively does not exist in the other_table?
Ideally as an error that later could be parsed (as it happends when doing an INSERT or an UPDATE with a wrong foreign key, for example).
You can exploit a feature of PL/pgSQL to implement this very cheaply:
CREATE OR REPLACE FUNCTION f_select_from_some_tbl(int)
RETURNS SETOF some_table AS
$func$
BEGIN
RETURN QUERY
SELECT *
FROM some_table
WHERE other_table_id = $1;
IF NOT FOUND THEN
RAISE WARNING 'Call with non-existing other_table_id >>%<<', $1;
END IF;
END
$func$ LANGUAGE plpgsql;
A final RETURN; is optional in this case.
The WARNINGis only raised if your query doesn't return any rows. I am not raising an ERROR in the example, since this would roll back the whole transaction (but you can do that if it fits your needs).
We've added a code example to the manual with Postgres 9.3 to demonstrate this.
If you perform an INSERT or UPDATE on some_table, specifying an other_table_id value that does not in fact exist in other_table, then you will get an error arising from violation of the foreign key constraint. SELECT queries are therefore your primary concern.
One way you could address the issue with SELECT queries would be to transform your queries to perform an outer join with other_table, like so:
SELECT st.*
FROM
other_table ot
LEFT JOIN some_table st ON st.other_table_id = ot.other_table_id
WHERE st.other_table_id = 3;
That query will always return at least one row if any other_table row has other_table_id = 3. In that case, if there is no matching some_table row, then it will return exactly one row, with that row having all columns NULL (given that it selects only columns from some_table), even columns that are declared not null.
If you want such queries to raise an error then you'll probably need to write a custom function to assist, but it can be done. I'd probably implement it in PL/pgSQL, using that language's RAISE statement.

PLpgSQL (or ANSI SQL?) Conditional calculation on a column

I want to write a stored procedure that performs a conditional calculation on a column. Ideally the implementation of the SP will be db agnostic - if possible. If not the underlying db is PostgreSQL (v8.4), so that takes precedence.
The underlying tables being queried looks like this:
CREATE TABLE treatment_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo_group_def ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL
);
CREATE TABLE foo ( id PRIMARY SERIAL KEY,
name VARCHAR(16) NOT NULL,
trtmt_id INT REFERENCES treatment_def(id) ON DELETE RESTRICT,
foo_grp_id INT REFERENCES foo_group_def(id) ON DELETE RESTRICT,
is_male BOOLEAN NOT NULL,
cost REAL NOT NULL
);
I want to write a SP that returns the following 'table' result set:
treatment_name, foo_group_name, averaged_cost
where averaged cost is calcluated differently, depending on whether the row field *is_male* flag is set to true or false.
For the purpose of this question, lets assume that if the is_male flag is set to true, then the averaged cost is calculated as the SUM of the cost values for the grouping, and if the is_male flag is set to false, then the cost value is calculated as the AVERAGE of the cost values for the grouping.
(Obviously) the data is being grouped by trmt_id, foo_grp_id (and is_male?).
I have a rough idea about how to to write the SQL if there was no conditional test on the is_male flag. However, I could do with some help in writing the SP as defined above.
Here is my first attempt:
CREATE TYPE FOO_RESULT AS (treatment_name VARCHAR(16), foo_group_name VARCHAR(64), averaged_cost DOUBLE);
// Outline plpgsql (Pseudo code)
CREATE FUNCTION somefunc() RETURNS SETOF FOO_RESULT AS $$
BEGIN
RETURN QUERY SELECT t.name treatment_name, g.name group_name, averaged_cost FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY f.trtmt_id, f.foo_grp_id;
END;
$$ LANGUAGE plpgsql;
I would appreciate some help on how to write this SP correctly to implement the conditional calculation in the column results
Could look like this:
CREATE FUNCTION somefunc()
RETURNS TABLE (
treatment_name varchar(16)
, foo_group_name varchar(16)
, averaged_cost double precision)
AS
$BODY$
SELECT t.name -- AS treatment_name
, g.name -- AS group_name
, CASE WHEN f.is_male THEN sum(f.cost)
ELSE avg(f.cost) END -- AS averaged_cost
FROM foo f
JOIN treatment_def t ON t.id = f.trtmt_id
JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
$BODY$ LANGUAGE sql;
Major points
I used an sql function, not plpgsql. You can use either, I just did it to shorten the code. plpgsql might be slightly faster, because the query plan is cached.
I skipped the custom composite type. You can do that simpler with RETURNS TABLE.
I would generally advise to use the data type text instead of varchar(n). Makes your life easier.
Be careful not to use names of the RETURN parameter without table-qualifying (tbl.col) in the function body, or you will create naming conflicts. That is why I commented the aliases.
I adjusted the GROUP BY clause. The original didn't work. (Neither does the one in #Ken's answer.)
You should be able to use a CASE statement:
SELECT t.name treatment_name, g.name group_name,
CASE is_male WHEN true then SUM(cost)
ELSE AVG(cost) END AS averaged_cost
FROM foo f
INNER JOIN treatment_def t ON t.id = f.trtmt_id
INNER JOIN foo_group_def g ON g.id = f.foo_grp_id
GROUP BY 1, 2, f.is_male;
I'm not familiar with PLpgSQL, so I'm not sure of the exact syntax for the BOOLEAN column, but the above should at least get you started in the right direction.

SQL pivoted table is read-only and cells can't be edited?

If I create a VIEW using this pivot table query, it isn't editable. The cells are read-only and give me the SQL2005 error: "No row was updated. The data in row 2 was not committed. Update or insert of view or function 'VIEWNAME' failed because it contains a derived or constant field."
Any ideas on how this could be solved OR is a pivot like this just never editable?
SELECT n_id,
MAX(CASE field WHEN 'fId' THEN c_metadata_value ELSE ' ' END) AS fId,
MAX(CASE field WHEN 'sID' THEN c_metadata_value ELSE ' ' END) AS sID,
MAX(CASE field WHEN 'NUMBER' THEN c_metadata_value ELSE ' ' END) AS NUMBER
FROM metadata
GROUP BY n_id
Assuming you have a unique constraint on n_id, field which means that at most one row can match you can (in theory at least) use an INSTEAD OF trigger.
This would be easier with MERGE (but that is not available until SQL Server 2008) as you need to cover UPDATES of existing data, INSERTS (Where a NULL value is set to a NON NULL one) and DELETES where a NON NULL value is set to NULL.
One thing you would need to consider here is how to cope with UPDATES that set all of the columns in a row to NULL I did this during testing the code below and was quite confused for a minute or two until I realised that this had deleted all the rows in the base table for an n_id (which meant the operation was not reversible via another UPDATE statement). This issue could be avoided by having the VIEW definition OUTER JOIN onto what ever table n_id is the PK of.
An example of the type of thing is below. You would also need to consider potential race conditions in the INSERT/DELETE code indicated and whether you need some additional locking hints in there.
CREATE TRIGGER trig
ON pivoted
INSTEAD OF UPDATE
AS
BEGIN
SET nocount ON;
DECLARE #unpivoted TABLE (
n_id INT,
field VARCHAR(10),
c_metadata_value VARCHAR(10))
INSERT INTO #unpivoted
SELECT *
FROM inserted UNPIVOT (data FOR col IN (fid, sid, NUMBER) ) AS unpvt
WHERE data IS NOT NULL
UPDATE m
SET m.c_metadata_value = u.c_metadata_value
FROM metadata m
JOIN #unpivoted u
ON u.n_id = m.n_id
AND u.c_metadata_value = m.field;
/*You need to consider race conditions below*/
DELETE FROM metadata
WHERE NOT EXISTS(SELECT *
FROM #unpivoted u
WHERE metadata.n_id = u.n_id
AND u.field = metadata.field)
INSERT INTO metadata
SELECT u.n_id,
u.field,
u.c_metadata_value
FROM #unpivoted u
WHERE NOT EXISTS (SELECT *
FROM metadata m
WHERE m.n_id = u.n_id
AND u.field = m.field)
END
You'll have to create trigger on view, because direct update is not possible:
CREATE TRIGGER TrMyViewUpdate on MyView
INSTEAD OF UPDATE
AS
BEGIN
SET NOCOUNT ON;
UPDATE MyTable
SET ...
FROM INSERTED...
END