Say I have a sample table:
id_pk value
------------
1 a
2 b
3 c
And I have a sample PL/SQL block, which has a query that currently selects a single row into an array:
declare
type t_table is table of myTable%rowtype;
n_RequiredId myTable.id_pk%type := 1;
t_Output t_table := t_table();
begin
select m.id_pk, m.value
bulk collect into t_Output
from myTable m
where m.id_pk = n_RequiredId;
end;
What I need to do is to implement an ability to select a single row into an array, as shown in the block above, OR to select all rows into an array, if n_RequiredID, which is actually a user-input parameter, is set to null.
And, the question is, what's the best practice to handle such situation?
I can think of modifying where clause of my query to something like this:
where m.id_pk = nvl(n_RequiredId, m.id_pk);
but I suppose that's going to slow down the query if the parameter won't be null, and I remember Kyte said something really bad about this approach.
I can also think of implementing the following PL/SQL logic:
if n_RequiredId is null then
select m.id_pk, m.value bulk collect into t_Output from myTable m;
else
select m.id_pk, m.value bulk collect
into t_Output
from myTable m
where m.id_pk = n_RequiredId;
end if;
But would become too complex if I encounter more than one parameter of this kind.
What would you advice me?
Yes, using any of the following:
WHERE m.id_pk = NVL(n_RequiredId, m.id_pk);
WHERE m.id_pk = COALESCE(n_RequiredId, m.id_pk);
WHERE (n_RequiredId IS NULL OR m.id_pk = n_RequiredId);
...are not sargable. They will work, but perform the worst of the available options.
If you only have one parameter, the IF/ELSE and separate, tailored statements are a better alternative.
The next option after that is dynamic SQL. But coding dynamic SQL is useless if you carry over the non-sargable predicates in the first example. Dynamic SQL allows you to tailor the query while accommodating numerous paths. But it also risks SQL injection, so it should be performed behind parameterized queries (preferably within stored procedures/functions in packages.
OMG_Ponies' and Rob van Wijk's answers are entirely correct, this is just supplemental.
There's a nice trick to make it easy to use bind variables and still use dynamic SQL. If you put all of the binds in a with clause at the beginning, you can always bind the same set of variables, whether or not you're going to use them.
For instance, say you have three parameters, representing a date range and an ID. If you want to just search on the ID, you could put the query together like this:
with parameters as (
select :start_date as start_date,
:end_date as end_date,
:search_id as search_id
from dual)
select *
from your_table
inner join parameters
on parameters.search_id = your_table.id;
On the other hand, if you need to search on the ID and date range, it could look like this:
with parameters as (
select :start_date as start_date,
:end_date as end_date,
:search_id as search_id
from dual)
select *
from your_table
inner join parameters
on parameters.search_id = your_table.id
and your_table.create_date between (parameters.start_date
and parameters.end_date);
This may seem like an round-about way of handling this, but the end result is that no matter how you complicated your dynamic SQL gets, as long as it only needs those three parameters, the PL/SQL call is always something like:
execute immediate v_SQL using v_start_date, v_end_date, v_search_id;
In my experience it's better to make the SQL construction slightly more complicated in order to ensure that there's only one line where it actually gets executed.
The NVL approach will usually work fine. The optimizer recognizes this pattern and will build a dynamic plan. The plan uses an index for a single value and a full table scan for a NULL.
Sample table and data
drop table myTable;
create table myTable(
id_pk number,
value varchar2(100),
constraint myTable_pk primary key (id_pk)
);
insert into myTable select level, level from dual connect by level <= 100000;
commit;
Execute with different predicates
--Execute predicates that return one row if the ID is set, or all rows if ID is null.
declare
type t_table is table of myTable%rowtype;
n_RequiredId myTable.id_pk%type := 1;
t_Output t_table := t_table();
begin
select /*+ SO_QUERY_1 */ m.id_pk, m.value
bulk collect into t_Output
from myTable m
where m.id_pk = nvl(n_RequiredId, m.id_pk);
select /*+ SO_QUERY_2 */ m.id_pk, m.value
bulk collect into t_Output
from myTable m
where m.id_pk = COALESCE(n_RequiredId, m.id_pk);
select /*+ SO_QUERY_3 */ m.id_pk, m.value
bulk collect into t_Output
from myTable m
where (n_RequiredId IS NULL OR m.id_pk = n_RequiredId);
end;
/
Get execution plans
select sql_id, child_number
from gv$sql
where lower(sql_text) like '%so_query_%'
and sql_text not like '%QUINE%'
and sql_text not like 'declare%';
select * from table(dbms_xplan.display_cursor(sql_id => '76ucq3bkgt0qa', cursor_child_no => 1, format => 'basic'));
select * from table(dbms_xplan.display_cursor(sql_id => '4vxf8yy5xd6qv', cursor_child_no => 1, format => 'basic'));
select * from table(dbms_xplan.display_cursor(sql_id => '457ypz0jpk3np', cursor_child_no => 1, format => 'basic'));
Bad plans for COALESCE and IS NULL OR
EXPLAINED SQL STATEMENT:
------------------------
SELECT /*+ SO_QUERY_2 */ M.ID_PK, M.VALUE FROM MYTABLE M WHERE M.ID_PK
= COALESCE(:B1 , M.ID_PK)
Plan hash value: 1229213413
-------------------------------------
| Id | Operation | Name |
-------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL| MYTABLE |
-------------------------------------
EXPLAINED SQL STATEMENT:
------------------------
SELECT /*+ SO_QUERY_3 */ M.ID_PK, M.VALUE FROM MYTABLE M WHERE (:B1 IS
NULL OR M.ID_PK = :B1 )
Plan hash value: 1229213413
-------------------------------------
| Id | Operation | Name |
-------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TABLE ACCESS FULL| MYTABLE |
-------------------------------------
Good plan for NVL
The FILTER operations allow the optimizer to choose a different plan at run time, depending on the input values.
EXPLAINED SQL STATEMENT:
------------------------
SELECT /*+ SO_QUERY_1 */ M.ID_PK, M.VALUE FROM MYTABLE M WHERE M.ID_PK
= NVL(:B1 , M.ID_PK)
Plan hash value: 730481884
----------------------------------------------------
| Id | Operation | Name |
----------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | CONCATENATION | |
| 2 | FILTER | |
| 3 | TABLE ACCESS FULL | MYTABLE |
| 4 | FILTER | |
| 5 | TABLE ACCESS BY INDEX ROWID| MYTABLE |
| 6 | INDEX UNIQUE SCAN | MYTABLE_PK |
----------------------------------------------------
Warnings
FILTER operations and this NVL trick are not well documented. I'm not sure what version introduced these features but it works with 11g. I've had problems getting the FILTER to work correctly with some complicated queries, but for simple queries like these it is reliable.
Related
I am trying to get unique values from multiple columns but since the datastructure is an array I can't directly do DISTINCT on all columns. I am using UNNEST() for each column and performing a UNION ALL for each column.
My idea is to create a UDF so that I can simply give the column name each time instead of performing the select every time.
I would like to replace this Query with a UDF since there are many feature columns and I need to do many UNION ALL.
SELECT DISTINCT user_log as unique_value,
'user_log' as feature
FROM `my_table`
left join UNNEST(user_Log) AS user_log
union all
SELECT DISTINCT page_name as unique_value,
'user_login_page_name' as feature
FROM `my_table`
left join UNNEST(PageName) AS page_name
order by feature;
My UDF
CREATE TEMP FUNCTION get_uniques(feature_name ARRAY<STRING>, feature STRING)
AS (
(SELECT DISTINCT feature as unique_value,
'feature' as feature
FROM `my_table`
left join UNNEST(feature_name) AS feature));
SELECT get_uniques(user_Log, log_feature);
However the UDF to select the column doesnt really work and gives the error
Scalar subquery cannot have more than one column unless using SELECT AS STRUCT to build STRUCT values; failed to parse CREATE [TEMP] FUNCTION statement at [8:1]
There is probably a better way of doing this. Appreciate your help.
By reading what are you trying to achieve, which is:
My idea is to create a UDF so that i can simply give the column name each time instead of performing the select every time.
One approach could be to use format in combination with execution immediate to create your custom query and get the desirable output.
Below example shows the function using format to return a custom query and execute immediate to retrieve the final query output from the final table. I'm using a public data set so you can also try it out on your side:
CREATE TEMP FUNCTION GetUniqueValues(table_name STRING, col_name STRING, nest_col_name STRING)
AS (format("SELECT DISTINCT %s.%s as unique_val,'%s' as featured FROM %s ", col_name,nest_col_name,col_name,table_name));
EXECUTE IMMEDIATE (
select CONCAT(
(SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Author','name'))
,' union all '
,(SELECT GetUniqueValues('bigquery-public-data.github_repos.commits','Committer','name'))
,' limit 100'))
output
Row | unique_val | featured
1 | Sergio Garcia Murillo | Committer
2 | klimek | Committer
3 | marclaporte#gmail.com | Committer
4 | acoul | Committer
5 | knghtbrd | Committer
... | ... | ...
100 | Gustavo Narea | Committer
Let's imagine that the table my_table is divided into 1000 partitions as the following example:
P1, P2, P3, ... , P997, P998, P999, P1000
Partitions are organized by dates, mostly a partition per day. E.g.:
P0 < 01/01/2000 => Contains around 472M records
P1 = 01/01/2000 => Contains around 15k records
P2 = 02/01/2000 => Contains around 15k records
P3 = 03/01/2000 => Contains around 15k records
... = ../../.... => Contains around ... records
P997 = 07/04/2000 => Contains around 15k records
P998 = 08/04/2000 => Contains around 15k records
P999 = 09/04/2000 => Contains around 15k records
P1000 = 10/04/2000 => Contains around 15k records
Please notice that P0 is < to 01/01/2000, NOT =
CURRENT SITUATION:
When looking for a specific record without knowing the date, I am doing a:
SELECT * FROM my_schema.my_table WHERE ... ;
But this take too much time because it does include P0 (30s).
IMPOSSIBLE SOLUTION:
So the best idea would be to execute an SQL query such as:
SELECT * FROM my_schema.my_table FROM PARTITION(P42) WHERE ... ;
But we never know in which partition is the record. We don't know either the date associated to the partition. And of course we won't loop over all partitions 1 by 1
BAD SOLUTION:
I could be clever by doing 5 by 5:
SELECT * FROM my_schema.my_table FROM PARTITION(P40,P41,P42,P43,P44) WHERE ... ;
However same issue as above, I won't loop over all partitions, even 5 by 5
LESS BAD SOLUTION:
I won't run either do (excluding P0 in the list):
SELECT * FROM my_schema.my_table FROM PARTITION(P1,P2,...,P99,P100) WHERE ... ;
The query would be too long and I would have to compute for each request the list of partitions names since it could not always start by P1 or end by P100 (each days some partitions are dropped, some are created)
CLEVER SOLUTION (but does it exist?):
How can I do something like this?
SELECT * FROM my_schema.my_table NOT IN PARTITION(P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table PARTITION(*,-P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table LESS PARTITION(P0) WHERE ... ;
or
SELECT * FROM my_schema.my_table EXCLUDE PARTITION(P0) WHERE ... ;
Is there any way to do that?
I mean a way to select all partitions expect one or some of them?
Note: I don't know in advance the value of the dateofSale. Inside the table, we have something like
CREATE TABLE my_table
(
recordID NUMBER(16) NOT NULL, --not primary
dateOfSale DATE NOT NULL, --unknown
....
<other fields>
)
Before you answer, read the following:
Index usage: yes, it is already optimized, but remember, we do not know the partitioning date
No we won't drop records in P0, we need to keep them for at least few years (3, 5 and sometimes 10 according each country laws)
We can "split" P0 into several partitions, but that won't solve the issue with a global SELECT
We cannot move that data into a new table, we need them to be kept in this table since we have multiple services and softwares performing select in it. We would have to edit to much code to add a query for the second table for each services and back-end.
We cannot do an aka WHERE date > 2019 clause and index the date field for multiples reasons that would take too much time to explain here.
The query below, ie two queries in a UNION ALL but I only want 1 row, will stop immediately a row is found. We do not need to go into the second part of the UNION ALL if we get a row in the first.
SQL> select * from
2 ( select x
3 from t1
4 where x = :b1
5 union all
6 select x
7 from t2
8 where x = :b1
9 )
10 where rownum = 1
11 /
See https://connor-mcdonald.com/golden-oldies/first-match-written-15-10-2007/ for a simple proof of this.
I'm assuming that you're working under the assumption that most of the time, the record you are interested in is in your most recent smaller partitions. In the absence of any other information to hone on in the right partition, you could do
select * from
( select ...
from tab
where trans_dt >= DATE'2000-01-01'
and record_id = :my_record
union all
select x
from tab
where trans_dt < DATE'2000-01-01'
and record_id = :my_record
)
where rownum = 1
which will only scan the big partition if we fall through and don't find it anywhere else.
But your problem does seem to be screaming out for an index to avoid all this work
Let's simplify your partitioned table as follows
CREATE TABLE tab
( trans_dt DATE
)
PARTITION BY RANGE (trans_dt)
( PARTITION p0 VALUES LESS THAN (DATE'2000-01-01')
, PARTITION p1 VALUES LESS THAN (DATE'2000-01-02')
, PARTITION p2 VALUES LESS THAN (DATE'2000-01-03')
, PARTITION p3 VALUES LESS THAN (DATE'2000-01-04')
);
If you want to skip your large partition P0 in a query, you simple (as this is the first partition) constraints the partition key as trans_dt >= DATE'2000-01-01'
You will need two predicates and or to skip a partition in the middle
The query
select * from tab
where trans_dt >= DATE'2000-01-01';
Checking the execution plan you see the expected behaviour in Pstart = 2(i.e. the 1st partition is pruned).
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 9 | 2 (0)| 00:00:01 | | |
| 1 | PARTITION RANGE ITERATOR | | 1 | 9 | 2 (0)| 00:00:01 | 2 | 4 |
| 2 | TABLE ACCESS STORAGE FULL| TAB | 1 | 9 | 2 (0)| 00:00:01 | 2 | 4 |
---------------------------------------------------------------------------------------------------
Remember, if you scan a partitioned table without constraining the partition key you will have to scall all partitions.
If you know, that most of the query results are in the recent and small partitions, simple scan tme in the first query
select * from tab
where trans_dt >= DATE'2000-01-01' and <your filter>
and only if you fail to get the row scan the large partition
select * from tab
where trans_dt < DATE'2000-01-01' and <your filter>
You will get much better response time on average if the assumption is true that the queries refer mostly the recent data.
Although there is no syntax to exclude a specific partition, you can build a pipelined table function that dynamically builds a query that uses every partition except for one.
The table function builds a query like the one below. The function uses the data dictionary view USER_TAB_PARTITIONS to get the partition names to build the SQL, uses dynamic SQL to execute the query, and then pipes the results back to the caller.
select * from my_table partition (P1) union all
select * from my_table partition (P2) union all
...
select * from my_table partition (P1000);
Sample schema
CREATE TABLE my_table
(
recordID NUMBER(16) NOT NULL, --not primary
dateOfSale DATE NOT NULL, --unknown
a NUMBER
)
partition by range (dateOfSale)
(
partition p0 values less than (date '2000-01-01'),
partition p1 values less than (date '2000-01-02'),
partition p2 values less than (date '2000-01-03')
);
insert into my_table
select 1,date '1999-12-31',1 from dual union all
select 2,date '2000-01-01',1 from dual union all
select 3,date '2000-01-02',1 from dual;
commit;
Package and function
create or replace package my_table_pkg is
type my_table_nt is table of my_table%rowtype;
function get_everything_but_p0 return my_table_nt pipelined;
end;
/
create or replace package body my_table_pkg is
function get_everything_but_p0 return my_table_nt pipelined is
v_sql clob;
v_results my_table_nt;
v_cursor sys_refcursor;
begin
--Build SQL that referneces all partitions.
for partitions in
(
select partition_name
from user_tab_partitions
where table_name = 'MY_TABLE'
and partition_name <> 'P0'
) loop
v_sql := v_sql || chr(10) || 'union all select * from my_table ' ||
'partition (' || partitions.partition_name || ')';
end loop;
v_sql := substr(v_sql, 12);
--Print the query for debugging:
dbms_output.put_line(v_sql);
--Gather the results in batches and pipe them out.
open v_cursor for v_sql;
loop
fetch v_cursor bulk collect into v_results limit 100;
exit when v_results.count = 0;
for i in 1 .. v_results.count loop
pipe row (v_results(i));
end loop;
end loop;
close v_cursor;
end;
end;
/
The package uses 12c's ability to define types in package specifications. If you build this in 11g or below, you'll need to create SQL types instead. This package only works for one table, but if necessary there are ways to create functions that work with any table (using Oracle data cartridge or 18c's polymorphic table functions).
Sample query
SQL> select * from table(my_table_pkg.get_everything_but_p0);
RECORDID DATEOFSAL A
---------- --------- ----------
2 01-JAN-00 1
3 02-JAN-00 1
Performance
This function should perform almost as well as the clever solution you were looking for. There will be overhead because the rows get passed through PL/SQL. But most importantly, the function builds a SQL statement that partition prunes away the large P0 partition.
One possible issue with this function is that the optimizer has no visibility inside it and can't create a good row cardinality estimate. If you use the function as part of another large SQL statement, be aware that the optimizer will blindly guess that the function returns 8168 rows. That bad cardinality guess may lead to a bad execution plan.
The MariaDB INSERT IGNORE... SELECT syntax is defined at https://mariadb.com/kb/en/insert/.
I am also relying on the assumption that each INSERT in this situation will be performed in the order of the SELECT clause result set.
Test case:
DROP TEMPORARY TABLE IF EXISTS index_and_color;
CREATE TEMPORARY TABLE index_and_color (`index` INT PRIMARY KEY, color TEXT);
INSERT IGNORE INTO index_and_color SELECT 5, "Red" UNION ALL SELECT 5, "Blue";
SELECT * FROM index_and_color;
Intuitively, I see that the "first" row in the SELECT clause result set has (5, "Red") and then the "second" row is ignored, containing the same key with "Blue".
From what I see, this is undefined behavior because another server implementing the same documentation could handle the rows in a different order.
Is my usage here indeed relying on undefined behavior?
What is "undefined behavior" is the ordering of the rows returned by an UNION ALL query. Use an ORDER BY clause if you want a stable result:
INSERT IGNORE INTO index_and_color (`index`, color)
SELECT 5 `index`, 'Red' color UNION ALL SELECT 5, 'Blue' ORDER BY color;
The SQL statement SELECT * FROM index_and_color should give you exactly for what you asked: all data rows from the given table.
If you expect that the resultset will be delivered in a specific order, then you need to add an ORDER BY clause, otherwise it might be ordered by index, or based on whichever algorithm the optimizer expects to produce the data the fastest.
For example::
CREATE TABLE t1 (a int, b int, index(a));
INSERT INTO t1 VALUES (2,1),(1,2);
/* Here we get the order from last insert statement */
SELECT a,b FROM t1;
MariaDB [test]> select a,b from t1;
+------+------+
| a | b |
+------+------+
| 2 | 1 |
| 1 | 2 |
+------+------+
/* Here we get the another order since the optimizer will deliver results from index */
SELECT a FROM t1;
+------+
| a |
+------+
| 1 |
| 2 |
+------+
I have an oracle SQL database where there is a column called SESID which has a DATA_TYPE of CHAR(8 BYTE). We have an index set up on this column, however when I have a look at the execution plan, we appear not to be using the index. The simple query that I would be using is
SELECT * FROM TestTable WHERE SESID = 12345
Having a look at the execution plan it is not using the index because it has to do a TO_NUMBER call on the SESID column, this prevents oracle from considering the index in the query plan.
Here is the execution plan information which details this:
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(TO_NUMBER("SESID")=12345)
My question is this, is there any way to change the query so that it considers the number '12345' as a CHAR Array? My intuition told me that this might work:
SELECT * FROM TestTable WHERE SESID = '12345'
But it obviously did not... Does anybody know how I could do this
I'm using the standard OracleClient provided in .NET 4 to connect to the oracle DB and run the query.
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE tbl ( SESID CHAR(8 BYTE) );
INSERT INTO tbl VALUES ( '1' );
INSERT INTO tbl VALUES ( '12' );
INSERT INTO tbl VALUES ( '123' );
INSERT INTO tbl VALUES ( '1234' );
INSERT INTO tbl VALUES ( '12345' );
INSERT INTO tbl VALUES ( '123456' );
INSERT INTO tbl VALUES ( '1234567' );
INSERT INTO tbl VALUES ( '12345678' );
Query 1:
A CHAR column will right pad the string with space characters. You can see this with the following query:
SELECT SESID, LENGTH( SESID ), LENGTH( TRIM( SESID ) )
FROM tbl
Results:
| SESID | LENGTH(SESID) | LENGTH(TRIM(SESID)) |
|----------|---------------|---------------------|
| 1 | 8 | 1 |
| 12 | 8 | 2 |
| 123 | 8 | 3 |
| 1234 | 8 | 4 |
| 12345 | 8 | 5 |
| 123456 | 8 | 6 |
| 1234567 | 8 | 7 |
| 12345678 | 8 | 8 |
Query 2:
This query explicitly converts the number to a character string:
SELECT SESID
FROM tbl
WHERE SESID = TO_CHAR( 12345 )
Results:
However, the SESID you want to match is 12345___ (where ___represents the three trailing spaces) and does not equal 12345 (without the trailing spaces) so no rows are returned.
Query 3:
Instead you can ensure that the number is padded to the correct length when it is converted to a character string:
SELECT SESID
FROM tbl
WHERE SESID = RPAD( TO_CHAR( 12345 ), 8, ' ' )
Results:
| SESID |
|----------|
| 12345 |
An Alternative Solution
Change the column definition from CHAR(8 BYTE) to VARCHAR2(8) then you can use query 2 without issues.
SELECT * FROM TestTable WHERE SESID = '12345'
or
SELECT * FROM TestTable WHERE SESID = TO_CHAR( 12345 )
Oracle is slightly odd in that it if you compare a literal to a column that requires an implicit type conversion it will always try to convert the column instead of the literal.
It's worth watching out for on every SQL statement, and adding an explicit conversion on the literal to make sure that you don't get caught out.
Having said this, having an index and a SQL statement that could use one does not guarantee that Oracle will use it - there also needs to be enough data in the table for Oracle to think and index will be of use.
If you want to find out why you need to do a LOT of reading up on the "cost based optimizer"
Aside:
If you find that you can't do a conversion on the literal for some reason (e.g. the sql is generated by a library over which you have no control) then you can create a functional index.
I.E. an index that is based on the conversion that will take place.
E.g. CREATE index test_table_index on ( TO_NUMBER( sesid ) )
This would only be possible if all data in sesid is numeric.
Which raises the point that when the column is converted it may convert more data than you intend and sometimes it is not possible.
E.g. You perform your original SELECT on a table that contains a mix of numeric and non-numeric data. Since no index exists that supports the select Oracle will need to do a full table scan. It therefore needs to look at every record in TestTable and convert sesid to a number. But it can't for the non-numeric values, and so it throws an exception even though you didn't want the record that was non-numeric.
Final Aside:
Object names in Oracle are not case sensitive (unless you use "quotes" to state that they should be) so general practice is to use identifiers with underscores instead of camel case so that they are easier to read when output by Oracle.
E.g. test_table rather than TestTable
I wonder if the following script can be optimized somehow. It does write a lot to disk because it deletes possibly up-to-date rows and reinserts them. I was thinking about applying something like "insert ... on duplicate key update" and found some possibilities for single-row updates but I don't know how to apply it in the context of INSERT INTO ... SELECT query.
CREATE OR REPLACE FUNCTION update_member_search_index() RETURNS VOID AS $$
DECLARE
member_content_type_id INTEGER;
BEGIN
member_content_type_id :=
(SELECT id FROM django_content_type
WHERE app_label='web' AND model='member');
DELETE FROM watson_searchentry WHERE content_type_id = member_content_type_id;
INSERT INTO watson_searchentry (engine_slug, content_type_id, object_id
, object_id_int, title, description, content
, url, meta_encoded)
SELECT 'default',
member_content_type_id,
web_member.id,
web_member.id,
web_member.name,
'',
web_user.email||' '||web_member.normalized_name||' '||web_country.name,
'',
'{}'
FROM web_member
INNER JOIN web_user ON (web_member.user_id = web_user.id)
INNER JOIN web_country ON (web_member.country_id = web_country.id)
WHERE web_user.is_active=TRUE;
END;
$$ LANGUAGE plpgsql;
EDIT: Schemas of web_member, watson_searchentry, web_user, web_country: http://pastebin.com/3tRVPPVi.
The main point is to update columns title and content in watson_searchentry. There is a trigger on the table that sets value of column search_tsv based on these columns.
(content_type_id, object_id_int) in watson_searchentry is unique pair in the table but atm the index is not present (there is no use for it).
This script should be run at most once a day for full rebuilds of search index and occasionally after importing some data.
Modified table definition
If you really need those columns to be NOT NULL and you really need the string 'default' as default for engine_slug, I would advice to introduce column defaults:
COLUMN | TYPE | Modifiers
-----------------+-------------------------+---------------------
id | INTEGER | NOT NULL DEFAULT ...
engine_slug | CHARACTER VARYING(200) | NOT NULL DEFAULT 'default'
content_type_id | INTEGER | NOT NULL
object_id | text | NOT NULL
object_id_int | INTEGER |
title | CHARACTER VARYING(1000) | NOT NULL
description | text | NOT NULL DEFAULT ''
content | text | NOT NULL
url | CHARACTER VARYING(1000) | NOT NULL DEFAULT ''
meta_encoded | text | NOT NULL DEFAULT '{}'
search_tsv | tsvector | NOT NULL
...
DDL statement would be:
ALTER TABLE watson_searchentry ALTER COLUMN engine_slug DEFAULT 'default';
Etc.
Then you don't have to insert those values manually every time.
Also: object_id text NOT NULL, object_id_int INTEGER? That's odd. I guess you have your reasons ...
I'll go with your updated requirement:
The main point is to update columns title and content in watson_searchentry
Of course, you must add a UNIQUE constraint to enforce your requirements:
ALTER TABLE watson_searchentry
ADD CONSTRAINT ws_uni UNIQUE (content_type_id, object_id_int)
The accompanying index will be used. By this query for starters.
BTW, I almost never use varchar(n) in Postgres. Just text. Here's one reason.
Query with data-modifying CTEs
This could be rewritten as a single SQL query with data-modifying common table expressions, also called "writeable" CTEs. Requires Postgres 9.1 or later.
Additionally, this query only deletes what has to be deleted, and updates what can be updated.
WITH ctyp AS (
SELECT id AS content_type_id
FROM django_content_type
WHERE app_label = 'web'
AND model = 'member'
)
, sel AS (
SELECT ctyp.content_type_id
,m.id AS object_id_int
,m.id::text AS object_id -- explicit cast!
,m.name AS title
,concat_ws(' ', u.email,m.normalized_name,c.name) AS content
-- other columns have column default now.
FROM web_user u
JOIN web_member m ON m.user_id = u.id
JOIN web_country c ON c.id = m.country_id
CROSS JOIN ctyp
WHERE u.is_active
)
, del AS ( -- only if you want to del all other entries of same type
DELETE FROM watson_searchentry w
USING ctyp
WHERE w.content_type_id = ctyp.content_type_id
AND NOT EXISTS (
SELECT 1
FROM sel
WHERE sel.object_id_int = w.object_id_int
)
)
, up AS ( -- update existing rows
UPDATE watson_searchentry
SET object_id = s.object_id
,title = s.title
,content = s.content
FROM sel s
WHERE w.content_type_id = s.content_type_id
AND w.object_id_int = s.object_id_int
)
-- insert new rows
INSERT INTO watson_searchentry (
content_type_id, object_id_int, object_id, title, content)
SELECT sel.* -- safe to use, because col list is defined accordingly above
FROM sel
LEFT JOIN watson_searchentry w1 USING (content_type_id, object_id_int)
WHERE w1.content_type_id IS NULL;
The subquery on django_content_type always returns a single value? Otherwise, the CROSS JOIN might cause trouble.
The first CTE sel gathers the rows to be inserted. Note how I pick matching column names to simplify things.
In the CTE del I avoid deleting rows that can be updated.
In the CTE up those rows are updated instead.
Accordingly, I avoid inserting rows that were not deleted before in the final INSERT.
Can easily be wrapped into an SQL or PL/pgSQL function for repeated use.
Not secure for heavy concurrent use. Much better than the function you had, but still not 100% robust against concurrent writes. But that's not an issue according to your updated info.
Replacing the UPDATEs with DELETE and INSERT may or may not be a lot more expensive. Internally every UPDATE results in a new row version anyways, due to the MVCC model.
Speed first
If you don't really care about preserving old rows, your simpler approach may be faster: Delete everything and insert new rows. Also, wrapping into a plpgsql function saves a bit of planning overhead. Your function basically, with a couple of minor simplifications and observing the defaults added above:
CREATE OR REPLACE FUNCTION update_member_search_index()
RETURNS VOID AS
$func$
DECLARE
_ctype_id int := (
SELECT id
FROM django_content_type
WHERE app_label='web'
AND model = 'member'
); -- you can assign at declaration time. saves another statement
BEGIN
DELETE FROM watson_searchentry
WHERE content_type_id = _ctype_id;
INSERT INTO watson_searchentry
(content_type_id, object_id, object_id_int, title, content)
SELECT _ctype_id, m.id, m.id::int,m.name
,u.email || ' ' || m.normalized_name || ' ' || c.name
FROM web_member m
JOIN web_user u USING (user_id)
JOIN web_country c ON c.id = m.country_id
WHERE u.is_active;
END
$func$ LANGUAGE plpgsql;
I even refrain from using concat_ws(): It is safe against NULL values and simplifies code, but a bit slower than simple concatenation.
Also:
There is a trigger on the table that sets value of column search_tsv
based on these columns.
It would be faster to incorporate the logic into this function - if this is the only time the trigger is needed. Else, it's probably not worth the fuss.