Query using multiple replace function calls - How to improve performance? - sql

I have a query that requires strings from tables to be stripped of special characters before they are compared against each other. I created a function that takes in a string and removes certain special characters from the string before returning it. The problem is I found myself using the function many times due to the query doing a lot of comparisons. This significantly slowed down the performance after adding the functionality.
So I have this function I created:
create or replace FUNCTION F_REMOVE_SPECIAL_CHARACTERS
(
IN_PARAM_EMAIL_NAME IN VARCHAR2,
IN_PARAM_NUMBER_FLAG IN VARCHAR2 DEFAULT 'N'
) RETURN VARCHAR2 AS
BEGIN
/* If flag is Y then remove all numbers too. Otherwise, keep numbers in the string */
IF IN_PARAM_NUMBER_FLAG = 'Y' THEN
RETURN replace(regexp_replace(IN_PARAM_EMAIL_NAME, '[-,._0-9]', ''), ' ', '');
ELSE
RETURN replace(regexp_replace(IN_PARAM_EMAIL_NAME, '[-,._]', ''), ' ', '');
END IF;
END F_REMOVE_SPECIAL_CHARACTERS;
I also have a query that goes like this:
SELECT a.ID, LISTAGG(b.BUSINESS_EMAIL) WITHIN GROUP (ORDER BY a.ID)
FROM tableA a, tableB b
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(b.LAST_NAME)) IN (
(SELECT UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) FROM tableC c
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) IN (
(SELECT UPPER(F_REMOVE_SPECIAL_CHARACTERS(c.NAME)) FROM tableC c
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) = UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.LAST_NAME))
)
)
)
)
The actual query is bigger and more complicated but the point is that I need to remove special characters from certain column values which happens to be repeated multiple times in the query. This means I need to use the function multiple times but this causes significant slowdown in performance.
Does anyone have an idea on how to reduce the performance slowdown when using multiple function calls in a query? Thanks.

Assuming you need this as a function (because you use it in many places), you could clean it up and simplify it (and make it more efficient) like so:
create or replace function f_remove_special_characters
(
in_param_email_name in varchar2,
in_param_number_flag in varchar2 default 'N'
)
return varchar2
deterministic
as
pragma udf; -- if on Oracle 12.1 or higher, and function is only for SQL use
/* If flag is Y then remove all numbers too.
Otherwise, keep numbers in the string
*/
chars_to_remove varchar2(16) := 'z-,._ ' ||
case in_param_number_flag when 'Y' then '0123456789' end;
begin
return translate(in_param_email_name, chars_to_remove, 'z');
end f_remove_special_characters;
/
The silly trick with the 'z' in translate (in the second and third arguments) is due to Oracle's odd treatment of null. In translate, if any of the arguments is null the result is null, in contrast with Oracle's treatment of null in other string operations.

If you are on 12c or above, then as a quick fix you can use WITH FUNCTION clause
As I remember this eliminates PL/SQL<->SQL Context switches so you query shoul perform better.
I've never tested that, but it is very likely that it will be faster even 30-50 times.
Let me know how fast it will be, because I'm curious
WITH FUNCTION F_REMOVE_SPECIAL_CHARACTERS
(
IN_PARAM_EMAIL_NAME IN VARCHAR2,
IN_PARAM_NUMBER_FLAG IN VARCHAR2 DEFAULT 'N'
) RETURN VARCHAR2 AS
BEGIN
/* If flag is Y then remove all numbers too. Otherwise, keep numbers in the string */
IF IN_PARAM_NUMBER_FLAG = 'Y' THEN
RETURN replace(regexp_replace(IN_PARAM_EMAIL_NAME, '[-,._0-9]', ''), ' ', '');
ELSE
RETURN replace(regexp_replace(IN_PARAM_EMAIL_NAME, '[-,._]', ''), ' ', '');
END IF;
END;
SELECT a.ID, LISTAGG(b.BUSINESS_EMAIL) WITHIN GROUP (ORDER BY a.ID)
FROM tableA a, tableB b
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(b.LAST_NAME)) IN (
(SELECT UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) FROM tableC c
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) IN (
(SELECT UPPER(F_REMOVE_SPECIAL_CHARACTERS(c.NAME)) FROM tableC c
WHERE UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.NICK_NAME)) = UPPER(F_REMOVE_SPECIAL_CHARACTERS(a.LAST_NAME))
)
)
)
)

Related

Oracle function with select all from tables

SELECT DISTINCT L.* FROM LABALES L , MATCHES M
WHERE M.LIST LIKE '%ENG'
ORDER BY L.ID
I need to create function with this select, I tried this but it doesn't work.
CREATE OR REPLACE FUNCTION getSoccerLists
RETURN varchar2 IS
list varchar2(2000);
BEGIN
SELECT DISTINCT L.* FROM LABALES L , MATCHES M
WHERE M.LIST LIKE '%ENG'
ORDER BY L.ID
return list;
END;
How will I create function that returns all from table L.
Thanks
You may use implicit result using DBMS_SQL.RETURN_RESULT(Oracle12c and above) in a procedure using a cursor to your query.
CREATE OR REPLACE PROCEDURE getSoccerLists
AS
x SYS_REFCURSOR;
BEGIN
OPEN x FOR SELECT DISTINCT L.* FROM LABALES L
JOIN MATCHES M ON ( 1=1 ) -- join condition
WHERE M.LIST LIKE '%ENG'
ORDER BY L.ID;
DBMS_SQL.RETURN_RESULT(x);
END;
/
then simply call the procedure
EXEC getSoccerLists;
For lower versions(Oracle 11g) , you may use a print command to display the cursor's o/p passing ref cursor as out parameter.
CREATE OR REPLACE PROCEDURE getSoccerLists (x OUT SYS_REFCURSOR)
AS
BEGIN
OPEN x FOR SELECT DISTINCT L.* FROM LABALES L
JOIN MATCHES M ON ( 1=1 ) -- join condition
WHERE M.LIST LIKE '%ENG'
ORDER BY L.ID;
END;
/
Then, in SQL* Plus or running as script in SQL developer and Toad, you may get the results using this.
VARIABLE r REFCURSOR;
EXEC getSoccerLists (:r);
PRINT r;
Another option is to use TABLE function by defining a collection of the record type of the result within a package.
Refer Create an Oracle function that returns a table
I guess this questions is a repetition of the your previously asked question, where you wanted to get all the columns of tables but into separate column. I already answered in stating this you cannot do if you call your function via a SELECT statement. If you call your function in a Anoymous block you can display it in separate columns.
Here Oracle function returning all columns from tables
Alternatively, you can get the results separated by a comma(,) or pipe (|) as below:
CREATE OR REPLACE
FUNCTION getSoccerLists
RETURN VARCHAR2
IS
list VARCHAR2(2000);
BEGIN
SELECT col1
||','
||col2
||','
||col2
INTO LIST
FROM SOCCER_PREMATCH_LISTS L ,
SOCCER_PREMATCH_MATCHES M
WHERE M.LIST LIKE '%' || (L.SUB_LIST) || '%'
AND (TO_TIMESTAMP((M.M_DATE || ' ' || M.M_TIME), 'DD.MM.YYYY HH24:MI') >
(SELECT SYSTIMESTAMP AT TIME ZONE 'CET' FROM DUAL
))
ORDER BY L.ID");
Return list;
End;
Note here if the column size increased 2000 chars then again you will lose the data.
Edit:
From your comments
I want it to return a table set of results.
You then need to create a table of varchar and then return it from the function. See below:
CREATE TYPE var IS TABLE OF VARCHAR2(2000);
/
CREATE OR REPLACE
FUNCTION getSoccerLists
RETURN var
IS
--Initialization
list VAR :=var();
BEGIN
SELECT NSO ||',' ||NAME BULK COLLECT INTO LIST FROM TEST;
RETURN list;
END;
Execution:
select * from table(getSoccerLists);
Note: Here in the function i have used a table called test and its column. You replace your table with its columnname.
Edit 2:
--Create a object with columns same as your select statement
CREATE TYPE v_var IS OBJECT
(
col1 NUMBER,
col2 VARCHAR2(10)
)
/
--Create a table of your object
CREATE OR REPLACE TYPE var IS TABLE OF v_var;
/
CREATE OR REPLACE FUNCTION getSoccerLists
RETURN var
IS
--Initialization
list VAR :=var();
BEGIN
--You above object should have same columns with same data type as you are selecting here
SELECT v_var( NSO ,NAME) BULK COLLECT INTO LIST FROM TEST;
RETURN list;
END;
Execution:
select * from table(getSoccerLists);
This is not an answer on how to build a function for this, as I'd recommend to make this a view instead:
CREATE OR REPLACE VIEW view_soccer_list AS
SELECT *
FROM soccer_prematch_lists l
WHERE EXISTS
(
SELECT *
FROM soccer_prematch_matches m
WHERE m.list LIKE '%' || (l.sub_list) || '%'
AND TO_TIMESTAMP((m.m_date || ' ' || m.m_time), 'DD.MM.YYYY HH24:MI') >
(SELECT SYSTIMESTAMP AT TIME ZONE 'CET' FROM DUAL)
);
Then call it in a query:
SELECT * FROM view_soccer_list ORDER BY id;
(It makes no sense to put an ORDER BY clause in a view, because you access the view like a table, and table data is considered unordered, so you could not rely on that order. The same is true for a pipelined function youd access with FROM TABLE (getSoccerLists). Always put the ORDER BY clause in your final queries instead.)

If parameter is empty, change where clause

I have an Oracle function that has 3 parameters and uses parameters to set where clause values in multiple select statements that are union'd together. Here is the pseudo code:
create or replace function fn_newfunction
(IN_1_id in VARCHAR2, IN_2 in VARCHAR2, IN_3 in VARCHAR2)
RETURN T_varchar_table AS
v_tab T_varchar_table;
begin
select
cast(multiset (
--add users
select * from table1 opt where opt.col2 = IN_2 and opt.col3 = IN_3 and opt.col1 = IN_1_id
union
...
<insert 10+ select statements here with same values>
) as T_varchar_table)
END
into v_tab
from dual;
return v_tab;
end;
A use case has come up to pass blank values into the function for any of the IN parameters and have it select ANY value in the where clause where the parameter is blank. An example is if IN_1_id is passed a blank value, the where clause in the first select statement would show where ANY value (even null) is in the opt.col1. How can I make this happen? Thank you!
The logic I most frequently use, albeit not in Oracle, is the following. I've written it pseudo, simply because as I mentioned, I believe this more methodological question, as opposed to syntax.
The Logic
Function (#Parameter1, #Parameter2)
SELECT * FROM MyTable
WHERE
--Parameter1
(#Parameter1 IS NULL OR MyTable.Parameter1 = #Parameter1)
AND
--Parameter2
(#Parameter2 IS NULL OR MyTable.Parameter2 = #Parameter2)
Why It works
If you don't pass #Parameter1:
--This evaluates to TRUE for every single row, because the first condition has been met
(#Parameter1 IS NULL OR MyTable.Parameter1 = #Parameter1)
If you do pass #Parameter1:
--The first condition will never be met, because #Parameter1 is NOT NULL.
--The second condition will only be met for rows that match the parameter.
(#Parameter1 IS NULL OR MyTable.Parameter1 = #Parameter1)
Using this method, you can conditionally add fields to your WHERE clause.
You can use dynamic sql to get your issue addressed
if IN_1_id is not null then
lv_where := ' and opt.col1 =IN_1_id1 ';
else
lv_where :=' ';
end if;
EXECUTE IMMEDIATE 'select * from table1 opt where opt.col2 = IN_2 and opt.col3 = IN_3 ' ||lv_where ;

Mass-Coalescing of Null Values

I have a table in a Postgres database with monthly columns from 2012 to the end of 2018:
create table sales_data (
part_number text not null,
customer text not null,
qty_2012_01 numeric,
qty_2012_02 numeric,
qty_2012_03 numeric,
...
qty_2018_10 numeric,
qty_2018_11 numeric,
qty_2018_12 numeric,
constraint sales_data_pk primary key (part_number, customer)
);
The data is populated from a large function that pulls data from an extremely wide variety of sources. It involves many left joins -- for example, in combining history with future data, where a single item may have history but not future demand or vice versa. Or, certain customers may not have data as far back or forward as we want.
The problem I'm coming up with is due to the left joins (and the nature of the data I'm pulling), a significant number of the values I am pulling are null. I would like any null to simply be zero to simplify any queries against this table, specifically aggregate functions that say 1 + null + 2 = null.
I could modify the function and add hundreds of coalesce statements. However, I was hoping there was another way around this, even if it means modifying the values after the fact. That said, this would mean adding 84 update statements at the end of the function:
update sales_data set qty_2012_01 = 0 where qty_2012_01 is null;
update sales_data set qty_2012_02 = 0 where qty_2012_02 is null;
update sales_data set qty_2012_03 = 0 where qty_2012_03 is null;
... 78 more like this...
update sales_data set qty_2018_10 = 0 where qty_2018_10 is null;
update sales_data set qty_2018_11 = 0 where qty_2018_11 is null;
update sales_data set qty_2018_12 = 0 where qty_2018_12 is null;
I'm missing something, right? Is there an easier way?
I was hoping the default setting on the column would force a zero, but it doesn't work when the function is explicitly telling it to insert a null. Likewise, if I make the column non-nullable, it just pukes on my insert -- I was hoping that might force the invocation of the default.
By the way, the insert-then-update strategy is one I chastise others for, so I understand this is less than ideal. This function is a bit of a beast, and it does require some occasional maintenance (long story). My primary goal is to keep the function as readable and maintainable as possible -- NOT to make the function uber-efficient. The table itself is not huge -- less than a million records after all is said and done -- and we run the function to populate it once or twice a month.
There is not built-in feature (I would know of). Short of spelling out COALESCE(col, 0) everywhere you can write a function to replace all NULL values with 0 in all numeric columns of a table:
CREATE OR REPLACE FUNCTION f_convert_numeric_null(_tbl regclass)
RETURNS void AS
$func$
BEGIN
RAISE NOTICE '%', -- test output for debugging
-- EXECUTE -- payload
(SELECT 'UPDATE ' || _tbl
|| ' SET ' || string_agg(format('%1$s = COALESCE(%1$s, 0)', col), ', ')
|| ' WHERE ' || string_agg(col || ' IS NULL', ' OR ')
FROM (
SELECT quote_ident(attname) AS col
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible, legal table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
AND NOT attnotnull -- exclude columns defined NOT NULL
AND atttypid = 'numeric'::regtype -- only numeric columns
ORDER BY attnum
) sub
);
END
$func$ LANGUAGE plpgsql;
Concatenates and executes a query of the form:
UPDATE sales_data
SET qty_2012_01 = COALESCE(qty_2012_01, 0)
, qty_2012_02 = COALESCE(qty_2012_02, 0)
, qty_2012_03 = COALESCE(qty_2012_03, 0)
...
WHERE qty_2012_01 IS NULL OR
qty_2012_02 IS NULL OR
qty_2012_03 IS NULL ... ;
Works for any table with any column names. All numeric columns are updated. Only rows that actually change are touched.
Since the function is massively invasive, I added a child-safety device. Quote the RAISE NOTICE line and unquote EXECUTE to prime the bomb.
Call:
SELECT f_convert_numeric_null('sales_data');
My primary goal is to keep the function as readable and maintainable as possible.
That should do it.
SQL Fiddle.
The parameter is type regclass, so pass the table name, possibly schema-qualified, non-standard identifiers must be double-quoted - names like "mySchema"."0dumb tablename".
Write your query results to a temporary table, run the function on the temp table and then INSERT into the actual table.
Related:
Replace empty strings with null values
Table name as a PostgreSQL function parameter
Generate DEFAULT values in a CTE UPSERT using PostgreSQL 9.3
While INSERT statement itself you can COALESCE (col_name, 0) will fix the issue. You can add NOT NULL also to maintain data integrity .
Assuming Inserting data from Temp Table
INSERT INTO sales_data (qty_2012_01, qty_2012_02)
SELECT COALESCE(qty_2012_01, 0), COALESCE(qty_2012_01, 0)
FROM temp_sales_data;
Single Update
UPDATE sales_date SET
qty_2012_01 = COALESCE(qty_2012_01, 0),
qty_2012_02 = COALESCE(qty_2012_02, 0)
..
..
WHERE qty_2012_01 IS NULL
OR qty_2012_02 IS NULL
...
....
The above query will update all the columns in single update.

Update substrings using lookup table and replace function

Here's my setup:
Table 1 (table_with_info): Contains a list of varchars with substrings that I'd like to replace.
Table 2 (sub_info): Contains two columns: the substring in table_with_info that I'd like to replace and the string I'd like to replace it with.
What I'd like to do is replace all the substrings in table_with_info with their substitutions in sub_info.
This works to a point but the issue is that select replace(...) returns a new row for each one of the substituted words replaced and doesn't replace all of the ones in an individual row.
I'm explaining the best I can but I don't know if it's too clear. Here's the code an example of what's happening/what I'd like to happen.
Here's my code:
create table table_with_info
(
val varchar
);
insert into table_with_info values
('this this is test data');
create table sub_info
(
word_from varchar,
word_to varchar
);
insert into sub_info values
('this','replace1')
, ('test', 'replace2');
update table_with_info set val = (select replace("val", "word_from", "word_to")
from "table_with_info", "sub_info"
the update() function doesn't work as select() returns two rows:
Row 1: replace1 replace1 is test data
Row 2: this this is replace2 data
so what I'd like for it for the select statement to return is:
Row 1: replace1 replace1 is test data
Any thoughts? I can't create UDFs on the system I'm running.
Your UPDATE statement is incorrect in multiple ways. Consult the manual before you try to run anything like this again. You introduce two cross joins that would make this statement extremely expensive, besides yielding nonsense.
To do this properly, you need to administer each UPDATE sequentially. In a single statement, one row version eliminates the other, while each replace would use the same original row version. You can use a DO statement for this or wrap it in a plpgsql function for instance:
DO
$do$
DECLARE
r sub_info;
BEGIN
FOR r IN
TABLE sub_info
-- SELECT * FROM sub_info ORDER BY ??? -- order is relevant
LOOP
UPDATE table_with_info
SET val = replace(val, r.word_from, r.word_to)
WHERE val LIKE ('%' || r.word_from || '%'); -- avoid empty updates
END LOOP;
END
$do$;
Be aware, that the order in which updates are applied can make a difference! If the first update creates a string where the second matches (but not otherwise) ..
So, order your columns in sub_info if that can be relevant.
Avoid empty updates. Without the additional WHERE clause, you would write many new row versions without changing anything. Expensive and useless.
double-quotes are optional for legal, lower-case names.
->SQLfiddle
Expanding on Erwin's answer, a do block with dynamic SQL can do the trick as well:
do $$
declare
rec record;
repl text;
begin
repl := 'val'; -- quote_ident() this if needed
for rec in select word_from, word_to from sub_info
loop
repl := 'replace(' || repl || ', '
|| quote_literal(rec.word_from) || ', '
|| quote_literal(rec.word_to) || ')';
end loop;
-- now do them all in a single query
execute 'update ' || 'table_with_info'::regclass || ' set val = ' || repl;
end;
$$ language plpgsql;
Optionally, build a like parameter in a similar way to avoid updating rows needlessly.

How can I get a hash of an entire table in postgresql?

I would like a fairly efficient way to condense an entire table to a hash value.
I have some tools that generate entire data tables, which can then be used to generate further tables, and so on. I'm trying to implement a simplistic build system to coordinate build runs and avoid repeating work. I want to be able to record hashes of the input tables so that I can later check whether they have changed. Building a table takes minutes or hours, so spending several seconds building hashes is acceptable.
A hack I have used is to just pipe the output of pg_dump to md5sum, but that requires transferring the entire table dump over the network to hash it on the local box. Ideally I'd like to produce the hash on the database server.
Finding the hash value of a row in postgresql gives me a way to calculate a hash for a row at a time, which could then be combined somehow.
Any tips would be greatly appreciated.
Edit to post what I ended up with: tinychen's answer didn't work for me directly, because I couldn't use 'plpgsql' apparently. When I implemented the function in SQL instead, it worked, but was very inefficient for large tables. So instead of concatenating all the row hashes and then hashing that, I switched to using a "rolling hash", where the previous hash is concatenated with the text representation of a row and then that is hashed to produce the next hash. This was much better; apparently running md5 on short strings millions of extra times is better than concatenating short strings millions of times.
create function zz_concat(text, text) returns text as
'select md5($1 || $2);' language 'sql';
create aggregate zz_hashagg(text) (
sfunc = zz_concat,
stype = text,
initcond = '');
I know this is old question, however this is my solution:
SELECT
md5(CAST((array_agg(f.* order by id))AS text)) /* id is a primary key of table (to avoid random sorting) */
FROM
foo f;
SELECT md5(array_agg(md5((t.*)::varchar))::varchar)
FROM (
SELECT *
FROM my_table
ORDER BY 1
) AS t
just do like this to create a hash table aggregation function.
create function pg_concat( text, text ) returns text as '
begin
if $1 isnull then
return $2;
else
return $1 || $2;
end if;
end;' language 'plpgsql';
create function pg_concat_fin(text) returns text as '
begin
return $1;
end;' language 'plpgsql';
create aggregate pg_concat (
basetype = text,
sfunc = pg_concat,
stype = text,
finalfunc = pg_concat_fin);
then you could use the pg_concat function to caculate the table's hash value.
select md5(pg_concat(md5(CAST((f.*)AS text)))) from f order by id
I had a similar requirement, to use when testing a specialized table replication solution.
#Ben's rolling MD5 solution (which he appended to the question) seems quite efficient, but there were a couple of traps which tripped me up.
The first (mentioned in some of the other answers) is that you need to ensure that the aggregate is performed in a known order over the table you are checking. The syntax for that is eg.
select zz_hashagg(CAST((example.*)AS text) order by id) from example;
Note the order by is inside the aggregate.
The second is that using CAST((example.*)AS text will not give identical results for two tables with the same column contents unless the columns were created in the same order. In my case that was not guaranteed, so to get a true comparison I had to list the columns separately, for example:
select zz_hashagg(CAST((example.id, example.a, example.c)AS text) order by id) from example;
For completeness (in case a subsequent edit should remove it) here is the definition of the zz_hashagg from #Ben's question:
create function zz_concat(text, text) returns text as
'select md5($1 || $2);' language 'sql';
create aggregate zz_hashagg(text) (
sfunc = zz_concat,
stype = text,
initcond = '');
Tomas Greif's solution is nice. But for huge enough table invalid memory alloc request size error will occur. So, it can be overcome with 2 options.
Option 1. Without batches
If the table is not big enough use string_agg and bytea data type.
select
md5(string_agg(c.row_hash, '' order by c.row_hash)) table_hash
from
foo f
cross join lateral(select ('\x' || md5(f::text))::bytea row_hash) c
;
Option 2. With batches
If the query in previous option ends with error like
SQL Error [54000]: ERROR: out of memory
Detail: Cannot enlarge string buffer containing 1073741808 bytes by 16 more bytes.
the row count limit is 1073741808 / 16 = 67108863 and the table should be divided to batches.
select
md5(string_agg(t.batch_hash, '' order by t.batch_hash)) table_hash
from(
select
md5(string_agg(c.row_hash, '' order by c.row_hash)) batch_hash
from
foo f
cross join lateral(select ('\x' || md5(f::text))::bytea row_hash) c
group by substring(row_hash for 3)
) t
;
Where 3 in group by clause divides row hashes to 16 777 216 batches (2: 65 536, 1: 256). Also other batching methods (e.g. strictly ntile) will work.
P.S. If you need to compare two tables this post may help.
Great answers.
In case by any means someone required not to use aggregation functions but maintaining support for tables sized several GiB, you can use this that has little performance penalties over the best answers in the case of largest tables.
CREATE OR REPLACE FUNCTION table_md5(
table_name CHARACTER VARYING
, VARIADIC order_key_columns CHARACTER VARYING [])
RETURNS CHARACTER VARYING AS $$
DECLARE
order_key_columns_list CHARACTER VARYING;
query CHARACTER VARYING;
first BOOLEAN;
i SMALLINT;
working_cursor REFCURSOR;
working_row_md5 CHARACTER VARYING;
partial_md5_so_far CHARACTER VARYING;
BEGIN
order_key_columns_list := '';
first := TRUE;
FOR i IN 1..array_length(order_key_columns, 1) LOOP
IF first THEN
first := FALSE;
ELSE
order_key_columns_list := order_key_columns_list || ', ';
END IF;
order_key_columns_list := order_key_columns_list || order_key_columns[i];
END LOOP;
query := (
'SELECT ' ||
'md5(CAST(t.* AS TEXT)) ' ||
'FROM (' ||
'SELECT * FROM ' || table_name || ' ' ||
'ORDER BY ' || order_key_columns_list ||
') t');
OPEN working_cursor FOR EXECUTE (query);
-- RAISE NOTICE 'opened cursor for query: ''%''', query;
first := TRUE;
LOOP
FETCH working_cursor INTO working_row_md5;
EXIT WHEN NOT FOUND;
IF first THEN
first := FALSE;
SELECT working_row_md5 INTO partial_md5_so_far;
ELSE
SELECT md5(working_row_md5 || partial_md5_so_far)
INTO partial_md5_so_far;
END IF;
-- RAISE NOTICE 'partial md5 so far: %', partial_md5_so_far;
END LOOP;
-- RAISE NOTICE 'final md5: %', partial_md5_so_far;
RETURN partial_md5_so_far :: CHARACTER VARYING;
END;
$$ LANGUAGE plpgsql;
Used as:
SELECT table_md5(
'table_name', 'sorting_col_0', 'sorting_col_1', ..., 'sorting_col_n'
);
As for the algorithm, you could XOR all the individual MD5 hashes, or concatenate them and hash the concatenation.
If you want to do this completely server-side you probably have to create your own aggregation function, which you could then call.
select my_table_hash(md5(CAST((f.*)AS text)) from f order by id
As an intermediate step, instead of copying the whole table to the client, you could just select the MD5 results for all rows, and run those through md5sum.
Either way you need to establish a fixed sort order, otherwise you might end up with different checksums even for the same data.