postgresql function confusion - sql

if I write a query as such:
with WordBreakDown (idx, word, wordlength) as (
select
row_number() over () as idx,
word,
character_length(word) as wordlength
from
unnest(string_to_array('yo momma so fat', ' ')) as word
)
select
cast(wbd.idx + (
select SUM(wbd2.wordlength)
from WordBreakDown wbd2
where wbd2.idx <= wbd.idx
) - wbd.wordlength as integer) as position,
cast(wbd.word as character varying(512)) as part
from
WordBreakDown wbd;
... I get a table of 4 rows like so:
1;"yo"
4;"momma"
10;"so"
13;"fat"
... this is what I want. HOWEVER, if I wrap this into a function like so:
drop type if exists split_result cascade;
create type split_result as(
position integer,
part character varying(512)
);
drop function if exists split(character varying(512), character(1));
create function split(
_s character varying(512),
_sep character(1)
) returns setof split_result as $$
begin
return query
with WordBreakDown (idx, word, wordlength) as (
select
row_number() over () as idx,
word,
character_length(word) as wordlength
from
unnest(string_to_array(_s, _sep)) as word
)
select
cast(wbd.idx + (
select SUM(wbd2.wordlength)
from WordBreakDown wbd2
where wbd2.idx <= wbd.idx
) - wbd.wordlength as integer) as position,
cast(wbd.word as character varying(512)) as part
from
WordBreakDown wbd;
end;
$$ language plpgsql;
select * from split('yo momma so fat', ' ');
... I get:
1;"yo momma so fat"
I'm scratching my head on this. What am I screwing up?
UPDATE
Per the suggestions below, I have replaced the function as such:
CREATE OR REPLACE FUNCTION split(_string character varying(512), _sep character(1))
RETURNS TABLE (postition int, part character varying(512)) AS
$BODY$
BEGIN
RETURN QUERY
WITH wbd AS (
SELECT (row_number() OVER ())::int AS idx
,word
,length(word) AS wordlength
FROM unnest(string_to_array(_string, rpad(_sep, 1))) AS word
)
SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
,word::character varying(512) -- AS part
FROM wbd;
END;
$BODY$ LANGUAGE plpgsql;
... which keeps my original function signature for maximum compatibility, and the lion's share of the performance gains. Thanks to the answerers, I found this to be a multifaceted learning experience. Your explanations really helped me understand what was going on.

Observe this:
select length(' '::character(1));
length
--------
0
(1 row)
A cause of this confusion is a bizarre definition of character type in SQL standard. From Postgres documentation for character types:
Values of type character are physically padded with spaces to the specified width n, and are stored and displayed that way. However, the padding spaces are treated as semantically insignificant. Trailing spaces are disregarded when comparing two values of type character, and they will be removed when converting a character value to one of the other string types.
So you should use string_to_array(_s, rpad(_sep,1)).

You had several constructs that probably did not do what you think they would.
Here is a largely simplified version of your function, that is also quite a bit faster:
CREATE OR REPLACE FUNCTION split(_string text, _sep text)
RETURNS TABLE (postition int, part text) AS
$BODY$
BEGIN
RETURN QUERY
WITH wbd AS (
SELECT (row_number() OVER ())::int AS idx
,word
,length(word) AS wordlength
FROM unnest(string_to_array(_string, _sep)) AS word
)
SELECT (sum(wordlength) OVER (ORDER BY idx))::int + idx - wordlength
,word -- AS part
FROM wbd;
END;
$BODY$ LANGUAGE plpgsql;
Explanation
Use another window function to sum up the word lengths. Faster, simpler and cleaner. This makes for most of the performance gain. A lot of sub-queries slow you down.
Use the data type text instead of character varying or even character(). character varying and character are awful types, mostly just there for compatibility with the SQL standard and historical reasons. There is hardly anything you can do with those that could not better be done with text. In the meantime #Tometzky has explained why character(1) was a particularly bad choice for the parameter type. I fixed that by using text instead.
As #Tometzky demonstrated, unnest(string_to_array(..)) is faster than regexp_split_to_table(..) - even if just a tiny bit for small strings like we use here (max. 512 characters). So I switched back to your original expression.
length() does the same as character_length().
In a query with only one table source (and no other possible naming conflicts) you might as well not table-qualify column names. Simplifies the code.
We need an integer value in the end, so I cast all numerical values (bigint in this case) to integer right away, so additions and subtractions are done with integer arithmetic which is generally fastest.
'value'::int is just shorter syntax for cast('value' as integer) and otherwise equivalent.

I found the answer, but I don't understand it.
The string_to_array(_s, _sep) function does not split with a non-varying character; even if I wrote it like so it would not work:
string_to_array(_s, cast(_sep as character_varying(1)))
BUT if I redefined the parameters as such:
drop function if exists split(character varying(512), character(1));
create function split(
_s character varying(512),
_sep character varying(1)
... all of a sudden it works as I expected. Dunno what to make of this, and really not the answer I wanted... now I have changed the signature of the function, which is not what I wanted to do.

Related

I wrote the procedure, only one column is filled in, tell me where the error is?

I tried to write a procedure that fills the table with random data using the pseudo-random sequence formula, only one column is filled, help find my error, here is my code:
CREATE OR REPLACE PROCEDURE pr_zakaz2(Номер integer, Сумма integer)
AS
$$
BEGIN
INSERT INTO Заказ(Номер, Сумма)
SELECT Заказ(Номер, Сумма) FROM generate_series(1, 100000), i WHERE
result = next * 1103515245+12345;
END;
$$ language 'plpgsql';
enter image description here
enter image description here
I think the syntax you want is:
INSERT INTO Заказ(Номер, Сумма)
SELECT i, (next * 1103515245+12345) % 101
FROM generate_series(1, 100000) gs(i)
END;
Note that %101 which is normally used in this case. This limits the result to some finite range -- in this case 0 to 100. Also, the two numbers used would typically be prime (or at least relatively prime) -- and numbers ending in 5 are not.

Improvement of a (working) function to obtain the rows where decimal number has certain number of decimal positions

I have created three test tables: users, teams and membership, which relates users and yeams.
The users table contains a user_id column which is the primary key.
The memberships table contains a user_id foreign key and another column called cost, with a decimal value.
Then I proposed myself the following SQL challenge, based on some interview questions I have read:
"Write the SQL code needed in order to get users with a cost that has at most N decimal places"
The SQL code must use SQL functions (and I use PostgreSQL).
The actual code I have written is:
CREATE OR REPLACE FUNCTION GET_NTH_DEC_RESIDUE(NUMERIC, INTEGER) RETURNS NUMERIC AS
$function$
SELECT CAST(CAST($1 AS NUMERIC) * POW(10,$2) - FLOOR(CAST($1 AS NUMERIC)*POW(10,$2)) AS NUMERIC);
$function$
LANGUAGE SQL;
SELECT user_id, cost, nth_pos_dec
FROM (SELECT user_id, GET_NTH_DEC_RESIDUE(CAST(memberships.cost AS NUMERIC), 2) AS nth_pos_dec
FROM users
JOIN memberships
USING (user_id)) AS T
WHERE NOT (nth_pos >= 0.0 AND nth_pos < 1.0);
The function GET_NTH_DEC_RESIDUE gets the residual decimal number (e.g. for 0.345 and 2 decimal positions, the function returns 0.5, for 0.12345 and 3 decimal positions it returns 0.45). The cost values we are looking for are then those which are not in the range [0,1).
By "applying" the function to the joined users+memberships view, it generates a new column with the residual decimal numbers and the right rows can be chosen.
This solution seems to do the job pretty well, but I am not fully satisfied with it.
I tried to wrap the logical comparison into another SQL function so that the main query gets simplified, but I did not manage to make it.
Is anyone able to devise a more elegant way to do this? (note that I am interested in using SQL functions and I do not want to do string conversions).
Thanks!
Your function is a nice one. Here are some things that I think could be optimized:
you don't need to cast the first parameter to NUMERIC - it's already of this type, so first optimization could be:
CREATE OR REPLACE FUNCTION GET_NTH_DEC_RESIDUE(NUMERIC, INTEGER) RETURNS NUMERIC AS
$function$
SELECT CAST($1 * POW(10, $2) - FLOOR($1 * POW(10, $2)) AS NUMERIC);
$function$
LANGUAGE SQL;
when you check the return value of the function, there is no need to check if it is less than 1 - it could never be equal or greater than 1, so you could just check if it's equal to 0 (I've checked that the function works fine with negative values too):
...
WHERE nth_pos = 0
if you don't need the numeric value returned by this function, you could change it to return boolean and use it only in the WHERE clause (notice, that you wouldn't need the cast at all):
CREATE OR REPLACE FUNCTION GET_NTH_DEC_RESIDUE(NUMERIC, INTEGER) RETURNS BOOLEAN AS
$function$
SELECT $1 * POW(10, $2) - FLOOR($1 * POW(10, $2)) = 0;
$function$
LANGUAGE SQL;
SELECT
user_id,
cost
FROM
users
JOIN memberships
USING (user_id)
WHERE
GET_NTH_DEC_RESIDUE(CAST(memberships.cost AS NUMERIC), 2);
you can calculate the POW(10, $2) only once (I'm not sure if query planner wouldn't do that anyway):
CREATE OR REPLACE FUNCTION GET_NTH_DEC_RESIDUE(NUMERIC, INTEGER) RETURNS BOOLEAN AS
$function$
WITH precalc AS (
SELECT POW(10, $2) AS power
)
SELECT
$1 * power - FLOOR($1 * power) = 0
FROM
precalc;
$function$
LANGUAGE SQL;

Hex string to integer conversion in Amazon Redshift

Amazon Redshift is based on ParAccel which is based on Postgres. From my research it seems that the preferred way to perform hexadecimal string to integer conversion in Postgres is via a bit field, as outlined in this answer.
In the case of bigint, this would be:
select ('x'||lpad('123456789abcdef',16,'0'))::bit(64)::bigint
Unfortunately, this fails on Redshift with:
ERROR: cannot cast type text to bit [SQL State=42846]
What other ways are there to perform this conversion in Postgres 8.1ish (that's close to the Redshift level of compatibility)? UDFs are not supported in Redshift and neither are array, regex functions or set generating functions...
It looks like they added a function for this at some point: STRTOL
Syntax
STRTOL(num_string, base)
Return type
BIGINT. If num_string is null, returns NULL.
For example
SELECT strtol('deadbeef', 16);
Returns: 3735928559
Assuming that you want a simple digit-by-digit ordinal position conversion (i.e. you're not worried about two's compliment negatives, etc) I think this should work on an 8.1-equivalent DB:
CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(CASE WHEN v >= ascii('a') THEN v - ascii('a') + 10 ELSE v - ascii('0') END * 16^ordpos)::bigint
FROM (
SELECT n-1, ascii(substring(reverse($1), n, 1))
FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;
The function form is optional, it just makes it easier to avoid repeating the argument a bunch of times. It should get inlined anyway. Efficiency will probably be awful, but most of the tools available to do this smarter don't seem to be available on versions that old, and this at least works:
regress=> CREATE TABLE t AS VALUES ('c13b'), ('a'), ('f');
regress=> SELECT hex2dec(column1) FROM t;
hex2dec
---------
49467
10
15
(3 rows)
If you can use regexp_split_to_array and generate_subscripts it might be faster. Or slower. I haven't tried. Another possible trick is to use a digit mapping array instead of the CASE, like:
'[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[]
which you can use with:
CREATE OR REPLACE FUNCTION hex2dec(text) RETURNS bigint AS $$
SELECT sum(
('[48:102]={0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,10,11,12,13,14,15}'::integer[])[ v ]
* 16^ordpos
)::bigint
FROM (
SELECT n-1, ascii(substring(reverse($1), n, 1))
FROM generate_series(1, length($1)) n
) AS x(ordpos, v);
$$ LANGUAGE sql IMMUTABLE;
Personally, I'd do it client-side instead, rather than wrangling the limited capabilities of an old PostgreSQL fork, especially one you can't load your own sensible user-defined C functions on, or use PL/Perl, etc.
In real PostgreSQL I'd just use this:
hex2dec.c:
#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"
#include "errno.h"
#include "limits.h"
#include <stdlib.h>
PG_MODULE_MAGIC;
Datum from_hex(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(hex2dec);
Datum
hex2dec(PG_FUNCTION_ARGS)
{
char *endpos;
const char *hexstr = text_to_cstring(PG_GETARG_TEXT_PP(0));
long decval = strtol(hexstr, &endpos, 16);
if (endpos[0] != '\0')
{
ereport(ERROR, (ERRCODE_INVALID_PARAMETER_VALUE, errmsg("Could not decode input string %s as hex", hexstr)));
}
if (decval == LONG_MAX && errno == ERANGE)
{
ereport(ERROR, (ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE, errmsg("Input hex string %s overflows int64", hexstr)));
}
PG_RETURN_INT64(decval);
}
Makefile:
MODULES = hex2dec
DATA = hex2dec--1.0.sql
EXTENSION = hex2dec
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
hex2dec.control:
comment = 'Utility function to convert hex strings to decimal'
default_version = '1.0'
module_pathname = '$libdir/hex2dec'
relocatable = true
hex2dec--1.0.sql:
CREATE OR REPLACE FUNCTION hex2dec(hexstr text) RETURNS bigint
AS 'hex2dec','hex2dec'
LANGUAGE c IMMUTABLE STRICT;
COMMENT ON FUNCTION hex2dec(hexstr text)
IS 'Decode the hex string passed, which may optionally have a leading 0x, as a bigint. Does not attempt to consider negative hex values.';
Usage:
CREATE EXTENSION hex2dec;
postgres=# SELECT hex2dec('7fffffffffffffff');
hex2dec
---------------------
9223372036854775807
(1 row)
postgres=# SELECT hex2dec('deadbeef');
hex2dec
------------
3735928559
(1 row)
postgres=# SELECT hex2dec('12345');
hex2dec
---------
74565
(1 row)
postgres=# select hex2dec(to_hex(-1));
hex2dec
------------
4294967295
(1 row)
postgres=# SELECT hex2dec('8fffffffffffffff');
ERROR: Input hex string 8fffffffffffffff overflows int64
postgres=# SELECT hex2dec('0x7abcz123');
ERROR: Could not decode input string 0x7abcz123 as hex
The performance difference is ... noteworthy. Given sample data:
CREATE TABLE randhex AS
SELECT '0x'||to_hex( abs(random() * (10^((random()-.5)*10)) * 10000000)::bigint) AS h
FROM generate_series(1,1000000);
conversion from hex to decimal takes about 1.3 from a warm cache using the C extension, which isn't great for a million rows. Reading them without any transformation takes 0.95s. It took 36 seconds for the SQL based hex2dec approach to process the same rows. Frankly I'm really impressed that the SQL approach was as fast as that, and surprised the C ext was that slow.
A likely explanation is that the cast from text to bit(n) relies on undocumented behavior, I repeat the quote from Tom Lane:
This is relying on some undocumented behavior of the bit-type input
converter, but I see no reason to expect that would break. A possibly
bigger issue is that it requires PG >= 8.3 since there wasn't a text
to bit cast before that.
And Amazon derivate is obviously not allowing this undocumented feature. Not surprising, since it is based off of Postgres 8.1 where there was no cast at all.
Previously quoted in this closely related answer:
Convert hex in text representation to decimal number

Pass multiple values in single parameter

I want to call a function by passing multiple values on single parameter, like this:
SELECT * FROM jobTitle('270,378');
Here is my function.
CREATE OR REPLACE FUNCTION test(int)
RETURNS TABLE (job_id int, job_reference int, job_job_title text
, job_status text) AS
$$
BEGIN
RETURN QUERY
select jobs.id,jobs.reference, jobs.job_title,
ltrim(substring(jobs.status,3,char_length(jobs.status))) as status
FROM jobs ,company c
WHERE jobs."DeleteFlag" = '0'
and c.id= jobs.id and c.DeleteFlag = '0' and c.active = '1'
and (jobs.id = $1 or -1 = $1)
order by jobs.job_title;
END;
$$ LANGUAGE plpgsql;
Can someone help with the syntax? Or even provide sample code?
VARIADIC
Like #mu provided, VARIADIC is your friend. One more important detail:
You can also call a function using a VARIADIC parameter with an array type directly. Add the key word VARIADIC in the function call:
SELECT * FROM f_test(VARIADIC '{1, 2, 3}'::int[]);
is equivalent to:
SELECT * FROM f_test(1, 2, 3);
Other advice
In Postgres 9.1 or later right() with a negative length is faster and simpler to trim leading characters from a string:
right(j.status, -2)
is equivalent to:
substring(j.status, 3, char_length(jobs.status))
You have j."DeleteFlag" as well as j.DeleteFlag (without double quotes) in your query. This is probably incorrect. See:
PostgreSQL Error: Relation already exists
"DeleteFlag" = '0' indicates another problem. Unlike other RDBMS, Postgres properly supports the boolean data type. If the flag holds boolean data (true / false / NULL) use the boolean type. A character type like text would be inappropriate / inefficient.
Proper function
You don't need PL/pgSQL here. You can use a simpler SQL function:
CREATE OR REPLACE FUNCTION f_test(VARIADIC int[])
RETURNS TABLE (id int, reference int, job_title text, status text)
LANGUAGE sql AS
$func$
SELECT j.id, j.reference, j.job_title
, ltrim(right(j.status, -2)) AS status
FROM company c
JOIN job j USING (id)
WHERE c.active
AND NOT c.delete_flag
AND NOT j.delete_flag
AND (j.id = ANY($1) OR '{-1}'::int[] = $1)
ORDER BY j.job_title
$func$;
db<>fiddle here
Old sqlfiddle
Don't do strange and horrible things like converting a list of integers to a CSV string, this:
jobTitle('270,378')
is not what you want. You want to say things like this:
jobTitle(270, 378)
jobTitle(array[270, 378])
If you're going to be calling jobTitle by hand then a variadic function would probably be easiest to work with:
create or replace function jobTitle(variadic int[])
returns table (...) as $$
-- $1 will be an array if integers in here so UNNEST, IN, ANY, ... as needed
Then you can jobTitle(6), jobTitle(6, 11), jobTitle(6, 11, 23, 42), ... as needed.
If you're going to be building the jobTitle arguments in SQL then the explicit-array version would probably be easier to work with:
create or replace function jobTitle(int[])
returns table (...) as $$
-- $1 will be an array if integers in here so UNNEST, IN, ANY, ... as needed
Then you could jobTitle(array[6]), jobTitle(array[6, 11]), ... as needed and you could use all the usual array operators and functions to build argument lists for jobTitle.
I'll leave the function's internals as an exercise for the reader.

Convert numeric to string inside a user-defined function

I am trying to call/convert a numeric variable into string inside a user-defined function. I was thinking about using to_char, but it didn't pass.
My function is like this:
create or replace function ntile_loop(x numeric)
returns setof numeric as
$$
select
max("billed") as _____(to_char($1,'99')||"%"???) from
(select "billed", "id","cm",ntile(100)
over (partition by "id","cm" order by "billed")
as "percentile" from "table_all") where "percentile"=$1
group by "id","cm","percentile";
$$
language sql;
My purpose is to define a new variable "x%" as its name, with x varying as the function input. In context, x is numeric and will be called again later in the function as a numeric (this part of code wasn't included in the sample above).
What I want to return:
I simply want to return a block of code so that every time I change the percentile number, I don't have to run this block of code again and again. I'd like to calculate 5, 10, 20, 30, ....90th percentile and display all of them in the same table for each id+cm group.
That's why I was thinking about macro or function, but didn't find any solutions I like.
Thank you for your answers. Yes, I will definitely read basics while I am learning. Today's my second day to use SQL, but have to generate some results immediately.
Converting numeric to text is the least of your problems.
My purpose is to define a new variable "x%" as its name, with x
varying as the function input.
First of all: there are no variables in an SQL function. SQL functions are just wrappers for valid SQL statements. Input and output parameters can be named, but names are static, not dynamic.
You may be thinking of a PL/pgSQL function, where you have procedural elements including variables. Parameter names are still static, though. There are no dynamic variable names in plpgsql. You can execute dynamic SQL with EXECUTE but that's something different entirely.
While it is possible to declare a static variable with a name like "123%" it is really exceptionally uncommon to do so. Maybe for deliberately obfuscating code? Other than that: Don't. Use proper, simple, legal, lower case variable names without the need to double-quote and without the potential to do something unexpected after a typo.
Since the window function ntile() returns integer and you run an equality check on the result, the input parameter should be integer, not numeric.
To assign a variable in plpgsql you can use the assignment operator := for a single variable or SELECT INTO for any number of variables. Either way, you want the query to return a single row or you have to loop.
If you want the maximum billed from the chosen percentile, you don't GROUP BY x, y. That might return multiple rows and does not do what you seem to want. Use plain max(billed) without GROUP BY to get a single row.
You don't need to double quote perfectly legal column names.
A valid function might look like this. It's not exactly what you were trying to do, which cannot be done. But it may get you closer to what you actually need.
CREATE OR REPLACE FUNCTION ntile_loop(x integer)
RETURNS SETOF numeric as
$func$
DECLARE
myvar text;
BEGIN
SELECT INTO myvar max(billed)
FROM (
SELECT billed, id, cm
,ntile(100) OVER (PARTITION BY id, cm ORDER BY billed) AS tile
FROM table_all
) sub
WHERE sub.tile = $1;
-- do something with myvar, depending on the value of $1 ...
END
$func$ LANGUAGE plpgsql;
Long story short, you need to study the basics before you try to create sophisticated functions.
Plain SQL
After Q update:
I'd like to calculate 5, 10, 20, 30, ....90th percentile and display
all of them in the same table for each id+cm group.
This simple query should do it all:
SELECT id, cm, tile, max(billed) AS max_billed
FROM (
SELECT billed, id, cm
,ntile(100) OVER (PARTITION BY id, cm ORDER BY billed) AS tile
FROM table_all
) sub
WHERE (tile%10 = 0 OR tile = 5)
AND tile <= 90
GROUP BY 1,2,3
ORDER BY 1,2,3;
% .. modulo operator
GROUP BY 1,2,3 .. positional parameter
It looks like you're looking for return query execute, returning the result from a dynamic SQL statement:
http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html
http://www.postgresql.org/docs/current/static/plpgsql-statements.html