I have a table in my Postgres database that I'm trying to determine fill rates for (that is, I'm trying to understand how often data is/isn't missing). I need to make a function that, for each column (in a list of a couple dozen columns I've selected), counts the number and percentage of columns with non-null values.
The problem is, I don't really know how to iterate through a list of columns in a programmatic way, because I don't know how to reference a column from a string of its name. I've read about how you can use the EXECUTE command to run dynamically-written SQL, but I haven't been able to get it to work. Here's my current function:
CREATE OR REPLACE FUNCTION get_fill_rates() RETURNS TABLE (field_name text, fill_count integer, fill_percentage float) AS $$
DECLARE
fields text[] := array['column_a', 'column_b', 'column_c'];
total_rows integer;
BEGIN
SELECT reltuples INTO total_rows FROM pg_class WHERE relname = 'my_table';
FOR i IN array_lower(fields, 1) .. array_upper(fields, 1)
LOOP
field_name := fields[i];
EXECUTE 'SELECT COUNT(*) FROM my_table WHERE $1 IS NOT NULL' INTO fill_count USING field_name;
fill_percentage := fill_count::float / total_rows::float;
RETURN NEXT;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT * FROM get_fill_rates() ORDER BY fill_count DESC;
This function, as written, returns every field as having a 100% fill rate, which I know to be false. How can I make this function work?
I know you already solved it. But let me suggest you to avoid concatenating identifiers on dynamic queries, you can use format with a identifier wildcard instead:
CREATE OR REPLACE FUNCTION get_fill_rates() RETURNS TABLE (field_name text, fill_count integer, fill_percentage float) AS $$
DECLARE
fields text[] := array['column_a', 'column_b', 'column_c'];
table_name name := 'my_table';
total_rows integer;
BEGIN
SELECT reltuples INTO total_rows FROM pg_class WHERE relname = table_name;
FOREACH field_name IN ARRAY fields
LOOP
EXECUTE format('SELECT COUNT(*) FROM %I WHERE %I IS NOT NULL', table_name, field_name) INTO fill_count;
fill_percentage := fill_count::float / total_rows::float;
RETURN NEXT;
END LOOP;
END;
$$ LANGUAGE plpgsql;
Doing this way will help you preventing SQL-injection attacks and will reduce query parse overhead a bit. More info here.
I figured out the solution after I wrote my question but before I submitted it -- since I've already done the work of writing the question, I'll just go ahead and share the answer. The problem was in my EXECUTE statement, specifically with that USING field_name bit. I think it was getting treated as a string literal when I did it that way, which meant the query was evaluating if "a string literal" IS NOT NULL which of course, is always true.
Instead of parameterizing the column name, I need to inject it directly into the query string. So, I changed my EXECUTE line to the following:
EXECUTE 'SELECT COUNT(*) FROM my_table WHERE ' || field_name || ' IS NOT NULL' INTO fill_count;
Some problems in the code aside (see below), this can be substantially faster and simpler with a single scan over the table in a plain query:
SELECT v.*
FROM (
SELECT count(column_a) AS ct_column_a
, count(column_b) AS ct_column_b
, count(column_c) AS ct_column_c
, count(*)::numeric AS ct
FROM my_table
) sub
, LATERAL (
VALUES
(text 'column_a', ct_column_a, round(ct_column_a / ct, 3))
, (text 'column_b', ct_column_b, round(ct_column_b / ct, 3))
, (text 'column_c', ct_column_c, round(ct_column_c / ct, 3))
) v(field_name, fill_count, fill_percentage);
The crucial "trick" here is that count() only counts non-null values to begin with, no tricks required.
I rounded the percentage to 3 decimal digits, which is optional. For this I cast to numeric.
Use a VALUES expression to unpivot the results and get one row per field.
For repeated use or if you have a long list of columns to process, you can generate and execute the query dynamically. But, again, don't run a separate count for each column. Just build above query dynamically:
CREATE OR REPLACE FUNCTION get_fill_rates(tbl regclass, fields text[])
RETURNS TABLE (field_name text, fill_count bigint, fill_percentage numeric) AS
$func$
BEGIN
RETURN QUERY EXECUTE (
-- RAISE NOTICE '%', ( -- to debug if needed
SELECT
'SELECT v.*
FROM (
SELECT count(*)::numeric AS ct
, ' || string_agg(format('count(%I) AS %I', fld, 'ct_' || fld), ', ') || '
FROM ' || tbl || '
) sub
, LATERAL (
VALUES
(text ' || string_agg(format('%L, %2$I, round(%2$I/ ct, 3))', fld, 'ct_' || fld), ', (') || '
) v(field_name, fill_count, fill_pct)
ORDER BY v.fill_count DESC'
FROM unnest(fields) fld
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM get_fill_rates('my_table', '{column_a, column_b, column_c}');
As you can see, this works for any given table and column list now.
And all identifiers are properly quoted automatically, using format() or by the built-in virtues of the regclass type.
Related:
Table name as a PostgreSQL function parameter
How to unpivot a table in PostgreSQL
Query for crosstab view
Convert one row into multiple rows with fewer columns
Your original query could be improved like this, but this is just lipstick on a pig. Do not use this inefficient approach.
CREATE OR REPLACE FUNCTION get_fill_rates()
RETURNS TABLE (field_name text, fill_count bigint, fill_percentage float) AS
$$
DECLARE
fields text[] := '{column_a, column_b, column_c}'; -- must be legal identifiers!
total_rows float; -- use float right away
BEGIN
SELECT reltuples INTO total_rows FROM pg_class WHERE relname = 'my_table';
FOREACH field_name IN ARRAY fields -- use FOREACH
LOOP
EXECUTE 'SELECT COUNT(*) FROM big WHERE ' || field_name || ' IS NOT NULL'
INTO fill_count;
fill_percentage := fill_count / total_rows; -- already type float
RETURN NEXT;
END LOOP;
END
$$ LANGUAGE plpgsql;
Plus, pg_class.reltuples is only an estimate. Since you are counting anyway, use an actual count.
Related:
Iterating over integer[] in PL/pgSQL
Fast way to discover the row count of a table in PostgreSQL
Related
The following function identifies columns with null values. How can I extend the where clause to check null or empty value?
coalesce(TRIM(string), '') = ''
CREATE OR REPLACE FUNCTION public.is_column_empty(IN table_name varchar, IN column_name varchar)
RETURNS bool
LANGUAGE plpgsql
AS $function$
declare
count integer;
BEGIN
execute FORMAT('SELECT COUNT(*) from %s WHERE %s IS NOT NULL', table_name, quote_ident(column_name)) into count;
RETURN (count = 0);
END;
$function$
;
There are more possibilities - for example you can use custom string separators:
CREATE OR REPLACE FUNCTION public.is_column_empty(IN table_name varchar,
IN column_name varchar)
RETURNS bool
LANGUAGE plpgsql
AS $function$
DECLARE _found boolean; /* attention "count" is keyword */
BEGIN
EXECUTE format($_$SELECT EXISTS(SELECT * FROM %I WHERE COALESCE(trim(%I), '') <> '')$_$,
table_name, column_name)
INTO _found;
RETURN NOT _found;
END;
$function$;
your example has more issues:
don't use count where you really need to know number of rows (items). This can be pretty slow on bigger tables
Usually for keywords are used uppercase chars
don't use variable names that are SQL, PL/pgSQL keywords (reserved or unreserved), there can be some problems in some contexts (count, user, ...)
this is classic example of some chaos in data - you should to disallow empty strings in data. Then you can use index and the predicate COLNAME IS NOT NULL. It can be pretty fast.
You need to double up the quotation marks, like this:
CREATE OR REPLACE FUNCTION public.is_column_empty(IN table_name varchar, IN column_name varchar)
RETURNS bool
LANGUAGE plpgsql
AS $function$
declare
count integer;
BEGIN
execute FORMAT('SELECT COUNT(*) from %s WHERE COALESCE(TRIM(%s),'''') <> ''''', table_name, quote_ident(column_name)) into count;
RETURN (count = 0);
END;
$function$
;
EDIT:
Re-reading your question, I was a little unsure that you are getting what you want. As it stands the function returns false if at least one row has a value in the given column, even if all the other rows are empty. Is this really what you want, or are you rather looking for columns where any row has this column empty?
Here's my example table:
CREATE TABLE IF NOT EXISTS public.cars
(
id serial PRIMARY KEY,
make varchar(32) not null,
model varchar(32),
has_automatic_transmission boolean not null default false,
created_on_date timestamptz not null DEFAULT NOW()
);
I have a function that allows my data service to insert a car into the database. It looks like this:
drop function if exists cars_insert;
create function cars_insert
(
in make_in text,
in model_in text,
in has_automatic_transmission_in boolean,
in created_on_date_in timestamptz
)
returns public.carsas
$$
declare result_set public.cars;
begin
insert into cars
(
make,
model,
has_automatic_transmission,
created_on_date
)
values
(
make_in,
model_in,
has_automatic_transmission_in,
created_on_date_in
)
returning * into result_set;
return result_set;
end;
$$
language 'plpgsql';
This works really well until the service wants to insert a car with no value for has_automatic_transmission or created_on_date. In that case they'd send null for those parameters and would expect the database to use a default value. But instead the database rejects that null for obvious reasons (NOT NULL!).
What I want to do is have the insert routine do a coalesce to DEFAULT, but that doesn't work. Here's the logic I want for the insert:
insert into cars
(
make,
model,
has_automatic_transmission,
created_on_date
)
values
(
make,
model,
COALESCE(has_automatic_transmission_in, DEFAULT),
COALESCE(created_on_date_in, DEFAULT)
)
How can I effectively achieve that? Ideally it'd be some method I can apply inline to every column so that we don't need special knowledge of which columns do or don't have defaults, but I'll take anything at this point...
Except I'd like to avoid Dynamic SQL if possible.
While you need to pass values to a function, and want to insert default values instead of NULL dynamically, you could look them up like this (but see disclaimer below!):
CREATE OR REPLACE FUNCTION cars_insert (make_in text
, model_in text
, has_automatic_transmission_in boolean
, created_on_date_in timestamptz)
RETURNS public.cars AS
$func$
INSERT INTO cars(make, model, has_automatic_transmission, created_on_date)
VALUES (make_in
, model_in
, COALESCE(has_automatic_transmission_in
, (SELECT pg_get_expr(d.adbin, d.adrelid)::bool -- default_value
FROM pg_catalog.pg_attribute a
JOIN pg_catalog.pg_attrdef d ON (d.adrelid, d.adnum) = (a.attrelid, a.attnum)
WHERE a.attrelid = 'public.cars'::regclass
AND a.attname = 'has_automatic_transmission'))
, COALESCE(created_on_date_in
, (SELECT pg_get_expr(d.adbin, d.adrelid)::timestamptz -- default_value
FROM pg_catalog.pg_attribute a
JOIN pg_catalog.pg_attrdef d ON (d.adrelid, d.adnum) = (a.attrelid, a.attnum)
WHERE a.attrelid = 'public.cars'::regclass
AND a.attname = 'created_on_date'))
)
RETURNING *;
$func$
LANGUAGE sql;
db<>fiddle here
You also have to know the column type to cast the text returned from pg_get_expr().
I simplified to an SQL function, as nothing here requires PL/pgSQL.
See:
Get the default values of table columns in Postgres?
However, this only works for constants and types where a cast from text is defined. Other expressions (incl. functions) are not evaluated without dynamic SQL. now() in the example only happens to work by coincidence, as 'now' (ignoring parentheses) is a special input string for timestamptz that evaluates to the the same as the function now(). Misleading coincidence. See:
Difference between now() and current_timestamp
To make it work for expressions that have to be evaluated, dynamic SQL is required - which you ruled out. But if dynamic SQL is allowed, it's much more efficient to build the target list of the INSERT dynamically and omit columns that are supposed get default values. Or keep the target list constant and switch NULL values for the DEFAULT keyword. See:
Function to INSERT dynamic list of columns in multiple tables
Test for null in function with varying parameters
Generate DEFAULT values in a CTE UPSERT using PostgreSQL 9.3
I like Erwin's solution from the playfulness point of view, but it is quite expensive to have these subqueries in every INSERT. For practical purposes, I would recommend one of the following:
Have four INSERT statements in the function, one for each combination of default/non-default arguments, and use IF statements to pick the right one.
Don't use DEFAULT, but write a BEFORE INSERT trigger that replaces NULLs with the appropriate value.
Of course this will add overhead too. You should benchmark the different options.
Building on the suggestions made by previous commentators, I would write a function that generates, in a dynamic fashion, an insert function for each table.
The advantage of such approach is that the resulting insert function will not use dynamic SQL at all.
Function generating function:
CREATE OR REPLACE FUNCTION f_generate_insert_function(tableid regclass) RETURNS VOID LANGUAGE PLPGSQL AS
$$
DECLARE
tablename text := tableid::text;
funcname text := tablename || '_insert';
ddl text := $ddl$
CREATE OR REPLACE FUNCTION %s (%s) RETURNS %s LANGUAGE PLPGSQL AS $func$
DECLARE
result_set %s;
BEGIN
INSERT INTO %s
(
%s
)
VALUES
(
%s
)
RETURNING * INTO result_set;
RETURN result_set;
END;
$func$
$ddl$;
argument_list text := '';
column_list text := '';
value_list text := '';
r record;
BEGIN
FOR r IN
SELECT attname nam, pg_catalog.format_type(atttypid, atttypmod) typ, pg_catalog.pg_get_expr(adbin, adrelid) def
FROM pg_catalog.pg_attribute
JOIN pg_catalog.pg_type t
ON t.oid = atttypid
LEFT JOIN pg_catalog.pg_attrdef
ON adrelid = attrelid AND adnum = attnum AND atthasdef
WHERE attrelid = tableid
AND attnum > 0
LOOP
IF r.def LIKE 'nextval%' THEN
CONTINUE;
END IF;
argument_list := argument_list || r.nam || '_in ' || r.typ || ',';
column_list := column_list || r.nam || ',';
IF r.def IS NULL THEN
value_list := value_list || r.nam || '_in,';
ELSE
value_list := value_list || 'coalesce(' || r.nam || '_in,' || r.def || '),';
END IF;
END LOOP;
argument_list := rtrim(argument_list, ',');
column_list := rtrim(column_list, ',');
value_list := rtrim(value_list, ',');
EXECUTE format(ddl, funcname, argument_list, tablename, tablename, tablename, column_list, value_list);
END;
$$;
In your case, the resulting insert function will be:
CREATE OR REPLACE FUNCTION public.cars_insert(make_in character varying, model_in character varying, has_automatic_transmission_in boolean, created_on_date_in timestamp with time zone)
RETURNS cars
LANGUAGE plpgsql
AS $function$
DECLARE
result_set cars;
BEGIN
INSERT INTO cars
(
make,model,has_automatic_transmission,created_on_date
)
VALUES
(
make_in,model_in,coalesce(has_automatic_transmission_in,false),coalesce(created_on_date_in,now())
)
RETURNING * INTO result_set;
RETURN result_set;
END;
$function$
You need two Insert Statements; one where the Nullable columns are filled and another one which omits these columns as the default is only used if you do not reference the columns for insert.
I am trying write function which open cursor with dynamic column name in it.
And I am concerned about obvious SQL injection possibility here.
I was happy to see in the fine manual that this can be easily done, but when I try it in my example, it goes wrong with
error: column does not exist.
My current attempt can be condensed into this SQL Fiddle. Below, I present formatted code for this fiddle.
The goal of tst() function is to be able to count distinct occurances of values in any given column of constant query.
I am asking for hint what am I doing wrong, or maybe some alternative way to achieve the same goal in a safe way.
CREATE TABLE t1 (
f1 character varying not null,
f2 character varying not null
);
CREATE TABLE t2 (
f1 character varying not null,
f2 character varying not null
);
INSERT INTO t1 (f1,f2) VALUES ('a1','b1'), ('a2','b2');
INSERT INTO t2 (f1,f2) VALUES ('a1','c1'), ('a2','c2');
CREATE OR REPLACE FUNCTION tst(p_field character varying)
RETURNS INTEGER AS
$BODY$
DECLARE
v_r record;
v_cur refcursor;
v_sql character varying := 'SELECT count(DISTINCT(%I)) as qty
FROM t1 LEFT JOIN t2 ON (t1.f1=t2.f1)';
BEGIN
OPEN v_cur FOR EXECUTE format(v_sql,lower(p_field));
FETCH v_cur INTO v_r;
CLOSE v_cur;
return v_r.qty;
END;
$BODY$
LANGUAGE plpgsql;
Test execution:
SELECT tst('t1.f1')
Provides error message:
ERROR: column "t1.f1" does not exist
Hint: PL/pgSQL function tst(character varying) line 1 at OPEN
This would work:
SELECT tst('f1');
The problem you are facing: format() interprets parameters concatenated with %I as one identifier. You are trying to pass a table-qualified column name that consists of two identifiers, which is interpreted as "t1.f1" (one name, double-quoted to preserve the otherwise illegal dot in the name.
If you want to pass table and column name, use two parameters:
CREATE OR REPLACE FUNCTION tst2(_col text, _tbl text = NULL)
RETURNS int AS
$func$
DECLARE
v_r record;
v_cur refcursor;
v_sql text := 'SELECT count(DISTINCT %s) AS qty
FROM t1 LEFT JOIN t2 USING (f1)';
BEGIN
OPEN v_cur FOR EXECUTE
format(v_sql, CASE WHEN _tbl <> '' -- rule out NULL and ''
THEN quote_ident(lower(_tbl)) || '.' ||
quote_ident(lower(_col))
ELSE quote_ident(lower(_col)) END);
FETCH v_cur INTO v_r;
CLOSE v_cur;
RETURN v_r.qty;
END
$func$ LANGUAGE plpgsql;
Aside: It's DISTINCT f1- no parentheses around the column name, unless you want to make it a row type.
Actually, you don't need a cursor for this at all. Faster, simpler:
CREATE OR REPLACE FUNCTION tst3(_col text, _tbl text = NULL, OUT ct bigint) AS
$func$
BEGIN
EXECUTE format('SELECT count(DISTINCT %s) AS qty
FROM t1 LEFT JOIN t2 USING (f1)'
, CASE WHEN _tbl <> '' -- rule out NULL and ''
THEN quote_ident(lower(_tbl)) || '.' ||
quote_ident(lower(_col))
ELSE quote_ident(lower(_col)) END)
INTO ct;
RETURN;
END
$func$ LANGUAGE plpgsql;
I provided NULL as parameter default for convenience. This way you can call the function with just a column name or with column and table name. But not without column name.
Call:
SELECT tst3('f1', 't1');
SELECT tst3('f1');
SELECT tst3(_col := 'f1');
Same as for test2().
SQL Fiddle.
Related answer:
Table name as a PostgreSQL function parameter
I want a select that returns all fields in a table that are of type = "character varying". It needs to run across multiple tables, so needs to be dynamic.
I was trying to use a subquery to first get the text columns, and then run the query:
SELECT (SELECT STRING_AGG(QUOTE_IDENT(column_name), ', ') FROM
information_schema.columns WHERE table_name = foo
AND data_type = 'character varying') FROM foo;
But that's not working, I just get a list of column names but not the values. Does anyone know how I can make it work or a better way to do it?
Thank you,
Ben
You need Pl/PgSQL for this, as PostgreSQL doesn't support dynamic SQL in its plain SQL dialect.
CREATE OR REPLACE FUNCTION get_cols(target_table text) RETURNS SETOF record AS $$
DECLARE
cols text;
BEGIN
cols := (SELECT STRING_AGG(QUOTE_IDENT(column_name), ', ')
FROM information_schema.columns
WHERE table_name = target_table
AND data_type = 'character varying');
RETURN QUERY EXECUTE 'SELECT '||cols||' FROM '||quote_ident(target_table)||';';
END;
$$
LANGUAGE plpgsql;
However, you'll find this hard to call, as you need to know the result column list to be able to call it. That kind of defeats the point. You'll need to massage the result into a concrete type. I convert to hstore here, but you could return json or an array or whatever, really:
CREATE OR REPLACE FUNCTION get_cols(target_table text) RETURNS SETOF hstore AS $$
DECLARE
cols text;
BEGIN
cols := (SELECT STRING_AGG(QUOTE_IDENT(column_name), ', ')
FROM information_schema.columns
WHERE table_name = target_table
AND data_type = 'character varying');
RETURN QUERY EXECUTE 'SELECT hstore(ROW('||cols||')) FROM '||quote_ident(target_table)||';';
END;
$$
LANGUAGE plpgsql;
Dynamic SQL is a pain, consider doing this at the application level.
I want to call table name manually input type then result should be table's details, I tried those function
1st function is working.
2nd function is not working.
1)
DECLARE
All_columns varchar;
Tab_name ALIAS FOR $1 ;
BEGIN
FOR All_columns IN SELECT column_name
FROM information_schema.columns
WHERE table_name=Tab_name
loop
raise notice 'Columns:%',All_columns;
end loop;
return All_columns;
END;
select test_levelfunction1('country_table');
It shows all columns of country table
2)
DECLARE
All_columns varchar ;
Tab_name ALIAS FOR $1 ;
BEGIN
FOR All_columns IN SELECT Tab_name.*
FROM Tab_name
loop
raise notice 'Columns:%',All_columns;
end loop;
return All_columns;
END;
The call select test_levelfunction1('country_table'); results in an error.
I need all the details from country_table.
How can I fix this function?
Neither function works, insofar as I read them. Or then you expect the first to return your input instead of column names.
You probably want to be using dynamic sql in both functions, e.g.:
EXECUTE $x$SELECT * FROM $x$ || Tab_name::regclass
http://www.postgresql.org/docs/current/static/plpgsql-statements.html
You can largely simplify this task. This SQL function does the job:
CREATE OR REPLACE FUNCTION f_columns_of_tbl(_tbl regclass)
RETURNS SETOF text AS
$func$
SELECT quote_ident(attname) AS col
FROM pg_attribute
WHERE attrelid = $1 -- valid, visible table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
ORDER BY attnum
$func$ LANGUAGE sql;
Call:
SELECT f_columns_of_tbl('myschema.mytable'); -- optionally schema-qualified name
For more details, links and a plpgsql version consider the related answer to your last question:
PLpgSQL function to find columns with only NULL values in a given table