Mass-Coalescing of Null Values - sql

I have a table in a Postgres database with monthly columns from 2012 to the end of 2018:
create table sales_data (
part_number text not null,
customer text not null,
qty_2012_01 numeric,
qty_2012_02 numeric,
qty_2012_03 numeric,
...
qty_2018_10 numeric,
qty_2018_11 numeric,
qty_2018_12 numeric,
constraint sales_data_pk primary key (part_number, customer)
);
The data is populated from a large function that pulls data from an extremely wide variety of sources. It involves many left joins -- for example, in combining history with future data, where a single item may have history but not future demand or vice versa. Or, certain customers may not have data as far back or forward as we want.
The problem I'm coming up with is due to the left joins (and the nature of the data I'm pulling), a significant number of the values I am pulling are null. I would like any null to simply be zero to simplify any queries against this table, specifically aggregate functions that say 1 + null + 2 = null.
I could modify the function and add hundreds of coalesce statements. However, I was hoping there was another way around this, even if it means modifying the values after the fact. That said, this would mean adding 84 update statements at the end of the function:
update sales_data set qty_2012_01 = 0 where qty_2012_01 is null;
update sales_data set qty_2012_02 = 0 where qty_2012_02 is null;
update sales_data set qty_2012_03 = 0 where qty_2012_03 is null;
... 78 more like this...
update sales_data set qty_2018_10 = 0 where qty_2018_10 is null;
update sales_data set qty_2018_11 = 0 where qty_2018_11 is null;
update sales_data set qty_2018_12 = 0 where qty_2018_12 is null;
I'm missing something, right? Is there an easier way?
I was hoping the default setting on the column would force a zero, but it doesn't work when the function is explicitly telling it to insert a null. Likewise, if I make the column non-nullable, it just pukes on my insert -- I was hoping that might force the invocation of the default.
By the way, the insert-then-update strategy is one I chastise others for, so I understand this is less than ideal. This function is a bit of a beast, and it does require some occasional maintenance (long story). My primary goal is to keep the function as readable and maintainable as possible -- NOT to make the function uber-efficient. The table itself is not huge -- less than a million records after all is said and done -- and we run the function to populate it once or twice a month.

There is not built-in feature (I would know of). Short of spelling out COALESCE(col, 0) everywhere you can write a function to replace all NULL values with 0 in all numeric columns of a table:
CREATE OR REPLACE FUNCTION f_convert_numeric_null(_tbl regclass)
RETURNS void AS
$func$
BEGIN
RAISE NOTICE '%', -- test output for debugging
-- EXECUTE -- payload
(SELECT 'UPDATE ' || _tbl
|| ' SET ' || string_agg(format('%1$s = COALESCE(%1$s, 0)', col), ', ')
|| ' WHERE ' || string_agg(col || ' IS NULL', ' OR ')
FROM (
SELECT quote_ident(attname) AS col
FROM pg_attribute
WHERE attrelid = _tbl -- valid, visible, legal table name
AND attnum >= 1 -- exclude tableoid & friends
AND NOT attisdropped -- exclude dropped columns
AND NOT attnotnull -- exclude columns defined NOT NULL
AND atttypid = 'numeric'::regtype -- only numeric columns
ORDER BY attnum
) sub
);
END
$func$ LANGUAGE plpgsql;
Concatenates and executes a query of the form:
UPDATE sales_data
SET qty_2012_01 = COALESCE(qty_2012_01, 0)
, qty_2012_02 = COALESCE(qty_2012_02, 0)
, qty_2012_03 = COALESCE(qty_2012_03, 0)
...
WHERE qty_2012_01 IS NULL OR
qty_2012_02 IS NULL OR
qty_2012_03 IS NULL ... ;
Works for any table with any column names. All numeric columns are updated. Only rows that actually change are touched.
Since the function is massively invasive, I added a child-safety device. Quote the RAISE NOTICE line and unquote EXECUTE to prime the bomb.
Call:
SELECT f_convert_numeric_null('sales_data');
My primary goal is to keep the function as readable and maintainable as possible.
That should do it.
SQL Fiddle.
The parameter is type regclass, so pass the table name, possibly schema-qualified, non-standard identifiers must be double-quoted - names like "mySchema"."0dumb tablename".
Write your query results to a temporary table, run the function on the temp table and then INSERT into the actual table.
Related:
Replace empty strings with null values
Table name as a PostgreSQL function parameter
Generate DEFAULT values in a CTE UPSERT using PostgreSQL 9.3

While INSERT statement itself you can COALESCE (col_name, 0) will fix the issue. You can add NOT NULL also to maintain data integrity .
Assuming Inserting data from Temp Table
INSERT INTO sales_data (qty_2012_01, qty_2012_02)
SELECT COALESCE(qty_2012_01, 0), COALESCE(qty_2012_01, 0)
FROM temp_sales_data;
Single Update
UPDATE sales_date SET
qty_2012_01 = COALESCE(qty_2012_01, 0),
qty_2012_02 = COALESCE(qty_2012_02, 0)
..
..
WHERE qty_2012_01 IS NULL
OR qty_2012_02 IS NULL
...
....
The above query will update all the columns in single update.

Related

PostgreSQL Varchar UID to Int UID while preserving uniqueness

Say I have a unique column of VarChar(32).
ex. 13bfa574e23848b68f1b7b5ff6d794e1.
I want to preserve the uniqueness of this while converting the column to int. I figure I can convert all of the letters to their ascii equivalent, while retaining the numbers and character position. To do this, I will use the translate function.
psuedo code: select translate(uid, '[^0-9]', ascii('[^0-9]'))
My issue is finding all of the letters in the VarChar column originally.
I've tried
select uid, substring(uid from '[^0-9]') from test_table;
But it only returns the first letter it encounters. Using the above example, I would be looking for bfaebfbbffde
Any help is appreciated!
First off, I agree with the two commenters who said you should use a UID datatype.
That aside...
Your UID looks like a traditional one, in that it's not alphanumeric, it's hex. If this is the case, you can convert the hex to the numeric value using this solution:
PostgreSQL: convert hex string of a very large number to a NUMERIC
Notice the accepted solution (mine, shame) is not as good as the other solution listed, as mine will not work for hex values this large.
That said, yikes, what a huge number. Holy smokes.
Depending on how many records are in your table and the frequency of insert/update, I would consider a radically different approach. In a nutshell, I would create another column to store your numeric ID whose value would be determined by a sequence.
If you really want to make it bulletproof, you can also create a cross-reference table to store the relationships that would
Reuse an ID if it ever repeated (I know UIDs don't, but this would cover cases where a record is deleted by mistake, re-appears, and you want to retain the original id)
If UIDs repeat (like this is a child table with multiple records per UID), it would cover that case as well
If neither of these apply, you could dumb it down quite a bit.
The solution would look something like this:
Add an ID column that will be your numeric equivalent to the UID:
alter table test_table
add column id bigint
Create a sequence:
CREATE SEQUENCE test_id
create a cross-reference table (again, not necessary for the dumbed down version):
create table test_id_xref (
uid varchar(32) not null,
id bigint not null,
constraint test_id_xref_pk primary key (uid)
)
Then do a one-time update to assign a surrogate ID to each UID for both the cross-reference and actual tables:
insert into test_id_xref
with uids as (
select distinct uid
from test_table
)
select uid, nextval ('test_id')
from uids;
update test_table tt
set id = x.id
from test_id_xref x
where tt.uid = x.uid;
And finally, for all future inserts, create a trigger to assign the next value:
CREATE OR REPLACE FUNCTION test_table_insert_trigger()
RETURNS trigger AS
$BODY$
BEGIN
select t.id
from test_id_xref t
into NEW.id
where t.uid = NEW.uid;
if NEW.id is null then
NEW.id := nextval('test_id');
insert into test_id_xref values (NEW.uid, NEW.id);
end if;
return NEW;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
CREATE TRIGGER insert_test_table_trigger
BEFORE INSERT
ON test_table
FOR EACH ROW
EXECUTE PROCEDURE test_table_insert_trigger();
create one function which replace charter with blank which you not need in string,
CREATE FUNCTION replace_char(v_string VARCHAR(32) CHARSET utf8) RETURNS VARCHAR(32)
DETERMINISTIC
BEGIN
DECLARE v_return_string VARCHAR(32) DEFAULT '';
DECLARE v_remove_char VARCHAR(200) DEFAULT '1,2,3,4,5,6,7,8,9,0';
DECLARE v_length, j INT(3) DEFAULT 0;
SET v_length = LENGTH(v_string);
WHILE(j < v_length) DO
IF ( FIND_IN_SET( SUBSTR(v_string, (j+1), 1), v_remove_char ) = 0) THEN
SET v_return_string = CONCAT(v_return_string, SUBSTR(v_string, (j+1), 1) );
END IF;
SET j = j+1;
END WHILE;
RETURN v_return_string;
END$$
DELIMITER ;
Now you just nee to call this function in query
select uid, replace_char(uid) from test_table;
It will give you string what you need (bfaebfbbffde)
If you want to int number only i.e 13574238486817567941 then change value of variable, and also column datatype in decimal(50,0), decimal can stored large number and there is 0 decimal point so it will store int value as decimal.
v_remove_char = 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z';

Exclude a range of values that are encompassed by another range, using variables

I have 2 variables that describe a range; from and to. However, there is some overlap between the ranges defined in my code. I would like to exclude one value from the range, by using variables because I am passing these variables as a parameter in the package to call other programs.
For example, if I have this code:
IF name = 'AA'
THEN
from := '101-0000-0000';
to := '101-9999-9999';
ELSIF name = 'BB'
THEN
from := '200-0000-0000';
to := '200-9999-9999';
ELSIF name = 'CC'
THEN
from := '100-0000-0000';
to := '120-9999-0000';
ELSIF name = 'DD'
THEN
from := '400-0000-0000';
to := '402-9999-9999';
END IF;
I want to exclude the 101-****-**** values from the name = 'CC' range because they are already in use by the name = 'AA' range. The CC from value is 100-0000-0000 and to value is 120-9999-9999, which completely covers 101-****-****.
You can't do that in the manner which you'd like to. You're going to have to have more variables, for instance from2 and to2 (or put them in an array). If you need to exclude 2 ranges then you're going to need 3 sets of variables, etc. This method is, therefore, not extensible and leads only to unmaintainable code bloat. I'd argue that you're already getting there.
You do, however, have a database to hand. Use it for what it's good at. Create a table.
create table name_ranges (
name varchar2(2) not null
, min_value varchar2(13) not null
, max_value varchar2(13) not null
, constraint pk_name_ranges primary key (name, min_value)
, constraint uk_name_ranges_min unique (min_value)
, constraint uk_name_ranges_max unique (max_value)
, constraint ck_name_ranges_min_max check (min_value <= max_value)
);
insert into name_ranges values ('AA', '101-0000-0000', '101-9999-9999');
insert into name_ranges values ('BB', '200-0000-0000', '200-9999-9999');
insert into name_ranges values ('CC', '100-0000-0000', '100-9999-9999');
insert into name_ranges values ('CC', '102-0000-0000', '120-9999-9999');
insert into name_ranges values ('DD', '400-0000-0000', '402-9999-9999');
I've left these numbers as VARCHARs; but I'd consider changing them to NUMBERs. The separators can be added in when you're required to display the data, but they just make managing numbers more difficult for the rest of the time. Though I've added a load of constraints, there's no way of guaranteeing that there are no overlapping ranges; you're going to have to be as careful as you are when declaring these in code.
Now, whenever you need to select data from another table based on these ranges you can join with a normal non-equi join:
select my.*
from my_table my
join name_ranges nr
on my.name = nr.range
and my.column_name between nr.min_value and nr.max_value
This has the benefit of simplifying the code and reducing the volume. It also means that if you ever need to change anything you only have to change a table. Nothing else. It saves so much hassle in the longer run it's unbelievable. If your reasoning for some of the choices you've made for your ranges are unclear add a DESCRIPTION column to the table that you populate with free-text to explain them.
If you absolutely have to have these as variables in code then declare a type that is a table of the actual table and then just put them into that.
declare
type t__name_ranges is table of name_ranges index by binary_integer;
t_name_ranges t__name_ranges;
begin
select * bulk collect into t_name_ranges
from name_ranges;
-- do_something
end;

Select a row and insert it with different IDs n times

I am trying to come up with a script in Postgres that will select the first row in a table and insert that row x number of times back into the same table.
Here is what I have:
INSERT INTO campaign (select column_name from campaign)
SELECT x.id from generate_series(50, 500) as x(id);
The above obviously doesn't work.
Just get the syntax for the INSERT statement right:
INSERT INTO campaign (id, column_name)
SELECT g.g, t.column_name
FROM (SELECT column_name FROM campaign LIMIT 1) t -- picking arbitrary row
,generate_series(50, 500) g(g); -- 451 times
The CROSS JOIN to generate_series() multiplies each selected row.
Selecting one arbitrary row, since the question didn't define "first". There is no natural order in a table. To pick a certain row, add ORDER BY and/or WHERE.
There is no syntactical shortcut to select all columns except the one named "id". You have to use the complete row or provide a list of selected columns.
Automation with dynamic SQL
To get around this, build the query string from catalog tables (or the information schema) and use EXECUTE in a plpgsql function (or some other procedural language). Only using pg_attribute.
format() requires Postgres 9.1 or later.
CREATE OR REPLACE FUNCTION f_multiply_row(_tbl regclass
, _idname text
, _minid int
, _maxid int)
RETURNS void AS
$func$
BEGIN
EXECUTE (
SELECT format('INSERT INTO %1$s (%2$I, %3$s)
SELECT g.g, %3$s
FROM (SELECT * FROM %1$s LIMIT 1) t
,generate_series($1, $2) g(g)'
, _tbl
, _idname
, string_agg(quote_ident(attname), ', ')
)
FROM pg_attribute
WHERE attrelid = _tbl
AND attname <> _idname -- exclude id column
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
)
USING _minid, _maxid;
END
$func$ LANGUAGE plpgsql;
Call in your case:
SELECT f_multiply_row('campaign', 'id', 50, 500);
SQL Fiddle.
Major points
Properly escape identifiers to avoid SQL injection. Using format() and regclass for the table name. Details:
Table name as a PostgreSQL function parameter
_idname is the column name to exclude ('id' in your case). Case sensitive!
Pass values in the USING clause. $1 and $2 in generate_series($1, $2) reference those parameters (not the function parameters).
More explanation in related answers. Try a search:
https://stackoverflow.com/search?q=[plpgsql]+[dynamic-sql]+format+pg_attribute

How insert rows with max(order_field) + 1 transactionally in PostgreSQL

I need to insert in a PostgreSQL table a row with a column containing the max value + 1 for this same column on a subset of the rows of the table. That column is used to ordering the rows in that subset.
I´m trying to update the column value in an after insert trigger but I´m obtaining duplicate values for this column in different rows.
What´s the best way to do that avoiding duplicate values for the ordering column in the subset in a concurrent environment with a lot of inserts in a short time?
Thanks in advance
EDIT:
The subset is defined by another column of the same table: this column has the same value for all the related rows.
If that column is used only for ordering then use a sequence:
create table t (
column1 integer,
ordering_column serial
);
http://www.postgresql.org/docs/current/static/datatype-numeric.html#DATATYPE-NUMERIC-TABLE
New transactional-safe answer:
To make it in a transactional-safe way you could use this trigger, which creates sequences for each different "set_id" value:
create or replace function calculate_index() returns trigger
as $$
declare my_indexer_name text;
begin
my_indexer_name = 'my_indexer_name_' || NEW.my_set_id;
if NOT EXISTS (SELECT * FROM pg_class WHERE relname = my_indexer_name)
then
execute 'create sequence ' || my_indexer_name;
end if;
select nextval(my_indexer_name) into NEW.my_index;
return new;
end
$$
language plpgsql;
CREATE TRIGGER my_indexer_trigger
BEFORE INSERT ON my_table FOR EACH ROW
EXECUTE PROCEDURE calculate_index();
Also you could create manually sequences named 'my_indexer_name_1', 'my_indexer_name_2', etc. if your set_id possible values are known beforehand, then you could eliminate the if-then from the trigger function above.
This was my initial and not transactional-safe answer:
I would create a new helper table let's call it set_indexes:
create table set_indexes( set_id integer, max_index integer );
each record has the set_id and the max index value of that set. e.g.:
set_id, max_index
1 53
2 12
3 43
in the trigger code you would:
select max_index + 1 from set_indexes where set_indexes.set_id = NEW.my_set_id
into NEW.my_index;
// Chek if the set_id is new:
if NEW.my_index is null then
insert into set_indexes( set_id, max_index) values (NEW.my_set_id, 1);
NEW.my_index = 0;
else
update set_indexes set max_index = NEW.my_index where set_indexes.set_id = NEW.my_set_id;
end if;

Update multiple columns in a trigger function in plpgsql

Given the following schema:
create table account_type_a (
id SERIAL UNIQUE PRIMARY KEY,
some_column VARCHAR
);
create table account_type_b (
id SERIAL UNIQUE PRIMARY KEY,
some_other_column VARCHAR
);
create view account_type_a view AS select * from account_type_a;
create view account_type_b view AS select * from account_type_b;
I try to create a generic trigger function in plpgsql, which enables updating the view:
create trigger trUpdate instead of UPDATE on account_view_type_a
for each row execute procedure updateAccount();
create trigger trUpdate instead of UPDATE on account_view_type_a
for each row execute procedure updateAccount();
An unsuccessful effort of mine was:
create function updateAccount() returns trigger as $$
declare
target_table varchar := substring(TG_TABLE_NAME from '(.+)_view');
cols varchar;
begin
execute 'select string_agg(column_name,$1) from information_schema.columns
where table_name = $2' using ',', target_table into cols;
execute 'update ' || target_table || ' set (' || cols || ') = select ($1).*
where id = ($1).id' using NEW;
return NULL;
end;
$$ language plpgsql;
The problem is the update statement. I am unable to come up with a syntax that would work here. I have successfully implemented this in PL/Perl, but would be interested in a plpgsql-only solution.
Any ideas?
Update
As #Erwin Brandstetter suggested, here is the code for my PL/Perl solution. I incoporated some of his suggestions.
create function f_tr_up() returns trigger as $$
use strict;
use warnings;
my $target_table = quote_ident($_TD->{'table_name'}) =~ s/^([\w]+)_view$/$1/r;
my $NEW = $_TD->{'new'};
my $cols = join(',', map { quote_ident($_) } keys $NEW);
my $vals = join(',', map { quote_literal($_) } values $NEW);
my $query = sprintf(
"update %s set (%s) = (%s) where id = %d",
$target_table,
$cols,
$vals,
$NEW->{'id'});
spi_exec_query($query);
return;
$$ language plperl;
While #Gary's answer is technically correct, it fails to mention that PostgreSQL does support this form:
UPDATE tbl
SET (col1, col2, ...) = (expression1, expression2, ..)
Read the manual on UPDATE.
It's still tricky to get this done with dynamic SQL. I'll assume a simple case where views consist of the same columns as their underlying tables.
CREATE VIEW tbl_view AS SELECT * FROM tbl;
Problems
The special record NEW is not visible inside EXECUTE. I pass NEW as a single parameter with the USING clause of EXECUTE.
As discussed, UPDATE with list-form needs individual values. I use a subselect to split the record into individual columns:
UPDATE ...
FROM (SELECT ($1).*) x
(Parenthesis around $1 are not optional.) This allows me to simply use two column lists built with string_agg() from the catalog table: one with and one without table qualification.
It's not possible to assign a row value as a whole to individual columns. The manual:
According to the standard, the source value for a parenthesized
sub-list of target column names can be any row-valued expression
yielding the correct number of columns. PostgreSQL only allows the
source value to be a row constructor or a sub-SELECT.
INSERT is implemented simpler. If the structure of view and table are identical we can omit the column definition list. (Can be improved, see below.)
Solution
I made a couple of updates to your approach to make it shine.
Trigger function for UPDATE:
CREATE OR REPLACE FUNCTION f_trg_up()
RETURNS TRIGGER
LANGUAGE plpgsql AS
$func$
DECLARE
_tbl regclass := quote_ident(TG_TABLE_SCHEMA) || '.'
|| quote_ident(substring(TG_TABLE_NAME from '(.+)_view$'));
_cols text;
_vals text;
BEGIN
SELECT INTO _cols, _vals
string_agg(quote_ident(attname), ', ')
, string_agg('x.' || quote_ident(attname), ', ')
FROM pg_attribute
WHERE attrelid = _tbl
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0; -- no system columns
EXECUTE format('
UPDATE %s
SET (%s) = (%s)
FROM (SELECT ($1).*) x', _tbl, _cols, _vals)
USING NEW;
RETURN NEW; -- Don't return NULL unless you knwo what you're doing
END
$func$;
Trigger function for INSERT:
CREATE OR REPLACE FUNCTION f_trg_ins()
RETURNS TRIGGER
LANGUAGE plpgsql AS
$func$
DECLARE
_tbl regclass := quote_ident(TG_TABLE_SCHEMA) || '.'
|| quote_ident(substring(TG_TABLE_NAME FROM '(.+)_view$'));
BEGIN
EXECUTE format('INSERT INTO %s SELECT ($1).*', _tbl)
USING NEW;
RETURN NEW; -- Don't return NULL unless you know what you're doing
END
$func$;
Triggers:
CREATE TRIGGER trg_instead_up
INSTEAD OF UPDATE ON a_view
FOR EACH ROW EXECUTE FUNCTION f_trg_up();
CREATE TRIGGER trg_instead_ins
INSTEAD OF INSERT ON a_view
FOR EACH ROW EXECUTE FUNCTION f_trg_ins();
Before Postgres 11 the syntax (oddly) was EXECUTE PROCEDURE instead of EXECUTE FUNCTION - which also still works.
db<>fiddle here - demonstrating INSERT and UPDATE
Old sqlfiddle
Major points
Include the schema name to make the table reference unambiguous. There can be multiple table of the same name in one database with multiple schemas!
Query pg_catalog.pg_attribute instead of information_schema.columns. Less portable, but much faster and allows to use the table-OID.
How to check if a table exists in a given schema
Table names are NOT safe against SQLi when concatenated as strings for dynamic SQL. Escape with quote_ident() or format() or with an object-identifer type. This includes the special trigger function variables TG_TABLE_SCHEMA and TG_TABLE_NAME!
Cast to the object identifier type regclass to assert the table name is valid and get the OID for the catalog look-up.
Optionally use format() to build the dynamic query string safely.
No need for dynamic SQL for the first query on the catalog tables. Faster, simpler.
Use RETURN NEW instead of RETURN NULL in these trigger functions unless you know what you are doing. (NULL would cancel the INSERT for the current row.)
This simple version assumes that every table (and view) has a unique column named id. A more sophisticated version might use the primary key dynamically.
The function for UPDATE allows the columns of view and table to be in any order, as long as the set is the same.
The function for INSERT expects the columns of view and table to be in identical order. If you want to allow arbitrary order, add a column definition list to the INSERT command, just like with UPDATE.
Updated version also covers changes to the id column by using OLD additionally.
Postgresql doesn't support updating multiple columns using the set (col1,col2) = select val1,val2 syntax.
To achieve the same in postgresql you'd use
update target_table
set col1 = d.val1,
col2 = d.val2
from source_table d
where d.id = target_table.id
This is going to make the dynamic query a bit more complex to build as you'll need to iterate the column name list you're using into individual fields. I'd suggest you use array_agg instead of string_agg as an array is easier to process than splitting the string again.
Postgresql UPDATE syntax
documentation on array_agg function