Batch delete rows matching multiple columns in postgres using an array - sql

i'd like to delete these rows from a table in a postgres function as an array parameter:
{id1: 4, id2: 8}
{id1: 4, id2: 9}
{id1: 5, id2: 8}
that is, something like this:
delete from mytable
where (id1, id2) = ANY(Array [(4,8), (4,9), (5,8)])
which doesn't work* UPDATE: yes it does, but not in supabase. see below
but i know you can do this:
delete from mytable
where (id1, id2) in ((4,8), (4,9), (5,8))
and this:
delete from othertable
where id = ANY(Array [1,2])
i'd love to know if anyone has insight on how to accurately combine these. i've tried every combination i can think of. maybe i'm missing something obvious, or maybe i could use a temporary table somehow?
*the error for that attempt is cannot compare dissimilar column types bigint and integer at record column 1, but both column types are definitely bigint
Update:
so actually this works fine, as demonstrated in the comments:
delete from mytable where (id1, id2) = ANY(Array [(1,3), (2,1)]);
my errors are from a bug in the supabase client that i was using to run postgres commands, i believe.
but here's a_horse_with_no_name's answer (and critical work-around thank you!) implemented in a function, putting here for posterity:
create or replace function bulk_delete (id1s bigint [], id2s bigint [])
returns setof bigint
language PLPGSQL
as $$
begin
return query WITH deleted AS (
delete from mytable
where (id1, id2) in (
select id1, id2
from unnest(id1s, id2s) as x(id1, id2)
) returning *
) SELECT count(*) FROM deleted;
end;
$$;
select bulk_delete(Array [1,1], Array [1,2]);
response:
|bulk_delete|
|-----------|
| 2 |
adding my supabase bug report for completeness
Update 2
the issue was conflicting types between implied ints and supabase's default int8

The only way I can think of is to pass two arrays (of the same length) then unnest them both together so that you can use an IN operator
delete from the_table
where (id1, id2) in (select id1, id2
from unnest(array_with_id1, array_with_id2) as x(id1, id2);

Related

SQL aggregation function to choose the only value

I have an rowset with two columns: technical_id and natural_id. The rowset is actually result of complex query. The mapping between columns values is assumed to be bijective (i.e. for two rows with same technical_id the natural_ids are same too, for distinct technical_ids the natural_ids are distinct too). The (technical_id,natural_id) pairs are not unique in rowset because of joins in original query. Example:
with t (technical_id, natural_id, val) as (values
(1, 'a', 1),
(1, 'a', 2),
(2, 'b', 3),
(2, 'b', 2),
(3, 'c', 0),
(3, 'c', 1),
(4, 'd', 1)
)
Unfortunately, the bijection is enforced only by application logic. The natural_id is actually collected from multiple tables and composed using coalesce-based expression so its uniqueness hardly can be enforced by db constraint.
I need to aggregate rows of rowset by technical_id assuming the natural_id is unique. If it isn't (for example if tuple (4, 'x', 1) were added into sample data), the query should fail. In ideal SQL world I would use some hypothetical aggregate function:
select technical_id, only(natural_id), sum(val)
from t
group by technical_id;
I know there is not such function in SQL. Is there some alternative or workaround? Postgres-specific solutions are also ok.
Note that group by technical_id, natural_id or select technical_id, max(natural_id) - though working well in happy case - are both unacceptable (first because the technical_id must be unique in result under all circumstances, second because the value is potentially random and masks data inconsistency).
Thanks for tips :-)
UPDATE: the expected answer is
technical_id,v,sum
1,a,3
2,b,5
3,c,1
4,d,1
or fail when 4,x,1 is also present.
You can get only the "unique" natural ids using:
select technical_id, max(natural_id), sum(val)
from t
group by technical_id
having min(natural_id) = max(natural_id);
If you want the query to actually fail, that is a little hard to guarantee. Here is a hacky way to do it:
select technical_id, max(natural_id), sum(val)
from t
group by technical_id
having (case when min(natural_id) = max(natural_id) then 0 else 1 / (count(*) - count(*)) end) = 0;
And a db<>fiddle illustrating this.
Seems I've finally found solution based on single-row cardinality of correlated subquery in select clause:
select technical_id,
(select v from unnest(array_agg(distinct natural_id)) as u(v)) as natural_id,
sum(val)
from t
group by technical_id;
This is the simplest solution for my situation at this moment so I'll resort to self-accept. Anyway if some disadvantages show, I will describe them here and reaccept to other answer. I appreciate all other proposals and believe they will be valuable for anybody too.
You can use
SELECT technical_id, max(natural_id), count(natural_id)
...
GROUP BY technical_id;
and throw an error whenever the count is not 1.
If you want to guarantee the constraint with the database, you could do one of these:
Do away with the artificial primary key.
Do something complicated like this:
CREATE TABLE id_map (
technical_id bigint UNIQUE NOT NULL,
natural_id text UNIQUE NOT NULL,
PRIMARY KEY (technical_id, natural_id)
);
ALTER TABLE t
ADD FOREIGN KEY (technical_id, natural_id) REFERENCES id_map;
You can create your own aggregates. ONLY is a keyword, so best not use it as the name of an aggregate. Not willing to put much time into deciding, I called it only2.
CREATE OR REPLACE FUNCTION public.only_agg(anyelement, anyelement)
RETURNS anyelement
LANGUAGE plpgsql
IMMUTABLE
AS $function$
BEGIN
if $1 is null then return $2; end if;
if $2 is null then return $1; end if;
if $1=$2 then return $1; end if;
raise exception 'not only';
END $function$;
create aggregate only2 (anyelement) ( sfunc = only_agg, stype = anyelement);
It might not do the thing you want with NULL inputs, but I don't know what you want in that case.

How to get arrays from a normalised table that stores array elements by index?

I have a table storing array elements by the array they belong to and
their index in the array. It seemed smart because the arrays were
expected to be sparse, and have their elements updated individually.
Let's say this is the table:
CREATE TABLE values (
pk TEXT,
i INTEGER,
value REAL,
PRIMARY KEY (pk, i)
);
pk | i | value
----+---+-------
A | 0 | 17.5
A | 1 | 32.7
A | 3 | 5.3
B | 1 | 13.5
B | 2 | 4.8
B | 4 | 89.1
Now I would like to get these as real arrays, i.e. {17.5, 32.7, NULL, 53} for A and {NULL, 13.5, 4.8, NULL, 89.1} for B.
I would have expected that it's easily possible with a grouping query
and an appropriate aggregate function. However, it turned out that there
is no such function that puts elements into an array by its index (or
subscript, as postgres calls it). It would've been much simpler if the
elements were successive - I just could've used array_agg with
ORDER BY i. But I want the null values in the result
arrays.
What I ended up with was this monster:
SELECT
pk,
ARRAY( SELECT
( SELECT value
FROM values innervals
WHERE innervals.pk = outervals.pk AND i = generate_series
)
FROM generate_series(0, MAX(i))
ORDER BY generate_series -- is this really necessary?
)
FROM values outervals
GROUP BY pk;
Having to SELECT … FROM values twice is ugly, and the query planner doesn't seem to be able to optimise this.
Is there a simple way to refer to the grouped rows as a relation in a subquery, so that I could just SELECT value FROM generate_series(0, MAX(i)) LEFT JOIN ????
Would it be more appropriate to solve this by defining a custom aggregate function?
Edit: It seems what I was looking for is possible with multiple-argument unnest and array_agg, although it is not particularly elegant:
SELECT
pk,
ARRAY( SELECT val
FROM generate_series(0, MAX(i)) AS series (series_i)
LEFT OUTER JOIN
unnest( array_agg(value ORDER BY i),
array_agg(i ORDER BY i) ) AS arr (val, arr_i)
ON arr_i = series_i
ORDER BY series_i
)
FROM values
GROUP BY pk;
The query planner even seems to realise that it can do a sorted merge JOIN on the sorted series_i and arr_i, although I need to put some more effort in really understanding the EXPLAIN output. Edit 2: It's actually a hash join between series_i and arr_i, only the outer group aggregation uses a "sorted" strategy.
Not sure if this qualifies as "simpler" - I personally find it easier to follow though:
with idx as (
select pk,
generate_series(0, max(i)) as i
from "values"
group by pk
)
select idx.pk,
array_agg(v.value order by idx.i) as vals
from idx
left join "values" v on v.i = idx.i and v.pk = idx.pk
group by idx.pk;
The CTE idx generates all possible index values for each PK values and then uses that to aggregate the values
Online example
Would it be more appropriate to solve this by defining a custom aggregate function?
It does at least simplify the query significantly:
SELECT pk, array_by_subscript(i+1, value)
FROM "values"
GROUP BY pk;
Using
CREATE FUNCTION array_set(arr anyarray, index int, val anyelement) RETURNS anyarray
AS $$
BEGIN
arr[index] = val;
RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;
CREATE FUNCTION array_fillup(arr anyarray) RETURNS anyarray
AS $$
BEGIN
-- necessary for nice to_json conversion of arrays that don't start at subscript 1
IF array_lower(arr, 1) > 1 THEN
arr[1] = NULL;
END IF;
RETURN arr;
END
$$ LANGUAGE plpgsql STRICT;
CREATE AGGREGATE array_by_subscript(int, anyelement) (
sfunc = array_set,
stype = anyarray,
initcond = '{}',
finalfunc = array_fillup
);
Online example. It also has a nice query plan that does a simple linear scan on the values, I'll have to benchmark how efficient array_set is at growing the array.
This is in fact the fastest solution, according to an EXPLAIN ANALYZE benchmark on a reasonably-sized test data set. It took 55ms, compared to about 80ms of the ARRAY + UNNEST solution, and is considerably faster than the 160ms of the join against the common table expression.
I think this qualifies as a solution (much better than my original attempt) so I'll post it as an answer. From this answer I realised that I can indeed put multiple values in the array_agg by using record syntax, it only forces me to declare the types in the column definition:
SELECT
pk,
ARRAY( SELECT val
FROM generate_series(0, MAX(i)) AS series (series_i)
LEFT OUTER JOIN
unnest(array_agg( (value, i) )) AS arr (val real, arr_i integer)
-- ^^^^^^^^^^ ^^^^ ^^^^^^^
ON arr_i = series_i
ORDER BY series_i
)
FROM values
GROUP BY pk;
It still uses a hash left join followed by sorting instead of a sorting followed by a merge join, but maybe the query planner does optimisation better than my naive assumption.

How to get the first id from the INSERT query

Let's imagine that we have a plpgsql (PostgreSQL 10.7) function where there is a query like
INSERT INTO "myTable"
SELECT * FROM "anotherTable"
INNER JOIN "otherTable"
...
So, this query will insert several rows into myTable. In the next query I want to collect the ids which were inserted with some condition. So, my idea was to do it the following:
INSERT INTO "resultTable" rt
SELECT FROM "myTable"
INNER JOIN ...
WHERE rt."id" >= firstInsertedId;
Now the question: how to find this firstInsertedId?
My solution:
select nextval(''"myTable.myTable_id_seq"'') into firstInsertedId;
if firstInsertedId > 1 then
perform setval(''"myTable.myTable_id_seq"'', (firstInsertedId - 1));
end if;
I don't really like the solution as I don't think that it is good for the performance to generate the id, then go back, then generate it again during the insertion.
Thoughts:
was thinking about inserting the ids into variable array and then find the minimum, but no luck.
was considering to use lastval() function, but it seems that it doesn'
t work for me even though in a very similar implementation in MySQL LAST_INSERT_ID() worked just fine.
Any suggestions?
You can do both things in a single statement using a data modifying common table expression. You don't really need PL/pgSQL for that.
with new_rows as (
INSERT INTO my_table
SELECT *
FROM anotherTable
JOIN "otherTable" ...
returning my_table.id
)
insert into resulttable (new_id)
select id
from new_rows;
Another option would be to store the generate IDs in an array.
declare
l_ids integer[];
begin
....
with new_rows as (
INSERT INTO my_table
SELECT *
FROM anotherTable
JOIN "otherTable" ...
returning my_table.id
)
select array_agg(id)
into l_ids
from new_rows;
....
end;

in postgres, is there an easy way to select several attr/val rows into one record?

if I have
create table t1( attr text primary key, val text );
insert into t1 values( 'attr1', 'val1' );
insert into t1 values( 'attr2', 'val3' );
insert into t1 values( 'attr3', 'val3' );
would like to select to return one row
attr1=>val1, attr2=>val2, attr3=>val3
right now doing conversion in javascript, but would be nice for pg to return the row itself
Answer
based on #mu's answer, the query:
select replace( replace( replace( array_agg( hstore( attr, val ) )::text
'"\"', '"'),
'\""', '"'),
'\"=>\"', '":"') from t1;
results in:
{"attr1":"val1","attr2":val2","attr3":"val3"}
which is quite nice JSON (as long as no quotes in values)
I expected it to be possible to use array_to_json with array_agg. See the PostgreSQL 9.2 json documentation for usage, and the json91 module that backports the JSON functionality for use in PostgreSQL 9.1 until 9.2 is out, or use a 9.2 beta.
Unfortunately, it turns out there doesn't seem to be any suport for merging, aggregating, etc json at this point. That makes it surprisingly difficult to build JSON values. I landed up just doing it with regular text operators, but that doesn't allow for quoting issues.
regress=# SELECT '{'||string_agg('"'||attr||'": "'||val||'"', ', ')||'}' FROM t1;
?column?
-----------------------------------------------------
{"attr1": "val1", "attr2": "val3", "attr3": "val3"}
(1 row)
See:
regress=# insert into t1 (attr,val) values ('at"tr', 'v"a"l');
INSERT 0 1
regress=# SELECT '{'||string_agg('"'||attr||'": "'||val||'"', ', ')||'}' FROM t1;
?column?
-----------------------------------------------------------------------
{"attr1": "val1", "attr2": "val3", "attr3": "val3", "at"tr": "v"a"l"}
(1 row)
regress=# SELECT ('{'||string_agg('"'||attr||'": "'||val||'"', ', ')||'}')::json FROM t1;
ERROR: invalid input syntax for type json
DETAIL: line 1: Token "tr" is invalid.
The same issue exists in the solution you added to your answer. For a good answer to that, we need functions something like json_escape_literal, and there isn't currently anything like that exposed to SQL.
The only safe approach I see with Pg's current json feature set is to produce an array of pairs, but that's no better than what you get with an ordinary row-oriented query.
regress=# SELECT array_to_json( array_agg( array_to_json( ARRAY[attr, val] ) )) FROM t1;
array_to_json
---------------------------------------------------------------------------
[["attr1","val1"],["attr2","val3"],["attr3","val3"],["at\"tr","v\"a\"l"]]
You can probably combine hstore and json to do what you want, but that's getting into extension soup. What this really needs is a json object constructor function that's equivalent to hstore(text[],text[]) so you can do the json equivalent of:
select hstore( array_agg(attr), array_agg(val) ) from t1;
UPDATE: pgsql-general mailing list post on this topic
If you have hstore installed then you could use array_agg:
select array_agg(attr => val) from t1;
That would give you exactly the output you're looking for. Of course, whatever interface you're using would have to understand hstore and arrays or you'd have to unpack the results yourself; and if that was the case, it would probably be simpler to iterate over a simple select attr, val from t1 query and build the data structure in JavaScript.

SQL SELECT INSERT INTO Generate Unique Id

I'm attempting to select a table of data and insert this data into another file with similar column names (it's essentially duplicate data). Current syntax as follows:
INSERT INTO TABLE1 (id, id2, col1, col2)
SELECT similiarId, similiarId2, similiarCol1, similiarCol2
FROM TABLE2
The problem I have is generating unique key fields (declared as integers) for the newly inserted records. I can't use table2's key's as table1 has existing data and will error on duplicate key values.
I cannot change the table schema and these are custom id columns not generated automatically by the DB.
Does table1 have an auto-increment on its id field? If so, can you lose similiarId from the insert and let the auto-increment take care of unique keys?
INSERT INTO TABLE1 (id2, col1, col2) SELECT similiarId2, similiarCol1, similiarCol2
FROM TABLE2
As per you requirement you need to do you query like this:
INSERT INTO TABLE1 (id, id2, col1, col2)
SELECT (ROW_NUMBER( ) OVER ( ORDER BY ID ASC ))
+ (SELECT MAX(id) FROM TABLE1) AS similiarId
, similiarId2, similiarCol1, similiarCol2
FROM TABLE2
What have I done here:
Added ROW_NUMBER() which will start from 1 so also added MAX() function for ID of destination table.
For better explanation See this SQLFiddle.
Im not sure if I understad you correctly:
You want to copy all data from TABLE2 but be sure that TABLE2.similiarId is not alredy in TABLE1.id, maybe this is solution for your problem:
DECLARE #idmax INT
SELECT #idmax = MAX(id) FROM TABLE1
INSERT INTO TABLE1 (id, id2, col1, col2)
SELECT similiarId + #idmax, similiarId2, similiarCol1, similiarCol2
FROM TABLE2
Now insert will not fail because of primary key violation because every inserted id will be greater then id witch was alredy there.
If the id field is defined as auto-id and you leave it out of the insert statement, then sql will generate unique id's from the available pool.
In SQL Server we have the function ROW_NUMBER, and if I have understood you correctly the following code will do what you need:
INSERT INTO TABLE1 (id, id2, col1, col2)
SELECT (ROW_NUMBER( ) OVER ( ORDER BY similiarId2 ASC )) + 6 AS similiarId,
similiarId2, similiarCol1, similiarCol2
FROM TABLE2
ROW_NUMBER will bring the number of each row, and you can add a "magic value" to it to make those values different from the current max ID of TABLE1. Let's say your current max ID is 6, then adding 6 to each result of ROW_NUMBER will give you 7, 8, 9, and so on. This way you won't have the same values for the TABLE1's primary key.
I have asked Google and it said to me that Sybase has the function ROW_NUMBER too (http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbusage/ug-olap-s-51258147.html), so I think you can try it.
If you want to make an identical table why not simply use (quick and dirty) Select INTO method ?
SELECT * INTO TABLE2
FROM TABLE1
Hope This helps.
Make the table1 ID IDENTITY if it is not a custom id.
or
Create new primary key in table1 and make it IDENTITY, and you can keep the previous IDs in the same format (but not primary key).
Your best bet may be to add an additional column on Table2 for Table1.Id. This way you keep both sets of Keys.
(If you are busy with a data merge, retaining Table1.Id may be important for any foreign keys which may still reference Table1.Id - you will then need to 'fix up' foreign keys in tables referencing Table1.Id, which now need to reference the applicable key in table 2).
If you need your 2nd table keep similar values as in 1st table , then donot apply auto increment on 2nd table.
If you have large range, and want easy fast make and don't care about ID:
Example wit CONCAT
INSERT INTO session(SELECT CONCAT("3000", id) as id, cookieid FROM `session2`)
but you can using also REPLACE