PostgreSQL Query Optimization - sql

I'm trying to generate new IDs for a large table. The ID's have to be consecutive and need to start at 0 (So I can't use sequence). What I come up with so far is the following function:
CREATE OR REPLACE FUNCTION genIds() RETURNS integer AS $$
DECLARE
edge RECORD;
i INTEGER := 0;
BEGIN
FOR edge IN SELECT * FROM network LOOP
UPDATE network SET id = i WHERE id = edge.id;
i := i + 1;
END LOOP;
RETURN i;
END;
$$ LANGUAGE plpgsql;
I would much rather like to not care about id = edge.id since I don't really care about the id's anyway. Is there a way to avoid having count(network) updates?
Cheers, Daniel

Is there a way to avoid having count(network) updates?
If your question is: can this done with a single statement instead of a loop, then yes this is possible:
This can be done without a loop in a single statement:
with numbered as (
select id as old_id,
row_number() over (order by id) as new_id
from network
)
update network nt
set id = nb.new_id - 1 // -1 to start at 0
from numbered nb
where nb.old_id = nt.id;

Related

Change the parameter each time and run the sql script

I am quite new to sql and have been trying to work on the following script to parameterize it.
This is my code:
select
dc.deviceid,
dc.kernel_time,
dc.crash_time,
dc.crash_process,
dps.start_time,
dps.end_time,
dps.start_kernel_time,
dps.end_kernel_time,
case
when dc.kernel_time between dps.start_kernel_time and dps.end_kernel_time then 1
when dc.crash_time between dps.start_time and dps.end_time then 2
else 3
end as flag,
ROW_NUMBER () over (partition by dc.deviceid, dc.kernel_time,
dc.crash_time, dc.crash_process order by flag) row_num
from dummy.dummy_crashes dc
left outer join (select *
from dummy.dummy_power) as dps
on dc.deviceid = dps.deviceid
and ((dc.kernel_time between (dps.start_kernel_time + 10000) and (dps.end_kernel_time + 10000)) or (dc.crash_time between dps.start_time and dps.end_time))
order by dc.crash_time;
I need to test this script by changing the start_kernel_time and end_kernel_time with a certain int parameter value (in this example shown: 10000) every time. So, instead of modifying it in the code, I would like to create a function with the int parameter of choice and run this script. Would that be possible?
I am really clueless as to how to achieve that.
My ideal idea would be something like this:
get_crashes(10000); <-- get records with adding int parameter (in start_kernel_time and end_kernel_time) as 10000
get_crashes(30000); <-- get records with adding int parameter as 30000
get_crashes(80000); <-- get records with adding int parameter as 80000
I am really trying to understand how I could achieve this?
I can't write a comment because i don't have 50rep, but here is my answer:
You can create a temp table with values that you want to pass and call cursor with simple query like:
SELECT [value] FROM *temptable*
After that, inside cursor just write script with single value from above temp table
UPDATE
DECLARE
cur CURSOR FOR select col1 from tempTable;
test_cur RECORD;
BEGIN
open cur;
LOOP
fetch cur into test_cur;
exit when test_cur = null;
if test_cur.col1 IS NOT NULL then
return next test_cur.col1;
end if;
END LOOP;
close cur;
END;
One note - I never write PostgreSQL, just have knowledge about SQL and find code on internet, so maybe you need to check documentation.

Iterate through table, perform calculation on each row

I would like to preface this by saying I am VERY new to SQL, but my work now requires that I work in it.
I have a dataset containing topographical point data (x,y,z). I am trying to build a KNN model based on this data. For every point 'P', I search for the 100 points in the data set nearest P (nearest meaning geographically nearest). I then average the values of these points (this average is known as a residual), and add this value to the table in the 'resid' column.
As a proof of concept, I am trying to simply iterate over the table, and set the value of the 'resid' column to 1.0 in every row.
My query is this:
CREATE OR REPLACE FUNCTION LoopThroughTable() RETURNS VOID AS '
DECLARE row table%rowtype;
BEGIN
FOR row in SELECT * FROM table LOOP
SET row.resid = 1.0;
END LOOP;
END
' LANGUAGE 'plpgsql';
SELECT LoopThroughTable() as output;
This code executes and returns successfully, but when I check the table, no alterations have been made. What is my error?
Doing updates row-by-row in a loop is almost always a bad idea and will be extremely slow and won't scale. You should really find a way to avoid that.
After having said that:
All your function is doing is to change the value of the column value in memory - you are just modifying the contents of a variable. If you want to update the data you need an update statement:
You need to use an UPDATE inside the loop:
CREATE OR REPLACE FUNCTION LoopThroughTable()
RETURNS VOID
AS
$$
DECLARE
t_row the_table%rowtype;
BEGIN
FOR t_row in SELECT * FROM the_table LOOP
update the_table
set resid = 1.0
where pk_column = t_row.pk_column; --<<< !!! important !!!
END LOOP;
END;
$$
LANGUAGE plpgsql;
Note that you have to add a where condition on the primary key to the update statement otherwise you would update all rows for each iteration of the loop.
A slightly more efficient solution is to use a cursor, and then do the update using where current of
CREATE OR REPLACE FUNCTION LoopThroughTable()
RETURNS VOID
AS $$
DECLARE
t_curs cursor for
select * from the_table;
t_row the_table%rowtype;
BEGIN
FOR t_row in t_curs LOOP
update the_table
set resid = 1.0
where current of t_curs;
END LOOP;
END;
$$
LANGUAGE plpgsql;
So if I execute the UPDATE query after the loop has finished, will that commit the changes to the table?
No. The call to the function runs in the context of the calling transaction. So you need to commit after running SELECT LoopThroughTable() if you have disabled auto commit in your SQL client.
Note that the language name is an identifier, do not use single quotes around it. You should also avoid using keywords like row as variable names.
Using dollar quoting (as I did) also makes writing the function body easier
I'm not sure if the proof of concept example does what you want. In general, with SQL, you almost never need a FOR loop. While you can use a function, if you have PostgreSQL 9.3 or later, you can use a LATERAL subquery to perform subqueries for each row.
For example, create 10,000 random 3D points with a random value column:
CREATE TABLE points(
gid serial primary key,
geom geometry(PointZ),
value numeric
);
CREATE INDEX points_geom_gist ON points USING gist (geom);
INSERT INTO points(geom, value)
SELECT ST_SetSRID(ST_MakePoint(random()*1000, random()*1000, random()*100), 0), random()
FROM generate_series(1, 10000);
For each point, search for the 100 nearest points (except the point in question), and find the residual between the points' value and the average of the 100 nearest:
SELECT p.gid, p.value - avg(l.value) residual
FROM points p,
LATERAL (
SELECT value
FROM points j
WHERE j.gid <> p.gid
ORDER BY p.geom <-> j.geom
LIMIT 100
) l
GROUP BY p.gid
ORDER BY p.gid;
Following is a simple example to update rows in a table:
Assuming the row id field id
Update all rows:
UPDATE my_table SET field1='some value'
WHERE id IN (SELECT id FROM staff)
Selective row update
UPDATE my_table SET field1='some value'
WHERE id IN (SELECT id FROM staff WHERE field2='same value')
You don't need a function for that.
All you need is to run this query:
UPDATE table SET resid = 1.0;
if you want to do it with a function you can use SQL function:
CREATE OR REPLACE FUNCTION LoopThroughTable()
RETURNS VOID AS
$BODY$
UPDATE table SET resid = 1.0;
$BODY$
LANGUAGE sql VOLATILE
if you want to use plpgsql then function would be:
CREATE OR REPLACE FUNCTION LoopThroughTable()
RETURNS void AS
$BODY$
begin
UPDATE table SET resid = 1.0;
end;
$BODY$
LANGUAGE plpgsql VOLATILE
Note that it is not recommended to use plpgsql functions for tasks that can be done with Sql functions.

How to select distinct in esql?

I have a subflow in esql (IBM Websphere Message Broker) where I need to achieve something similar to select distinct functionality.
Some background: I have a table in an Oracle database group_errcode_ref. This table is pretty much a fixed link/mapping of ERROR_CODE and ID. ERROR_CODE is unique, but ID can be duplicated. For example error code 4000 and 4001 can both be linked to ID 1.
In my esql subflow, I have an array of error codes that varies based on the current data coming into the flow.
So what I need to do is I need to take the input error code array, and select the ID for all the error codes in the array from my table group_errcode_ref
What I have now:
declare db rows;
set db.rows[] = (select d.ID from Database.group_errcode_ref as d where d.ERROR_CODE in (select D from errCodes.Code[] as D);
errCodes is the array of error codes from the input. row is an array of all IDs that correspond to the error codes.
This is fine, but I want to remove duplicates from the db.rows[] array.
I'm not certain the best way to do this in esql, but it does not support distinct. group by, or order by
If you are using the PASSTHRU statement, then all the functionality of your database manager is supported, so distinct as well.
The only thing you have to overcome is that you cannot directly mix database and messagetree queries in PASSTHRU, everything you pass to it goes directly to the database.
So your original solution would look something like this:
set db.rows[] = PASSTHRU 'select distinct d.ID from SCHEMA.group_errcode_ref as d where d.ERROR_CODE in ' || getErrorCodesFromInput(errCodes) TO Database.DSN1;
Here getErrorCodesFromInput is a function that returns character, which contains the error codes in your input, formatted correctly for the query, e.g. (ec1, ec2, ...)
My work around ended up not doing select distinct or sorting at all but another work around.
Basically I iterate through the entire array of ERROR_CODEs, then I query for the ID that corresponds to the error_code, then I select count(*) in a table I insert them to.
This works for my particular application only because I insert the ID/Issue pairs.
Basically it looks like:
for x as errs.Error[] do
declare db row;
set db.rows[] = passthru('select ID from my_static_map_table where error_code = ?;' values(x.Code));
declare db2 row;
set db2.rows[] = passthru('select count(*) from my_table_2 where guid = ? and id = ?;' values(guid, db.ID));
if db2.COUNT == 0 then
-- Here I do an insert into my_table_2 with ID and a few other values
end if;
end for;
Not really a proper answer, but it works for my specific application. Basically loop through every error code and select one at a time, rather than sending in the entire array. Then doing an insert into another database while avoiding duplicates by another select to see if it's already been inserted.
I'll still wait a week to see if there's a better answer and accept that one.
UPDATE
I've changed my code to match Attila's solution - which is much better and what I was looking for originally
Only thing I will add is my function that formats the error codes - which is really simple:
create function FlattenErrorCodesArray(in err row) returns char begin
declare idx int 1;
declare ret char;
for x as errs.Error[] do
if idx = 1 then
set ret = '(' || cast(x.Code as char);
else
set ret = ret || ',' || cast(x.Code as char);
end if;
set idx = idx + 1;
end for;
set ret = ret || ')';
end;

Display Number of Rows based on input parameter

CREATE OR REPLACE PROCEDURE test_max_rows (
max_rows IN NUMBER DEFAULT 1000
)
IS
CURSOR cur_test ( max_rows IN number ) IS
SELECT id FROM test_table
WHERE user_id = 'ABC'
AND ROWNUM <= max_rows;
id test_table.id%TYPE;
BEGIN
OPEN cur_test(max_rows) ;
LOOP
FETCH cur_test INTO id;
EXIT WHEN cur_test%NOTFOUND;
DBMS_OUTPUT.PUT_LINE('ID:' || id);
END LOOP;
END;
My requirement is to modify the above code so that when I pass -1 for max_rows, the proc should return all the rows returned by the query. Otherwise, it should limit the rows as per max_rows.
For example:
EXECUTE test_max_rows(-1);
This command should return all the rows returned by the SELECT statement above.
EXECUTE test_max_rows(10);
This command should return only 10 rows.
You can do this with a OR clause; change:
AND ROWNUM <= max_rows;
to:
AND (max_rows < 1 OR ROWNUM <= max_rows);
Then passing zero, -1, or any negative number will fetch all rows, and any positive number will return a restricted list. You could also replace the default 1000 clause with default null, and then test for null instead, which might be a bit more obvious:
AND (max_rows is null OR ROWNUM <= max_rows);
Note that which rows you get with a passed value will be indeterminate because you don't have an order by clause at the moment.
Doing this in a procedure also seems a bit odd, and you're assuming whoever calls it will be able to see the output - i.e. will have done set serveroutput on or the equivalent for their client - which is not a very safe assumption. An alternative, if you can't specify the row limit in a simple query, might be to use a pipelined function instead - you could at least then call that from plain SQL.
CREATE OR REPLACE FUNCTION test_max_rows (max_rows IN NUMBER DEFAULT NULL)
RETURN sys.odcinumberlist PIPELINED
AS
BEGIN
FOR r IN (
SELECT id FROM test_table
WHERE user_id = 'ABC'
AND (max_rows IS NULL OR ROWNUM <= max_rows)
) LOOP
PIPE ROW (r.id);
END LOOP;
END;
/
And then call it as:
SELECT * FROM TABLE(test_max_rows);
or
SELECT * FROM TABLE(test_max_rows(10));
Here's a quick SQL Fiddle demo. But you should still consider if you can do the whole thing in plain SQL and PL/SQL altogether.

PL/pgSQL checking if a row exists

I'm writing a function in PL/pgSQL, and I'm looking for the simplest way to check if a row exists.
Right now I'm SELECTing an integer into a boolean, which doesn't really work. I'm not experienced with PL/pgSQL enough yet to know the best way of doing this.
Here's part of my function:
DECLARE person_exists boolean;
BEGIN
person_exists := FALSE;
SELECT "person_id" INTO person_exists
FROM "people" p
WHERE p.person_id = my_person_id
LIMIT 1;
IF person_exists THEN
-- Do something
END IF;
END; $$ LANGUAGE plpgsql;
Update - I'm doing something like this for now:
DECLARE person_exists integer;
BEGIN
person_exists := 0;
SELECT count("person_id") INTO person_exists
FROM "people" p
WHERE p.person_id = my_person_id
LIMIT 1;
IF person_exists < 1 THEN
-- Do something
END IF;
Simpler, shorter, faster: EXISTS.
IF EXISTS (SELECT FROM people p WHERE p.person_id = my_person_id) THEN
-- do something
END IF;
The query planner can stop at the first row found - as opposed to count(), which scans all (qualifying) rows regardless. Makes a big difference with big tables. The difference is small for a condition on a unique column: only one row qualifies and there is an index to look it up quickly.
Only the existence of at least one qualifying row matters. The SELECT list can be empty - in fact, that's shortest and cheapest. (Some other RDBMS don't allow an empty SELECT list on principal.)
Improved with #a_horse_with_no_name's comments.
Use count(*)
declare
cnt integer;
begin
SELECT count(*) INTO cnt
FROM people
WHERE person_id = my_person_id;
IF cnt > 0 THEN
-- Do something
END IF;
Edit (for the downvoter who didn't read the statement and others who might be doing something similar)
The solution is only effective because there is a where clause on a column (and the name of the column suggests that its the primary key - so the where clause is highly effective)
Because of that where clause there is no need to use a LIMIT or something else to test the presence of a row that is identified by its primary key. It is an effective way to test this.