Optimising function which extracts records with a minimum gap in timestamps - sql

I have a big table of timestamps in Postgres 9.4.5:
CREATE TABLE vessel_position (
posid serial NOT NULL,
mmsi integer NOT NULL,
"timestamp" timestamp with time zone,
the_geom geometry(PointZ,4326),
CONSTRAINT "PK_posid_mmsi" PRIMARY KEY (posid, mmsi)
);
Additional index:
CREATE INDEX vessel_position_timestamp_idx ON vessel_position ("timestamp");
I want to extract every row where the timestamp is at least x minutes after the previous row. I've tried a few different SELECT statements using LAG() which all kind of worked, but didn't give me the exact result I require. The below functions gives me what I need, but I feel it could be quicker:
CREATE OR REPLACE FUNCTION _getVesslTrackWithInterval(mmsi integer, startTime character varying (25) ,endTime character varying (25), interval_min integer)
RETURNS SETOF vessel_position AS
$func$
DECLARE
count integer DEFAULT 0;
posids varchar DEFAULT '';
tbl CURSOR FOR
SELECT
posID
,EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY posid asc)))::int as diff
FROM vessel_position vp WHERE vp.mmsi = $1 AND vp.timestamp BETWEEN $2::timestamp AND $3::timestamp;
BEGIN
FOR row IN tbl
LOOP
count := coalesce(row.diff,0) + count;
IF count >= $4*60 OR count = 0 THEN
posids:= posids || row.posid || ',';
count:= 0;
END IF;
END LOOP;
RETURN QUERY EXECUTE 'SELECT * from vessel_position where posid in (' || TRIM(TRAILING ',' FROM posids) || ')';
END
$func$ LANGUAGE plpgsql;
I can't help thinking getting all the posids as a string and then selecting them all again at the very end is slowing things down.
Within the IF statement, I already have access to each row I want to keep, so could potentially store them in a temp table and then return temp table at the end of the loop.
Can this function be optimised - to improve performance in particular?

Query
Your function has all kinds of expensive, unnecessary overhead. A single query should be many times faster, doing the same:
CREATE OR REPLACE FUNCTION _get_vessel_track_with_interval
(mmsi int, starttime timestamptz, endtime timestamptz, min_interval interval)
RETURNS SETOF vessel_position AS
$func$
BEGIN
SELECT (vp).* -- parentheses required for decomposing row type
FROM (
SELECT vp -- whole row (!)
, timestamp - lag(timestamp) OVER (ORDER BY posid) AS diff
FROM vessel_position vp
WHERE vp.mmsi = $1
AND vp.timestamp >= $2 -- typically you'd include the lower bound
AND vp.timestamp < $3; -- ... and exlude the upper
ORDER BY posid
) sub
WHERE diff >= $4;
END
$func$ LANGUAGE plpgsql STABLE;
Could also just be an SQL function or the bare SELECT without any wrapper (Maybe as prepared statement? Example.)
Note how starttime and endtime are passed as timestamp. (Makes no sense to pass as text and cast.) And the minimum interval min_interval is an actual interval. Pass any interval of your choosing.
Index
If the predicate on mmsi is in any way selective, the two indexes you currently have (PK ON (posid, mmsi) and idx on (timestamp)) are not very useful. If you reverse the column order of your PK to (mmsi, posid), it becomes far more useful for the query at hand. See:
Is a composite index also good for queries on the first field?
The optimal index for this would typically be on vessel_position(mmsi, timestamp). Related:
Multicolumn index and performance
PostgreSQL performance with (col = value or col is NULL)
Query does not hit the index - are these the proper columns to index?
Aside: Avoid keywords as identifiers. That's asking for trouble. Plus, a column timestamp that actually holds timestamptz is misleading.

Related

How to use upsert with Postgres

I want to convert this code in Postgres to something shorter that will do the same. I read about upsert but I couldn't understand a good way to implement that on my code.
What I wrote works fine, but I want to find a more elegant way to write it.
Hope someone here can help me! This is the query:
CREATE OR REPLACE FUNCTION insert_table(
in_guid character varying,
in_x_value character varying,
in_y_value character varying
)
RETURNS TABLE(response boolean) LANGUAGE 'plpgsql'
DECLARE _id integer;
BEGIN
-- guid exists and it's been 10 minutes from created_date:
IF ((SELECT COUNT (*) FROM public.tbl_client_location WHERE guid = in_guid AND created_date < NOW() - INTERVAL '10 MINUTE') > 0) THEN
RETURN QUERY (SELECT FALSE);
-- guid exists but 10 minutes hasen't passed yet:
ELSEIF ((SELECT COUNT (*) FROM public.tbl_client_location WHERE guid = in_guid) > 0) THEN
UPDATE
public.tbl_client_location
SET
x_value = in_x_value,
y_value = in_y_value,
updated_date = now()
WHERE
guid = in_guid;
RETURN QUERY (SELECT TRUE);
-- guid not exist:
ELSE
INSERT INTO public.tbl_client_location
( guid , x_value , y_value )
VALUES
( in_guid, in_x_value, in_y_value )
RETURNING id INTO _id;
RETURN QUERY (SELECT TRUE);
END IF;
END
This can indeed be a lot simpler:
CREATE OR REPLACE FUNCTION insert_table(in_guid text
, in_x_value text
, in_y_value text
, OUT response bool) -- ④
-- RETURNS record -- optional noise -- ④
LANGUAGE plpgsql AS -- ①
$func$ -- ②
-- DECLARE
-- _id integer; -- what for?
BEGIN
INSERT INTO tbl AS t
( guid, x_value, y_value)
VALUES (in_guid, in_x_value, in_y_value)
ON CONFLICT (guid) DO UPDATE -- guid exists
SET ( x_value, y_value, updated_date)
= (EXCLUDED.x_value, EXCLUDED.y_value, now()) -- ⑤
WHERE t.created_date >= now() - interval '10 minutes' -- ③ have not passed yet
-- RETURNING id INTO _id -- what for?
;
response := FOUND; -- ⑥
END
$func$;
Assuming guid is defined UNIQUE or PRIMARY KEY, and created_date is defined NOT NULL DEFAULT now().
① Language name is an identifier - better without quotes.
② Quotes around function body were missing (invalid command). See:
What are '$$' used for in PL/pgSQL
③ UPDATE only if 10 min have not passed yet. Keep in mind that timestamps are those from the beginning of the respective transactions. So keep transactions short and simple. See:
Difference between now() and current_timestamp
④ A function with OUT parameter(s) and no RETURNS clause returns a single row (record) automatically. Your original was declared as set-returning function (0-n returned rows), which didn't make sense. See:
Return multiple fields as a record in PostgreSQL with PL/pgSQL
⑤ It's generally better to use the special EXCLUDED row than to spell out values again. See:
How could this UPSERT query be made shorter?
⑤ Also using short syntax for updating multiple columns. See:
Update multiple columns that start with a specific string
⑥ To see whether a row was written use the special variable FOUND. Subtle difference: different from your original, you get true or false after the fact, saying that a row has actually been written (or not). In your original, the INSERT or UPDATE might still be skipped (without raising an exception) by a trigger or rule, and the function result would be misleading in this case. See:
IS NOT NULL test for a record does not return TRUE when variable is set
Further reading:
Postgres ON CONFLICT ON CONSTRAINT triggering errors in the error log
How to use RETURNING with ON CONFLICT in PostgreSQL?
You might just run the single SQL statement instead, providing your values once:
INSERT INTO tbl AS t(guid, x_value,y_value)
VALUES ($in_guid, $in_x_value, $in_y_value) -- your values here, once
ON CONFLICT (guid) DO UPDATE
SET (x_value,y_value, updated_date)
= (EXCLUDED.x_value, EXCLUDED.y_value, now())
WHERE t.created_date >= now() - interval '10 minutes';
I finally solved it. I made another function that'll be called and checked if it's already exists and the time and then I can do upsert without any problems.
That's what I did at the end:
CREATE OR REPLACE FUNCTION fnc_check_table(
in_guid character varying)
RETURNS TABLE(response boolean)
LANGUAGE 'plpgsql'
COST 100
VOLATILE
ROWS 1000
AS $BODY$
BEGIN
IF EXISTS (SELECT FROM tbl WHERE guid = in_guid AND created_date < NOW() - INTERVAL '10 MINUTE' ) THEN
RETURN QUERY (SELECT FALSE);
ELSEIF EXISTS (SELECT FROM tbl WHERE guid = in_guid AND created_date > NOW() - INTERVAL '10 MINUTE') THEN
RETURN QUERY (SELECT TRUE);
ELSE
RETURN QUERY (SELECT TRUE);
END IF;
END
$BODY$;
CREATE OR REPLACE FUNCTION fnc_insert_table(
in_guid character varying,
in_x_value character varying,
in_y_value character varying)
RETURNS TABLE(response boolean)
LANGUAGE 'plpgsql'
COST 100
VOLATILE
ROWS 1000
AS $BODY$
BEGIN
IF (fnc_check_table(in_guid)) THEN
INSERT INTO tbl (guid, x_value, y_value)
VALUES (in_guid,in_x_value,in_y_value)
ON CONFLICT (guid)
DO UPDATE SET x_value=in_x_value, y_value=in_y_value, updated_date=now();
RETURN QUERY (SELECT TRUE);
ELSE
RETURN QUERY (SELECT FALSE);
END IF;
END
$BODY$;

How to use a counter in a Postgresql function loop when generating a timestamp type for each iteration

CREATE OR REPLACE FUNCTION inserts_table_foo(startId INT,
endId INT,
stepTimeZone INT default 0,
period INT default 0) RETURNS void AS
$$
DECLARE
nameVar TEXT;
dateVar TIMESTAMP;
begin
for idVar in startId..endId
loop
nameVar := md5(random()::text);
dateVar :=
(now()::timestamp with time zone +
make_interval(hours := stepTimeZone) +
make_interval(days := period)
);
period := period + 1;
insert into foo(id, name, data) values (idVar, nameVar, dateVar);
end loop;
end ;
$$ LANGUAGE plpgsql;
Unfortunately, the variable -> period, does not work as intended. I assumed that the counter would increase by 1 and each time, the date generation would be different, but in fact, the date turns out to be the same for the entire duration of the cycle.
How to make the date, for each iteration, be generated anew? This won't do here
generate_series ( TIMESTAMP WITHOUT TIME ZONE '2022-01-01', CURRENT_DATE, '1 day' )
because, you need to bet 100,000,000 records, in a table where the date field can be in the table in several places (dateStart, dateEnd, startDateReg, etc.) and are placed out of order. Therefore, I want to make a function that accepts a number of variables, according to which I can set the periods of the generated values, in this example, this is the date and time.
I would like to clearly see which variable the value comes into and also see the code for inserting this value into the table (that is, using the INSERT into entry .....)
What is the problem here and how could it be solved ?

how to get the count of rows returned of last executed selected query in postgresql

I have just started with the PostgreSQL and created a function that returns a table based on the select query with where clause and limit.
I also want to return the count of all the rows from the function that satisfied the condition without considering the limit clause.
Here is the function
CREATE OR REPLACE FUNCTION public.get_notifications(
search_text character,
page_no integer,
count integer)
RETURNS TABLE(id integer, head character, description text, img_url text, link text, created_on timestamp without time zone)
LANGUAGE 'plpgsql'
COST 100
VOLATILE
ROWS 1000
AS $BODY$
DECLARE
query text;
skip_records int = page_no * count;
BEGIN
query = concat('SELECT id,head,description,img_url,link,created_on from notifications where head ilike ''%',
search_text,'%''','offset ',skip_records,'limit ',count);
RETURN QUERY Execute query;
END;
$BODY$;
Here is the call
select * from get_notifications('s',0,5)
You can use a window function to return the count of the entire rowset:
select col1
, col2
, count(*) over () as total_rows -- < Window function
from YourTable
order by
col1
limit 10
The over clause specifies the window for count to operate on. The over () is shorthand for over (rows between unbounded preceding and unbounded following).
Working example on SQL Fiddle.

PostgreSQL Varchar UID to Int UID while preserving uniqueness

Say I have a unique column of VarChar(32).
ex. 13bfa574e23848b68f1b7b5ff6d794e1.
I want to preserve the uniqueness of this while converting the column to int. I figure I can convert all of the letters to their ascii equivalent, while retaining the numbers and character position. To do this, I will use the translate function.
psuedo code: select translate(uid, '[^0-9]', ascii('[^0-9]'))
My issue is finding all of the letters in the VarChar column originally.
I've tried
select uid, substring(uid from '[^0-9]') from test_table;
But it only returns the first letter it encounters. Using the above example, I would be looking for bfaebfbbffde
Any help is appreciated!
First off, I agree with the two commenters who said you should use a UID datatype.
That aside...
Your UID looks like a traditional one, in that it's not alphanumeric, it's hex. If this is the case, you can convert the hex to the numeric value using this solution:
PostgreSQL: convert hex string of a very large number to a NUMERIC
Notice the accepted solution (mine, shame) is not as good as the other solution listed, as mine will not work for hex values this large.
That said, yikes, what a huge number. Holy smokes.
Depending on how many records are in your table and the frequency of insert/update, I would consider a radically different approach. In a nutshell, I would create another column to store your numeric ID whose value would be determined by a sequence.
If you really want to make it bulletproof, you can also create a cross-reference table to store the relationships that would
Reuse an ID if it ever repeated (I know UIDs don't, but this would cover cases where a record is deleted by mistake, re-appears, and you want to retain the original id)
If UIDs repeat (like this is a child table with multiple records per UID), it would cover that case as well
If neither of these apply, you could dumb it down quite a bit.
The solution would look something like this:
Add an ID column that will be your numeric equivalent to the UID:
alter table test_table
add column id bigint
Create a sequence:
CREATE SEQUENCE test_id
create a cross-reference table (again, not necessary for the dumbed down version):
create table test_id_xref (
uid varchar(32) not null,
id bigint not null,
constraint test_id_xref_pk primary key (uid)
)
Then do a one-time update to assign a surrogate ID to each UID for both the cross-reference and actual tables:
insert into test_id_xref
with uids as (
select distinct uid
from test_table
)
select uid, nextval ('test_id')
from uids;
update test_table tt
set id = x.id
from test_id_xref x
where tt.uid = x.uid;
And finally, for all future inserts, create a trigger to assign the next value:
CREATE OR REPLACE FUNCTION test_table_insert_trigger()
RETURNS trigger AS
$BODY$
BEGIN
select t.id
from test_id_xref t
into NEW.id
where t.uid = NEW.uid;
if NEW.id is null then
NEW.id := nextval('test_id');
insert into test_id_xref values (NEW.uid, NEW.id);
end if;
return NEW;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
CREATE TRIGGER insert_test_table_trigger
BEFORE INSERT
ON test_table
FOR EACH ROW
EXECUTE PROCEDURE test_table_insert_trigger();
create one function which replace charter with blank which you not need in string,
CREATE FUNCTION replace_char(v_string VARCHAR(32) CHARSET utf8) RETURNS VARCHAR(32)
DETERMINISTIC
BEGIN
DECLARE v_return_string VARCHAR(32) DEFAULT '';
DECLARE v_remove_char VARCHAR(200) DEFAULT '1,2,3,4,5,6,7,8,9,0';
DECLARE v_length, j INT(3) DEFAULT 0;
SET v_length = LENGTH(v_string);
WHILE(j < v_length) DO
IF ( FIND_IN_SET( SUBSTR(v_string, (j+1), 1), v_remove_char ) = 0) THEN
SET v_return_string = CONCAT(v_return_string, SUBSTR(v_string, (j+1), 1) );
END IF;
SET j = j+1;
END WHILE;
RETURN v_return_string;
END$$
DELIMITER ;
Now you just nee to call this function in query
select uid, replace_char(uid) from test_table;
It will give you string what you need (bfaebfbbffde)
If you want to int number only i.e 13574238486817567941 then change value of variable, and also column datatype in decimal(50,0), decimal can stored large number and there is 0 decimal point so it will store int value as decimal.
v_remove_char = 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z';

How to structure SQL - select first X rows for each value of a column?

I have a table with the following type of data:
create table store (
n_id serial not null primary key,
n_place_id integer not null references place(n_id),
dt_modified timestamp not null,
t_tag varchar(4),
n_status integer not null default 0
...
(about 50 more fields)
);
There are indices on n_id, n_place_id, dt_modified and all other fields used in the query below.
This table contains about 100,000 rows at present, but may grow to closer to a million or even more. Yet, for now let's assume we're staying at around the 100K mark.
I'm trying to select rows from these table where one two conditions are met:
All rows where n_place_id is in a specific subset (this part is easy); or
For all other n_place_id values the first ten rows sorted by dt_modified (this is where it becomes more complicated).
Doing it in one SQL seems to be too painful, so I'm happy with a stored function for this. I have my function defined thus:
create or replace function api2.fn_api_mobile_objects()
returns setof store as
$body$
declare
maxres_free integer := 10;
resulter store%rowtype;
mcnt integer := 0;
previd integer := 0;
begin
create temporary table paid on commit drop as
select n_place_id from payments where t_reference is not null and now()::date between dt_paid and dt_valid;
for resulter in
select * from store where n_status > 0 and t_tag is not null order by n_place_id, dt_modified desc
loop
if resulter.n_place_id in (select n_place_id from paid) then
return next resulter;
else
if previd <> resulter.n_place_id then
mcnt := 0;
previd := resulter.n_place_id;
end if;
if mcnt < maxres_free then
return next resulter;
mcnt := mcnt + 1;
end if;
end if;
end loop;
end;$body$
language 'plpgsql' volatile;
The problem is that
select * from api2.fn_api_mobile_objects()
takes about 6-7 seconds to execute. Considering that after that this resultset needs to be joined to 3 other tables with a bunch of additional conditions applied and further sorting applied, this is clearly unacceptable.
Well, I still do need to get this data, so either I am missing something in the function or I need to rethink the entire algorithm. Either way, I need help with this.
CREATE TABLE store
( n_id serial not null primary key
, n_place_id integer not null -- references place(n_id)
, dt_modified timestamp not null
, t_tag varchar(4)
, n_status integer not null default 0
);
INSERT INTO store(n_place_id,dt_modified,n_status)
SELECT n,d,n%4
FROM generate_series(1,100) n
, generate_series('2012-01-01'::date ,'2012-10-01'::date, '1 day'::interval ) d
;
WITH zzz AS (
SELECT n_id AS n_id
, rank() OVER (partition BY n_place_id ORDER BY dt_modified) AS rnk
FROM store
)
SELECT st.*
FROM store st
JOIN zzz ON zzz.n_id = st.n_id
WHERE st.n_place_id IN ( 1,22,333)
OR zzz.rnk <=10
;
Update: here is the same selfjoin construct as a subquery (CTEs are treated a bit differently by the planner):
SELECT st.*
FROM store st
JOIN ( SELECT sx.n_id AS n_id
, rank() OVER (partition BY sx.n_place_id ORDER BY sx.dt_modified) AS zrnk
FROM store sx
) xxx ON xxx.n_id = st.n_id
WHERE st.n_place_id IN ( 1,22,333)
OR xxx.zrnk <=10
;
After much struggle, I managed to get the stored function to return the results in just over 1 second (which is a huge improvement). Now the function looks like this (I added the additional condition, which didn't affect the performance much):
create or replace function api2.fn_api_mobile_objects(t_search varchar)
returns setof store as
$body$
declare
maxres_free integer := 10;
resulter store%rowtype;
mid integer := 0;
begin
create temporary table paid on commit drop as
select n_place_id from payments where t_reference is not null and now()::date between dt_paid and dt_valid
union
select n_place_id from store where n_status > 0 and t_tag is not null group by n_place_id having count(1) <= 10;
for resulter in
select * from store
where n_status > 0 and t_tag is not null
and (t_name ~* t_search or t_description ~* t_search)
and n_place_id in (select n_place_id from paid)
loop
return next resulter;
end loop;
for mid in
select distinct n_place_id from store where n_place_id not in (select n_place_id from paid)
loop
for resulter in
select * from store where n_status > 0 and t_tag is not null and n_place_id = mid order by dt_modified desc limit maxres_free
loop
return next resulter;
end loop;
end loop;
end;$body$
language 'plpgsql' volatile;
This runs in just over 1 second on my local machine and in about 0.8-1.0 seconds on live. For my purpose, this is good enough, although I am not sure what will happen as the amount of data grows.
As a simple suggestion, the way I like to do this sort of troubleshooting is to construct a query that gets me most of the way there, and optimize it properly, and then add the necessary pl/pgsql stuff around it. The major advantage to this approach is that you can optimize based on query plans.
Also if you aren't dealing with a lot of rows, array_agg() and unnest() are your friends as they allow you (on Pg 8.4 and later!) to dispense with the temporary table management overhead and simply construct and query an array of tuples in memory as a relation. It may perform better also if you are just hitting an array in memory instead of a temp table (less planning overhead and less query overhead too).
Also on your updated query I would look at replacing that final loop with a subquery or a join, allowing the planner to decide when to do a nested loop lookup or when to try to find a better way.