PostgreSQL SQL query to find number of occurrences of substring in string - sql

I’m trying to wrap my head around a problem but I’m hitting a blank. I know SQL quite well, but I’m not sure how to approach this.
My problem:
Given a string and a table of possible substrings, I need to find the number of occurrences.
The search table consists of a single colum:
searchtable
| pattern TEXT PRIMARY KEY|
|-------------------------|
| my |
| quick |
| Earth |
Given the string "Earth is my home planet and where my friends live", the expected outcome is 3 (2x "my" and 1x "Earth").
In my function, I have variable bodytext which is the string to examine.
I know I can do IN (SELECT pattern FROM searchtable) to get the list of substrings, and I could possibly use a LIKE ANY clause to get matches, but how can I count occurrences of the substrings in the table within the search string?

This is easily done without a custom function:
select count(*)
from (values ('Earth is my home planet and where my friends live')) v(str) cross join lateral
regexp_split_to_table(v.str, ' ') word join
patterns p
on word = p.pattern
Just break the original string into "words". Then match on the words.
Another method uses regular expression matching:
select (select count(*) from regexp_matches(v.str, p.rpattern, 'g'))
from (values ('Earth is my home planet and where my friends live')) v(str) cross join
(select string_agg(pattern, '|') as rpattern
from patterns
) p;
This stuffs all the patterns into a regular expression. Not that this version does not take word breaks into account.
Here is a db<>fiddle.

I solved the problem with the following code:
CREATE OR REPLACE FUNCTION count_matches(body TEXT, OUT matches INTEGER) AS $$
DECLARE
results INTEGER := 0;
matchlist RECORD;
BEGIN
FOR matchlist IN (SELECT pattern FROM searchtable)
LOOP
results := results + (SELECT LENGTH(body) -
LENGTH(REPLACE(body, matchlist.pattern, ''))) /
LENGTH(matchlist.pattern);
END LOOP;
matches := results;
END;
$$ LANGUAGE plpgsql;

Related

Concatenate rows in function PostgreSQL

Assume there's a table projects containing project name, location, team id, start and end years. How can I concatenate rows so that the same names would combine the other information into one string?
name location team_id start end
Library Atlanta 2389 2015 2017
Library Georgetown 9920 2003 2007
Museum Auckland 3092 2005 2007
Expected output would look like this:
name Records
Library Atlanta, 2389, 2015-2017
Georgetown, 9920, 2003-2007
Museum Auckland, 3092, 2005-2007
Each line should contain end-of-line / new line character.
I have a function for this, but I don't think it would work with just using CONCAT. What are other ways this can be done? What I tried:
CREATE OR REPLACE TYPE projects (name TEXT, records TEXT);
CREATE OR REPLACE FUNCTION records (INT)
RETURNS SETOF projects AS
$$
RETURN QUERY
SELECT p.name
CONCAT(p.location, ', ', p.team_id, ', ', p.start, '-', p.end, CHAR(10))
FROM projects($1) p;
$$
LANGUAGE PLpgSQL;
I tried using CHAR(10) for new line, but its giving a syntax error (not sure why?).
The above sample concatenate the string but expectedly leaving out duplicated names.
You do not need PL/pgSQL for that.
First eliminate duplicate names using DISTINCT and then in a subquery you can concat the columns into a single string. After that use array_agg to create an array out of it. It will then "merge" multiple arrays, in case the subquery returns more than one row. Finally, get rid of the commas and curly braces using array_to_string. Instead of using the char value of a newline, you can simply use E'\n' (E stands for escape):
WITH j (name,location,team_id,start,end_) AS (
VALUES ('Library','Atlanta',2389,2015,2017),
('Library','Georgetown',9920,2003,2007),
('Museum','Auckland',3092,2005,2007)
)
SELECT
DISTINCT q1.name,
array_to_string(
(SELECT array_agg(concat(location,', ',team_id,', ',start,'-', end_, E'\n'))
FROM j WHERE name = q1.name),'') AS records
FROM j q1;
name | records
---------+----------------------------
Library | Atlanta, 2389, 2015-2017
| Georgetown, 9920, 2003-2007
|
Museum | Auckland, 3092, 2005-2007
Note: try to not use reserved strings (e.g. end,name,start, etc.) to name your columns. Although PostgreSQL allows you to use them, it is considered a bad practice.
Demo: db<>fiddle
A bit simple query:
select
name,
string_agg( concat(location, ', ', team_id, ', ', start, '-', "end"), E'\n') AS records
FROM t
group by name;
PostgreSQL fiddle

SAP HANA SQL SUBSTR_REGEXPR Match Aggregation

I am using HANA and am trying to create a new column based on the following:
Regex Example 1: SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\d{2,6})' in "TEXT") as "Location"
How can I get this to return all results instead of just the first? Is it a string agg of this expression repeated? There would be at most 6 matches in each text field (per row).
Regex Example 1 Current Output:
Row Text Location(new column)
1 msdfmsfmdf PT2222, ST 43434 asdasdas PT2222
Regex Example 1 Desired Output:
Row Text Location(new column)
1 msdfmsfmdf PT2222, ST 43434 asdasdas PT2222, ST43434
I also have varying formats so I need to be able to use multiple variations of that regex to be able to capture all matches and put them into the new "Location" column as a delimited aggregation. Is this possible?
One of the other variations is where I would need to pull the numbers from this series:
"Locations 1, 2, 35 & 5 lkfaskjdlsaf .282 lkfdsklfjlkdsj 002"
So far I have:
Regex Example 2: "Locations (\d{1,2}.?){1,5}"
but I know that is not working. When I remove the "Locations" it picks up the numbers but also picks up the .282 and 002 which I do not want.
Regex Example 2 Current Output:
Row Text Location(new column)
1 msdfmsfmdf Locations 3,5,7 & 9" asdasdas Locations 3
Regex Example 2 Desired Output:
Row Text Location(new column)
1 msdfmsfmdf Locations 3,5,7 & 9" asdasdas 3,5,7,9
Sometimes the "Location" in the text field is in the format which would require Example 1s Regex and sometimes it is in the format requiring example 2s regex so I would need to have the regex searching for both possible formats.
Example 3 Regex in Select Statement:
Select "Primary Key",
"Text",
STRING_AGG(SUBSTR_REGEXPR('([PpSs][Tt][Ss]?\w?\d{2,6})' OR '(\d{1,2}.?){1,5})' in "Text" ),',') as "Location"
FROM Table
Needs to capture both example 1 and 2 location formats using some sort of OR condition in the create column SQL
Regex Example 3 Current Output:
Not working, no output
Regex Example 3 Desired Output:
Row Text Location(new column)
1 msdfmsfmdf Locations 3,5,7 & 9" asdasdas 3,5,7,9
2 msdfmsfmdf PT2222, ST 43434 asdasdas PT2222, ST43434
Other Tools I have access to are SAS and python. Any alternate recommendations to simplify the process are welcome. I did already try in Tableau but same problem with only returning the first match. Aggregating them makes the calculation super slow and very long.
Please help me figure this out. Any help is much appreciated.
Thanks.
For single input string values, following script can be used.
Use of SubStr_RegExpr with Series_Generate_Integer to split string using SQLScript in HANA can be descriptive to understand the use of series_generate function
declare pString nvarchar(5000);
pString := 'msdfmsfmdf PT2222, ST 43434 asdasdas';
select
STRING_AGG(SUBSTR_REGEXPR( '([PpSs][Tt][Ss]?\w?\d{2,6})' IN Replace(pString,' ','') OCCURRENCE NT.Element_Number GROUP 1),',') as "Location"
from
DUMMY as SplitString,
SERIES_GENERATE_INTEGER(1, 0, 10 ) as NT;
Output will return as PT2222,ST43434
Thanks for adding the necessary requirement examples. This makes it a lot easier to work through the problem.
In this case, your requirement is to match multiple strings against multiple patterns and to apply multiple formatting operations on the output.
This cannot be done in a single regular expression in SAP HANA.
Basically, SAP HANA SQL allows two kinds of regex operations:
Match against a pattern and return one occurrence
Match against a pattern and replace one or ALL occurrences of this match
That means for this transformation we basically can try to remove everything that does not match the pattern or loop over the input string and pick out everything that matches.
The problem with the remove-approach (e.g. using SUBSTR_REGEXPR()) is that the matching patterns are not guaranteed to not overlap. That means we could remove matches for other patterns in the process.
Instead, I would use the first approach and try and pick all matches against all pattern and return those.
For that a scalar user-defined function can be created like this:
drop function extract_locators;
create function extract_locators(IN input_text NVARCHAR(1000))
returns location_text NVARCHAR(1000)
as
begin
declare matchers NVARCHAR(100) ARRAY;
declare part_res NVARCHAR(100) := '';
declare full_res NVARCHAR (2000) := '';
declare occn integer;
declare curr_matcher integer;
-- setting up matchers
matchers[1] := '(PT\s*[[:digit:]]+)|(ST\s*[[:digit:]]+)'; -- matches PTxxxx, pt xxxx , St ... , STxxxx
matchers[2] := '(?>\s)[1-9][0-9]*'; -- matches 21, 1, 23, 34
curr_matcher :=0;
-- loop over all matchers
while (:curr_matcher < cardinality(:matchers)) do
curr_matcher := :curr_matcher + 1;
-- loop over all occurrences
occn := 1;
part_res := '';
while (:part_res IS NOT NULL) do
part_res := SUBSTR_REGEXPR(:matchers[:curr_matcher]
FLAG 'i'
IN :input_text
OCCURRENCE :occn);
if (:part_res IS NOT NULL) then
occn := :occn + 1;
full_res := :full_res
|| MAP(LENGTH(:full_res), 0, '', ',')
|| IFNULL(:part_res, '');
else
BREAK;
end if;
end while; -- occurrences
-- if current matcher matched, don't apply the others
if (:full_res !='') then
BREAK;
end if;
end while; -- matchers
-- remove spaces
location_text := replace (:full_res, ' ', '');
end;
With your test data in a table like the following:
drop table loc_data;
create column table loc_data ("CASE" integer primary key,
"INPUT_TEXT" NVARCHAR(2000));
-- PT and ST
insert into loc_data values (1, 'msdfmsfmdf PT2222, ST 43434 asdasdas');
-- Locations
insert into loc_data values (2, 'Locations 1, 2, 35 & 5 lkfaskjdlsaf .282 lkfdsklfjlkdsj 002');
You can now simply run
select
*
, extract_locators("INPUT_TEXT") as location_text
from
loc_data;
To get
1 | msdfmsfmdf PT2222, ST 43434 asdasdas | PT2222,ST43434
2 | Locations 1, 2, 35 & 5 lkfaskjdlsaf .282 lkfdsklfjlkdsj 002 | 1,2,35,5
This approach also allows for keeping the matching rules in a separate table and use a cursor (instead of the array) to loop over them. In addition to that, it keeps the single regular expressions rather small and relatively easy to understand, which is probably the biggest benefit here.
The runtime performance obviously can be an issue, therefore I would probably try and save the results of the operation and only run the function when the data changes.

Merged multiple values in one record value using SQL [duplicate]

I have a table and I'd like to pull one row per id with field values concatenated.
In my table, for example, I have this:
TM67 | 4 | 32556
TM67 | 9 | 98200
TM67 | 72 | 22300
TM99 | 2 | 23009
TM99 | 3 | 11200
And I'd like to output:
TM67 | 4,9,72 | 32556,98200,22300
TM99 | 2,3 | 23009,11200
In MySQL I was able to use the aggregate function GROUP_CONCAT, but that doesn't seem to work here... Is there an equivalent for PostgreSQL, or another way to accomplish this?
Since 9.0 this is even easier:
SELECT id,
string_agg(some_column, ',')
FROM the_table
GROUP BY id
This is probably a good starting point (version 8.4+ only):
SELECT id_field, array_agg(value_field1), array_agg(value_field2)
FROM data_table
GROUP BY id_field
array_agg returns an array, but you can CAST that to text and edit as needed (see clarifications, below).
Prior to version 8.4, you have to define it yourself prior to use:
CREATE AGGREGATE array_agg (anyelement)
(
sfunc = array_append,
stype = anyarray,
initcond = '{}'
);
(paraphrased from the PostgreSQL documentation)
Clarifications:
The result of casting an array to text is that the resulting string starts and ends with curly braces. Those braces need to be removed by some method, if they are not desired.
Casting ANYARRAY to TEXT best simulates CSV output as elements that contain embedded commas are double-quoted in the output in standard CSV style. Neither array_to_string() or string_agg() (the "group_concat" function added in 9.1) quote strings with embedded commas, resulting in an incorrect number of elements in the resulting list.
The new 9.1 string_agg() function does NOT cast the inner results to TEXT first. So "string_agg(value_field)" would generate an error if value_field is an integer. "string_agg(value_field::text)" would be required. The array_agg() method requires only one cast after the aggregation (rather than a cast per value).
SELECT array_to_string(array(SELECT a FROM b),', ');
Will do as well.
Try like this:
select field1, array_to_string(array_agg(field2), ',')
from table1
group by field1;
Assuming that the table your_table has three columns (name, id, value), the query is this one:
select name,
array_to_string(array_agg(id), ','),
array_to_string(array_agg(value), ',')
from your_table
group by name
order by name
;
"TM67" "4,9,72" "32556,98200,22300"
"TM99" "2,3" "23009,11200"
KI
and the version to work on the array type:
select
array_to_string(
array(select distinct unnest(zip_codes) from table),
', '
);
My sugestion in postgresql
SELECT cpf || ';' || nome || ';' || telefone
FROM (
SELECT cpf
,nome
,STRING_AGG(CONCAT_WS( ';' , DDD_1, TELEFONE_1),';') AS telefone
FROM (
SELECT DISTINCT *
FROM temp_bd
ORDER BY cpf DESC ) AS y
GROUP BY 1,2 ) AS x
In my experience, I had bigint as column type. So The below code worked for me. I am using PostgreSQL 12.
Type cast is happening here. (::text).
string_agg(some_column::text, ',')
Hope below Oracle query will work.
Select First_column,LISTAGG(second_column,',')
WITHIN GROUP (ORDER BY second_column) as Sec_column,
LISTAGG(third_column,',')
WITHIN GROUP (ORDER BY second_column) as thrd_column
FROM tablename
GROUP BY first_column

How to get the first field from an anonymous row type in PostgreSQL 9.4?

=# select row(0, 1) ;
row
-------
(0,1)
(1 row)
How to get 0 within the same query? I figured the below sort of working but is there any simple way?
=# select json_agg(row(0, 1))->0->'f1' ;
?column?
----------
0
(1 row)
No luck with array-like syntax [0].
Thanks!
Your row type is anonymous and therefore you cannot access its elements easily. What you can do is create a TYPE and then cast your anonymous row to that type and access the elements defined in the type:
CREATE TYPE my_row AS (
x integer,
y integer
);
SELECT (row(0,1)::my_row).x;
Like Craig Ringer commented in your question, you should avoid producing anonymous rows to begin with, if you can help it, and type whatever data you use in your data model and queries.
If you just want the first element from any row, convert the row to JSON and select f1...
SELECT row_to_json(row(0,1))->'f1'
Or, if you are always going to have two integers or a strict structure, you can create a temporary table (or type) and a function that selects the first column.
CREATE TABLE tmptable(f1 int, f2 int);
CREATE FUNCTION gettmpf1(tmptable) RETURNS int AS 'SELECT $1.f1' LANGUAGE SQL;
SELECT gettmpf1(ROW(0,1));
Resources:
https://www.postgresql.org/docs/9.2/static/functions-json.html
https://www.postgresql.org/docs/9.2/static/sql-expressions.html
The json solution is very elegant. Just for fun, this is a solution using regexp (much uglier):
WITH r AS (SELECT row('quotes, "commas",
and a line break".',null,null,'"fourth,field"')::text AS r)
--WITH r AS (SELECT row('',null,null,'')::text AS r)
--WITH r AS (SELECT row(0,1)::text AS r)
SELECT CASE WHEN r.r ~ '^\("",' THEN ''
WHEN r.r ~ '^\("' THEN regexp_replace(regexp_replace(regexp_replace(right(r.r, -2), '""', '\"', 'g'), '([^\\])",.*', '\1'), '\\"', '"', 'g')
ELSE (regexp_matches(right(r.r, -1), '^[^,]*'))[1] END
FROM r
When converting a row to text, PostgreSQL uses quoted CSV formatting. I couldn't find any tools for importing quoted CSV into an array, so the above is a crude text manipulation via mostly regular expressions. Maybe someone will find this useful!
With Postgresql 13+, you can just reference individual elements in the row with .fN notation. For your example:
select (row(0, 1)).f1; --> returns 0.
See https://www.postgresql.org/docs/13/sql-expressions.html#SQL-SYNTAX-ROW-CONSTRUCTORS

How to replace all subsets of characters based on values of other tables in pl/pgsql?

I've been doing some research on how to replace a subset of string of characters of a single row base on the values of the columns of other rows, but was not able to do so since the update are only for the first row values of the other table. So I'm planning to insert this in a loop in a plpsql function.
Here are the snippet of my tables. Main table:
Table "public.tbl_main"
Column | Type | Modifiers
-----------------------+--------+-----------
maptarget | text |
expression | text |
maptarget | expression
-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
43194-0 | 363787002:70434600=(386053000:704347000=(414237002:704320005=259470008,704318007=118539007,704319004=50863008),704327008=122592007,246501002=703690001,370132008=30766002)
Look-up table:
Table "public.tbl_values"
Column | Type | Modifiers
-----------------------+--------+-----------
conceptid | bigint |
term | text |
conceptid | term
-----------+------------------------------------------
386053000 | Patient evaluation procedure (procedure)
363787002 | Observable entity (observable entity)
704347000 | Observes (attribute)
704320005 | Towards (attribute)
704318007 | Property type (attribute)
I want to create a function that will replace all numeric values in the tbl_main.expression columns with their corresponding tbl_values.term using the tbl_values.conceptid as the link to each numeric values in the expression string.
I'm stuck currently in the looping part since I'm a newbie in LOOP of plpgsql. Here is the rough draft of my function.
--create first a test table
drop table if exists tbl_test;
create table tbl_test as select * from tbl_main limit 1;
--
create or replace function test ()
RETURNS SETOF tbl_main
LANGUAGE plpgsql
AS $function$
declare
resultItem tbl_main;
v_mapTarget text;
v_expression text;
ctr int;
begin
v_mapTarget:='';
v_expression:='';
ctr:=1;
for resultItem in (select * from tbl_test) loop
v_mapTarget:=resultItem.mapTarget;
select into v_expression expression from ee;
raise notice 'parameter used: %',v_mapTarget;
raise notice 'current expression: %',v_expression;
update ee set expression=replace(v_expression, new_exp::text, term) from (select new_exp::text, term from tbl_values offset ctr limit 1) b ;
ctr:=ctr+1;
raise notice 'counter: %', ctr;
v_expression:= (select expression from ee);
resultItem.expression:= v_expression;
raise notice 'current expression: %',v_expression;
return next resultItem;
end loop;
return;
end;
$function$;
Any further information will be much appreciated.
My Postgres version:
PostgreSQL 9.3.6 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu
4.8.2-19ubuntu1) 4.8.2, 64-bit
PL/pgSQL function with dynamic SQL
Looping is always a measure of last resort. Even in this case it is substantially cheaper to concatenate a query string using a query, and execute it once:
CREATE OR REPLACE FUNCTION f_make_expression(_expr text, OUT result text) AS
$func$
BEGIN
EXECUTE (
SELECT 'SELECT ' || string_agg('replace(', '') || '$1,'
|| string_agg(format('%L,%L)', conceptid::text, v.term), ','
ORDER BY conceptid DESC)
FROM (
SELECT conceptid::bigint
FROM regexp_split_to_table($1, '\D+') conceptid
WHERE conceptid <> ''
) m
JOIN tbl_values v USING (conceptid)
)
USING _expr
INTO result;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT *, f_make_expression(expression) FROM tbl_main;
However, if not all conceptid have the same number of digits, the operation could be ambiguous. Replace conceptid with more digits first to avoid that - ORDER BY conceptid DESC does that - and make sure that replacement strings do not introduce ambiguity (numbers that might be replaced in the the next step). Related answer with more on these pitfalls:
Replace a string with another string from a list depending on the value
The token $1 is used two different ways here, don't be misled:
regexp_split_to_table($1, '\D+')
This one references the first function parameter _expr. You could as well use the parameter name.
|| '$1,'
This concatenates into the SQL string a references to the first expression passed via USING clause to EXECUTE. Parameters of the outer function are not visible inside EXECUTE, you have to pass them explicitly.
It's pure coincidence that $1 (_expr) of the outer function is passed as $1 to EXECUTE. Might as well hand over $7 as third expression in the USING clause ($3) ...
I added a debug function to the fiddle. With a minor modification you can output the generated SQL string to inspect it:
SQL function
Here is a pure SQL alternative. Probably also faster:
CREATE OR REPLACE FUNCTION f_make_expression_sql(_expr text)
RETURNS text AS
$func$
SELECT string_agg(CASE WHEN $1 ~ '^\d'
THEN txt || COALESCE(v.term, t.conceptid)
ELSE COALESCE(v.term, t.conceptid) || txt END
, '' ORDER BY rn) AS result
FROM (
SELECT *, row_number() OVER () AS rn
FROM (
SELECT regexp_split_to_table($1, '\D+') conceptid
, regexp_split_to_table($1, '\d+') txt
) sub
) t
LEFT JOIN tbl_values v ON v.conceptid = NULLIF(t.conceptid, '')::int
$func$ LANGUAGE sql STABLE;
In Postgres 9.4 this can be much more elegant with two new features:
ROWS FROM to replacing the old (weird) technique to sync set-returning functions
WITH ORDINALITY to get row numbers on the fly reliably:
PostgreSQL unnest() with element number
CREATE OR REPLACE FUNCTION f_make_expression_sql(_expr text)
RETURNS text AS
$func$
SELECT string_agg(CASE WHEN $1 ~ '^\d'
THEN txt || COALESCE(v.term, t.conceptid)
ELSE COALESCE(v.term, t.conceptid) || txt END
, '' ORDER BY rn) AS result
FROM ROWS FROM (
regexp_split_to_table($1, '\D+')
, regexp_split_to_table($1, '\d+')
) WITH ORDINALITY AS t(conceptid, txt, rn)
LEFT JOIN tbl_values v ON v.conceptid = NULLIF(t.conceptid, '')::int
$func$ LANGUAGE sql STABLE;
SQL Fiddle demonstrating all for Postgres 9.3.
There's also another way, without creating functions... using "WITH RECURSIVE". Used it with lookup talbe of thousands of rows.
You'll need to change following table names and columns to your names:
tbl_main, strsourcetext, strreplacedtext;
lookuptable, strreplacefrom, strreplaceto.
WITH RECURSIVE replaced AS (
(SELECT
strsourcetext,
strreplacedtext,
array_agg(strreplacefrom ORDER BY length(strreplacefrom) DESC, strreplacefrom, strreplaceto) AS arrreplacefrom,
array_agg(strreplaceto ORDER BY length(strreplacefrom) DESC, strreplacefrom, strreplaceto) AS arrreplaceto,
count(1) AS intcount,
1 AS intindex
FROM tbl_main, lookuptable WHERE tbl_main.strsourcetext LIKE '%' || strreplacefrom || '%'
GROUP BY strsourcetext)
UNION ALL
SELECT
strsourcetext,
replace(strreplacedtext, arrreplacefrom[intindex], arrreplaceto[intindex]) AS strreplacedtext,
arrreplacefrom,
arrreplaceto,
intcount,
intindex+1 AS intindex
FROM replaced WHERE intindex<=intcount
)
SELECT strsourcetext,
(array_agg(strreplacedtext ORDER BY intindex DESC))[1] AS strreplacedtext
FROM replaced
GROUP BY strsourcetext