How to use regexp_matches() in an UPDATE statement? - sql

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?

Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)

You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.

Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

Related

count number of times a regex pattern occurs in hive

I have a string variable stored in hive as follows
stringvar
AA1,BB3,CD4
AA12,XJ5
I would like to count (and filter on) how many times the regex pattern \w\w\d occurs. In the example, in the first row there are obviously three such examples. How can I do that without resorting to lateral views and explosions of stringvar (too expensive)?
Thanks!
You can split string by pattern and calculate size of result array - 1.
Demo:
select size(split('AA1,BB3,CD4','\\w\\w\\d'))-1 --returns 3
select size(split('AA12,XJ5','\\w\\w\\d'))-1 --returns 2
select size(split('AAxx,XJx','\\w\\w\\d'))-1 --returns 0
select size(split('','\\w\\w\\d'))-1 --returns 0
If column is null-able than special care should be taken. For example like this (depends on what you need to be returned in case of NULL):
select case when col is null then 0
else size(split(col,'\\w\\w\\d'))-1
end
Or simply convert NULL to empty string using NVL function:
select size(split(NVL(col,''),'\\w\\w\\d'))-1
The solution above is the most flexible one, you can count the number of occurrences and use it for complex filtering/join/etc.
In case you just need to filter records with fixed number of pattern occurrences or at least fixed number and do not need to know exact count then simple RLIKE without splitting is the cheapest method.
For example check for at least 2 repeats:
select 'AA1,BB3,CD4' rlike('\\w\\w\\d+,\\w\\w\\d+') --returns true, can be used in WHERE

Get max on comma separated values in column

How to get max on comma separated values in Original_Ids column and get max value in one column and remaining ids in different column.
|Original_Ids | Max_Id| Remaining_Ids |
|123,534,243,345| 534 | 123,234,345 |
Upadte -
If I already have Max_id and just need below equation?
Remaining_Ids = Original_Ids - Max_id
Thanks
Thanks to the excellent possibilities of array manipulation in Postgres, this could be done relatively easy by converting the string to an array and from there to a set.
Then regular queries on that set are possible. With max() the maximum can be selected and with EXCEPT ALL the maximum can be removed from the set.
A set can then be converted to an array and with array_to_string() and the array can be converted to a delimited string again.
SELECT ids original_ids,
(SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id)) max_id,
array_to_string(ARRAY((SELECT un.id::integer
FROM unnest(string_to_array(ids,
',')) un(id)
EXCEPT ALL
SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id))),
',') remaining_ids
FROM elbat;
Another option would have been regexp_split_to_table() which directly produces a set (or regexp_split_to_array() but than we'd had the possible regular expression overhead and still had to convert the array to a set).
But nevertheless you just should (almost) never use delimited lists (nor arrays). Use a table, that's (almost) always the best option.
SQL Fiddle
You can use a window function (https://www.postgresql.org/docs/current/static/tutorial-window.html) to get the max element per unnested array. After that you can reaggregate the elements and remove the calculated max value from the array.
Result:
a max_elem remaining
123,534,243,345 534 123,243,345
3,23,1 23 3,17
42 42
56,123,234,345,345 345 56,123,234
This query needs only one split/unnest as well as only one max calculation.
SELECT
a,
max_elem,
array_remove(array_agg(elements), max_elem) as remaining -- C
FROM (
SELECT
*,
MAX(elements) OVER (PARTITION BY a) as max_elem -- B
FROM (
SELECT
a,
unnest((string_to_array(a, ','))::int[]) as elements -- A
FROM arrays
)s
)s
GROUP BY a, max_elem
A: string_to_array converts the string list into an array. Because the arrays are treated as string arrays you need the cast them into integer arrays by adding ::int[]. The unnest() expands all array elements into own rows.
B: window function MAX gives the maximum value of the single arrays as max_elem
C: array_agg reaggregates the elements through the GROUP BY id. After that array_remove removes the max_elem value from the array.
If you do not like to store them as pure arrays but as string list again you could add array_to_string. But I wouldn't recommend this because your data are integer arrays and not strings. For every further calculation you would need this string cast. A even better way (as already stated by #stickybit) is not to store the elements as arrays but as unnested data. As you can see in nearly every operation should would do the unnest before.
Note:
It would be better to use an ID to adress the columns/arrays instead of the origin string as in SQL Fiddle with IDs
If you install the extension intarray this is quite easy.
First you need to create the extension (you have to be superuser to do that):
create extension intarray;
Then you can do the following:
select original_ids,
original_ids[1] as max_id,
sort(original_ids - original_ids[1]) as remaining_ids
from (
select sort_desc(string_to_array(original_ids,',')::int[]) as original_ids
from bad_design
) t
But you shouldn't be storing comma separated values to begin with

Postgres query to calculate matching strings

I have following table:
id description additional_info
123 XYZ XYD
And an array as:
[{It is known to be XYZ},{It is know to be none},{It is know to be XYD}]
I need to map both the content in such a way that for every record of table I'm able to define the number of successful match.
The result of the above example will be:
id RID Matches
1 123 2
Only the content at position 0 and 2 match the record's description/additional_info so Matches is 2 in the result.
I am struggling to transform this to a query in Postgres - dynamic SQL to create a VIEW in a PL/pgSQL function to be precise.
It's undefined how to deal with array elements that match both description and additional_info at the same time. I'll assume you want to count that as 1 match.
It's also undefined where id = 1 comes from in the result.
One way is to unnest() the array and LEFT JOIN the main table to each element on a match on either of the two columns:
SELECT 1 AS id, t.id AS "RID", count(a.txt) AS "Matches"
FROM tbl t
LEFT JOIN unnest(my_arr) AS a(txt) ON a.txt ~ t.description
OR a.txt ~ t.additional_info
GROUP BY t.id;
I use a regular expression for the match. Special characters like (.\?) etc. in the strings to the right have special meaning. You might have to escape those if possible.
Addressing your comment
You should have mentioned that you are using a plpgsql function with EXECUTE. Probably 2 errors:
The variable array_content is not visible inside EXECUTE, you need to pass the value with a USING clause - or concatenate it as string literal in a CREATE VIEW statement which does not allow parameters.
Missing single quotes around the string 'brand_relevance_calculation_‌​view'. It's still a string literal before you concatenate it as identifier. You did good to use format() with %I there.
Demo:
DO
$do$
DECLARE
array_content varchar[]:= '{FREE,DAY}';
BEGIN
EXECUTE format('
CREATE VIEW %I AS
SELECT id, description, additional_info, name, count(a.text) AS business_objectives
, multi_city, category IS NOT NULL AS category
FROM initial_events i
LEFT JOIN unnest(%L::varchar[]) AS a(text) ON a.text ~ i.description
OR a.text ~ i.additional_info'
, 'brand_relevance_calculation_‌​view', array_content);
END
$do$;

How to use ANY instead of IN in a WHERE clause?

I used to have a query like in Rails:
MyModel.where(id: ids)
Which generates sql query like:
SELECT "my_models".* FROM "my_models"
WHERE "my_models"."id" IN (1, 28, 7, 8, 12)
Now I want to change this to use ANY instead of IN. I created this:
MyModel.where("id = ANY(VALUES(#{ids.join '),('}))"
Now when I use empty array ids = [] I get the folowing error:
MyModel Load (53.0ms) SELECT "my_models".* FROM "my_models" WHERE (id = ANY(VALUES()))
ActiveRecord::JDBCError: org.postgresql.util.PSQLException: ERROR: syntax error at or near ")"
ActiveRecord::StatementInvalid: ActiveRecord::JDBCError: org.postgresql.util.PSQLException: ERROR: syntax error at or near ")"
Position: 75: SELECT "social_messages".* FROM "social_messages" WHERE (id = ANY(VALUES()))
from arjdbc/jdbc/RubyJdbcConnection.java:838:in `execute_query'
There are two variants of IN expressions:
expression IN (subquery)
expression IN (value [, ...])
Similarly, two variants with the ANY construct:
expression operator ANY (subquery)
expression operator ANY (array expression)
A subquery works for either technique, but for the second form of each, IN expects a list of values (as defined in standard SQL) while = ANY expects an array.
Which to use?
ANY is a later, more versatile addition, it can be combined with any binary operator returning a boolean value. IN burns down to a special case of ANY. In fact, its second form is rewritten internally:
IN is rewritten with = ANY
NOT IN is rewritten with <> ALL
Check the EXPLAIN output for any query to see for yourself. This proves two things:
IN can never be faster than = ANY.
= ANY is not going to be substantially faster.
The choice should be decided by what's easier to provide: a list of values or an array (possibly as array literal - a single value).
If the IDs you are going to pass come from within the DB anyway, it is much more efficient to select them directly (subquery) or integrate the source table into the query with a JOIN (like #mu commented).
To pass a long list of values from your client and get the best performance, use an array, unnest() and join, or provide it as table expression using VALUES (like #PinnyM commented). But note that a JOIN preserves possible duplicates in the provided array / set while IN or = ANY do not. More:
Optimizing a Postgres query with a large IN
In the presence of NULL values, NOT IN is often the wrong choice and NOT EXISTS would be right (and faster, too):
Select rows which are not present in other table
Syntax for = ANY
For the array expression Postgres accepts:
an array constructor (array is constructed from a list of values on the Postgres side) of the form: ARRAY[1,2,3]
or an array literal of the form '{1,2,3}'.
To avoid invalid type casts, you can cast explicitly:
ARRAY[1,2,3]::numeric[]
'{1,2,3}'::bigint[]
Related:
PostgreSQL: Issue with passing array to procedure
How to pass custom type array to Postgres function
Or you could create a Postgres function taking a VARIADIC parameter, which takes individual arguments and forms an array from them:
Passing multiple values in single parameter
How to pass the array from Ruby?
Assuming id to be integer:
MyModel.where('id = ANY(ARRAY[?]::int[])', ids.map { |i| i})
But I am just dabbling in Ruby. #mu provides detailed instructions in this related answer:
Sending array of values to a sql query in ruby?

Check if value exists in Postgres array

Using Postgres 9.0, I need a way to test if a value exists in a given array. So far I came up with something like this:
select '{1,2,3}'::int[] #> (ARRAY[]::int[] || value_variable::int)
But I keep thinking there should be a simpler way to this, I just can't see it. This seems better:
select '{1,2,3}'::int[] #> ARRAY[value_variable::int]
I believe it will suffice. But if you have other ways to do it, please share!
Simpler with the ANY construct:
SELECT value_variable = ANY ('{1,2,3}'::int[])
The right operand of ANY (between parentheses) can either be a set (result of a subquery, for instance) or an array. There are several ways to use it:
SQLAlchemy: how to filter on PgArray column types?
IN vs ANY operator in PostgreSQL
Important difference: Array operators (<#, #>, && et al.) expect array types as operands and support GIN or GiST indices in the standard distribution of PostgreSQL, while the ANY construct expects an element type as left operand and can be supported with a plain B-tree index (with the indexed expression to the left of the operator, not the other way round like it seems to be in your example). Example:
Index for finding an element in a JSON array
None of this works for NULL elements. To test for NULL:
Check if NULL exists in Postgres array
Watch out for the trap I got into: When checking if certain value is not present in an array, you shouldn't do:
SELECT value_variable != ANY('{1,2,3}'::int[])
but use
SELECT value_variable != ALL('{1,2,3}'::int[])
instead.
but if you have other ways to do it please share.
You can compare two arrays. If any of the values in the left array overlap the values in the right array, then it returns true. It's kind of hackish, but it works.
SELECT '{1}' && '{1,2,3}'::int[]; -- true
SELECT '{1,4}' && '{1,2,3}'::int[]; -- true
SELECT '{4}' && '{1,2,3}'::int[]; -- false
In the first and second query, value 1 is in the right array
Notice that the second query is true, even though the value 4 is not contained in the right array
For the third query, no values in the left array (i.e., 4) are in the right array, so it returns false
unnest can be used as well.
It expands array to a set of rows and then simply checking a value exists or not is as simple as using IN or NOT IN.
e.g.
id => uuid
exception_list_ids => uuid[]
select * from table where id NOT IN (select unnest(exception_list_ids) from table2)
Hi that one works fine for me, maybe useful for someone
select * from your_table where array_column ::text ilike ANY (ARRAY['%text_to_search%'::text]);
"Any" works well. Just make sure that the any keyword is on the right side of the equal to sign i.e. is present after the equal to sign.
Below statement will throw error: ERROR: syntax error at or near "any"
select 1 where any('{hello}'::text[]) = 'hello';
Whereas below example works fine
select 1 where 'hello' = any('{hello}'::text[]);
When looking for the existence of a element in an array, proper casting is required to pass the SQL parser of postgres. Here is one example query using array contains operator in the join clause:
For simplicity I only list the relevant part:
table1 other_name text[]; -- is an array of text
The join part of SQL shown
from table1 t1 join table2 t2 on t1.other_name::text[] #> ARRAY[t2.panel::text]
The following also works
on t2.panel = ANY(t1.other_name)
I am just guessing that the extra casting is required because the parse does not have to fetch the table definition to figure the exact type of the column. Others please comment on this.