Wildcard search for array<string> in Athena - sql

I have a table in Athena where one of the columns is of type array.
I tried the below query to get output containing earth but doesn't work.
How do I perform a wildcard search in this column?
Expected output after wildcard search:
select * from mytable
where contains(myarr,'eart%');

This is from memory, so it might need a bit of tweaking, but you can use a filter on the array elements
where cardinality(filter(myarr, q -> q like 'eart%')) > 0
filter creates an array of matches and cardinality tests for one or more elements in the array

Related

How to get value string with regexp in bigquery

Hi i have string in BigQuery column like this
cancellation_amount: 602000
after_cancellation_transaction_amount: 144500
refund_time: '2022-07-31T06:05:55.215203Z'
cancellation_amount: 144500
after_cancellation_transaction_amount: 0
refund_time: '2022-08-01T01:22:45.94919Z'
i already using this logic to get cancellation_amount
regexp_extract(file,r'.*cancellation_amount:\s*([^\n\r]*)')
but the output only amount 602000, i need the output 602000 and 144500 become different column
Appreciate for helping
If your lines in the input (which will eventually become columns) are fixed you can use multiple regexp_extracts to get all the values.
SELECT
regexp_extract(file,r'cancellation_amount:\s*([^\n\r]*)') as cancellation_amount
regexp_extract(file,r'. after_cancellation_transaction_amount:\s*([^\n\r]*)') as after_cancellation_transaction_amount
FROM table_name
One issue I found with your regex expression is that .*cancellation_amount won't match after_cancellation_transaction_amount.
There is also a function called regexp_extract_all which returns all the matches as an array which you can later explode into columns, but if you have finite values separating them out in different columns would be a easier.

How to access nested arrays and JSON in AWS Athena

I'm trying to process some data from s3 logs in Athena that has a complex type I cannot figure out how to work with.
I have a table with rows such as:
data
____
"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"
I'd like to treat it as (1) an array to extract the first element, and then that first element as the JSON that it is.
Everything is confused because the data naturally is a string, that contains an array, that contains json and I don't even know where to start
You can use the following combination of JSON commands:
SELECT
JSON_EXTRACT_SCALAR(
JSON_EXTRACT_SCALAR('"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"','$'),
'$[0].k1'
)
The inner JSON_EXTRACT_SCALAR will return the JSON ARRAY [{"k1":"value1", "key2":"value2"...}] and the outer will return the relevant value value1
Another similar option is to use CAST(JSON :
SELECT
JSON_EXTRACT_SCALAR(
CAST(JSON '"[{\"k1\":\"value1\", \"key2\":\"value2\"...}]"' as VARCHAR),
'$[0].k1'
)

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?
Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)
You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.
Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

Get an average value for element in column of arrays of json data in postgres

I have some data in a postgres table that is a string representation of an array of json data, like this:
[
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]
This is is data in one cell from a single column of similar data in my database.
The datatype of this stored in the db is varchar(max).
My goal is to find the average RetailPrice of EVERY json item with "Role"=>"Abstract", including all of the json elements in the array, and all of the rows in the database.
Something like:
SELECT avg(json_extract_path_text(json_item, 'RetailPrice'))
FROM (
SELECT cast(json_items to varchar[]) as json_item
FROM my_table
WHERE json_extract_path_text(json_item, 'Role') like 'Abstract'
)
Now, obviously this particular query wouldn't work for a few reasons. Postgres doesn't let you directly convert a varchar to a varchar[]. Even after I had an array, this query would do nothing to iterate through the array. There are probably other issues with it too, but I hope it helps to clarify what it is I want to get.
Any advice on how to get the average retail price from all of these arrays of json data in the database?
It does not seem like Redshift would support the json data type per se. At least, I found nothing in the online manual.
But I found a few JSON function in the manual, which should be instrumental:
JSON_ARRAY_LENGTH
JSON_EXTRACT_ARRAY_ELEMENT_TEXT
JSON_EXTRACT_PATH_TEXT
Since generate_series() is not supported, we have to substitute for that ...
SELECT tbl_id
, round(avg((json_extract_path_text(elem, 'RetailPrice'))::numeric), 2) AS avg_retail_price
FROM (
SELECT *, json_extract_array_element_text(json_items, pos) AS elem
FROM (VALUES (0),(1),(2),(3),(4),(5)) a(pos)
CROSS JOIN tbl
) sub
WHERE json_extract_path_text(elem, 'Role') = 'Abstract'
GROUP BY 1;
I substituted with a poor man's solution: A dummy table counting from 0 to n (the VALUES expression). Make sure you count up to the maximum number of possible elements in your array. If you need this on a regular basis create an actual numbers table.
Modern Postgres has much better options, like json_array_elements() to unnest a json array. Compare to your sibling question for Postgres:
Can get an average of values in a json array using postgres?
I tested in Postgres with the related operator ->>, where it works:
SQL Fiddle.

PostgreSQL query for array in any array

I need a PostgreSQL command that matches from a list of emails against an array type column.
my column is declared as this:
emails character varying(255)[] DEFAULT '{}'::character varying[]
And I need to search against it using one of many potential matches.
Normally I'd search using the IN operator like so: SELECT * FROM identities WHERE emails IN ['test#test.com']; but I can't seem to find an example of how to generate an IN query when searching against arrays.
Ideally it'd be something like this (which clearly doesn't work):
SELECT * FROM identities WHERE ('jaylan.jones#runte.name','qwe#qwe.com') IN ANY (emails);
The overlap && operator will check if there are elements in common
SELECT *
FROM identities
WHERE array['jaylan.jones#runte.name','qwe#qwe.com']::varchar[] && emails;
http://www.postgresql.org/docs/current/static/functions-array.html#ARRAY-OPERATORS-TABLE