Trino implement a function like regexp_split_to_table()? - sql

Everyone,
I am new to Trino, and I find no function in Trino like regexp_split_to_table() in GreenPlum or PostgreSQL. How can I approach that?
select regexp_split_to_table( sensor_type, E',+' ) as type from hydrology.device_info;

There is regexp_split(string, pattern) function, returns array, you can unnest it.
Demo:
select s.str as original_str, u.str as exploded_value
from
(select 'one,two,,,three' as str)s
cross join unnest(regexp_split(s.str,',+')) as u(str)
Result:
original_str exploded_value
one,two,,,three one
one,two,,,three two
one,two,,,three three

Related

How to flatten nested array data into row in bigquery

I am trying to flatten inside_array or sub array of nested array data into table rows.
I am able to flatten array_data which is outside array.
Anybody have any suggestion.Thanks in advance
#standardSQL
SELECT ...
FROM `project.dataset.table`,
UNNEST(array_data) AS array_data_rec,
UNNEST(array_data_rec.inside_array) AS inside_array_rec
To handle "no data inside the inside_array" - use LEFT JOIN instead as in below example
#standardSQL
SELECT ...
FROM `project.dataset.table`,
UNNEST(array_data) AS array_data_rec
LEFT JOIN UNNEST(array_data_rec.inside_array) AS inside_array_rec
You can do following
...
FROM
AA.nested_array,
UNNEST(array_data) as array_data,
UNNEST(array_data.inside_array) as array_data_inside_array

Issue with array_agg method when aggregating arrays of different lengths

Here is a lateral query which is part of a bigger query:
lateral (
select
array_agg(
sh.dogsfilters
) filter (
where
sh.dogsfilters is not null
) as dependencyOfFoods
from
shelter sh
where
sh.shelterid = ${shelterid}
) filtersOfAnimals,
the problem is with array_agg method as it fails when it has arrays with different lengths like this ("[[7, 9], [7, 9, 8], [8]]")!
The problem is easy to solve using json_agg but later in the query there's a any check like this:
...
where
cd.dogsid = any(filtersOfAnimals.dependencyOfFoods)
and
...
...
But as any will not work on json data which is prepared using json_agg so I can't use it instead of array_agg!
What might be a better solution to this?
Unnest the arrays and re-aggregate:
lateral
(select array_agg(dogfilter) filter (where dogfilter is not null) as dependencyOfFoods
from shelter sh cross join
unnest(sh.dogsfilters) dogfilter
where sh.shelterid = ${shelterid}
) filtersOfAnimals,
It is interesting that Postgres doesn't have a function that does this. BigQuery offers array_concat_agg() which does exactly what you want.
It is ugly, but it works:
regexp_split_to_array(
array_to_string(
array_agg(
array_to_string(value,',')
),','
),',')::integer[]
I don't know if this could be a valid solution from the the performance point of view ...
In PostgreSQL, you can define your own aggregates. I think that this one does what you want:
create function array_concat_agg_tran(anyarray,anyarray) returns anyarray language sql
as $$ select $1||$2 $$;
create aggregate array_concat_agg(anyarray) (sfunc=array_concat_agg_tran, stype=anyarray);
Then:
select array_concat_agg(x) from (values (ARRAY[1,2]),(ARRAY[3,4,5])) f(x);
array_concat_agg
------------------
{1,2,3,4,5}
With a bit more work, you could make it parallelizable as well.

explode function in hive

I have the following sample data and I am trying to explode it in hive.. I used split but I know I am missing something..
["[[-80.742426,35.23248],[-80.740424,35.23184],[-80.739583,35.231562],[-80.735935,35.23041],[-80.728624,35.228069],[-80.727753,35.227836],[-80.727294,35.227741],[-80.726762,35.227647],[-80.726321,35.227594],[-80.725687,35.227544],[-80.725134,35.227535],[-80.721502,35.227615],[-80.691298,35.216202],[-80.688009,35.215396],[-80.686516,35.215016],[-80.598433,35.234307]]"]
I used the below query
select explode(split(col, ',')) from sample2;
and the result is this
["[[-80.742426
35.23248]
[-80.740424
35.23184]
[-80.739583
35.231562]
[-80.735935
35.23041]
[-80.728624
35.228069]
[-80.727753
35.227836]
[-80.71143
35.227831]
[-80.711007
35.227795]
[-80.710638
35.227741]
[-80.673884
35.21014]
[-80.672358
35.209481]
[-80.672036
35.209356]
[-80.671686
35.209234]
[-80.67124
35.209099]
[-80.670815
35.209006]
[-80.670267
35.208906]
[-80.669612
35.208833]
[-80.668924
35.208806]
[-80.598433
35.234307]]"]
I need it in below format
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]
Any help over here..?
You have your data set as arrays of array and you want to explode your data at first level only, so use LATERAL VIEW explode(colname) to explode at the first level.
Below is the SELECT query with explode():
SELECT col1 FROM sample2 LATERAL VIEW EXPLODE(col) explodeVal AS col1;
output generated from your input data set as below:
[-80.742426,35.23248]
[-80.740424,35.23184]
[-80.739583,35.231562]
[-80.735935,35.23041]
[-80.728624,35.228069]
[-80.727753,35.227836]
[-80.727294,35.227741]
[-80.726762,35.227647]
[-80.726321,35.227594]
[-80.725687,35.227544]
[-80.725134,35.227535]
[-80.721502,35.227615]
[-80.691298,35.216202]
[-80.688009,35.215396]
[-80.686516,35.215016]
[-80.684281,35.214466]
[-80.68396,35.214395]
[-80.683375,35.214231]
[-80.682908,35.214079]
[-80.682444,35.213905]
[-80.682045,35.213733]
[-80.68062,35.213112]
[-80.678078,35.211983]
[-80.676836,35.211447]
[-80.598433,35.234307]

H2DB WITH clause

I'm writing a unit test for a method with the following sql
WITH temptab(
i__id , i__name, i__effective, i__expires, i__lefttag, i__righttag,
hier_id, hier_dim_id, parent_item_id, parent_hier_id, parent_dim_id,
ancestor, h__id, h__name, h__level, h__effective, h__expires, rec_lvl)
AS (
SELECT
item.id as i__id,
item.name as i__name,
item.effectivets as i__effective,
item.expirests as i__expires,
item.lefttag as i__lefttag,
item.righttag as i__righttag,
hier_id, hier_dim_id,
parent_item_id,
parent_hier_id,
parent_dim_id, 1 as ancestor,
hier.id as h__id, hier.name as h__name,
hier.level as h__level, hier.effectivets as h__effective,
hier.expirests as h__expires, 1 as rec_lvl FROM metro.item item,
metro.hierarchy hier WHERE item.id = 'DI' AND hier_id = '69' AND hier_dim_id= '36' AND hier.id =item.hier_id
)
SELECT
i__id, i__name, i__effective, i__expires, i__lefttag,
i__righttag, hier_id, hier_dim_id, parent_item_id,
parent_hier_id, parent_dim_id, ancestor,
h__id, h__name, h__level, h__effective, h__expires
FROM temptab
This query returns empty dataset, but I expect 1 row.
The data are correct, as similar simple query without with clause works fine.
I investigated the problem and I've found the
Sub Query with WITH-CLAUSE in H2DB
but that solution did not help.
So, does anyone know how H2 supports with clause?
Thanks in advance for your time.
According to the following :h2 database grammar
Looks like WITH clause is not supported in H2 database, except of experimental support for recursive queries: h2 recursive queries
Its supported now http://www.h2database.com/html/grammar.html
For non-recursive queries also.

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?
Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.
One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10
I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!