Issue with array_agg method when aggregating arrays of different lengths - sql

Here is a lateral query which is part of a bigger query:
lateral (
select
array_agg(
sh.dogsfilters
) filter (
where
sh.dogsfilters is not null
) as dependencyOfFoods
from
shelter sh
where
sh.shelterid = ${shelterid}
) filtersOfAnimals,
the problem is with array_agg method as it fails when it has arrays with different lengths like this ("[[7, 9], [7, 9, 8], [8]]")!
The problem is easy to solve using json_agg but later in the query there's a any check like this:
...
where
cd.dogsid = any(filtersOfAnimals.dependencyOfFoods)
and
...
...
But as any will not work on json data which is prepared using json_agg so I can't use it instead of array_agg!
What might be a better solution to this?

Unnest the arrays and re-aggregate:
lateral
(select array_agg(dogfilter) filter (where dogfilter is not null) as dependencyOfFoods
from shelter sh cross join
unnest(sh.dogsfilters) dogfilter
where sh.shelterid = ${shelterid}
) filtersOfAnimals,
It is interesting that Postgres doesn't have a function that does this. BigQuery offers array_concat_agg() which does exactly what you want.

It is ugly, but it works:
regexp_split_to_array(
array_to_string(
array_agg(
array_to_string(value,',')
),','
),',')::integer[]
I don't know if this could be a valid solution from the the performance point of view ...

In PostgreSQL, you can define your own aggregates. I think that this one does what you want:
create function array_concat_agg_tran(anyarray,anyarray) returns anyarray language sql
as $$ select $1||$2 $$;
create aggregate array_concat_agg(anyarray) (sfunc=array_concat_agg_tran, stype=anyarray);
Then:
select array_concat_agg(x) from (values (ARRAY[1,2]),(ARRAY[3,4,5])) f(x);
array_concat_agg
------------------
{1,2,3,4,5}
With a bit more work, you could make it parallelizable as well.

Related

Trino implement a function like regexp_split_to_table()?

Everyone,
I am new to Trino, and I find no function in Trino like regexp_split_to_table() in GreenPlum or PostgreSQL. How can I approach that?
select regexp_split_to_table( sensor_type, E',+' ) as type from hydrology.device_info;
There is regexp_split(string, pattern) function, returns array, you can unnest it.
Demo:
select s.str as original_str, u.str as exploded_value
from
(select 'one,two,,,three' as str)s
cross join unnest(regexp_split(s.str,',+')) as u(str)
Result:
original_str exploded_value
one,two,,,three one
one,two,,,three two
one,two,,,three three

Postgresql Convert SQL XML Coding to SQL JSON Coding

How to convert XML SQL Coding to JSON SQL Coding.
Example:
SELECT XMLELEMENT(NAME "ORDER", XMLFOREST(PURCHASE_ORDER AS OD_NO)) AS "XMLELEMENT" FROM
TBL_SALES
Now how to convert this XMLELEMENT & XMLFOREST into JSON functions. Please help me. Do we have equivalent XMLELEMENT/XMLFOREST in JSON functions.
xml:
<order><OD_NO>4524286167</OD_NO><order_date>2020-06-15</order_date><sales_office>CH</sales_office></order>
json:
{ "OD_NO": "4524286167", "order_date": "2020-06-15", "sales_office": "CH" }
Here row_to_json will do the thing.
You can write your query like below:
select row_to_json(x) from
(select purchase_order "OD_NO", order_date, sales_office from tbl_sales ) x
If You want to aggregate all the results in a single JSON Array use JSON_AGG with row_to_json:
select json_agg(row_to_json(x)) from
(select purchase_order "OD_NO", order_date, sales_office from tbl_sales ) x
DEMO
These Postgresql functions
json_build_object(VARIADIC "any") and
jsonb_build_object(VARIADIC "any")
are semantically close to XMLELEMENT and very convenient for 'embroidering' of whatever complex JSON you may need. Your query might look like this:
select json_build_object
(
'OD_NO', order_number, -- or whatever the name of the column is
'order_date', order_date,
'sales_office', sales_office
) as json_order
from tbl_sales;
I do not think that there is a XMLFOREST equivalent however.

H2DB WITH clause

I'm writing a unit test for a method with the following sql
WITH temptab(
i__id , i__name, i__effective, i__expires, i__lefttag, i__righttag,
hier_id, hier_dim_id, parent_item_id, parent_hier_id, parent_dim_id,
ancestor, h__id, h__name, h__level, h__effective, h__expires, rec_lvl)
AS (
SELECT
item.id as i__id,
item.name as i__name,
item.effectivets as i__effective,
item.expirests as i__expires,
item.lefttag as i__lefttag,
item.righttag as i__righttag,
hier_id, hier_dim_id,
parent_item_id,
parent_hier_id,
parent_dim_id, 1 as ancestor,
hier.id as h__id, hier.name as h__name,
hier.level as h__level, hier.effectivets as h__effective,
hier.expirests as h__expires, 1 as rec_lvl FROM metro.item item,
metro.hierarchy hier WHERE item.id = 'DI' AND hier_id = '69' AND hier_dim_id= '36' AND hier.id =item.hier_id
)
SELECT
i__id, i__name, i__effective, i__expires, i__lefttag,
i__righttag, hier_id, hier_dim_id, parent_item_id,
parent_hier_id, parent_dim_id, ancestor,
h__id, h__name, h__level, h__effective, h__expires
FROM temptab
This query returns empty dataset, but I expect 1 row.
The data are correct, as similar simple query without with clause works fine.
I investigated the problem and I've found the
Sub Query with WITH-CLAUSE in H2DB
but that solution did not help.
So, does anyone know how H2 supports with clause?
Thanks in advance for your time.
According to the following :h2 database grammar
Looks like WITH clause is not supported in H2 database, except of experimental support for recursive queries: h2 recursive queries
Its supported now http://www.h2database.com/html/grammar.html
For non-recursive queries also.

Filter results if not contained in another column

Is there an equivalent way to do the following SQL command with Django's QuerySet API?
select id, childid from mysite_nodetochild
where childid NOT IN (Select "Nodeid" from mysite_nodetochild)
I would prefer not to use raw sql if possible but I can't get a clean working version using Django's Queryset.
Try
nodetochild.objects.exclude(childid=nodetochild.objects.values_list('Nodeid', flat=True)).only('id', 'childid')
This should evaluate to, more or less:
SELECT "mysite_nodetochild"."id", "mysite_nodetochild"."childid" FROM "mysite_nodetochild" WHERE NOT ("mysite_nodetochild"."childid" = (SELECT U0."nodeid" FROM "mysite_nodetochild" U0))
Or, if you need the IN condition:
nodetochild.objects.exclude(childid__in=nodetochild.objects.values_list('Nodeid', flat=True)).only('id', 'childid')
Would evaluate to:
SELECT "mysite_nodetochild"."id", "mysite_nodetochild"."childid" FROM "mysite_nodetochild" WHERE NOT ("mysite_nodetochild"."childid" IN (SELECT U0."nodeid" FROM "mysite_nodetochild" U0))

bigquery url decode

Is there an easy way to do URL decoding within the BigQuery query language? I'm working with a table that has a column containing URL-encoded strings in some values. For example:
http://xyz.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
I extract the "url" parameter like so:
SELECT REGEXP_EXTRACT(column_name, "url=([^&]+)") as url
from [mydataset.mytable]
which gives me:
http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345
What I would like to do is something like:
SELECT URL_DECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) as url
from [mydataset.mytable]
thereby returning:
http://www.example.com/hello?v=12345
I would like to avoid using multiple REGEXP_REPLACE() statements (replacing %20, %3A, etc...) if possible.
Ideas?
Below is built on top of #sigpwned answer, but slightly refactored and wrapped with SQL UDF (which has no limitation that JS UDF has so safe to use)
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
SELECT
column_name,
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url
FROM `project.dataset.table`
can be tested with example from question as below
#standardSQL
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT SAFE_CONVERT_BYTES_TO_STRING(
ARRAY_TO_STRING(ARRAY_AGG(
IF(STARTS_WITH(y, '%'), FROM_HEX(SUBSTR(y, 2)), CAST(y AS BYTES)) ORDER BY i
), b''))
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}|[^%]+")) AS y WITH OFFSET AS i
));
WITH `project.dataset.table` AS (
SELECT 'http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz' column_name
)
SELECT
URLDECODE(REGEXP_EXTRACT(column_name, "url=([^&]+)")) AS url,
column_name
FROM `project.dataset.table`
with result
Row url column_name
1 http://www.example.com/hello?v=12345 http://example.com/example.php?url=http%3A%2F%2Fwww.example.com%2Fhello%3Fv%3D12345&foo=bar&abc=xyz
Update with further quite optimized SQL UDF
CREATE TEMP FUNCTION URLDECODE(url STRING) AS ((
SELECT STRING_AGG(
IF(REGEXP_CONTAINS(y, r'^%[0-9a-fA-F]{2}'),
SAFE_CONVERT_BYTES_TO_STRING(FROM_HEX(REPLACE(y, '%', ''))), y), ''
ORDER BY i
)
FROM UNNEST(REGEXP_EXTRACT_ALL(url, r"%[0-9a-fA-F]{2}(?:%[0-9a-fA-F]{2})*|[^%]+")) y
WITH OFFSET AS i
));
It's a good feature request, but currently there is no built in BigQuery function that provides URL decoding.
One more workaround is using a user-defined function.
#standardSQL
CREATE TEMPORARY FUNCTION URL_DECODE(enc STRING)
RETURNS STRING
LANGUAGE js AS """
try {
return decodeURI(enc);;
} catch (e) { return null }
return null;
""";
SELECT ven_session,
URL_DECODE(REGEXP_EXTRACT(para,r'&kw=(\w|[^&]*)')) AS q
FROM raas_system.weblog_20170327
WHERE para like '%&kw=%'
LIMIT 10
I agree with everyone here that URLDECODE should be a native function. However, until that happens, it is possible to write a "native" URLDECODE:
SELECT id, SAFE_CONVERT_BYTES_TO_STRING(ARRAY_TO_STRING(ps, b'')) FROM (SELECT
id,
ARRAY_AGG(CASE
WHEN REGEXP_CONTAINS(y, r"^%") THEN FROM_HEX(SUBSTR(y, 2))
ELSE CAST(y AS bytes)
END ORDER BY i) AS ps
FROM (SELECT x AS id, REGEXP_EXTRACT_ALL(x, r"%[0-9a-fA-F]{2}|[^%]+") AS element FROM UNNEST(ARRAY['domodossola%e2%80%93locarno railway', 'gabu%c5%82t%c3%b3w']) AS x) AS x
CROSS JOIN UNNEST(x.element) AS y WITH OFFSET AS i GROUP BY id);
In this example, I've tried and tested the implementation with a couple of percent-encoded page names from Wikipedia as the input. It should work with your input, too.
Obviously, this is extremely unwieldly! For that reason, I'd suggest building a materialized join table, or wrapping this in a view, rather than using this expression "naked" in your query. However, it does appear to get the job done, and it doesn't hit the UDF limits.
EDIT: #MikhailBerylyant's post below has wrapped this cumbersome implementation into a nice, tidy little SQL UDF. That's a much better way to handle this!