What is the order of execution of sql functions in BigQuery - google-bigquery

I need to do transformation on the string elements found by REGEXP_REPLACE but function SUBSTR look like executed before REGEXP_REPLACE.
I was unable to found any limitations of what replacement should be
WITH tabA AS
(SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\",\"6cc623dd-41f5-42d9-9637-a169af42e2b1\"]" as myids
union all
SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\"]" as myids
)
SELECT
myids,
REGEXP_REPLACE(myids,"\"(.{36})\"",SUBSTR("\\1",0,8)) as ccc,
SUBSTR(myids,0,8) as ddd
FROM tabA;
I expect only first 8 characters of each regex to be output, but instead getting all 36.
Expected to see:
5cc623dd,6cc623dd
5cc623dd

In your query, SUBSTR(...) is part of input to function REGEXP_REPLACE, so it must be evaluated prior to calling REGEXP_REPLACE().
You must do the SUBSTR equivalent as part of your regular expression, like in below example
WITH tabA AS
(SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\",\"6cc623dd-41f5-42d9-9637-a169af42e2b1\"]" as myids
union all
SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\"]" as myids
)
SELECT
myids,
REGEXP_REPLACE(myids,"\"(.{8}).{28}\"","\\1") as ccc
FROM tabA;
Output:
+---------------------------------------------------------------------------------+---------------------+
| myids | ccc |
+---------------------------------------------------------------------------------+---------------------+
| ["5cc623dd-41f5-42d9-9637-a169af42e2b1","6cc623dd-41f5-42d9-9637-a169af42e2b1"] | [5cc623dd,6cc623dd] |
| ["5cc623dd-41f5-42d9-9637-a169af42e2b1"] | [5cc623dd] |
+---------------------------------------------------------------------------------+---------------------+

Below is for BigQuery Standard SQL
#standardSQL
WITH tabA AS (
SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\",\"6cc623dd-41f5-42d9-9637-a169af42e2b1\"]" AS myids UNION ALL
SELECT "[\"5cc623dd-41f5-42d9-9637-a169af42e2b1\"]" AS myids
)
SELECT myids,
( SELECT STRING_AGG(id)
FROM UNNEST(REGEXP_EXTRACT_ALL(myids, r'"(.{8})-')) id
) ids
FROM tabA
with result
Row myids ids
1 ["5cc623dd-41f5-42d9-9637-a169af42e2b1","6cc623dd-41f5-42d9-9637-a169af42e2b1"] 5cc623dd,6cc623dd
2 ["5cc623dd-41f5-42d9-9637-a169af42e2b1"] 5cc623dd

Related

Split record into 2 records with distinct values based on a unique id

I have a table with some IDs that correspond to duplicate data that i would like to get rid of. They are linked by a groupid number. Currently my data looks like this:
|GroupID|NID1 |NID2 |
|S1 |644763|643257|
|T2 |4759 |84689 |
|W3 |96676 |585876|
In order for the software to run, I need the data in the following format:
|GroupID|NID |
|S1 |644763|
|S1 |643257|
|T2 |4759 |
|T2 |84689 |
|W3 |96676 |
|W3 |585876|
Thank you for your time.
You want union all :
select groupid, nid1 as nid
from table t
union all -- use "union" instead if you don't want duplicate rows
select groupid, nid2
from table t;
In Oracle 12C+, you can use lateral joins:
select t.groupid, v.nid
from t cross apply
(select t.nid1 as nid from dual union all
select t.nid2 as nid from dual
) v;
This is more efficient than union all because it only scans the table once.
You can also express this as:
select t.groupid,
(case when n.n = 1 then t.nid1 when n.n = 2 then t.nid2 end) as nid
from t cross join
(select 1 as n from dual union all select 2 from dual) n;
A little more complicated, but still only one scan of the table.

Translate code string into desc in hive

Here we have a hyphened string like 0-1-3.... and the length is not fixed,
also we have a DETAIL table in hive to explain the meaning of each code.
DETAIL
| code | desc |
+ ---- + ---- +
| 0 | AAA |
| 1 | BBB |
| 2 | CCC |
| 3 | DDD |
Now we need a hive query to convert the code string into a description string.
For example: the case 0-1-3 should get a string like AAA-BBB-DDD.
any advice on how to get that ?
Split your string to get an array, explode array and join with detail table (CTE is used in my example instead of it, use normal table instead) to get desc joined with code. Then assemble string using collect_list(desc) to get an array + concat_ws() to get concatenated string:
select concat_ws('-',collect_list(d.desc)) as code_desc
from
( --initial string explode
select explode(split('0-1-3','-')) as code
) s
inner join
(-- use your table instead of this subquery
select 0 code, 'AAA' desc union all
select 1, 'BBB' desc union all
select 2, 'CCC' desc union all
select 3, 'DDD' desc
) d on s.code=d.code;
Result:
OK
AAA-BBB-DDD
Time taken: 114.798 seconds, Fetched: 1 row(s)
In case you need to preserve the original order, then use posexplode it returns the element as well as its position in the original array. Then you can order by record ID and pos before collect_list().
If your string is a table column then use lateral view to select exploded values.
This is more complicated example with order preserved and lateral view.
select str as original_string, concat_ws('-',collect_list(s.desc)) as transformed_string
from
(
select s.str, s.pos, d.desc
from
( --initial string explode with ordering by str and pos
--(better use your table PK, like ID instead of str for ordering), pos
select str, pos, code from ( --use your table instead of this subquery
select '0-1-3' as str union all
select '2-1-3' as str union all
select '3-2-1' as str
)s
lateral view outer posexplode(split(s.str,'-')) v as pos,code
) s
inner join
(-- use your table instead of this subquery
select 0 code, 'AAA' desc union all
select 1, 'BBB' desc union all
select 2, 'CCC' desc union all
select 3, 'DDD' desc
) d on s.code=d.code
distribute by s.str -- this should be record PK candidate
sort by s.str, s.pos --sort on each reducer
)s
group by str;
Result:
OK
0-1-3 AAA-BBB-DDD
2-1-3 CCC-BBB-DDD
3-2-1 DDD-CCC-BBB
Time taken: 67.534 seconds, Fetched: 3 row(s)
Note that distribute + sort is being used instead of simply order by str, pos. distribute + sort works in fully distributed mode, order by will work also correct but on single reducer.

Combine Columns with same ID and search the data

I have a table (tblABC) which looks like
-------------------------------
BasicID | Filter1 | Filter2 |
-------------------------------
100 1 2
100 3 4
101 8 9
What I want to do on it is - I want to select the BasicID which has Filter1=1 and Filter2=4. ie I want to get the output as
100
Can I create a view of something by combining the BasicID. Something that looks similar to
--------------------------------
BasicID | Filter1| Filter2 |
--------------------------------
100 1,3 2,4
101 8 9
Once this is done I can search using a simple search query like
'select the BasicID tblNewlyCreatedTable where Filter1=1 and Filter2=4'
and get the output as 100.
To solve this issue I have tried the following methods all of which have failed me as it was not efficient because I have around 12 filter to filter on. Also not all of the filter will be applied all the time, at times it will be 4 filters, at times 2 and at time all 12.
select * from tblABC
where
and BasicID in
(
select BasicID from tblABC
where Filter1 IN (1)
)
and BasicID in
(
select BasicID from tblABC
where Filter2 IN (4)
)
2.
using SELECT for finding the results individually for Filter1 and Filter2 using a INTERSECT to intersect them.
One more question I have is will creating another table where all Filter fields are varchar instead of int and then searching on the text be any good? I was advised by many to avoid this, as this will also cause efficiency problem. But would filtering on 12 varchar fields in the same select query be more efficient than, calling 12 select queries on int fields and combining them???
For MySQL
SELECT BasicID,
GROUP_CONCAT(FILTER1 ORDER BY FILTER1 ASC SEPARATOR ', ') AS FILTER1,
GROUP_CONCAT(FILTER2 ORDER BY FILTER2 ASC SEPARATOR ', ') AS FILTER2
FROM tblABC
GROUP BY BASICID
with filters as
(select 1 as val from dual union select 2 as val from dual
--add more values when needed
)
select basicid from tblABC where Filter1 in (select val from filters)
intersect
select basicid from tblABC where Filter2 in (select val from filters)
This uses a cte with all the filter values.

Joining arrays within group by clause

We have a problem grouping arrays into a single array.
We want to join the values from two columns into one single array and aggregate these arrays of multiple rows.
Given the following input:
| id | name | col_1 | col_2 |
| 1 | a | 1 | 2 |
| 2 | a | 3 | 4 |
| 4 | b | 7 | 8 |
| 3 | b | 5 | 6 |
We want the following output:
| a | { 1, 2, 3, 4 } |
| b | { 5, 6, 7, 8 } |
The order of the elements is important and should correlate with the id of the aggregated rows.
We tried the array_agg() function:
SELECT array_agg(ARRAY[col_1, col_2]) FROM mytable GROUP BY name;
Unfortunately, this statement raises an error:
ERROR: could not find array type for data type character varying[]
It seems to be impossible to merge arrays in a group by clause using array_agg().
Any ideas?
UNION ALL
You could "unpivot" with UNION ALL first:
SELECT name, array_agg(c) AS c_arr
FROM (
SELECT name, id, 1 AS rnk, col1 AS c FROM tbl
UNION ALL
SELECT name, id, 2, col2 FROM tbl
ORDER BY name, id, rnk
) sub
GROUP BY 1;
Adapted to produce the order of values you later requested. The manual:
The aggregate functions array_agg, json_agg, string_agg, and xmlagg,
as well as similar user-defined aggregate functions, produce
meaningfully different result values depending on the order of the
input values. This ordering is unspecified by default, but can be
controlled by writing an ORDER BY clause within the aggregate call, as
shown in Section 4.2.7. Alternatively, supplying the input values from
a sorted subquery will usually work.
Bold emphasis mine.
LATERAL subquery with VALUES expression
LATERAL requires Postgres 9.3 or later.
SELECT t.name, array_agg(c) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
CROSS JOIN LATERAL (VALUES (t.col1), (t.col2)) v(c)
GROUP BY 1;
Same result. Only needs a single pass over the table.
Custom aggregate function
Or you could create a custom aggregate function like discussed in these related answers:
Selecting data into a Postgres array
Is there something like a zip() function in PostgreSQL that combines two arrays?
CREATE AGGREGATE array_agg_mult (anyarray) (
SFUNC = array_cat
, STYPE = anyarray
, INITCOND = '{}'
);
Then you can:
SELECT name, array_agg_mult(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Or, typically faster, while not standard SQL:
SELECT name, array_agg_mult(ARRAY[col1, col2]) AS c_arr
FROM (SELECT * FROM tbl ORDER BY name, id) t
GROUP BY 1;
The added ORDER BY id (which can be appended to such aggregate functions) guarantees your desired result:
a | {1,2,3,4}
b | {5,6,7,8}
Or you might be interested in this alternative:
SELECT name, array_agg_mult(ARRAY[ARRAY[col1, col2]] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Which produces 2-dimensional arrays:
a | {{1,2},{3,4}}
b | {{5,6},{7,8}}
The last one can be replaced (and should be, as it's faster!) with the built-in array_agg() in Postgres 9.5 or later - with its added capability of aggregating arrays:
SELECT name, array_agg(ARRAY[col1, col2] ORDER BY id) AS c_arr
FROM tbl
GROUP BY 1
ORDER BY 1;
Same result. The manual:
input arrays concatenated into array of one higher dimension (inputs
must all have same dimensionality, and cannot be empty or null)
So not exactly the same as our custom aggregate function array_agg_mult();
select n, array_agg(c) as c
from (
select n, unnest(array[c1, c2]) as c
from t
) s
group by n
Or simpler
select
n,
array_agg(c1) || array_agg(c2) as c
from t
group by n
To address the new ordering requirement:
select n, array_agg(c order by id, o) as c
from (
select
id, n,
unnest(array[c1, c2]) as c,
unnest(array[1, 2]) as o
from t
) s
group by n

Postgres - How to check for an empty array

I'm using Postgres and I'm trying to write a query like this:
select count(*) from table where datasets = ARRAY[]
i.e. I want to know how many rows have an empty array for a certain column, but postgres doesn't like that:
select count(*) from super_eds where datasets = ARRAY[];
ERROR: syntax error at or near "]"
LINE 1: select count(*) from super_eds where datasets = ARRAY[];
^
The syntax should be:
SELECT
COUNT(*)
FROM
table
WHERE
datasets = '{}'
You use quotes plus curly braces to show array literals.
You can use the fact that array_upper and array_lower functions, on empty arrays return null
, so you can:
select count(*) from table where array_upper(datasets, 1) is null;
If you find this question in 2020, like I did, the correct answer is
select count(*) from table where cardinality(datasets) = 0
cardinality was added in PostgreSQL 9.4, which is ~2015
https://www.postgresql.org/docs/9.4/functions-array.html
Solution Query:
select id, name, employee_id from table where array_column = ARRAY[NULL]::array_datatype;
Example:
table_emp:
id (int)| name (character varying) | (employee_id) (uuid[])
1 | john doe | {4f1fabcd-aaaa-bbbb-cccc-f701cebfabcd, 2345a3e3-xxxx-yyyy-zzzz-f69d6e2edddd }
2 | jane doe | {NULL}
select id, name, employee_id from tab_emp where employee_id = ARRAY[NULL]::uuid[];
-------
2 | jane doe | {NULL}
SELECT COUNT(*)
FROM table
WHERE datasets = ARRAY(SELECT 1 WHERE FALSE)