How to unnest BigQuery nested records into multiple columns - sql

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -

That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.

Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))

Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

Related

Big Query SQL concatenating or linking values in piped strings or arrays of strings

I have two fields in a big BQ table. Both fields are strings. Both strings are formatted to represent several values separated by pipes, the same number of values in each string. I need to associate each value in the first set with each value in the second set. For example:
id names nums
x "a|b|c" "3|9|5"
y "d" "1"
z "e|f" "4|7"
I need to come out with a result like:
x,a,3
x,b,9
x,c,5
y,d,1
z,e,4
z,f,7
The second input string is a sequence of numbers but I don't mind if this comes out as numeric or string (I'll figure out the casting).
Seems obvious I need to use split() at some point can convert the strings to arrays, but how can I concatenate the arrays from left to right rather than lenght-wise?
I know I can use a double unnest with an index and select only where the indexes are equal, however I don't want to use this method because it makes the already large input table massive (before selecting the equal indexes).
Thanks for any thoughts!
Here is one method that generates a series of subscripts and uses arrays to extract the values:
select id, split(names, '|')[safe_ordinal(n)] as names,
split(nums, '|')[safe_ordinal(n)] as nums
from (select 'x' as id, 'a|b|c' as names, '3|9|5' as nums union all
select 'y', 'd', '1' union all
select 'z', 'e|f', '4|7'
) t cross join
unnest(generate_array(1, array_length(split(names, '|')))) as n;
This uses the length of names to determine how many values there are.
Alternative option for BigQuery Standard SQL
#standardSQL
SELECT id, name, num
FROM `project.dataset.table`,
UNNEST(SPLIT(names, '|')) name WITH OFFSET
JOIN UNNEST(SPLIT(nums, '|')) num WITH OFFSET
USING(OFFSET)
If apply to sample data in your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'x' id, 'a|b|c' names, '3|9|5' nums UNION ALL
SELECT 'y', 'd', '1' UNION ALL
SELECT 'z', 'e|f', '4|7'
)
SELECT id, name, num
FROM `project.dataset.table`,
UNNEST(SPLIT(names, '|')) name WITH OFFSET
JOIN UNNEST(SPLIT(nums, '|')) num WITH OFFSET
USING(OFFSET)
result is
Row id name num
1 x a 3
2 x b 9
3 x c 5
4 y d 1
5 z e 4
6 z f 7
Above can be refactored to use UDF so the main query gets simple and readable - as in below example
#standardSQL
CREATE TEMP FUNCTION xxx(arr1 ANY TYPE, arr2 ANY TYPE)
RETURNS ARRAY<STRUCT<name STRING, num STRING>> AS (
ARRAY(
SELECT AS STRUCT el1, el2
FROM UNNEST(SPLIT(arr1, '|')) el1 WITH OFFSET
JOIN UNNEST(SPLIT(arr2, '|')) el2 WITH OFFSET
USING(OFFSET)
));
SELECT id, x.*
FROM `project.dataset.table`, UNNEST(xxx(names, nums)) x
obviously with the same output

PostgreSQL: How to return a subarray dynamically using array slices in postgresql

I need to sum a subarray from an array using postgresql.
I need to create a postgresql query that will dynamically do this as the upper and lower indexes will be different for each array.
These indexes will come from two other columns within the same table.
I had the below query that will get the subarray:
SELECT
SUM(t) AS summed_index_values
FROM
(SELECT UNNEST(int_array_column[34:100]) AS t
FROM array_table
WHERE id = 1) AS t;
...but I then realised I couldn't use variables or SELECT statements when using array slices to make the query dynamic:
int_array_column[SELECT array_index_lower FROM array_table WHERE id = 1; : SELECT array_index_upper FROM array_table WHERE id = 1;]
...does anyone know how I can achieve this query dynamically?
No need for sub-selects, just use the column names:
SELECT SUM(t) AS summed_index_values
FROM (
SELECT UNNEST(int_array_column[tb.array_index_lower:tb.array_index_upper]) AS t
FROM array_table tb
WHERE id = 1
) AS t;
Note that it's not recommended to use set-returning functions (unnest) in the SELECT list. It's better to put that into the FROM clause:
SELECT sum(t.val)
FROM (
SELECT t.val
FROM array_table tb
cross join UNNEST(int_array_column[tb.array_idx_lower:array_idx_upper]) AS t(val)
WHERE id = 1
) AS t;

tricky SQL with substrings

I have a table (postgres) with a varchar field that has content structured like:
".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."
The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.
That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain ".. John;Smith; .."
The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].
I'd appreciate if someone could help me out with the SQL to accomplish this.
Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:
select
substring(col, 'uuid=([[:alnum:]]+)') as uuid,
string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;
Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028
If you only want to count:
select
count(*) as cnt_uuids,
sum(num_names) as cnt_names,
sum(num_rows) as cnt_rows
from
(
select
count(*) as num_rows,
count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1
) flaws;
But as has been mentioned already: This is not how a database should be used.
I assume you know all the reasons why this is a bad data format, but you are stuck with it. Here is my approach:
select v.user_id, array_agg(distinct names)
from (select v.id,
max(el) filter (where n = un) as user_id,
array_agg(el order by el) filter (where n in (un - 2, un - 1)) as names
from (select v.id, u.*,
max(u.n) filter (where el like 'uuid=%') over (partition by v.id) as un
from (values (1 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(2 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(3 , 'junkgoeshere;John;Smith;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(4 , 'junkgoeshere;John;Jay;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..')
) v(id, str) cross join lateral
unnest(regexp_split_to_array(v.str, ';')) with ordinality u(el, n)
) v
where n between un - 2 and un
group by v.id
) v
group by user_id
having min(names) <> max(names);
Here is a db<>fiddle.
This assumes that the fields are separated by semicolons. Your data format is just awful, not just as a string but because the names are not identified. So, I am assuming they are the two fields before the user_id field.
So, this implements the following logic:
Breaks up the string by semicolons, with an identifying number.
Finds the number for the user_id.
Extracts the previous two fields together and the user_id column.
Then uses aggregation to find cases where there are multiple matches.

Returning the lowest integer not in a list in SQL

Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;

Index number for records within a pipe-delimited field inside a csv

I have a csv that I'm bringing into a SQL table. The csv has a field within it for CrimeType. That field is pipe delimited. So, I'm using cross apply to break up the pipe, like this:
SELECT CrimeRecords.CaseNum, CrimeRecords.Offense, PrimaryCrime.PrimaryCrime
FROM (SELECT CaseNum ,x.i.value('.','varchar(20)') AS Offense
FROM (SELECT CaseNum, CONVERT(XML,'<i>'+REPLACE(CrimeType, '|', '</i><i>') + '</i>') AS d
FROM CrimeView.dbo.tblCrimeData)x1 CROSS APPLY d.nodes('i') AS x(i)) AS CrimeRecords
Can someone help me add a step to create a field for a sequence number? Basically I just want to return the order of the items in the pipe.
For rows like:
1, Burglary|Assault
2, Burglary
3, Assault|Assault-Weapon|Theft
My result table would look like this:
CaseNum CrimeType SeqNum
1 Burglary 1
1 Assault 2
2 Burglary 1
3 Assault 1
3 Assault-Weapon 2
3 Theft 3
Edit to show that the Sequence Number resets for each CaseNum.
Edit tags to clarify that this is Microsoft SQL, not MySQL.
Try including the ROW_NUMBER() function in your SELECT statement (http://technet.microsoft.com/en-us/library/ms186734.aspx).
i.e.
SELECT ROW_NUMBER() OVER (PARTITION BY CrimeRecords.CaseNum ORDER BY CrimeRecords.CaseNum) As Idx, CrimeRecords.CaseNum, CrimeRecords.Offense, PrimaryCrime.PrimaryCrime
FROM (SELECT CaseNum ,x.i.value('.','varchar(20)') AS Offense
FROM (SELECT CaseNum, CONVERT(XML,'<i>'+REPLACE(CrimeType, '|', '</i><i>') + '</i>') AS d
FROM CrimeView.dbo.tblCrimeData)x1 CROSS APPLY d.nodes('i') AS x(i)) AS CrimeRecords
Edit: Included Partition By to reset the sequence for each case.
if you have a simple table CrimeRecords like CaseNum | CrimeType
you have to do something like this
SELECT CaseNum,CrimeType, #row:=#row+1 SeqNum
FROM CrimeRecords a JOIN (SELECT #row := 0) b;
ok.. I cant see crearly in your query and i cant try it in a db, buy try to use the query i shared.
It is just an example to show how you can add numbers in order 1,2,3...x from some elements in the rows.. so try to mix it code in your query and reestart the #row each time the group change..
so you ll get it