How to create an array from flattened data in BigQuery

How to create an array from flattened data in BigQuery - google-bigquery

There is a lot of information online to go from flattened data to arrays or structs, but I need to do the opposite and I am having a hard time archiving it. I am using Google BigQuery.
I have something like:
| Id | Value1 | Value2 |
| 1 | 1 | 2 |
| 1 | 3 | 4 |
| 2 | 5 | 6 |
| 2 | 7 | 8 |
I would like to get for the example above:
1, [(1, 2), (3, 4)]
2, [(5, 6), (7, 8)]
If I try to put an array in the select with a group by it is not a valid statement
For example:
SELECT Id, [ STRUCT(Value1, Value2) ] as Value
FROM `table.dataset`
GROUP BY Id
Which returns:
1, (1, 2)
1, (3, 4)
2, (5, 6)
2, (7, 8)
Which is not what I am looking for. The structure I got is: Id, Value.Value1, Value.Value2 and I want Id, [ Value(V1, V2), Value(V1, V2), ... ]

You can do that with SELECT Id, ARRAY_AGG(STRUCT(Value1, Value2)) ... GROUP BY Id

Below is for BigQuery Standard SQL
#standardSQL
select id, array_agg((select as struct t.* except(id))) as `value`
from `project.dataset.table` t
group by id
If to apply to sample data in your question - output is

Related

Presto SQL category counter

I have the following table
cust_id | category | counts
1 | food | 2
1 | pets | 5
3 | pets | 3
I would like to get this output
cust_id | food_count | pets_count
1 | 2 | 5
3 | 0 | 3
Where the number of columns map all unique values in the category column. Do you know how that can be done in Presto SQL? If I were doing this in pySpark I would use CountVectorizer but I'm a bit struggling with SQL.

You can use GROUP BY and sum on condition. For example using if function:
-- sample data
WITH dataset (cust_id, category, counts) AS (
VALUES (1, 'food', 2),
(1, 'pets', 5),
(3, 'pets', 3)
)
--query
select cust_id, sum(if(category = 'food', counts, 0)) food_counts, sum(if(category = 'pets', counts, 0)) pets_counts
from dataset
group by cust_id
Output:
cust_id
food_counts
pets_counts
1
2
5
3
0
3

Roll up multiple rows into one when joining in SQL Server

I have a table, Foo
ID | Name
-----------
1 | ONE
2 | TWO
3 | THREE
And another, Bar:
ID | FooID | Value
------------------
1 | 1 | Alpha
2 | 1 | Alpha
3 | 1 | Alpha
4 | 2 | Beta
5 | 2 | Gamma
6 | 2 | Beta
7 | 3 | Delta
8 | 3 | Delta
9 | 3 | Delta
I would like a query that joins these tables, returning one row for each row in Foo, rolling up the 'value' column from Bar. I can get back the first Bar.Value for each FooID:
SELECT * FROM Foo f OUTER APPLY
(
SELECT TOP 1 Value FROM Bar WHERE FooId = f.ID
) AS b
Giving:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | Beta
3 | THREE | Delta
But that's not what I want, and I haven't been able to find a variant that will bring back a rolled up value, that is the single Bar.Value if it is the same for each corresponding Foo, or a static string something like '(multiple)' if not:
ID | Name | Value
---------------------
1 | ONE | Alpha
2 | TWO | (multiple)
3 | THREE | Delta
I have found some solutions that would bring back concatenated values (albeit not very elegant) 'Alpha' Alpha, Alpha', 'Beta, Gamma, Beta' &c, but that's not what I want either.

One method, using a a CASE expression and assuming that [Value] cannot have a value of NULL:
WITH Foo AS
(SELECT *
FROM (VALUES (1, 'ONE'),
(2, 'TWO'),
(3, 'THREE')) V (ID, [Name])),
Bar AS
(SELECT *
FROM (VALUES (1, 1, 'Alpha'),
(2, 1, 'Alpha'),
(3, 1, 'Alpha'),
(4, 2, 'Beta'),
(5, 2, 'Gamma'),
(6, 2, 'Beta'),
(7, 3, 'Delta'),
(8, 3, 'Delta'),
(9, 3, 'Delta')) V (ID, FooID, [Value]))
SELECT F.ID,
F.[Name],
CASE COUNT(DISTINCT B.[Value]) WHEN 1 THEN MAX(B.Value) ELSE '(Multiple)' END AS [Value]
FROM Foo F
JOIN Bar B ON F.ID = B.FooID
GROUP BY F.ID,
F.[Name];

You can also try below:
SELECT F.ID, F.Name, (case when B.Value like '%,%' then '(Multiple)' else B.Value end) as Value
FROM Foo F
outer apply
(
select SUBSTRING((
SELECT distinct ', '+ isnull(Value,',') FROM Bar WHERE FooId = F.ID
FOR XML PATH('')
), 2 , 9999) as Value
) as B

What is the best possible implementation for the following recursively query?

I got a table with the following struct representing a file system.
Every item, might be a file or a folder, has a unique id. If it is a category(folder), it contains other files.
level indicates the directory depth.
|id |parent_id|is_category|level|
|:-:|: - :|: - :|: - :|
|0 | -1 | true | 0 |
|1 | 0 | true | 1 |
|2 | 0 | true | 1 |
|3 | 1 | true | 2 |
|4 | 2 | false | 2 |
|5 | 3 | true | 3 |
|6 | 5 | false | 4 |
|7 | 5 | false | 4 |
|8 | 5 | true | 4 |
|9 | 5 | false | 4 |
Task:
Fetch all subitems levels <= 3 in the folder id == 1.
The result ids should be [1,3,5]
My current implementation is recursively queries, which means, for the example above, my program would fetch id == 1 first and then find all items with is_categorh == true and level <= 3.
It doesn't feel like a efficient way.
Any advice will be appreciated.

You don't mention the database you are using so I'll assume PostgreSQL.
You can retrieve the rows you want using a single query that uses a "Recursive CTE". Recursive CTEs are implemented by several database engines, such as Oracle, DB2, PostgreSQL, SQL Server, MariaDB, MySQL, HyperSQL, H2, Teradata, etc.
The query should take a for similar to:
with recursive x as (
select * from t where id = 1
union all
select t.*
from x
join t on t.parent_id = x.id and t.level <= 3
)
select id from x
For the record, the data script I used to test it is:
create table t (
id int,
parent_id int,
level int
);
insert into t (id, parent_id, level) values (0, -1, 0);
insert into t (id, parent_id, level) values (1, 0, 1);
insert into t (id, parent_id, level) values (2, 0, 1);
insert into t (id, parent_id, level) values (3, 1, 2);
insert into t (id, parent_id, level) values (4, 2, 2);
insert into t (id, parent_id, level) values (5, 3, 3);
insert into t (id, parent_id, level) values (6, 5, 4);
insert into t (id, parent_id, level) values (7, 5, 4);
insert into t (id, parent_id, level) values (8, 5, 4);
insert into t (id, parent_id, level) values (9, 5, 4);

As others have said, recursive CTE's are a fast, and typically efficient method to retrieve the data you're looking for. If you wanted to avoid recursive CTE's, since they aren't infinitely scalable, and thus prone to erratic behavior given certain use cases, you could also take a more direct approach by implementing the recursive search via a WHILE loop. Note that this is not more efficient than the recursive CTE, but it is something that gives you more control over what happens in the recursion. In my sample, I am using Transact-SQL.
First, setup code, like #The Impaler provided:
drop table if exists
dbo.folder_tree;
create table dbo.folder_tree
(
id int not null constraint [PK_folder_tree] primary key clustered,
parent_id int not null,
fs_level int not null,
is_category bit not null constraint [DF_folder_tree_is_category] default(0),
constraint [UQ_folder_tree_parent_id] unique(parent_id, id)
);
insert into dbo.folder_tree
(id, parent_id, fs_level, is_category)
values
(0, -1, 0, 1), --|0 | -1 | true | 0 |
(1, 0, 1, 1), --|1 | 0 | true | 1 |
(2, 0, 1, 1), --|2 | 0 | true | 1 |
(3, 1, 2, 1), --|3 | 1 | true | 2 |
(4, 2, 2, 0), --|4 | 2 | false | 2 |
(5, 3, 3, 1), --|5 | 3 | true | 3 |
(6, 5, 4, 0), --|6 | 5 | false | 4 |
(7, 5, 4, 0), --|7 | 5 | false | 4 |
(8, 5, 4, 1), --|8 | 5 | true | 4 |
(9, 5, 4, 0); --|9 | 5 | false | 4 |
And then the code for implementing a recursive search of the table via WHILE loop:
drop function if exists
dbo.folder_traverse;
go
create function dbo.folder_traverse
(
#start_id int,
#max_level int = null
)
returns #result table
(
id int not null primary key,
parent_id int not null,
fs_level int not null,
is_category bit not null
)
as
begin
insert into
#result
select
id,
parent_id,
fs_level,
is_category
from
dbo.folder_tree
where
id = #start_id;
while ##ROWCOUNT > 0
begin
insert into
#result
select
f.id,
f.parent_id,
f.fs_level,
f.is_category
from
#result r
inner join dbo.folder_tree f on
r.id = f.parent_id
where
f.is_category = 1 and
(
#max_level is null or
f.fs_level <= #max_level
)
except
select
id,
parent_id,
fs_level,
is_category
from
#result;
end;
return;
end;
go
In closing, the only reason I'd recommend this approach is if you have a large number of recursive members, or need to add logging or some other process in between actions. This approach is slower in most use cases, and adds complexity to the code, but is an alternative to the recursive CTE and meets your required criteria.

create a temp table with given columns and data

I need to create a temp table with the data in table below in Netezza. The typical way I would create a temp table in Netezza is via
CREATE TEMP TABLE temp_table1 AS
(
-- statement to fill the data
) DISTRIBUTE ON RANDOM;
How do I go about constructing the statement to be used inside so that the data below is available in the temp table ?
+---------+----------+
| bin_val | bin_cnt |
+---------+----------+
| 0 | 2 |
| 4 | 10 |
| 8 | 15 |
| 12 | 12 |
| 16 | 6 |
| 20 | 1 |
+---------+----------+
A PostgreSQL solution would also be helpful.

Is this what you want?
select v.*
from (values (0, 2), (4, 10), (8, 15), (12, 12), (16, 6), (20, 1)
) v(bin_val, bin_cnt)
Here is a SQL Fiddle.
This will probably not work in Netezza, because it uses a very old version of Postgres. Instead, I think you can do:
select 0 as bin_val, 2 as bin_cnt union all
select 4, 10 union all
select 8, 15 union all
select 12, 12 union all
select 16, 6 union all
select 20, 1

CREATE TEMPORARY TABLE MY_TABLE AS
SELECT
A,
B,
C
FROM
DB1.TABLE1
WHERE A NOTNULL
LIMIT 100;
------ DROP TABLE MY_TABLE

Efficient Way to do Very Complicated SQL Grouping:

Say you have a table like this:
ID | Type | Reference #1 | Reference #2
0 | 1 | [A] | {a}
1 | 2 | [B] | {b}
2 | 2 | [B] | {c}
3 | 1 | [C] | {d}
4 | 1 | [D] | {d}
5 | 1 | [E] | {d}
6 | 1 | [C] | {e}
Is there any good way to group by "Reference #1" and "Reference #2" as a "fallback", for lack of a better way of putting it...
For example, I would like to group the following IDs together:
{0} [Unique Reference #1],
{1,2} [Same Reference #1],
{3,4,5,6} [{3,4,5} have same Reference #2 and {3,6} have same Reference #1]
I am at a total loss as to how to do this... Any thoughts?

In mellamokb's query, the groupings are dependent on the order of the input.
ie.
VALUES
(0, 1, '[A]', '{a}'),
(1, 2, '[B]', '{b}'),
(2, 2, '[B]', '{c}'),
(3, 1, '[C]', '{d}'), // group 3
(4, 1, '[D]', '{d}'), // group 3
(5, 1, '[E]', '{d}'), // group 3
(6, 1, '[C]', '{e}'); // group 3
produces a different result tahn
VALUES
(0, 1, '[A]', '{a}'),
(1, 2, '[B]', '{b}'),
(2, 2, '[B]', '{c}'),
(3, 1, '[C]', '{e}'), //group 3
(4, 1, '[D]', '{d}'), // group 4
(5, 1, '[E]', '{d}'), // group 4
(6, 1, '[C]', '{d}'); // group 3
This might be intended, if there is some natural order to the References that you could specify, but its a problem if they are not. The way to 'solve' this or specify another problem is to say that all equal Reference1s create a set of elements whose members are themselves and those elements whose Reference2 is equal to at least one member of that set.
In SQL:
with groupings as (
select
ID,Reference1,Reference2,
(select min(ID) from Table1 t2
where t2.Reference1=t1.Reference1 or t2.Reference2=t1.Reference2 ) as minID
from
Table1 t1
)
select
t1.ID,t1.Reference1,t1.Reference2,t1.minid as round1,
(select min(t2.minid) from
groupings t2
INNER JOIN groupings t3 ON t1.Reference2=t2.Reference2
) as minID
from
groupings t1
This should produce the full grouping each time.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to create an array from flattened data in BigQuery - google-bigquery

You can do that with SELECT Id, ARRAY_AGG(STRUCT(Value1, Value2)) ... GROUP BY Id

Below is for BigQuery Standard SQL #standardSQL select id, array_agg((select as struct t.* except(id))) as `value` from `project.dataset.table` t group by id If to apply to sample data in your question - output is

Related

Presto SQL category counter

Roll up multiple rows into one when joining in SQL Server

What is the best possible implementation for the following recursively query?

create a temp table with given columns and data

Efficient Way to do Very Complicated SQL Grouping:

Categories

Resources