Idiomatic equivalent to map structure - google-bigquery

My analytics involves the need to aggregate rows and to store the number of different values occurrences of a field someField in all the rows.
Sample data structure
[someField, someKey]
I'm trying to GROUP BY someKey and then be able to know for each of the results how many time there was each someField values
Example:
[someField: a, someKey: 1],
[someField: a, someKey: 1],
[someField: b, someKey: 1],
[someField: c, someKey: 2],
[someField: d, someKey: 2]
What I would like to achieve:
[someKey: 1, fields: {a: 2, b: 1}],
[someKey: 2, fields: {c: 1, d: 1}],

Does it work for you?
WITH data AS (
select 'a' someField, 1 someKey UNION all
select 'a', 1 UNION ALL
select 'b', 1 UNION ALL
select 'c', 2 UNION ALL
select 'd', 2)
SELECT
someKey,
ARRAY_AGG(STRUCT(someField, freq)) fields
FROM(
SELECT
someField,
someKey,
COUNT(someField) freq
FROM data
GROUP BY 1, 2
)
GROUP BY 1
Results:
It won't give exactly the results you are looking for, but it might work to receive the same queries your previous result would. As you said, for each key you can retrieve how many times (column freq) someField happened.
I've been looking for a way on how to aggregate structs and couldn't find one. But retrieving the results as an ARRAY of STRUCTS turned out to be quite straightforward.

There's probably a smarter way to do this (and get it in the format you want e.g. using an Array for the 2nd column), but this might be enough for you:
with sample as (
select 'a' as someField, 1 as someKey UNION all
select 'a' as someField, 1 as someKey UNION ALL
select 'b' as someField, 1 as someKey UNION ALL
select 'c' as someField, 2 as someKey UNION ALL
select 'd' as someField, 2 as someKey)
SELECT
someKey,
SUM(IF(someField = 'a', 1, 0)) AS a,
SUM(IF(someField = 'b', 1, 0)) AS b,
SUM(IF(someField = 'c', 1, 0)) AS c,
SUM(IF(someField = 'd', 1, 0)) AS d
FROM
sample
GROUP BY
someKey order by somekey asc
Results:
someKey a b c d
---------------------
1 2 1 0 0
2 0 0 1 1
This is well used technique in BigQuery (see here).

I'm trying to GROUP BY someKey and then be able to know for each of the results how many time there was each someField values
#standardSQL
SELECT
someKey,
someField,
COUNT(someField) freq
FROM yourTable
GROUP BY 1, 2
-- ORDER BY someKey, someField
What I would like to achieve:
[someKey: 1, fields: {a: 2, b: 1}],
[someKey: 2, fields: {c: 1, d: 1}],
This is different from what you expressed in words - it is called pivoting and based on your comment - The a, b, c, and d keys are potentially infinite - most likely is not what you need. At the same time - pivoting is easily doable too (if you have some finite number of field values) and you can find plenty of related posts

Related

PostgreSQL: Select unique rows where distinct values are in list

Say that I have the following table:
with data as (
select 'John' "name", 'A' "tag", 10 "count"
union all select 'John', 'B', 20
union all select 'Jane', 'A', 30
union all select 'Judith', 'A', 40
union all select 'Judith', 'B', 50
union all select 'Judith', 'C', 60
union all select 'Jason', 'D', 70
)
I know there are a number of distinct tag values, namely (A, B, C, D).
I would like to select the unique names that only have the tag A
I can get close by doing
-- wrong!
select
distinct("name")
from data
group by "name"
having count(distinct tag) = 1
however, this will include unique names that only have 1 distinct tag, regardless of what tag is it.
I am using PostgreSQL, although having more generic solutions would be great.
You're almost there - you already have groups with one tag, now just test if it is the tag you want:
select
distinct("name")
from data
group by "name"
having count(distinct tag) = 1 and max(tag)='A'
(Note max could be min as well - SQL just doesn't have single() aggregate function but that's different story.)
You can use not exists here:
select distinct "name"
from data d
where "tag" = 'A'
and not exists (
select * from data d2
where d2."name" = d."name" and d2."tag" != d."tag"
);
This is one possible way of solving it:
select
distinct("name")
from data
where "name" not in (
-- create list of names we want to exclude
select distinct name from data where "tag" != 'A'
)
But I don't know if it's the best or most efficient one.

How to query on fields from nested records without referring to the parent records in BigQuery?

I have data structured as follows:
{
"results": {
"A": {"first": 1, "second": 2, "third": 3},
"B": {"first": 4, "second": 5, "third": 6},
"C": {"first": 7, "second": 8, "third": 9},
"D": {"first": 1, "second": 2, "third": 3},
... },
...
}
i.e. nested records, where the lowest level has the same schema for all records in the level above. The schema would be similar to this:
results RECORD NULLABLE
results.A RECORD NULLABLE
results.A.first INTEGER NULLABLE
results.A.second INTEGER NULLABLE
results.A.third INTEGER NULLABLE
results.B RECORD NULLABLE
results.B.first INTEGER NULLABLE
...
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level? Put differently, can I do a query on first for all records in results without having to specify A, B, ... in my query?
I would for example want to achieve something like
SELECT SUM(results.*.first) FROM table
in order to get 1+4+7+1 = 13,
but SELECT results.*.first isn't supported.
(I've tried playing around with STRUCTs, but haven't gotten far.)
Below trick is for BigQuery Standard SQL
#standardSQL
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
You can test, play with above using dummy/sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id, (
SELECT AS STRUCT
SUM(first) AS sum_first,
SUM(second) AS sum_second,
SUM(third) AS sum_third
FROM UNNEST([a]||[b]||[c]||[d])
).*
FROM `project.dataset.table`,
UNNEST([results])
with output
Row id sum_first sum_second sum_third
1 1 13 17 21
Is there a way to do (e.g. aggregate) queries in BigQuery on fields from the lowest level, without knowledge of the keys on the (direct) parent level?
Below is for BigQuery Standard SQL and totally avoids referencing parent records (A, B, C, D, etc.)
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
if to apply to sample data from your question as in below example
#standardSQL
CREATE TEMP FUNCTION Nested_SUM(entries ANY TYPE, field_name STRING) AS ((
SELECT SUM(CAST(SPLIT(kv, ':')[OFFSET(1)] AS INT64))
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
));
WITH `project.dataset.table` AS (
SELECT 1 AS id, STRUCT(
STRUCT(1 AS first, 2 AS second, 3 AS third) AS A,
STRUCT(4 AS first, 5 AS second, 6 AS third) AS B,
STRUCT(7 AS first, 8 AS second, 9 AS third) AS C,
STRUCT(1 AS first, 2 AS second, 3 AS third) AS D
) AS results
)
SELECT id,
Nested_SUM(results, 'first') AS first_sum,
Nested_SUM(results, 'second') AS second_sum,
Nested_SUM(results, 'third') AS third_sum,
Nested_SUM(results, 'forth') AS forth_sum
FROM `project.dataset.table`
output is
Row id first_sum second_sum third_sum forth_sum
1 1 13 17 21 null
I adapted Mikhail's answer in order to support grouping on the values of the lowest-level fields:
#standardSQL
CREATE TEMP FUNCTION Nested_AGGREGATE(entries ANY TYPE, field_name STRING) AS ((
SELECT ARRAY(
SELECT AS STRUCT TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') AS value, COUNT(SPLIT(kv, ':')[OFFSET(1)]) AS count
FROM UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(entries), r'":{(.*?)}')) entry,
UNNEST(SPLIT(entry)) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') = field_name
GROUP BY TRIM(SPLIT(kv, ':')[OFFSET(1)], '"')
)
));
SELECT id,
Nested_AGGREGATE(results, 'first') AS first_agg,
Nested_AGGREGATE(results, 'second') AS second_agg,
Nested_AGGREGATE(results, 'third') AS third_agg,
FROM `project.dataset.table`
Output for WITH `project.dataset.table` AS (SELECT 1 AS id, STRUCT( STRUCT(1 AS first, 2 AS second, 3 AS third) AS A, STRUCT(4 AS first, 5 AS second, 6 AS third) AS B, STRUCT(7 AS first, 8 AS second, 9 AS third) AS C, STRUCT(1 AS first, 2 AS second, 3 AS third) AS D) AS results ):
Row id first_agg.value first_agg.count second_agg.value second_agg.count third_agg.value third_agg.count
1 1 1 2 2 2 3 2
4 1 5 1 6 1
7 1 8 1 9 1

SQL querying the same table twice with criteria

I have 1 table
table contains something like:
ID, parent_item, Comp_item
1, 123, a
2, 123, b
3, 123, c
4, 456, a
5, 456, b
6, 456, d
7, 789, b
8, 789, c
9, 789, d
10, a, a
11, b, b
12, c, c
13, d, d
I need to return only the parent_items that have a Comp_item of a and b
so I should only get:
123
456
Here is a canonical way to do this:
SELECT parent_item
FROM yourTable
WHERE Comp_item IN ('a', 'b')
GROUP BY parent_item
HAVING COUNT(DISTINCT Comp_item) = 2
The idea here to aggregate by parent_item, restricting to only records having a Comp_item of a or b, then asserting that the distinct number of Comp_item values is 2.
Alternatively you could use INTERSECT:
select parent_item from my_table where comp_item = 'a'
intersect
select parent_item from my_table where comp_item = 'b';
If you have a parent item table, the most efficient method is possibly:
select p.*
from parent_items p
where exists (select 1 from t1 where t1.parent_id = p.parent_id and t1.comp_item = 'a') and
exists (select 1 from t1 where t1.parent_id = p.parent_id and t1.comp_item = 'b');
For optimal performance, you want an index on t1(parent_id, comp_item).
I should emphasize that I very much like the aggregation solution by Tim. I bring this up because performance was brought up in a comment. Both intersect and group by expend effort aggregating (in the first case to remove duplicates, in the second explicitly). An approach like this does not incur that cost -- assuming that a table with unique parent ids is available.

how to use Pivot SQL

For Example,
select A,B,C,D,E,YEAR
FROM t1
where t1.year = 2018
UNION ALL
select A,B,C,D,E,YEAR
FROM t2
where t2.year = 2017
execute like this
A --- B----C----D----E----YEAR
2 --- 4----6----8----10---2018
1 --- 3----5----7----9----2017
I would like to have a result like this
2018 2017
A 2 1
B 4 3
C 6 5
D 8 7
E 10 9
I know I should use pivot, and googled around, but I can not figure out how to write a code to have a result like above.
Thanks
Assuming you are using Oracle 11.1 or above, you can use the pivot and unpivot operators. In your problem, the data is already "pivoted" one way, but you want it pivoted the other way; so you must un-pivot first, and then re-pivot the way you want it. In the solution below, the data is read from the table (I use a WITH clause to generate the test data, but you don't need the WITH clause, you can start at SELECT and use your actual table and column names). The data is fed through unpivot and then immediately to pivot - you don't need subqueries or anything like that.
Note about column names: don't use year, it is an Oracle keyword and you will cause confusion if not (much) worse. And in the output, you can't have 2018 and such as column names - identifiers must begin with a letter. You can go around these limitations using names in double quotes; that is a very poor practice though, best left just to the Oracle parser and not used by us humans. You will see I called the input column yr and the output columns y2018 and such.
with
inputs ( a, b, c, d, e, yr ) as (
select 2, 4, 6, 8, 10, 2018 from dual union all
select 1, 3, 5, 7, 9, 2017 from dual
)
select col, y2018, y2017
from inputs
unpivot ( val for col in (a as 'A', b as 'B', c as 'C', d as 'D', e as 'E') )
pivot ( min(val) for yr in (2018 as y2018, 2017 as y2017) )
order by col -- if needed
;
COL Y2018 Y2017
--- ---------- ----------
A 2 1
B 4 3
C 6 5
D 8 7
E 10 9
ADDED:
Here is how this used to be done (before the pivot and unpivot were introduced in Oracle 11.1). Unpivoting was done with a cross join to a small helper table, with a single column and as many rows as there were columns to unpivot in the base table - in this case, five columns, a, b, c, d, e need to be unpivoted, so the helper table has five rows. And pivoting was done with conditional aggregation. Both can be combined into a single query - there is no need for subqueries (other than to create the helper "table" or inline view).
Note, importantly, that the base table is read just once. Other methods of unpivoting are much more inefficient, because they require reading the base table multiple times.
select decode(lvl, 1, 'A', 2, 'B', 3, 'C', 4, 'D', 5, 'E') as col,
max(case yr when 2018 then decode(lvl, 1, a, 2, b, 3, c, 4, d, 5, e) end) as y2018,
max(case yr when 2017 then decode(lvl, 1, a, 2, b, 3, c, 4, d, 5, e) end) as y2017
from inputs cross join ( select level as lvl from dual connect by level <= 5 )
group by decode(lvl, 1, 'A', 2, 'B', 3, 'C', 4, 'D', 5, 'E')
order by decode(lvl, 1, 'A', 2, 'B', 3, 'C', 4, 'D', 5, 'E')
;
This looks worse than it is; the same decode() function is called three times, but with exactly the same arguments, so it is calculated only once, the value is cached and it is reused in the other places. (It is calculated for group by and then reused for select and for order by.)
To test, you can use the same WITH clause as above - or your actual data.
decode() is proprietary to Oracle, but the same can be written with case expressions (essentially identical to the decode() approach, just different syntax) and it will work in most other database products.
This is a bit tricky -- unpivotting and repivotting. Here is one way:
select col,
max(case when year = 2018 then val end) as val_2018,
max(case when year = 2017 then val end) as val_2017
from ((select 'A' as col, A as val, YEAR from t1 where year = 2018) union all
(select 'B' as col, B as val, YEAR from t1 where year = 2018) union all
(select 'C' as col, C as val, YEAR from t1 where year = 2018) union all
(select 'D' as col, D as val, YEAR from t1 where year = 2018) union all
(select 'E' as col, D as val, YEAR from t1 where year = 2018) union all
(select 'A' as col, A as val, YEAR from t2 where year = 2017) union all
(select 'B' as col, B as val, YEAR from t2 where year = 2017) union all
(select 'C' as col, C as val, YEAR from t2 where year = 2017) union all
(select 'D' as col, D as val, YEAR from t2 where year = 2017) union all
(select 'E' as col, D as val, YEAR from t2 where year = 2017)
) tt
group by col;
You don't specify the database, but this is pretty database independent.

SQL Server stored procedure: iterate through table, compare values, insert to different table on value change

I can't tell whether my particular situation has already been covered given the titles of other questions, so apologies if an answer already exists.
I have a database that records values as strings, and another table that records runs of particular types within those values.
I require a stored procedure that iterates through the values (I understand this is related to the concept of Cursors), recording each value to temporary tables to control the count for a particular run type (odd/even numbers, for example, or vowels/consonants). When a given value indicates that a particular type of run has stopped (i.e. an odd number has stopped a run of even numbers and vice versa), the run is counted, the count is inserted into a runs table with a related type value (0 = odd/even, 1 = vowel/consonant etc.), the temporary table contents are deleted, and the value that caused the table count/clear is inserted to the temp table.
As I am completely new to stored procedures, I don't know exactly how to structure this kind of procedure, and the examples I've found don't:
Describe how to implement Cursors in a straightforward, understandable manner
Provide insights into comparisons between a given value and a stored comparison value
Allow for recognition of changes to an established pattern to initiate a section of a procedure
Let me know if any of this needs clarifying.
EDIT:
Version in use: MS SQL Server 2012
table structure for the raw values:
ID: Int PK AI
DateTimeStamp: Datetime
SelectedValue: Char(2)
UserId: Int
table structure for value runs:
ID: Int PK AI
DateTimeStamp: Datetime
Type: Int
Run: Int
Sample data: [following presented as comma-delimited string for brevity, input by one user]
e, 00, 1, t, r, 2, 4, 3, 5, 7, a, i, w, q, u, o, 23, 25, 24, 36, 12, e ...
groups would be:
vowels/consonants
even numbers/odd numbers
00
numbers under/over 20
numbers/letters
From the above, the runs are:
e (vowels/consonants: vowels)
e (numbers/letters: letters)
00 (00)
1 (odd/even: odd)
1 (numbers/letters: numbers)
t, r (vowels/consonants: consonants)
t, r (numbers/letters: letters)
2, 4 (odd/even: even)
3, 5, 7 (odd/even: odd)
2, 4, 3, 5, 7 (numbers/letters: numbers)
a, i (vowels/consonants: vowels)
w, q (vowels/consonants: consonants)
a, i, w, q, u, o (numbers/letters: letters)
1, 2, 4, 3, 5, 7 (under/over 20: under 20)
23, 25 (odd/even: odd)
23, 25, 24, 36 (under/over 20: over 20)
24, 36, 12 (odd/even: even)
u, o, e (vowels/consonants: vowels)
Which would make entries to the runs table as
Type: vowels/consonants, run: 1
Type: numbers/letters, run: 1
Type: 00, run: 1
Type: odd/even, run: 1
Type: numbers/letters, run: 1
Type: odd/even, run: 2
Type: odd/even, run: 3
Type: numbers/letters, run: 5
Type: vowels/consonants, run: 2
Type: vowels/consonants, run: 2
Type: numbers/letters, run: 6
Type: under/over 20, run: 6
Type: odd/even, run: 2
Type: under/over 20, run: 4
Type: odd/even, run: 3
Type: vowels/consonants, run: 3
EDIT Updated based on clarification of the original question.
This might not be the cleanest solution, but it should get you started:
WITH cteClassifications (ID, GroupNo, Type, Description) As
(
-- Vowels:
SELECT
ID,
1,
1,
'Vowels'
FROM
RawData
WHERE
SelectedValue In ('a', 'e', 'i', 'o', 'u')
UNION ALL
-- Consonants:
SELECT
ID,
1,
2,
'Consonants'
FROM
RawData
WHERE
SelectedValue Between 'a' And 'z'
And
SelectedValue Not In ('a', 'e', 'i', 'o', 'u')
UNION ALL
-- Even numbers:
SELECT
ID,
2,
1,
'Even numbers'
FROM
RawData
WHERE
SelectedValue != '00'
And
SelectedValue Not Between 'a' And 'z'
And
(TRY_PARSE(SelectedValue As tinyint) & 1) = 0
UNION ALL
-- Odd numbers:
SELECT
ID,
2,
2,
'Odd numbers'
FROM
RawData
WHERE
SelectedValue != '00'
And
SelectedValue Not Between 'a' And 'z'
And
(TRY_PARSE(SelectedValue As tinyint) & 1) = 1
UNION ALL
-- "00":
SELECT
ID,
3,
1,
'"00"'
FROM
RawData
WHERE
SelectedValue = '00'
UNION ALL
-- Numbers under 20:
SELECT
ID,
4,
1,
'Numbers under 20'
FROM
RawData
WHERE
SelectedValue != '00'
And
SelectedValue Not Between 'a' And 'z'
And
TRY_PARSE(SelectedValue As tinyint) < 20
UNION ALL
-- Numbers over 20:
SELECT
ID,
4,
2,
'Numbers over 20'
FROM
RawData
WHERE
SelectedValue != '00'
And
SelectedValue Not Between 'a' And 'z'
And
TRY_PARSE(SelectedValue As tinyint) > 20
UNION ALL
-- Numbers:
SELECT
ID,
5,
1,
'Numbers'
FROM
RawData
WHERE
SelectedValue != '00'
And
SelectedValue Not Between 'a' And 'z'
And
TRY_PARSE(SelectedValue As tinyint) Is Not Null
UNION ALL
-- Letters:
SELECT
ID,
5,
2,
'Letters'
FROM
RawData
WHERE
SelectedValue Between 'a' And 'z'
),
cteOrderedClassifications (ID, GroupNo, Type, Description, PrevType, RN) As
(
SELECT
ID,
GroupNo,
Type,
Description,
LAG(Type, 1, 0) OVER (PARTITION BY GroupNo ORDER BY ID),
ROW_NUMBER() OVER (PARTITION BY GroupNo ORDER BY ID)
FROM
cteClassifications
),
cteGroupedClassifications (ID, GroupNo, Type, Description, RN, ORN) As
(
SELECT
ID,
GroupNo,
Type,
Description,
RN,
RN
FROM
cteOrderedClassifications As C
WHERE
Type != PrevType
UNION ALL
SELECT
C.ID,
G.GroupNo,
G.Type,
G.Description,
G.RN,
C.RN
FROM
cteGroupedClassifications As G
INNER JOIN cteOrderedClassifications As C
ON C.GroupNo = G.GroupNo
And C.Type = G.Type
And C.RN = G.ORN + 1
),
cteRuns (ID, GroupNo, Type, Description, RN, Run) As
(
SELECT
Min(ID),
GroupNo,
Type,
MAX(Description),
RN,
Count(1)
FROM
cteGroupedClassifications
GROUP BY
GroupNo,
Type,
RN
)
SELECT
ROW_NUMBER() OVER (ORDER BY ID) As ID,
GroupNo,
Type,
Description,
Run
FROM
cteRuns
ORDER BY
ID
;
Once you're happy that the query is working, you can replace the final SELECT with an INSERT INTO Runs (ID, Type, Run) SELECT ID, Type, Run FROM cteFinalRuns to populate the table in a single pass.
SQL Fiddle example