Querying missing rows within a group in Bigquery - google-bigquery

I have a table that looks something like this:
with base_tbl as (
select
"A" as name, 123 as roll_num, "chemistry" as subject, 1 as slot
union all
select
"A" as name, 123 as roll_num, "chemistry" as subject, 2 as slot
union all
select
"A" as name, 123 as roll_num, "physics" as subject, 1 as slot
union all
select
"B" as name, 234 as roll_num, "physics" as subject, 1 as slot
union all
select
"B" as name, 234 as roll_num, "physics" as subject, 2 as slot
)
The column subject can only take values physics or chemistry and the column slot can take values 1 or 2.
Looking for recommendations on how I can flag students who have either one of the subjects missing or a slot missing: In the example above, expected output would be:
student
roll_num
subject_missing
slot_missing
A
123
physics
2
B
234
chemistry
1
B
234
chemistry
2
My real data has about ~170m rows, with several other grouping columns (student and roll_num here). Essentially I am trying to gauge the "completeness" of the dataset.
Compilation screengrab:

Using a set operation,
SELECT t.* REPLACE(sj AS subject, sl AS slot)
FROM base_tbl t,UNNEST(["physics", "chemistry"]) sj, UNNEST([1, 2]) sl
EXCEPT DISTINCT
SELECT * FROM base_tbl;
output:

A little bit optimized version of original answer:
SELECT *
FROM (SELECT DISTINCT * EXCEPT(subject, slot) FROM base_tbl) t,
UNNEST(["physics", "chemistry"]) subject, UNNEST([1, 2]) slot
EXCEPT DISTINCT
SELECT * FROM base_tbl WHERE subject IN ("physics", "chemistry");

Consider below approach
select * except(missing_subjects_slots)
from (
select name, roll_num,
array(
select as struct subject as subject_missing, slot as slot_missing
from unnest(['chemistry', 'physics']) subject, unnest([1, 2]) slot
where not (subject, slot) in (
select as struct subject, slot
from t1.subjects_slots
)
) missing_subjects_slots
from (
select name, roll_num,
array_agg(struct(subject, slot)) subjects_slots,
from base_tbl
group by name, roll_num
) t1
) t2, t2.missing_subjects_slots
if applied to sample data in your question - output is

Related

How to remove Repeated field in BigQuery schema?

I have a schema that has a repeated field nested into another repeated field like so: person.children.toys. I want to make this inner field not repeated (so child can have only single nullable toy). I know that for such change I need to make a new table with new schema and run SQL query that inserts modified results into it, but I don't know how to make the query. I need it to select first toy (or null) for each child and insert resulting objects into new table. There is a guarantee that in source table all children have no more than 1 toy.
Below is for BigQuery Standard SQL
I know - it might look over-complicated - but it totally preserves original schema while eliminating all but first (or null) toys. This can be handy if your real schema has more than just few fields so you don't need to worry about them
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, STRUCT([STRUCT('mike' AS name, ['woody'] AS toys)] AS children) AS person UNION ALL
SELECT 2 id, STRUCT([STRUCT('nik', ['buzz', 'bobeep']), ('john', ['car', 'buzz', 'bobeep'])] AS children) AS person UNION ALL
SELECT 3 id, STRUCT([STRUCT('vincent', IF(TRUE,[],['']))] AS children) AS person
)
SELECT *
REPLACE(
(SELECT AS STRUCT *
REPLACE (
(SELECT ARRAY_AGG(t) FROM
(SELECT * REPLACE((SELECT toy FROM UNNEST(toys) toy WITH OFFSET ORDER BY OFFSET LIMIT 1) AS toys) FROM UNNEST(children)) t)
AS children)
FROM UNNEST([person]))
AS person)
FROM `project.dataset.table`
If to apply to below data
Row id person.children.name person.children.toys
1 1 mike toy1
2 2 nik toy2
toy3
john toy4
toy5
toy6
3 3 vincent
result will be
Row id person.children.name person.children.toys
1 1 mike toy1
2 2 nik toy2
john toy4
3 3 vincent null
Note: toys field originally REPEATED STRING becomes just STRING
I could give you a better answer if you had a better described schema, but with the data provided:
CREATE OR REPLACE TABLE `temp.flat` AS
WITH data AS (
SELECT 1 id, STRUCT([STRUCT(['woody']AS toy)] AS children) AS person
UNION ALL
SELECT 2 id, STRUCT([STRUCT(['buzz', 'bobeep'])] AS children) AS person
UNION ALL
SELECT 3 id, STRUCT([STRUCT(IF(true,[],['']))] AS children) AS person
)
SELECT id, person.children[SAFE_OFFSET(0)].toy[SAFE_OFFSET(0)] first_toy
FROM `data`
Goes from:
To:

Exclude certain products based on date range

For example, I have sales data for 1 year, and some of the products not available on a specific date range.
I currently have for 1 date range, but what is the best practice if have multiple exclusions?
SELECT * FROM XXX
WHERE
IF(Date BETWEEN '2018-11-22' AND '2019-03-28',
ID IN (8467,8468,8469,8470),
ID IN (8467,8468,8469,8470,9551,9552,9553)
)
Especially how to solve the issue if dates are overlapping?
If you are trying to exclude values, I am thinking:
SELECT *
FROM XXX
WHERE ID IN (8467, 8468, 8469, 8470, 9551, 9552, 9553) AND
(Date BETWEEN '2018-11-22' AND '2019-03-28' AND
ID NOT IN (9551, 9552, 9553) OR
Date NOT BETWEEN '2018-11-22' AND '2019-03-28'
);
You can add multiple pairs for other dates.
For a full solution, you might want to create a table with olumns such as:
product_id
start_exclusion_date
end_exclusion_date
And then phrase the query as:
select xxx.*
from xxx left join
exclusions e
on xxx.id = e.product_id and
xxx.date >= e.start_exclusion_date and
xxx.date <= e.end_exclusion_date
where xxx.id in ( . . . );
This is likely to be easier to maintain in the long term.
Try this,
select * from xxx
where not(date between '2018-11-22' and '2019-03-28' and id in(9551,9552,9553))
order by id, date
Below is an example for BigQuery Standard SQL and shows direction for building "complete picture" with whitelist and blacklist rules (all with quite simplified dummy data just to demonstrate it in action)
#standardSQL
WITH `project.dataset.xxx` AS (
SELECT 1 id, DATE '2018-11-22' `date` UNION ALL
SELECT 2, '2018-11-23' UNION ALL
SELECT 3, '2018-11-24' UNION ALL
SELECT 4, '2018-11-25' UNION ALL
SELECT 1, '2018-11-26' UNION ALL
SELECT 2, '2018-11-27' UNION ALL
SELECT 3, '2018-11-28' UNION ALL
SELECT 8, '2018-11-29'
), `project.dataset.whitelist` AS (
SELECT DATE '2018-11-22' start, DATE '2018-11-29' finish, [2,3] ids UNION ALL
SELECT '2018-11-22', '2018-11-22', [1]
), `project.dataset.blacklist` AS (
SELECT DATE '2018-11-26' start, DATE '2018-11-28' finish, [1,3] ids UNION ALL
SELECT '2018-11-22', '2018-11-22', [10]
)
SELECT DISTINCT t.*
FROM `project.dataset.xxx` t
JOIN `project.dataset.whitelist` w
ON (`date` BETWEEN w.start AND w.finish AND id IN UNNEST(w.ids))
JOIN `project.dataset.blacklist` b
ON NOT(`date` BETWEEN b.start AND b.finish AND id IN UNNEST(b.ids))
with result
Row id date
1 1 2018-11-22
2 2 2018-11-27
3 2 2018-11-23
4 3 2018-11-28
5 3 2018-11-24
Obviously, in real case all involved tables are real tables and query will look just like below
#standardSQL
SELECT DISTINCT t.*
FROM `project.dataset.xxx` t
JOIN `project.dataset.whitelist` w
ON (`date` BETWEEN w.start AND w.finish AND id IN UNNEST(w.ids))
JOIN `project.dataset.blacklist` b
ON NOT(`date` BETWEEN b.start AND b.finish AND id IN UNNEST(b.ids))

SQL with nested condition

EDIT: added third requirement after playing with solution from Tim Biegeleisen
EDIT2: modified Robbie's DOB to be before his parent's marriage date
I am trying to create a query that will look at two tables and determine the difference in dates based on a percentage. I know, super confusing... Let me try and explain using the tables below:
Bob and Mary are married on 2010-01-01 and expect 4 kids (Parent table)
I want to know how many years it took until they met 50% of their expected kids (i.e. 2/4 kids). Using the Child table to see the DOB of their 4 kids, we know that Frankie is the second child which meets our 50% threshold so we use Frankie's DOB and subtract it from Frankie's parent's marriage date and end up with 3 years!
If the goal isn't reached then display no value e.g. Mick and Jo only had 1 child so far so they haven't yet reached their goal
Hoping this is doable using BigQuery standard SQL.
Parent table
id married_couple married_at expected_kids
--------------------------------------
1 Bob and Mary 2010-01-01 4
2 Mick and Jo 2010-01-01 4
Child table
id child_name parent_id date_of_birth
--------------------------------------
1 Eddie 1 2012-01-01
2 Frankie 1 2013-01-01
3 Robbie 1 2005-01-01
4 Duncan 1 2015-01-01
5 Rick 2 2014-01-01
Expected SQL result
parent_id half_goal_reached(years)
--------------------------------------
1 3
2
Below both soluthions for BigQuery Standard SQL
First one is more in classic sql way, the second one is more of BigQuery style (I think)
First Solution: with analytics function
#standardSQL
SELECT
parent_id,
IF(
MAX(pos) = MAX(CAST(expected_kids / 2 AS INT64)),
MAX(DATE_DIFF(date_of_birth, married_at, YEAR)),
NULL
) AS half_goal_reached
FROM (
SELECT c.parent_id, c.date_of_birth, expected_kids, married_at,
ROW_NUMBER() OVER(PARTITION BY c.parent_id ORDER BY c.date_of_birth) AS pos
FROM `child` AS c
JOIN `parent` AS p
ON c.parent_id = p.id
)
WHERE pos <= CAST(expected_kids / 2 AS INT64)
GROUP BY parent_id
Second Solution: with use of ARRAY
#standardSQL
SELECT
parent_id,
DATE_DIFF(dates[SAFE_ORDINAL(CAST(expected_kids / 2 AS INT64))], married_at, YEAR) AS half_goal_reached
FROM (
SELECT
parent_id,
ARRAY_AGG(date_of_birth ORDER BY date_of_birth) AS dates,
MAX(expected_kids) AS expected_kids,
MAX(married_at) AS married_at
FROM `child` AS c
JOIN `parent` AS p
ON c.parent_id = p.id
GROUP BY parent_id
)
Dummy Data
You can test / play with both solutions using below dummy data
#standardSQL
WITH `parent` AS (
SELECT 1 id, 'Bob and Mary' married_couple, DATE '2010-01-01' married_at, 4 expected_kids UNION ALL
SELECT 2, 'Mick and Jo', DATE '2010-01-01', 4
),
`child` AS (
SELECT 1 id, 'Eddie' child_name, 1 parent_id, DATE '2012-01-01' date_of_birth UNION ALL
SELECT 2, 'Frankie', 1, DATE '2013-01-01' UNION ALL
SELECT 3, 'Robbie', 1, DATE '2014-01-01' UNION ALL
SELECT 4, 'Duncan', 1, DATE '2015-01-01' UNION ALL
SELECT 5, 'Rick', 2, DATE '2014-01-01'
)
Try the following query, whose logic is too verbose to explain it well. I join the parent and child tables, bringing into line the parent id, number of years elapsed since marriage, running number of children, and expected number of children. With this information in hand, we can easily find the first row whose running number of children matches or exceeds half of the expected number.
SELECT parent_id, num_years AS half_goal_reached
FROM
(
SELECT parent_id, num_years, cnt, expected_kids,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY num_years) rn
FROM
(
SELECT
t2.parent_id,
YEAR(t2.date_of_birth) - YEAR(t1.married_at) AS num_years,
(SELECT COUNT(*) FROM child c
WHERE c.parent_id = t2.parent_id AND
c.date_of_birth <= t2.date_of_birth) AS cnt,
t1.expected_kids
FROM parent t1
INNER JOIN child t2
ON t1.id = t2.parent_id
) t
WHERE
cnt >= expected_kids / 2
) t
WHERE t.rn = 1;
Note that there may be issues with how I computed the yearly differences, or how I compute the threshhold for half the number of expected children. Also, if we were using a recent enterprise database we could have used an analytic function to get the running number of children instead of a correlated subquery, but I was unsure if Big Query would support that, so I used the latter.

Count number of records with specific values

I have a table:
Table Teams
Id_team member_1 member_2 member_3
1 Alice Ben
2 Ben
3 Charles Alice Ben
4 Ben Alice
I will need to know in how many different teams Alice is a member (doesn't count if she is the first member, second or third). In my sample, the right answer is 2 (with Ben in Id_team 1 and 4, with Ben and Charles in Id_team = 3). Thank you!
You have to count "alices" in each column separately to ensure distinct oer column
What you appear to checking is "
SELECT
COUNT(DISTINCT CASE WHEN member_1 = 'Alice' THEN member_1 END) +
COUNT(DISTINCT CASE WHEN member_2 = 'Alice' THEN member_2 END) +
COUNT(DISTINCT CASE WHEN member_3 = 'Alice' THEN member_3 END)
FROM tablename
WHERE 'Alice' IN(member_1, member_2, member_3);
Update: fixed COUNT
Okay, so you want to same teams with different positions (e.g. Alice&Ben, Ben&Alice) count as one.
To do this, order the members in ascending order for alice in every position, and count the results (this returns 2 to your example):
SELECT COUNT(*) FROM
(
SELECT
least( member_2, member_3) AS l,
greatest(member_2, member_3) AS g
FROM teams
WHERE
member_1 = 'Alice'
UNION
SELECT
least( member_1, member_3) AS l,
greatest(member_1, member_3) AS g
FROM teams
WHERE
member_2 = 'Alice'
UNION
SELECT
least( member_1, member_2) AS l,
greatest(member_1, member_2) AS g
FROM teams
WHERE
member_3 = 'Alice'
) q
;
Note that this can only be done to the special case of 3 member teams, because least and greatest can select the two other members - for member coun of 4 and greater, a more complex solution is needed.
You can try to concatenate the fields (sorted alphabetically) in order to turn them into a list of strings.
Then run a distinct on this list (so it will list all separate teams)
Then search how many strings contains Alice
From this the hardest is the "concat alphabetically", as I couldn't really find any good function to do it, but a GROUP_CONCAT with a separate SELECTs and UNIONs to convert the fields into rows should do it:
SELECT COUNT(*)
FROM (
SELECT DISTINCT team_as_string
FROM (
SELECT id tid, GROUP_CONCAT(q ORDER BY q ASC SEPARATOR ',') team_as_string
FROM (
SELECT id, member_1 q FROM teams
UNION SELECT id, member_2 q FROM teams
UNION SELECT id, member_3 q FROM teams
/* add more fields if needed */
) c
GROUP BY tid
) b
) a
WHERE team_as_string LIKE '%Alice%'
I haven't checked it for syntax errors, but it should be fine logically. Tested and gives the correct answer (2)
This can be enhanced for more members, if needed.
Of course if the members are in a separate join table, then the whole group_concat part can be simplified.
I will need to know in how many different teams Alice is a member
Try this:
SELECT 'Alice', COUNT(id_team)
FROM tablename
WHERE 'Alice' IN(member_1, member_2, member_3);
The result:
| ALICE | THECOUNT |
--------------------
| Alice | 3 |
Fiddle Demo.
If id_team is not unique, use COUNT(DISTINCT id_team).

SQL Percentage of True columns

I have a table where each row has a description field as well as a boolean value. I'm trying to write a query where I can group by each respective description, and see the percentage of times that the boolean was true.
Example table:
PID Gender SeniorCitizen
1 M 1
2 M 1
3 F 0
4 F 1
5 M 0
And I want a query that will return this:
Gender SeniorPct
M .66
F .50
I've got to the point where I have a query that will calculate the individual percentages for a male or female - but I want to see both results at once
SELECT Gender, COUNT(*) * 1.0 /
(SELECT COUNT(*) FROM MyTable WHERE Gender='M')
FROM MyTable WHERE Gender='M' and SeniorCitizen=1;
I've been trying to include a "GROUP BY Gender" statement in my outer SELECT above, but I can't seem to figure out how to tweak the inner SELECT to get the correct results after tweaking the outer SELECT as such.
(I tested this under MySQL, please check if the same idea can be applied to the SQLite.)
To find the number of seniors (per gender), we can treat the bits as numbers and simply sum them up:
SELECT
Gender,
SUM(SeniorCitizen) Seniors
FROM MyTable
GROUP BY Gender
GENDER SENIORS
M 2
F 1
Based on that, we can easily calculate percentages:
SELECT
Gender,
SUM(SeniorCitizen) / COUNT(*) * 100 SeniorsPct
FROM MyTable
GROUP BY Gender
GENDER SENIORSPCT
M 66.6667
F 50
You can play with it in this SQL Fiddle.
UPDATE: Very similar idea works under SQLite as well. Please take a look at another SQL Fiddle.
Try the following:
CREATE TABLE #MyTable
(
PID INT,
Gender VARCHAR(1),
SeniorCitizen BIT
)
INSERT INTO #MyTable
(
PID,
Gender,
SeniorCitizen
)
SELECT 1, 'M', 1 UNION
SELECT 2, 'M', 1 UNION
SELECT 3, 'F', 0 UNION
SELECT 4, 'F', 1 UNION
SELECT 5, 'M', 0
SELECT
Gender,
COUNT(CASE WHEN SeniorCitizen = 1 THEN 1 END), -- Count of SeniorCitizens grouped by Gender
COUNT(1), -- Count of all users grouped by Gender
CONVERT(DECIMAL(2, 2), -- You can ignore this if you want
COUNT(CASE WHEN SeniorCitizen = 1 THEN 1 END) * 1.0 / COUNT(1) -- Your ratio
)
FROM
#MyTable
GROUP BY
Gender