Lead & Analytical Functions in BigQuery - sql

Assume my table is this
I am trying to modify my table with this information
I have added two columns where column WhenWasLastBasicSubjectDone will let you know when in which semester the student completed his latest Basic Course (sorted by Semester). The other column TotalBasicSubjectsDoneTillNow explains how many times had the student completed Basic Course(Subject) till now (sorted by Semester) ?
I think this is easy to solve with Joins as well as with UDFs but I want to use the power of existing analytical functions in BigQuery and solve it without joins.

You can use window functions for this -- assuming you have a column that specifies ordering. Let me assume that column is semester:
select t.*,
max( case when subject = 'Basic' then semester end ) over (partition by student order by semester end) as lastbasic,
sum( case when subject = 'Basic' then 1 else 0 end ) over (partition by student order by semester end) as numbasictillnow
from t

Below is for BigQuery Standard SQL
#standardSQL
SELECT *,
LAST_VALUE(IF(subject='Basic',semester,NULL) IGNORE NULLS) OVER(win) AS WhenWasLastBasicSubjectDone ,
COUNTIF(subject='Basic') OVER(win) AS TotalBasicSubjectsDoneTillNow
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY student ORDER BY semester)
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 Student, 'Sub1' Subject, 'Sem1' Semester UNION ALL
SELECT 1, 'Sub2', 'Sem2' UNION ALL
SELECT 1, 'Basic', 'Sem3' UNION ALL
SELECT 1, 'Basic', 'Sem4' UNION ALL
SELECT 1, 'Sub3', 'Sem5' UNION ALL
SELECT 1, 'Sub2', 'Sem6' UNION ALL
SELECT 1, 'Sub3', 'Sem7' UNION ALL
SELECT 1, 'Sub4', 'Sem8'
)
SELECT *,
LAST_VALUE(IF(subject='Basic',semester,NULL) IGNORE NULLS) OVER(win) AS WhenWasLastBasicSubjectDone ,
COUNTIF(subject='Basic') OVER(win) AS TotalBasicSubjectsDoneTillNow
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY student ORDER BY semester)
-- ORDER BY Semester

Related

BigQuery SQL: Sum of first N related items

I would like to know the sum of a value in the first n items in a related table. For example, I want to get the sum of a companies first 6 invoices (the invoices can be sorted by ID asc)
Current SQL:
SELECT invoices.company_id, SUM(invoices.amount)
FROM invoices
JOIN companies on invoices.company_id = companies.id
GROUP BY invoices.company_id
This seems simple but I can't wrap my head around it.
Consider also below approach
select company_id, (
select sum(amount)
from t.amounts amount
) as top_six_invoices_amount
from (
select invoices.company_id,
array_agg(invoices.amount order by invoices.invoice_id limit 6) amounts
from your_table invoices
group by invoices.company_id
) t
You can create order row numbers to the lines in a partition based on invoice id and filter to it, something like this:
with array_table as (
select 'a' field, * from unnest([3, 2, 1 ,4, 6, 3]) id
union all
select 'b' field, * from unnest([1, 2, 1, 7]) id
)
select field, sum(id) from (
select field, id, row_number() over (partition by a.field order by id desc) rownum
from array_table a
)
where rownum < 3
group by field
More examples for analytical examples here:
https://medium.com/#aliz_ai/analytic-functions-in-google-bigquery-part-1-basics-745d97958fe2
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts

Nth result in BigQuery Group By

I have a derived table like:
id, desc, total, account
1, one, 10, a
1, one, 9, b
1, one, 3, c
2, two, 27, c
I can do a simple
select id, desc, sum(total) as total from mytable group by id
but I want to add the equivalent first(account), first(total), second(account), second(total) to the output so it'd be:
id, desc, total, first_account, first_account_total, second_account, second_account_total
1, one, 21, a, 10, b, 9
2, two, 27, c, 27, null, 0
Any pointers?
Thanks in advance!
Below is for BigQuery Standard SQL
#standardSQL
SELECT id, `desc`, total,
arr[OFFSET(0)].account AS first_account,
arr[OFFSET(0)].total AS first_account_total,
arr[SAFE_OFFSET(1)].account AS second_account,
arr[SAFE_OFFSET(1)].total AS second_account_total
FROM (
SELECT id, `desc`, SUM(total) total,
ARRAY_AGG(STRUCT(account, total) ORDER BY total DESC LIMIT 2) arr
FROM `project.dataset.table`
GROUP BY id, `desc`
)
In cases when more than 2 first bins are required I would use below pattern that eliminates repeating of heavy repeated lines like arr[SAFE_OFFSET(1)].total AS second_account_total
#standardSQL
SELECT * FROM (SELECT NULL id, '' `desc`, NULL total, '' first_account, NULL first_account_total, '' second_account, NULL second_account_total) WHERE FALSE
UNION ALL
SELECT id, `desc`, total, arr[OFFSET(0)].*, arr[SAFE_OFFSET(1)].*
FROM (
SELECT id, `desc`, SUM(total) total,
ARRAY_AGG(STRUCT(account, total) ORDER BY total DESC LIMIT 2) arr
FROM `project.dataset.table`
GROUP BY id, `desc`
)
In above, first line sets layout of output while returning no rows at all because of WHERE FALSE, so then I don't need to explicitly parse struct's elements and provide aliases

Replacement for row_number() in clickhouse

Row_number () is not supported by clickhouse database, looking for a alternate function.
SELECT company_name AS company,
DOMAIN,
city_name AS city,
state_province_code AS state,
country_code AS country,
location_revenue AS revenueRange,
location_TI_industry AS industry,
location_employeecount_range AS employeeSize,
topic,
location_duns AS duns,
rank AS intensityRank,
dnb_status_code AS locationStatus,
rank_delta AS intensityRankDelta,
company_id,
ROW_NUMBER() OVER (PARTITION BY DOMAIN) AS rowNumberFROM company_intent c
WHERE c.rank > 0
AND c.rank <= 10
AND c.signal_count > 0
AND c.topic IN ('Cloud Computing')
AND c.country_code = 'US'
AND c.rank IN (7, 8, 9, 10)
GROUP BY c.location_duns,
company_name,
DOMAIN,
city_name,
state_province_code,
country_code,
location_revenue,
location_TI_industry,
location_employeecount_range,
topic,
rank,
dnb_status_code,
rank_delta,
company_id
ORDER BY intensityRank DESC
LIMIT 15 SELECT COUNT (DISTINCT c.company_id) AS COUNT
FROM company_intent c
WHERE c.rank > 0
AND c.rank <= 10
AND c.signal_count > 0
AND c.topic IN ('Cloud Computing')
AND c.country_code = 'US'
AND c.rank IN (7, 8, 9, 10)
When executed the above query got the below error.
Expected one of: SETTINGS, FORMAT, WITH, HAVING, LIMIT, FROM, PREWHERE, token, UNION ALL, Comma, WHERE, ORDER BY, INTO OUTFILE, GROUP BY
any suggestions is appreciated
Solution #1
SELECT
*,
rowNumberInAllBlocks()
FROM
(
-- YOUR SELECT HERE
)
https://clickhouse.com/docs/en/sql-reference/functions/other-functions/#rownumberinallblocks says:
rowNumberInAllBlocks() Returns the ordinal number of the row in the data block. This function only considers the affected data blocks.
Solution #2
SELECT
row_number() OVER (),
...
FROM
...
https://clickhouse.com/docs/en/sql-reference/window-functions/
In my tests, both solutions show identical results. However, you need to remember that at the beginning of 2022, window functions work in single-threaded mode.
ClickHouse doesn't support Window Functions for now. There is a rowNumberInAllBlocks function that might be interesting to you.
SELECT *, rowNumberInAllBlocks() as row_count FROM (SELECT .....)
smth like this (terrible lokks but works good)
SELECT *, rn +1 -min_rn current, max_rn - min_rn + 1 last FROM (
SELECT *, rowNumberInAllBlocks() rn FROM (
SELECT i_device, i_time
FROM tbl
ORDER BY i_device, i_time
) t
) t1 LEFT JOIN (
SELECT i_device, min(rn) min_rn, max(rn) max_rn FROM (
SELECT *, rowNumberInAllBlocks() rn FROM (
SELECT i_device, i_time
FROM tbl
ORDER BY i_device, i_time
) t
) t GROUP BY i_device
) t2 USING (i_device)

Differences between row in google big query

I'm currently attempting to calculate differences between rows in google big query. I actually have a working query.
SELECT
id, record_time, level, lag,
(level - lag) as diff
FROM (
SELECT
id, record_time, level,
LAG(level) OVER (ORDER BY id, record_time) as lag
FROM (
SELECT
*
FROM
TABLE_QUERY(MY_TABLES))
ORDER BY
1, 2 ASC
)
GROUP BY 1, 2, 3, 4
ORDER BY 1, 2 ASC
But I'm working with big data and sometimes I have memory limit warning that does not let me execute the query. So, I would like to understand why I cant do an optimized query like bellow. I think it will allow work with more records without memory limit warning.
SELECT
id, record_time, level,
level - LAG(level, 1) OVER (ORDER BY id, record_time) as diff
FROM (
SELECT
*
FROM
TABLE_QUERY(MY_TABLES))
ORDER BY
1, 2 ASC
This kind of function level - LAG(level, 1) OVER (ORDER BY id, record_time) as diff, when the query is executed returns the error
Missing function in Analytic Expression
on Big Query.
I also tried to put ( ) into this function but it does not work as well.
Thanks for helping me!
It works fine for me. Maybe you forgot to enable standard SQL? Here is an example:
WITH Input AS (
SELECT 1 AS id, TIMESTAMP '2017-10-17 00:00:00' AS record_time, 2 AS level UNION ALL
SELECT 2, TIMESTAMP '2017-10-16 00:00:00', 3 UNION ALL
SELECT 1, TIMESTAMP '2017-10-16 00:00:00', 4
)
SELECT
id, record_time, level, lag,
(level - lag) as diff
FROM (
SELECT
id, record_time, level,
LAG(level) OVER (ORDER BY id, record_time) as lag
FROM Input
)
GROUP BY 1, 2, 3, 4
ORDER BY 1, 2 ASC;

"Group" some rows together before sorting (Oracle)

I'm using Oracle Database 11g.
I have a query that selects, among other things, an ID and a date from a table. Basically, what I want to do is keep the rows that have the same ID together, and then sort those "groups" of rows by the most recent date in the "group".
So if my original result was this:
ID Date
3 11/26/11
1 1/5/12
2 6/3/13
2 10/15/13
1 7/5/13
The output I'm hoping for is:
ID Date
3 11/26/11 <-- (Using this date for "group" ID = 3)
1 1/5/12
1 7/5/13 <-- (Using this date for "group" ID = 1)
2 6/3/13
2 10/15/13 <-- (Using this date for "group" ID = 2)
Is there any way to do this?
One way to get this is by using analytic functions; I don't have an example of that handy.
This is another way to get the specified result, without using an analytic function (this is ordering first by the most_recent_date for each ID, then by ID, then by Date):
SELECT t.ID
, t.Date
FROM mytable t
JOIN ( SELECT s.ID
, MAX(s.Date) AS most_recent_date
FROM mytable s
WHERE s.Date IS NOT NULL
GROUP BY s.ID
) r
ON r.ID = t.ID
ORDER
BY r.most_recent_date
, t.ID
, t.Date
The "trick" here is to return "most_recent_date" for each ID, and then join that to each row. The result can be ordered by that first, then by whatever else.
(I also think there's a way to get this same ordering using Analytic functions, but I don't have an example of that handy.)
You can use the MAX ... KEEP function with your aggregate to create your sort key:
with
sample_data as
(select 3 id, to_date('11/26/11','MM/DD/RR') date_col from dual union all
select 1, to_date('1/5/12','MM/DD/RR') date_col from dual union all
select 2, to_date('6/3/13','MM/DD/RR') date_col from dual union all
select 2, to_date('10/15/13','MM/DD/RR') date_col from dual union all
select 1, to_date('7/5/13','MM/DD/RR') date_col from dual)
select
id,
date_col,
-- For illustration purposes, does not need to be selected:
max(date_col) keep (dense_rank last order by date_col) over (partition by id) sort_key
from sample_data
order by max(date_col) keep (dense_rank last order by date_col) over (partition by id);
Here is the query using analytic functions:
select
id
, date_
, max(date_) over (partition by id) as max_date
from table_name
order by max_date, id
;