BigQuery SQL running totals - google-bigquery

Any idea how to calculate running total in BigQuery SQL?
id value running total
-- ----- -------------
1 1 1
2 2 3
3 4 7
4 7 14
5 9 23
6 12 35
7 13 48
8 16 64
9 22 86
10 42 128
11 57 185
12 58 243
13 59 302
14 60 362
Not a problem for traditional SQL servers using either correlated scalar query:
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id)
FROM RunTotalTestData a
ORDER BY a.id;
or join:
SELECT a.id, a.value, SUM(b.Value)
FROM RunTotalTestData a,
RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;
But I couldn't find a way to make it work in BigQuery...

2018 update: The query in the original question works without modification now.
#standardSQL
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id) runningTotal
FROM RunTotalTestData a
ORDER BY a.id;
2013 update: You can use SUM() OVER() to calculate running totals.
In your example:
SELECT id, value, SUM(value) OVER(ORDER BY id)
FROM [your.table]
A working example:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30;

You probably figured it out already. But here is one, not the most efficient, way:
JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.
https://developers.google.com/bigquery/docs/query-reference#joins
This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.
SELECT a.id, SUM(a.value) as rt
FROM RunTotalTestData a
JOIN RunTotalTestData b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
You can manually constrain the time as well:
SELECT a.id, SUM(a.value) as rt
FROM (
SELECT id, timestamp RunTotalTestData
WHERE timestamp >= foo
AND timestamp < bar
) AS a
JOIN (
SELECT id, timestamp, value RunTotalTestData
WHERE timestamp >= foo AND timestamp < bar
) b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
Update:
You don't need a special property. You can just use
SELECT 1 AS one
and join on that.
As billing goes the join table counts in the processing.

The problem is with the second query, that BigQuery will UNION the 2 tables in the FROM expression.
I'm not sure about the first one, but it's possible that bigquery doesn't like subselects at the Select expressions, only at the FromExpression. So you need to move the subquery into the fromexpression, and JOIN the results.
Also, you could give it a try to our JDBC driver:
Starschema BigQuery JDBC Driver
Just simply load it into Squirrel SQL, or RazorSQL or kinda any tool that supports JDBC drivers, make sure you turn on the Query Transformer by setting:
transformQuery=true
In the properties or in the JDBC url, every info can be found at the project page. After you did this, try to run the 2nd query, it will be transformed into a BigQuery compatible join.

It's easy if we are allow to use window function.
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
With that we can do it like this :
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
select *, sum(value) over(order by id) as running_total
from RunTotalTestData

Related

Bigquery: WHERE clause using column from outside the subquery

New to Bigquery, and googling could not really point me to the solution of the problem.
I am trying to use a where clause in a subquery to filter and pick the latest row for each other row in the main query. In postgres I'd normally do it like this:
SELECT
*
FROM
table_a AS a
LEFT JOIN LATERAL
(
SELECT
score,
CONCAT( "AB", id ) AS id
FROM
table_b AS b
WHERE
id = a.company_id
and
b.date < a.date
ORDER BY
b.date DESC
LIMIT
1
) ON true
WHERE
id LIKE 'AB%'
ORDER BY
createdAt DESC
so this would essentially run the subquery against each row and pick the latest row from table B based on a given row's date from table A.
So if table A would have a row
id
date
12
2021-05-XX
and table B:
id
date
value
12
2022-01-XX
99
12
2021-02-XX
98
12
2020-03-XX
97
12
2019-04-XX
96
It would have joined only the row with 2021-02-XX to table a.
In another example, with
Table A:
id
date
15
2021-01-XX
Table B:
id
date
value
15
2022-01-XX
99
15
2021-02-XX
98
15
2020-03-XX
97
15
2019-04-XX
96
it would join only the row with date: 2020-03-XX, value: 97.
Hope that is clear, not really sure how to write this query to work
Thanks for help!
You can replace some of your correlated sub-select logic with a simple join and qualify statement.
Try the following:
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1
With your sample data it produces:
This should work for both truncated dates (YYYY-MM) as well as full dates (YYYY-MM-DD)
Something like below should work for your requirements
WITH
latest_record AS (
SELECT
a.id,
value,b.date, a.createdAt
FROM
`gcp-project-name.data-set-name.A` AS a
JOIN
`gcp-project-name.data-set-name.B` b
ON
( a.id = b.id
AND b.date < a.updatedAt )
ORDER BY
b.date DESC
LIMIT
1 )
SELECT
*
FROM
latest_record
I ran this with table A as
and table B as
and get result

SQL - Comparing Dates in 2 Tables, Retrieving Most Recent

I have two large MS SQL Server tables (A & B) that I'll generalize here. I'm attempting to create a new column in TableA derived from TableB, containing the most recent TableB.refDate PRIOR to each row's TableA.dataDate. Extremely large datasets, but these DISTINCT date queries run quickly. Simply concerned with the distinct dates in each table, no further matching criteria required.
SELECT DISTINCT dataDate FROM TableA
> 2019-02-13
> 2019-02-09
> 2019-02-05
SELECT DISTINCT refDate FROM TableB
> 2019-02-13
> 2019-02-12
> 2019-02-10
> 2019-02-07
> 2019-02-05
> 2019-02-04
The end result should be something like:
dataDate mostRecentRefDate
2019-02-13 2019-02-12
2019-02-09 2019-02-07
2019-02-05 2019-02-04
Something along these lines should work in theory, but the datasets are far too large:
SELECT
DISTINCT a.dataDate as dataDate,
(SELECT MAX(b.refDate) FROM TableB b WHERE a.dataDate > b.refDate) as mostRecentRefDate
FROM TableA a
Is there a better way to perform this utilizing the results of those initial DISTINCT date queries? Then reference the results to quickly insert the new column?
You may want to try this:
SELECT
a.dataDate as dataDate,
MAX(b.refDate) mostRecentRefDate
FROM TableA a,
inner join TableB b
on a.dataDate > b.refDate
group by a.dataDate
Assuming you want other columns besides the dates, I would recommend apply:
select a.*, b.*
from a outer apply
(select top (1) b.*
from b
where b.refdate < a.datadate
order by b.refdate desc
) b;
You can try the following
WITH T(date, source) AS
(
SELECT DISTINCT dataDate, 'A'
FROM TableA
UNION ALL
SELECT DISTINCT refDate, 'B'
FROM TableB
), T2 AS
(
SELECT * ,
mostRecentRefDate = MAX(CASE WHEN source = 'B' THEN date END)
OVER (ORDER BY date, source ROWS UNBOUNDED PRECEDING)
FROM T
)
SELECT date AS dataDate, mostRecentRefDate
FROM T2
WHERE source = 'A'
The plan for this looks pretty good (though yours may differ depending on how the DISTINCT is carried out)

Joining tables that compute values between dates

so I have the two following tables
Table A
Date num
01-16-15 10
02-20-15 12
03-20-15 13
Table B
Date Value
01-02-15 100
01-03-15 101
. .
01-17-15 102
01-18-15 103
. .
02-22-15 104
. .
03-20-15 110
And i want to create a table that have the the following output in impala
Date Value
01-17-15 102*10
01-18-15 103*10
02-22-15 104*12
. .
. .
So the idea is that we only consider dates between 01-16-15 and 02-20-15, and 02-20-15 and 03-20-15 exclusively. And use the num from the starting date of that period, say 01-16-15, and multiply it by everyday in the period, i.e. 1-16 to 2-20.
I understand it should be done by join but I am not sure how do you join in this case.
Thanks!
Hmmm. In standard SQL you can do:
select b.*,
(select a.num
from a
where a.date <= b.date
order by a.date desc
fetch first 1 row only
) * value as new_value
from b;
I don't think this meets the range conditions, but I don't understand your description of that.
I also don't know if Impala supports correlated subqueries. An alternative is probably faster on complex data:
with ab as (
select a.date, a.value as a_value, null as b_value, 'a' as which
from a
union all
select b.date, null as a_value, b_value, 'b' as which
from b
)
select date, b_value * a_real_value
from (select ab.*,
max(a_value) over (partition by a_date) as a_real_value
from (select ab.*,
max(a.date) over (order by date, which) as a_date
from ab
) ab
) ab
where which = 'b';
This works on MariaDb (MySql) and it's pretty basic so hopefully it works on impala too.
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND (b.date < (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)
OR NOT EXISTS(SELECT c.date FROM tableA c WHERE c.date > a.date))
The last NOT EXISTS... was needed to include dates after the last date in table A
Update
In the revised version of the question the date in B is never larger (after) the last date in A so then the query can be written as
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND b.date <= (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)

Find consecutive free numbers in table

I have a table, containing numbers (phone numbers) and a code (free or not available).
Now, I need to find series, of 30 consecutive numbers, like 079xxx100 - 079xxx130, and all of them to have free status.
Here is an example how my table looks like:
CREATE TABLE numere
(
value int,
code varchar(10)
);
INSERT INTO numere (value,code)
Values
(123100, 'free'),
(123101, 'free'),
...
(123107, 'booked'),
(123108, 'free'),
(...
(123130, 'free'),
(123131, 'free'),
...
(123200, 'free'),
(123201, 'free'),
...
(123230, 'free'),
(123231, 'free'),
...
I need a SQL query, to get me in this example, the 123200-123230 range (and all next available ranges).
Now, I found an example, doing more or less what I need:
select value, code
from numere
where value >= (select a.value
from numere a
left join numere b on a.value < b.value
and b.value < a.value + 30
and b.code = 'free'
where a.code = 'free'
group by a.value
having count(b.value) + 1 = 30)
limit 30
but this is returning only the first 30 available numbers, and not within my range (0-30). (and takes 13 minutes to execute, hehe..)
If anyone has an idea, please let me know (I am using SQL Server)
This seems like it works in my dataset. Modify the select and see if it works with your table name.
DECLARE #numere TABLE
(
value int,
code varchar(10)
);
INSERT INTO #numere (value,code) SELECT 123100, 'free'
WHILE (SELECT COUNT(*) FROM #numere)<=30
BEGIN
INSERT INTO #numere (value,code) SELECT MAX(value)+1, 'free' FROM #numere
END
UPDATE #numere
SET code='booked'
WHERE value=123105
select *
from #numere n1
inner join #numere n2 ON n1.value=n2.value-30
AND n1.code='free'
AND n2.code='free'
LEFT JOIN #numere n3 ON n3.value>=n1.value
AND n3.value<=n2.value
AND n3.code<>'free'
WHERE n3.value IS NULL
This is usual Island and Gap problem.
; with cte as
(
select *, grp = row_number() over (order by value)
- row_number() over (partition by code order by value)
from numere
),
grp as
(
select grp
from cte
group by grp
having count(*) >= 30
)
select c.grp, c.value, c.code
from grp g
inner join cte c on g.grp = c.grp
You can query table data for gaps between booked numbers using following SQL query where SQL LEAD() analytical function is used
;with cte as (
select
value, lead(value) over (order by value) nextValue
from numere
where code = 'booked'
), cte2 as (
select
value gapstart, nextValue gapend,
(nextValue - value - 1) [number count in gap] from cte
where value < nextValue - 1
)
select *
from cte2
where [number count in gap] >= 30
You can check the SQL tutorial Find Missing Numbers and Gaps in a Sequence using SQL
I hope it helps,
Can't Test it at the moment, but this might work:
SELECT a.Value
FROM (SELECT Value
FROM numere
WHERE Code='free'
) a INNER Join
(SELECT Value
FROM numere
WHERE code='free'
) b ON b.Value BETWEEN a.Value+1 AND a.Value+29
GROUP BY a.Value
HAVING COUNT(b.Value) >= 29
ORDER BY a.Value ASC
The output should be all numbers that have 29 free numbers following (so it's 30 consecutive numbers)

SUM Column in SQL

I have a table in SQL Server, and I need to sum a column, like the example below:
CREATE TABLE B
(
ID int,
Qty int,
)
INSERT INTO B VALUES (1,2)
INSERT INTO B VALUES (2,7)
INSERT INTO B VALUES (3,2)
INSERT INTO B VALUES (4,11)
SELECT *, '' AS TotalQty FROM B
ORDER BY ID
In this example what I need is the column TotalQty give me the values like:
2
9
11
22
How can it be achieved?
You can use SUM in a co-related subquery or CROSS APPLY like this
Co-related Subquery
SELECT ID,(SELECT SUM(Qty) FROM B WHERE B.id <= C.id) FROM B as C
ORDER BY ID
Using CROSS APPLY
SELECT ID,D.Qty FROM B as C
CROSS APPLY
(
SELECT SUM(Qty) Qty
FROM B WHERE B.id <= C.id
)AS D
ORDER BY ID
Output
1 2
2 9
3 11
4 22
If you were using SQL Server 2012 or above, SUM() with Over() clause could have been used like this.
SELECT ID, SUM(Qty) OVER(ORDER BY ID ASC) FROM B as C
ORDER BY ID
Edit
Another way to do this in SQL Server 2008 is using Recursive CTE. Something like this.
Note: This method is based on the answer by Roman Pekar on this thread Calculate a Running Total in SQL Server. Based on his observation this would perform better than co related subquery and CROSS APPLY both
;WITH CTE as
(
SELECT ID,Qty,ROW_NUMBER()OVER(ORDER BY ID ASC) as rn
FROM B
), CTE_Running_Total as
(
SELECT Id,rn,Qty,Qty as Running_Total
FROM CTE
WHERE rn = 1
UNION ALL
SELECT C1.Id,C1.rn,C1.Qty,C1.Qty + C2.Running_Total as Running_Total
FROM CTE C1
INNER JOIN CTE_Running_Total C2
ON C1.rn = C2.rn + 1
)
SELECT *
FROM CTE_Running_Total
ORDER BY Id
OPTION (maxrecursion 0)