New to Bigquery, and googling could not really point me to the solution of the problem.
I am trying to use a where clause in a subquery to filter and pick the latest row for each other row in the main query. In postgres I'd normally do it like this:
SELECT
*
FROM
table_a AS a
LEFT JOIN LATERAL
(
SELECT
score,
CONCAT( "AB", id ) AS id
FROM
table_b AS b
WHERE
id = a.company_id
and
b.date < a.date
ORDER BY
b.date DESC
LIMIT
1
) ON true
WHERE
id LIKE 'AB%'
ORDER BY
createdAt DESC
so this would essentially run the subquery against each row and pick the latest row from table B based on a given row's date from table A.
So if table A would have a row
id
date
12
2021-05-XX
and table B:
id
date
value
12
2022-01-XX
99
12
2021-02-XX
98
12
2020-03-XX
97
12
2019-04-XX
96
It would have joined only the row with 2021-02-XX to table a.
In another example, with
Table A:
id
date
15
2021-01-XX
Table B:
id
date
value
15
2022-01-XX
99
15
2021-02-XX
98
15
2020-03-XX
97
15
2019-04-XX
96
it would join only the row with date: 2020-03-XX, value: 97.
Hope that is clear, not really sure how to write this query to work
Thanks for help!
You can replace some of your correlated sub-select logic with a simple join and qualify statement.
Try the following:
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1
With your sample data it produces:
This should work for both truncated dates (YYYY-MM) as well as full dates (YYYY-MM-DD)
Something like below should work for your requirements
WITH
latest_record AS (
SELECT
a.id,
value,b.date, a.createdAt
FROM
`gcp-project-name.data-set-name.A` AS a
JOIN
`gcp-project-name.data-set-name.B` b
ON
( a.id = b.id
AND b.date < a.updatedAt )
ORDER BY
b.date DESC
LIMIT
1 )
SELECT
*
FROM
latest_record
I ran this with table A as
and table B as
and get result
Related
so I have the two following tables
Table A
Date num
01-16-15 10
02-20-15 12
03-20-15 13
Table B
Date Value
01-02-15 100
01-03-15 101
. .
01-17-15 102
01-18-15 103
. .
02-22-15 104
. .
03-20-15 110
And i want to create a table that have the the following output in impala
Date Value
01-17-15 102*10
01-18-15 103*10
02-22-15 104*12
. .
. .
So the idea is that we only consider dates between 01-16-15 and 02-20-15, and 02-20-15 and 03-20-15 exclusively. And use the num from the starting date of that period, say 01-16-15, and multiply it by everyday in the period, i.e. 1-16 to 2-20.
I understand it should be done by join but I am not sure how do you join in this case.
Thanks!
Hmmm. In standard SQL you can do:
select b.*,
(select a.num
from a
where a.date <= b.date
order by a.date desc
fetch first 1 row only
) * value as new_value
from b;
I don't think this meets the range conditions, but I don't understand your description of that.
I also don't know if Impala supports correlated subqueries. An alternative is probably faster on complex data:
with ab as (
select a.date, a.value as a_value, null as b_value, 'a' as which
from a
union all
select b.date, null as a_value, b_value, 'b' as which
from b
)
select date, b_value * a_real_value
from (select ab.*,
max(a_value) over (partition by a_date) as a_real_value
from (select ab.*,
max(a.date) over (order by date, which) as a_date
from ab
) ab
) ab
where which = 'b';
This works on MariaDb (MySql) and it's pretty basic so hopefully it works on impala too.
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND (b.date < (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)
OR NOT EXISTS(SELECT c.date FROM tableA c WHERE c.date > a.date))
The last NOT EXISTS... was needed to include dates after the last date in table A
Update
In the revised version of the question the date in B is never larger (after) the last date in A so then the query can be written as
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND b.date <= (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)
I have a table with records:
DATE NAME AGE ADDRESS
01/13/2014 abc 27 us
01/29/2014 abc 27 ma <- duplicate
02/03/2014 abc 27 ny <- duplicate
02/03/2014 def 28 ca
I want to delete the record number 2 and 3 since they are duplicates for record 1 based on name and age. DATE column is a timestamp based from the record when it was added (sql date) and considered unique.
I found this sql but not sure if it will work and a bit concerned as the table has 2 million records and delting the wrong ones will be a bad idea:
SELECT A.DATE, A.NAME, A.AGE
FROM table A
WHERE EXISTS (SELECT B.DATE
FROM table B
WHERE B.NAME = A.NAME
AND B.AGE = A.AGE);
There are many instance of this records so if someone can help me write a sql to delete this records?
Query
DELETE FROM tbl t1
WHERE dt IN
(
SELECT t1.dt
FROM tbl t1
JOIN tbl t2 ON
(
t2.name = t1.name
AND t2.age=t1.age
AND t2.dt > t1.dt
)
);
Fiddle demo
delete from table
where (date, name, age) not in ( select max( date ), name, age from table group by name, age )
Before delete verify with
select * from table
where (date, name, age) not in ( select max( date ), name, age from table group by name, age )
ROW_NUMBER analytical function will helpful (supported by Oracle and Sqlserver).
The logic of assigning a unique ordered number for each row inside a partition, needs to be implemented carefully inside ORDER BY clause.
SELECT A_TABLE.*,
ROW_NUMBER ()
OVER (PARTITION BY NAME, AGE
ORDER BY DATE DESC)
seq_no
FROM A_TABLE;
Then you may use the result for delete operation:
Delete A_TABLE
where DATE,NAME,AGE IN
(
SELECT DATE,NAME,AGE FROM
(
SELECT A_TABLE.*,
ROW_NUMBER ()
OVER (PARTITION BY NAME, AGE
ORDER BY DATE DESC)
seq_no
FROM A_TABLE;
)
WHERE seq_no != 1
)
I don't understand what I'm doing wrong. I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B.
select CREATE_WEEK, count ( A.PK )
from TABLE1 A
where ( A.PK not in (select distinct ( B.FK )
from TABLE2 B
where B.CREATE_TIMESTAMP > '01-Jan-2013')
or A.PK in (select A.PK
from ( select A.PK, A.CREATE_TIMESTAMP as A_CRT, min ( B.CREATE_TIMESTAMP ) as FIRST_B
from TABLE1 A, TABLE2 B
where A.PK = B.FK
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and B.CREATE_TIMESTAMP > '01-Jan-2013'
group by A.PK, A.CREATE_TIMESTAMP)
where A_CRT < FIRST_B) )
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and CREATE_WEEK >= 2
and THIS_WEEK - CREATE_WEEK >= 1
group by CREATE_WEEK
order by CREATE_WEEK asc
**Note: PK in table1 = FK in table2, so in the first subquery, I'm checking whether the PK from table1 exists as FK in table2. Week comes from TO_CHAR (TO_DATE (TRUNC (A.CREATE_TIMESTAMP, 'IW')), 'IW')
When I take out the OR and run the query on either subquery the results are returned in 1-2 seconds. But when I try to run the combined query, the results aren't returned after 20 minutes.
I know I can run them separately and then sum them in a spreadsheet, but I'd rather just get one number.
I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B
However your code is:
ID NOT IN (subquery A) OR ID IN (subquery B)
The NOT is at odds with your requirement.
Assuming you ID's that meet both criteria, use:
ID in (
select ... -- this is subquery A
union
select ... -- this is subquery B)
I have a column like this:
ID
--------
1
2
3
4
5
7
10
and I want to get the following resultset:
ID
--------
1-5
7
10
Is there a way to achieve this with (Oracle) SQL only?
Yes:
select (case when min(id) < max(id)
then cast(min(id) as varchar2(255)) || '-' || cast(max(id) as varchar2(255))
else cast(min(id) as varchar2(255))
end)
from (select id, id - rownum as grp
from t
order by id
) t
group by grp
order by min(id);
Here is a SQL Fiddle demonstrating it.
The idea behind the query is that subtracting rownum from a sequence of numbers results in a constant. You can use the constant for grouping.
self joins are necessary... I think this will work
Select a.id, b.id
From table a -- to get beginning of each range
Join table b -- to get end of each range
On b.id >= a.Id -- guarantees that b is after a
And Not exists (Select * From table m -- this guarantees all values between
where id Between a.id+1 and b.id
And Not exists(Select * From table
Where id = m.id-1))
And Not exists (Select * From table -- this guarantees that table a is start
Where id = a.id -1)
And Not exists (Select * From table -- this guarantees that table b is end
Where id = b.id + 1)
Any idea how to calculate running total in BigQuery SQL?
id value running total
-- ----- -------------
1 1 1
2 2 3
3 4 7
4 7 14
5 9 23
6 12 35
7 13 48
8 16 64
9 22 86
10 42 128
11 57 185
12 58 243
13 59 302
14 60 362
Not a problem for traditional SQL servers using either correlated scalar query:
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id)
FROM RunTotalTestData a
ORDER BY a.id;
or join:
SELECT a.id, a.value, SUM(b.Value)
FROM RunTotalTestData a,
RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;
But I couldn't find a way to make it work in BigQuery...
2018 update: The query in the original question works without modification now.
#standardSQL
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id) runningTotal
FROM RunTotalTestData a
ORDER BY a.id;
2013 update: You can use SUM() OVER() to calculate running totals.
In your example:
SELECT id, value, SUM(value) OVER(ORDER BY id)
FROM [your.table]
A working example:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30;
You probably figured it out already. But here is one, not the most efficient, way:
JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.
https://developers.google.com/bigquery/docs/query-reference#joins
This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.
SELECT a.id, SUM(a.value) as rt
FROM RunTotalTestData a
JOIN RunTotalTestData b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
You can manually constrain the time as well:
SELECT a.id, SUM(a.value) as rt
FROM (
SELECT id, timestamp RunTotalTestData
WHERE timestamp >= foo
AND timestamp < bar
) AS a
JOIN (
SELECT id, timestamp, value RunTotalTestData
WHERE timestamp >= foo AND timestamp < bar
) b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
Update:
You don't need a special property. You can just use
SELECT 1 AS one
and join on that.
As billing goes the join table counts in the processing.
The problem is with the second query, that BigQuery will UNION the 2 tables in the FROM expression.
I'm not sure about the first one, but it's possible that bigquery doesn't like subselects at the Select expressions, only at the FromExpression. So you need to move the subquery into the fromexpression, and JOIN the results.
Also, you could give it a try to our JDBC driver:
Starschema BigQuery JDBC Driver
Just simply load it into Squirrel SQL, or RazorSQL or kinda any tool that supports JDBC drivers, make sure you turn on the Query Transformer by setting:
transformQuery=true
In the properties or in the JDBC url, every info can be found at the project page. After you did this, try to run the 2nd query, it will be transformed into a BigQuery compatible join.
It's easy if we are allow to use window function.
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
With that we can do it like this :
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
select *, sum(value) over(order by id) as running_total
from RunTotalTestData