Joining tables that compute values between dates - sql

so I have the two following tables
Table A
Date num
01-16-15 10
02-20-15 12
03-20-15 13
Table B
Date Value
01-02-15 100
01-03-15 101
. .
01-17-15 102
01-18-15 103
. .
02-22-15 104
. .
03-20-15 110
And i want to create a table that have the the following output in impala
Date Value
01-17-15 102*10
01-18-15 103*10
02-22-15 104*12
. .
. .
So the idea is that we only consider dates between 01-16-15 and 02-20-15, and 02-20-15 and 03-20-15 exclusively. And use the num from the starting date of that period, say 01-16-15, and multiply it by everyday in the period, i.e. 1-16 to 2-20.
I understand it should be done by join but I am not sure how do you join in this case.
Thanks!

Hmmm. In standard SQL you can do:
select b.*,
(select a.num
from a
where a.date <= b.date
order by a.date desc
fetch first 1 row only
) * value as new_value
from b;
I don't think this meets the range conditions, but I don't understand your description of that.
I also don't know if Impala supports correlated subqueries. An alternative is probably faster on complex data:
with ab as (
select a.date, a.value as a_value, null as b_value, 'a' as which
from a
union all
select b.date, null as a_value, b_value, 'b' as which
from b
)
select date, b_value * a_real_value
from (select ab.*,
max(a_value) over (partition by a_date) as a_real_value
from (select ab.*,
max(a.date) over (order by date, which) as a_date
from ab
) ab
) ab
where which = 'b';

This works on MariaDb (MySql) and it's pretty basic so hopefully it works on impala too.
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND (b.date < (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)
OR NOT EXISTS(SELECT c.date FROM tableA c WHERE c.date > a.date))
The last NOT EXISTS... was needed to include dates after the last date in table A
Update
In the revised version of the question the date in B is never larger (after) the last date in A so then the query can be written as
SELECT b.date, b.value * a.num
FROM tableB b, tableA a
WHERE b.date >= a.date
AND b.date <= (SELECT MIN(c.date) FROM tableA c WHERE c.date > a.date)

Related

Bigquery: WHERE clause using column from outside the subquery

New to Bigquery, and googling could not really point me to the solution of the problem.
I am trying to use a where clause in a subquery to filter and pick the latest row for each other row in the main query. In postgres I'd normally do it like this:
SELECT
*
FROM
table_a AS a
LEFT JOIN LATERAL
(
SELECT
score,
CONCAT( "AB", id ) AS id
FROM
table_b AS b
WHERE
id = a.company_id
and
b.date < a.date
ORDER BY
b.date DESC
LIMIT
1
) ON true
WHERE
id LIKE 'AB%'
ORDER BY
createdAt DESC
so this would essentially run the subquery against each row and pick the latest row from table B based on a given row's date from table A.
So if table A would have a row
id
date
12
2021-05-XX
and table B:
id
date
value
12
2022-01-XX
99
12
2021-02-XX
98
12
2020-03-XX
97
12
2019-04-XX
96
It would have joined only the row with 2021-02-XX to table a.
In another example, with
Table A:
id
date
15
2021-01-XX
Table B:
id
date
value
15
2022-01-XX
99
15
2021-02-XX
98
15
2020-03-XX
97
15
2019-04-XX
96
it would join only the row with date: 2020-03-XX, value: 97.
Hope that is clear, not really sure how to write this query to work
Thanks for help!
You can replace some of your correlated sub-select logic with a simple join and qualify statement.
Try the following:
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1
With your sample data it produces:
This should work for both truncated dates (YYYY-MM) as well as full dates (YYYY-MM-DD)
Something like below should work for your requirements
WITH
latest_record AS (
SELECT
a.id,
value,b.date, a.createdAt
FROM
`gcp-project-name.data-set-name.A` AS a
JOIN
`gcp-project-name.data-set-name.B` b
ON
( a.id = b.id
AND b.date < a.updatedAt )
ORDER BY
b.date DESC
LIMIT
1 )
SELECT
*
FROM
latest_record
I ran this with table A as
and table B as
and get result

Count Two Tables on shared date in Postgresql

I have two separate customer tables A and B. I am trying to count the customers created in A and B in the same query by date. I can get the right data with Union All but not properly grouped.
I want the data like so:
date,count A created, count B created
4/15/2015,1,5
Instead of:
date, count
4/15/2015, 1
4/15/2015, 5
Appreciate the help!
Just use a cte, just have to be carefull if you dont have date in every day. In that case you would need a date table to get 0 when no sales.
Also try not use reserved words like date as fieldnames
with countA as (
SELECT date, count(*) as CountA
from tableA
group by date
),
countB as (
SELECT date, count(*) as CountB
from tableB
group by date
)
SELECT A.date, A.CountA, B.CountB
FROM CountA A
INNER JOIN CountB B
ON A.date = B.date
With a table AllDates to solve day without sales
SELECT T.date,
CASE
WHEN A.CountA IS NULL THEN 0
ELSE A.CountA
END as CountA,
CASE
WHEN B.CountB IS NULL THEN 0
ELSE B.CountB
END as CountB
FROM AllDates T
LEFT JOIN CountA A
ON T.date = A.date
LEFT JOIN CountB B
ON T.date = B.date
select a.dte
,a.count a_created
,b.count b_created
from
(select dte,count(*)from table_a group by dte) a
,(select dte,count(*)from table_b group by dte) b
where b.dte=a.dte
SQLFIDDLE DEMO
OR
You can use achieve this by using PostgreSQL's tablefunc
start bty creating CREATE EXTENSION if not exists tablefunc;
and following as an example
create table table_a (dte date,is_created int);
create table table_b (dte date,is_created int);
insert into table_a values('2015-10-07',1);
insert into table_a values('2015-10-07',1);
insert into table_a values('2015-10-07',1);
insert into table_a values('2015-10-07',1);
insert into table_a values('2015-10-07',1);
insert into table_b values('2015-10-07',2);
by using crosstab() the select should be
SELECT *
FROM crosstab(
'select dte,''a_created'' col,count(*) created from table_a group by dte
union all
select dte, ''b_created'' col,count(*) created from table_b group by dte')
AS ct("date" DATE, "a_created" BIGINT, "b_created" BIGINT);

select count(ID) where ID IN a or b

I don't understand what I'm doing wrong. I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B.
select CREATE_WEEK, count ( A.PK )
from TABLE1 A
where ( A.PK not in (select distinct ( B.FK )
from TABLE2 B
where B.CREATE_TIMESTAMP > '01-Jan-2013')
or A.PK in (select A.PK
from ( select A.PK, A.CREATE_TIMESTAMP as A_CRT, min ( B.CREATE_TIMESTAMP ) as FIRST_B
from TABLE1 A, TABLE2 B
where A.PK = B.FK
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and B.CREATE_TIMESTAMP > '01-Jan-2013'
group by A.PK, A.CREATE_TIMESTAMP)
where A_CRT < FIRST_B) )
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and CREATE_WEEK >= 2
and THIS_WEEK - CREATE_WEEK >= 1
group by CREATE_WEEK
order by CREATE_WEEK asc
**Note: PK in table1 = FK in table2, so in the first subquery, I'm checking whether the PK from table1 exists as FK in table2. Week comes from TO_CHAR (TO_DATE (TRUNC (A.CREATE_TIMESTAMP, 'IW')), 'IW')
When I take out the OR and run the query on either subquery the results are returned in 1-2 seconds. But when I try to run the combined query, the results aren't returned after 20 minutes.
I know I can run them separately and then sum them in a spreadsheet, but I'd rather just get one number.
I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B
However your code is:
ID NOT IN (subquery A) OR ID IN (subquery B)
The NOT is at odds with your requirement.
Assuming you ID's that meet both criteria, use:
ID in (
select ... -- this is subquery A
union
select ... -- this is subquery B)

BigQuery SQL running totals

Any idea how to calculate running total in BigQuery SQL?
id value running total
-- ----- -------------
1 1 1
2 2 3
3 4 7
4 7 14
5 9 23
6 12 35
7 13 48
8 16 64
9 22 86
10 42 128
11 57 185
12 58 243
13 59 302
14 60 362
Not a problem for traditional SQL servers using either correlated scalar query:
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id)
FROM RunTotalTestData a
ORDER BY a.id;
or join:
SELECT a.id, a.value, SUM(b.Value)
FROM RunTotalTestData a,
RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;
But I couldn't find a way to make it work in BigQuery...
2018 update: The query in the original question works without modification now.
#standardSQL
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id) runningTotal
FROM RunTotalTestData a
ORDER BY a.id;
2013 update: You can use SUM() OVER() to calculate running totals.
In your example:
SELECT id, value, SUM(value) OVER(ORDER BY id)
FROM [your.table]
A working example:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30;
You probably figured it out already. But here is one, not the most efficient, way:
JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.
https://developers.google.com/bigquery/docs/query-reference#joins
This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.
SELECT a.id, SUM(a.value) as rt
FROM RunTotalTestData a
JOIN RunTotalTestData b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
You can manually constrain the time as well:
SELECT a.id, SUM(a.value) as rt
FROM (
SELECT id, timestamp RunTotalTestData
WHERE timestamp >= foo
AND timestamp < bar
) AS a
JOIN (
SELECT id, timestamp, value RunTotalTestData
WHERE timestamp >= foo AND timestamp < bar
) b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
Update:
You don't need a special property. You can just use
SELECT 1 AS one
and join on that.
As billing goes the join table counts in the processing.
The problem is with the second query, that BigQuery will UNION the 2 tables in the FROM expression.
I'm not sure about the first one, but it's possible that bigquery doesn't like subselects at the Select expressions, only at the FromExpression. So you need to move the subquery into the fromexpression, and JOIN the results.
Also, you could give it a try to our JDBC driver:
Starschema BigQuery JDBC Driver
Just simply load it into Squirrel SQL, or RazorSQL or kinda any tool that supports JDBC drivers, make sure you turn on the Query Transformer by setting:
transformQuery=true
In the properties or in the JDBC url, every info can be found at the project page. After you did this, try to run the 2nd query, it will be transformed into a BigQuery compatible join.
It's easy if we are allow to use window function.
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
With that we can do it like this :
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
select *, sum(value) over(order by id) as running_total
from RunTotalTestData

How to optimize this query?

I have tables:
A (ID_A, VALID_FROM, DATA ...)
B (ID_B, ID, T1, T2, T3, DATE)
Table A can contain historical data (eg. data valid for given period)
I need to select records from table B joined with appropritate records from table A (from table A I need row where b.id = a.id_a and record was valid at b.date)
select *
from B, (select * from (select * from A where a.id_a = b.id and a.valid_from <= b.date order by valid_from desc) where rownum = 1)
where b.id = a.id_a
Sounds like you're looking for a JOIN: http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/queries006.htm
This isn't much more optimal, but is probably more readable:
select *
from A a, B b
Where
a.id_a = b.id
and a.valid_from = (select max(valid_from)
from A
where id_a = b.id
and valid_from <= b.date)
order by valid_from desc
I've seen this problem before, and the best way I know of to optimise it is to put a valid_to column onto table A.
For the latest record, this should contain the biggest date Oracle can handle.
Whenever you create a newer version of the record, update it with the time the new record is created (minus a millisecond to avoid overlaps) so you have something like this:
ID Valid_from Valid_to
1 01/01/2011 12.34.56.0000 02/01/2011 12.34.56.0000
1 02/01/2011 12.34.56.0001 03/01/2011 12.34.56.0000
1 03/01/2011 12.34.56.0001 31/12/9999 23.59.59.9999
Then you can query it like this:
select *
from A a, B b
Where
a.id_a = b.id
and b.date between a.valid_from and a.valid_to
order by valid_from desc
With an index on the date columns, the performance should be ok..
I've taken StevieG's answer and expanded on it. Without a valid_to column there are tricky subqueries to write. I would propose using the LEAD analytic function to find the end of the current validity period and work with that. This is an alternative to the subqueries and the valid_to column.
The LEAD analytic function looks over the rows in the current data set and finds the next valid_from date and uses that as the end of the current period.
My query is shown below. It incorporates the sample data you provided, in a with clause.
with table_a as (
select 1 as id, 'XXX1' as data, date '2009-01-01' as valid_from from dual union all
select 1 as id, 'XXX2' as data, date '2009-05-30' as valid_from from dual union all
select 1 as id, 'XXX3' as data, date '2010-01-11' as valid_from from dual union all
select 2 as id, 'YYY' as data, date '1999-01-01' as valid_from from dual
),
table_b as (
select 1 as id, 1 as id_a, date '2009-02-01' as date_col from dual union all
select 2 as id, 2 as id_a, date '2009-09-12' as date_col from dual union all
select 3 as id, 1 as id_a, date '2009-06-30' as date_col from dual
)
select *
from table_b b
join (
select
id,
valid_from,
lead(valid_from, 1, date '9999-12-31') over (partition by a.id order by a.valid_from) as valid_to
from table_a a
) a on (a.id = b.id_a)
where
a.valid_from <= b.date_col and
b.date_col < a.valid_to