SQL - Comparing Dates in 2 Tables, Retrieving Most Recent - sql

I have two large MS SQL Server tables (A & B) that I'll generalize here. I'm attempting to create a new column in TableA derived from TableB, containing the most recent TableB.refDate PRIOR to each row's TableA.dataDate. Extremely large datasets, but these DISTINCT date queries run quickly. Simply concerned with the distinct dates in each table, no further matching criteria required.
SELECT DISTINCT dataDate FROM TableA
> 2019-02-13
> 2019-02-09
> 2019-02-05
SELECT DISTINCT refDate FROM TableB
> 2019-02-13
> 2019-02-12
> 2019-02-10
> 2019-02-07
> 2019-02-05
> 2019-02-04
The end result should be something like:
dataDate mostRecentRefDate
2019-02-13 2019-02-12
2019-02-09 2019-02-07
2019-02-05 2019-02-04
Something along these lines should work in theory, but the datasets are far too large:
SELECT
DISTINCT a.dataDate as dataDate,
(SELECT MAX(b.refDate) FROM TableB b WHERE a.dataDate > b.refDate) as mostRecentRefDate
FROM TableA a
Is there a better way to perform this utilizing the results of those initial DISTINCT date queries? Then reference the results to quickly insert the new column?

You may want to try this:
SELECT
a.dataDate as dataDate,
MAX(b.refDate) mostRecentRefDate
FROM TableA a,
inner join TableB b
on a.dataDate > b.refDate
group by a.dataDate

Assuming you want other columns besides the dates, I would recommend apply:
select a.*, b.*
from a outer apply
(select top (1) b.*
from b
where b.refdate < a.datadate
order by b.refdate desc
) b;

You can try the following
WITH T(date, source) AS
(
SELECT DISTINCT dataDate, 'A'
FROM TableA
UNION ALL
SELECT DISTINCT refDate, 'B'
FROM TableB
), T2 AS
(
SELECT * ,
mostRecentRefDate = MAX(CASE WHEN source = 'B' THEN date END)
OVER (ORDER BY date, source ROWS UNBOUNDED PRECEDING)
FROM T
)
SELECT date AS dataDate, mostRecentRefDate
FROM T2
WHERE source = 'A'
The plan for this looks pretty good (though yours may differ depending on how the DISTINCT is carried out)

Related

Finding the most recent records from duplicates

I'm trying to get the most recent records from a table where there are duplicates for each row.
Every month a new row for some IDs is getting added to the table, but some other records might not have a new row each month so the data will be like this
ID Date
1 8/30/2022
1 7/30/2022
3 8/30/2022
3 7/30/2022
3 6/30/2022
4 1/11/2021
The query result should be
ID Date
1 8/30/2022
3 8/30/2022
4 1/11/2021
I tried to use a sub-query, but it is only returning records that actually has the most recent for the whole table not per ID so it is only returning those who has a record in 8/30/2022.
This is my query
create table test as (
select * from table1 inner join
(select EmpID, max(Record_Date) as maxdate
from table1 group by EmpID) ms
on table1.EmpID ms.EmpID and Record_Date=maxdate)
WITH DATA;
You may use NOT EXISTS operator with correlated subquery as the following:
SELECT T.ID, T.Date
FROM Table1 T
WHERE NOT EXISTS(SELECT * FROM Table1 D WHERE D.ID=T.ID AND D.Date>T.Date)
And of course, if you want to create a new table from this statement the query will be:
CREATE TABLE test AS
(
SELECT T.ID, T.Date
FROM Table1 T
WHERE NOT EXISTS(SELECT * FROM Table1 D WHERE D.ID=T.ID AND D.Date>T.Date)
) WITH DATA;

Get one row from multiple rows

I have data like
tableid name status uuid date
1 a none 1 2019-12-02
1 a none 2 2019-12-02
1 a done 4 2019-12-02
2 b none 6 2019-12-02
2 b done 7 2019-12-02
3 c none 8 2019-12-02
if I had multiple rows for one table, I want to select the row of that table which is having status done. if any table doesn't have status like 'done' want to return 'none'.
tableid name status uuid date
1 a done 4 2019-12-02
2 b done 7 2019-12-02
3 c none 8 2019-12-02
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY tableid ORDER BY status ASC) as RN,
tableid,
name,
status,
uuid,
date
FROM
SAMPLE
)T
WHERE T.RN =1;
CHECK THIS : http://sqlfiddle.com/#!17/562f6b/4
You can try this below script if there are always none and done exists in the Status column-
DEMO HERE
WITH your_table(tableid,name,status,uuid,date)
AS
(
SELECT 1,'a','none',1,'2019-12-02' UNION ALL
SELECT 1,'a','none',2,'2019-12-02' UNION ALL
SELECT 1,'a','done',4,'2019-12-02' UNION ALL
SELECT 2,'b','none',6,'2019-12-02' UNION ALL
SELECT 2,'b','done',7,'2019-12-02' UNION ALL
SELECT 3,'c','none',8,'2019-12-02'
)
SELECT tableid,name, MIN(status),MAX(uuid),MAX(date)
FROM your_table
GROUP BY tableid,name
ORDER BY tableid
Since you are using Postgres, I would recommend DISTINCT ON, which is generally more efficient than other approaches (and is also much simpler to write):
select distinct on(tableid) t.*
from mytable t
order by status, date desc
The second sorting criteria is there to consistently break the ties on status, if any (ie if there there are several records with status = none and no record with status = done, only, latest record will be picked)
You want distinct on for this, but the correct formulation is:
select distinct on (tableid) t.*
from mytable t
order by tableid,
(status = 'done') desc,
(status = 'none') desc,
date desc;
Your question is unclear on what you want if there are no nones or dones.
If there is at most one done and you want all nones if there is no done, then a different approach is not exists:
select t.*
from mytable t
where t.status = 'done' or
(t.status = 'none and
not exists (select 1
from table t2
where t2.tableid = t.tableid and
t2.status = 'done'
)
);

select count(ID) where ID IN a or b

I don't understand what I'm doing wrong. I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B.
select CREATE_WEEK, count ( A.PK )
from TABLE1 A
where ( A.PK not in (select distinct ( B.FK )
from TABLE2 B
where B.CREATE_TIMESTAMP > '01-Jan-2013')
or A.PK in (select A.PK
from ( select A.PK, A.CREATE_TIMESTAMP as A_CRT, min ( B.CREATE_TIMESTAMP ) as FIRST_B
from TABLE1 A, TABLE2 B
where A.PK = B.FK
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and B.CREATE_TIMESTAMP > '01-Jan-2013'
group by A.PK, A.CREATE_TIMESTAMP)
where A_CRT < FIRST_B) )
and A.CREATE_TIMESTAMP > '01-Jan-2013'
and CREATE_WEEK >= 2
and THIS_WEEK - CREATE_WEEK >= 1
group by CREATE_WEEK
order by CREATE_WEEK asc
**Note: PK in table1 = FK in table2, so in the first subquery, I'm checking whether the PK from table1 exists as FK in table2. Week comes from TO_CHAR (TO_DATE (TRUNC (A.CREATE_TIMESTAMP, 'IW')), 'IW')
When I take out the OR and run the query on either subquery the results are returned in 1-2 seconds. But when I try to run the combined query, the results aren't returned after 20 minutes.
I know I can run them separately and then sum them in a spreadsheet, but I'd rather just get one number.
I'm trying to get a weekly COUNT of every ID that meets criteria A OR criteria B
However your code is:
ID NOT IN (subquery A) OR ID IN (subquery B)
The NOT is at odds with your requirement.
Assuming you ID's that meet both criteria, use:
ID in (
select ... -- this is subquery A
union
select ... -- this is subquery B)

BigQuery SQL running totals

Any idea how to calculate running total in BigQuery SQL?
id value running total
-- ----- -------------
1 1 1
2 2 3
3 4 7
4 7 14
5 9 23
6 12 35
7 13 48
8 16 64
9 22 86
10 42 128
11 57 185
12 58 243
13 59 302
14 60 362
Not a problem for traditional SQL servers using either correlated scalar query:
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id)
FROM RunTotalTestData a
ORDER BY a.id;
or join:
SELECT a.id, a.value, SUM(b.Value)
FROM RunTotalTestData a,
RunTotalTestData b
WHERE b.id <= a.id
GROUP BY a.id, a.value
ORDER BY a.id;
But I couldn't find a way to make it work in BigQuery...
2018 update: The query in the original question works without modification now.
#standardSQL
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
SELECT a.id, a.value, (SELECT SUM(b.value)
FROM RunTotalTestData b
WHERE b.id <= a.id) runningTotal
FROM RunTotalTestData a
ORDER BY a.id;
2013 update: You can use SUM() OVER() to calculate running totals.
In your example:
SELECT id, value, SUM(value) OVER(ORDER BY id)
FROM [your.table]
A working example:
SELECT word, word_count, SUM(word_count) OVER(ORDER BY word)
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'hamlet'
AND word > 'a' LIMIT 30;
You probably figured it out already. But here is one, not the most efficient, way:
JOIN can only be done using equality comparisons i.e. b.id <= a.id cannot be used.
https://developers.google.com/bigquery/docs/query-reference#joins
This is pretty lame if you ask me. But there is one work around. Just use equality comparison on some dummy value to get the cartesian product and then use WHERE for <=. This is crazily suboptimal. But if your tables are small this is going to work.
SELECT a.id, SUM(a.value) as rt
FROM RunTotalTestData a
JOIN RunTotalTestData b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
You can manually constrain the time as well:
SELECT a.id, SUM(a.value) as rt
FROM (
SELECT id, timestamp RunTotalTestData
WHERE timestamp >= foo
AND timestamp < bar
) AS a
JOIN (
SELECT id, timestamp, value RunTotalTestData
WHERE timestamp >= foo AND timestamp < bar
) b ON a.dummy = b.dummy
WHERE b.id <= a.id
GROUP BY a.id
ORDER BY rt
Update:
You don't need a special property. You can just use
SELECT 1 AS one
and join on that.
As billing goes the join table counts in the processing.
The problem is with the second query, that BigQuery will UNION the 2 tables in the FROM expression.
I'm not sure about the first one, but it's possible that bigquery doesn't like subselects at the Select expressions, only at the FromExpression. So you need to move the subquery into the fromexpression, and JOIN the results.
Also, you could give it a try to our JDBC driver:
Starschema BigQuery JDBC Driver
Just simply load it into Squirrel SQL, or RazorSQL or kinda any tool that supports JDBC drivers, make sure you turn on the Query Transformer by setting:
transformQuery=true
In the properties or in the JDBC url, every info can be found at the project page. After you did this, try to run the 2nd query, it will be transformed into a BigQuery compatible join.
It's easy if we are allow to use window function.
https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts
With that we can do it like this :
WITH RunTotalTestData AS (
SELECT * FROM UNNEST([STRUCT(1 AS id, 1 AS value),(2,0),(3,1),(4,1),(5,2),(6,3)])
)
select *, sum(value) over(order by id) as running_total
from RunTotalTestData

Does table1 UNION ALL table2 guarantee output order table1, table2?

SELECT a FROM b
UNION ALL
SELECT a FROM c
UNION ALL
SELECT a FROM d
Does UNION ALL guarantee to print out records from tables b, c, d in that order? I.e., no records from c before any from b. This question is not for a specific DBMS.
No order by, no order guarantee whatsoever - that's for every database.
And for standard SQL, an ORDER BY is applied to the results from all the unioned queries.
To be sure in order use
Select 1 as TableNo,* from a
union all
select 2 as TableNo,* from b
union all
select 3 as TableNO,* from c
order by TableNo, [desired column]