Bigquery: How to extract value by date across different tables?

Bigquery: How to extract value by date across different tables? - sql

I have 2 tables that need joined together, and currently when I query them, I'm returned with a lot of identical rows with one column having a different value. I'd suspect it's performing an inner join across all matches, so I tried left joining so values were attached to date, but that appears to be incorrect.
I am trying to get the last entered value for each day across the following columns.
Symbol is not unique, but all others have different values across different time periods within.
So my questions are:
How would I select the max value for each date across each column and return it in a single row?
How would I select the most recent value for each date across each column and return it in a single row?
My current query looks like this.
SELECT
DISTINCT(shortdata.date),
shortdata.symbol,
shortdata.fee_rate,
shortdata.rebate_rate,
shortdata.short_shares_available,
tss.total_volume as share_volume,
FROM `db1.table1` as shortdata
LEFT JOIN `db1.table2` as tss
ON shortdata.symbol = tss.ticker
AND shortdata.date = tss.date
WHERE shortdata.symbol in ('AMC')
ORDER BY date desc
LIMIT 25
Which is returning a query with 6 values for each identifier.
date symbol fee_rate rebate_rate short_shares_available share_volume
2023-01-20 AMC 100.3354 -96.0154 200000 30313546
2023-01-20 AMC 97.206 -92.886 65000 30313546
2023-01-20 AMC 100.3354 -96.0154 200000 489689
2023-01-20 AMC 97.206 -92.886 65000 31271340
2023-01-20 AMC 100.3354 -96.0154 200000 31271340
2023-01-20 AMC 97.206 -92.886 65000 489689
2023-01-19 AMC 122.3875 -118.0675 300000 29182367
2023-01-19 AMC 117.3614 -113.0414 300000 29734773
2023-01-19 AMC 113.7761 -109.4561 300000 801000
2023-01-19 AMC 113.7761 -109.4561 300000 29182367
2023-01-19 AMC 122.3875 -118.0675 300000 801000
2023-01-19 AMC 113.7761 -109.4561 300000 29734773
2023-01-19 AMC 122.3875 -118.0675 300000 29734773
2023-01-19 AMC 117.3614 -113.0414 300000 29182367
2023-01-19 AMC 117.3614 -113.0414 300000 801000
2023-01-18 AMC 106.2183 -101.8983 150000 2294874
2023-01-18 AMC 106.2183 -101.8983 150000 61230037
2023-01-18 AMC 106.2183 -101.8983 150000 62117798
2023-01-17 AMC 105.4728 -101.1528 100000 57591052
2023-01-17 AMC 105.4436 -101.1236 150000 759340
2023-01-17 AMC 120.211 -115.891 150000 759340
2023-01-17 AMC 105.4436 -101.1236 150000 57591052
2023-01-17 AMC 105.4728 -101.1528 100000 56101661
2023-01-17 AMC 105.4436 -101.1236 150000 56101661
2023-01-17 AMC 107.1329 -102.8129 200000 57591052
The expected end result would be something like:
date symbol fee_rate rebate_rate short_shares_available share_volume
2023-01-20 AMC 100.3354 -96.0154 200000 31271340
2023-01-19 AMC 122.3875 -118.0675 300000 29182367
2023-01-18 AMC 106.2183 -101.8983 150000 2294874
2023-01-17 AMC 107.1329 -102.8129 200000 57591052
Here is a sample query from each table:
db1
date symbol fee_rate rebate_rate short_shares_available
2023-01-20 AMC 100.3354 -96.0154 200000
2023-01-20 AMC 97.206 -92.886 65000
2023-01-19 AMC 117.3614 -113.0414 300000
2023-01-19 AMC 113.7761 -109.4561 300000
2023-01-19 AMC 122.3875 -118.0675 300000
2023-01-18 AMC 106.2183 -101.8983 150000
2023-01-17 AMC 107.1329 -102.8129 200000
2023-01-17 AMC 105.4728 -101.1528 100000
2023-01-17 AMC 105.4436 -101.1236 150000
2023-01-17 AMC 120.211 -115.891 150000
db2
Note: tape time is only to show where one could get most recent value. Tape time on the 21st will be the last entry from the 20th, etc.
date ticker tape_time share_volume
2023-01-20 AMC 2023-01-21 00:59:54.000000 UTC 31271340
2023-01-20 AMC 2023-01-20 14:28:56.000000 UTC 489689
2023-01-20 AMC 2023-01-20 20:59:58.000000 UTC 30313546
2023-01-19 AMC 2023-01-19 14:29:56.000000 UTC 801000
2023-01-19 AMC 2023-01-19 20:59:58.000000 UTC 29182367
2023-01-19 AMC 2023-01-20 00:59:45.000000 UTC 29734773
2023-01-18 AMC 2023-01-19 00:58:06.000000 UTC 62117798
2023-01-18 AMC 2023-01-18 20:59:59.000000 UTC 61230037
2023-01-18 AMC 2023-01-18 14:29:40.000000 UTC 2294874
2023-01-17 AMC 2023-01-18 00:59:42.000000 UTC 57591052
I have found snippets on a theoretical how this could be completed, but I'm unable to connect the dots further.
Any help would be appreciated.

You can consider the approach below for your requirement.
SELECT distinct d1.date, d1.symbol, d1.fee_rate, d1.rebate_rate,
d1.short_shares_available, d2.share_volume
FROM `project.dataset.db1` d1 left join `project.dataset.db2` d2
on d1.symbol=d2.ticker and d1.date = d2.date
WHERE true
QUALIFY 1 = ROW_NUMBER() OVER (PARTITION BY d1.date ORDER BY d1.date desc)

The initial issue seems to be a misunderstanding of DISTINCT. It is not a function and no matter how parenthesized it always applies to the entire row. Duplicate rows are removed but if a single column is different then it is not a duplicate row and therefore not removed. This is the issue you have. But Postgres offers a solution: DISTINCT ON(...). When used it returns only the first row of set of rows containing the specified value(s). In your case perhaps:
SELECT DISTINCT on(shortdata.date), --- will select just the first row for each date
shortdata.date --- (Not selected by above, repeat to return)
shortdata.symbol,
shortdata.fee_rate,
shortdata.rebate_rate,
shortdata.short_shares_available,
tss.total_volume as share_volume,
FROM db1.table1 as shortdata
LEFT JOIN db1.table2 as tss
ON shortdata.symbol = tss.ticker
AND shortdata.date = tss.date
WHERE shortdata.symbol in ('AMC')
ORDER BY shortdata.date desc --- based upon this criteria
LIMIT 25;

Related

Create series of counting for each employee by date

I am trying to list out a count of how many times an employeeid shows up for each date separated by lines and in the 0000 format.
Each time the specific employeeid occurs on a specific date, the count goes up. So EmployeeId 143 happens twice on 2023-01-18, so the first row is 0001, second is 0002
SELECT
FORMAT(COUNT(e.EmployeeId), '0000') AS [Count]
, e.EmployeeId
, c.CheckDate
FROM dbo.Check c
JOIN dbo.Employees e
ON e.EmployeeId = c.CreatedBy
GROUP BY c.CheckDate, e.EmployeeId
ORDER BY c.CheckDate DESC;
What I'm currently getting:
COUNT
EmployeeId
CheckDate
0002
143
2023-01-18 00:00:00.000
0002
143
2023-01-17 00:00:00.000
0002
427
2023-01-17 00:00:00.000
0007
607
2023-01-17 00:00:00.000
What I am wanting is:
COUNT
EmployeeId
CheckDate
0001
143
2023-01-18 00:00:00.000
0002
143
2023-01-18 00:00:00.000
0001
143
2023-01-17 00:00:00.000
0002
143
2023-01-17 00:00:00.000
0001
427
2023-01-17 00:00:00.000
0002
427
2023-01-17 00:00:00.000
etc.

My take of your issue here is that you are aggregating while you need a window function really.
To ensure unique values for your couples <CheckDate, EmployeeId> duplicates, you can try using the ROW_NUMBER window function.
SELECT FORMAT(ROW_NUMBER() OVER(
PARTITION BY c.CheckDate, e.EmployeeId
ORDER BY e.EmployeeId
), '0000') AS [Count]
, e.EmployeeId
, c.CheckDate
FROM dbo.Check c
INNER JOIN dbo.Employees e
ON e.EmployeeId = c.CreatedBy
ORDER BY c.CheckDate DESC;

Last change column Impala

I would like to create a new column indicating when was the last change done to the price column. If the price changes, we want to see the current date and if it is stable we would like to see date when there was the last change. Everything should be written without loops and declare because it should be working in Impala.
Input:
date price
2023-01-31 150
2023-01-30 150
2023-01-29 100
2023-01-28 100
2023-01-27 100
2023-01-26 50
Output:
date price valid_from
2023-01-31 150 2023-01-30
2023-01-30 150 2023-01-30
2023-01-29 100 2023-01-27
2023-01-28 100 2023-01-27
2023-01-27 100 2023-01-27
2023-01-26 50 2023-01-26
Thanks.

Finding cumulative sum using SQL Server with ORDER BY

Trying to calculate a cumulative sum up to a given number.
Need to order by 2 columns : Delivery, Date.
Query:
SELECT Date, Delivery, Balance, SUM(Balance) OVER ( ORDER BY Delivery, Date) AS cumsum
FROM t
Results:
Contract_Date Delivery Balance cumsum
2020-02-25 2020-03-01 308.100000 308.100000
2020-03-05 2020-03-01 -2.740000 305.360000
2020-03-06 2020-04-01 176.820000 682.180000
2020-03-06 2020-04-01 200.000000 682.180000
2020-03-09 2020-04-01 300.000000 1082.180000
2020-03-09 2020-04-01 100.000000 1082.180000
2020-03-13 2020-04-01 129.290000 1211.470000
2020-03-16 2020-04-01 200.000000 1711.470000
2020-03-16 2020-04-01 300.000000 1711.470000
2020-03-17 2020-04-01 300.000000 2011.470000
2020-04-01 2020-04-01 86.600000 2098.070000
2020-04-03 2020-04-01 200.000000 2298.070000
Expected results:
Contract_Date Delivery Balance cumsum
25/2/2020 1/3/2020 308.1 308.1
5/3/2020 1/3/2020 -2.74 305.36
6/3/2020 1/4/2020 176.82 482.18
6/3/2020 1/4/2020 200 682.18
9/3/2020 1/4/2020 300 982.18
9/3/2020 1/4/2020 100 1082.18
13/3/2020 1/4/2020 129.29 1211.47
16/3/2020 1/4/2020 200 1411.47
16/3/2020 1/4/2020 300 1711.47
17/3/2020 1/4/2020 300 2011.47
1/4/2020 1/4/2020 86.6 2098.07
3/4/2020 1/4/2020 200 2298.07
Version:
Microsoft SQL Server 2017

You need a third column in the ORDER BY clause to break the ties on Contract_Date and Delivery. It is not obvious which one you would use. Here is one option using column Balance:
SELECT
Date,
Delivery,
Balance,
SUM(Balance) OVER ( ORDER BY Delivery, Contract_Date, Balance) AS cumsum
FROM t

subtraction in SQL giving incorrect value

I have a table that contains Id, Date and a float value as below:
ID startDt Days
1328 2015-04-01 00:00:00.000 15
2444 2015-04-03 00:00:00.000 5.7
1658 2015-05-08 00:00:00.000 6
1329 2015-05-12 00:00:00.000 28.5
1849 2015-06-23 00:00:00.000 28.5
1581 2015-06-30 00:00:00.000 25.5
3535 2015-07-03 00:00:00.000 3
3536 2015-08-13 00:00:00.000 13.5
2166 2015-09-22 00:00:00.000 28.5
3542 2015-11-05 00:00:00.000 13.5
3543 2015-12-18 00:00:00.000 6
2445 2015-12-25 00:00:00.000 5.7
4096 2015-12-31 00:00:00.000 7.5
2446 2016-01-01 00:00:00.000 5.7
4287 2016-02-11 00:00:00.000 13.5
4288 2016-02-18 00:00:00.000 13.5
4492 2016-03-02 00:00:00.000 19.7
2447 2016-03-25 00:00:00.000 5.7
I am using a stored procedure which adds up the Days then subtracts it from a fixed value stored in a variable.
The total in the table is 245 and the variable is set to 245 so I should get a value of 0 when subtracting the two. However, I am getting a value of 5.6843418860808E-14 instead. I cant figure out why this is the case and I have gone and re entered each number in the table but I still get the same result.
This is my sql statement that I am using to calculate the result:
Declare #AL_Taken as float
Declare #AL_Remaining as float
Declare #EntitledLeave as float
Set #EntitledLeave=245
set #AL_Taken= (select sum(Days) from tblALMain)
Set #AL_Remaining=#EntitledLeave-#AL_Taken
Select #EntitledLeave, #AL_Taken, #AL_Remaining
The select returns the following:
245, 245, 5.6843418860808E-14
Can anyone suggest why I am getting this number when I should be getting 0?
Thanks for the help
Rob

I changed the data type to Decimal as Tab Allenman suggested and this resolved my issue. I still dont understand why I didnt get zero when using float as all the values added up to 245 exactly (I even re-entered the values manually) and 245 - 245 should have given me 0.
Thanks again for all the comments and explanations.
Rob

How to select most recent values?

I have a logging table collecting values from many probes:
CREATE TABLE [Log]
(
[LogID] int IDENTITY (1, 1) NOT NULL,
[Minute] datetime NOT NULL,
[ProbeID] int NOT NULL DEFAULT 0,
[Value] FLOAT(24) NOT NULL DEFAULT 0.0,
CONSTRAINT Log_PK PRIMARY KEY([LogID])
)
GO
CREATE INDEX [Minute_ProbeID_Value] ON [Log]([Minute], [ProbeID], [Value])
GO
Typically, each probe generates a value every minute or so. Some example output:
LogID Minute ProbeID Value
====== ================ ======= =====
873875 2014-07-27 09:36 1972 24.4
873876 2014-07-27 09:36 2001 29.7
873877 2014-07-27 09:36 3781 19.8
873878 2014-07-27 09:36 1963 25.6
873879 2014-07-27 09:36 2002 22.9
873880 2014-07-27 09:36 1959 -30.1
873881 2014-07-27 09:36 2005 20.7
873882 2014-07-27 09:36 1234 23.8
873883 2014-07-27 09:36 1970 19.9
873884 2014-07-27 09:36 1991 22.4
873885 2014-07-27 09:37 1958 1.7
873886 2014-07-27 09:37 1962 21.3
873887 2014-07-27 09:37 1020 23.1
873888 2014-07-27 09:38 1972 24.1
873889 2014-07-27 09:38 3781 20.1
873890 2014-07-27 09:38 2001 30
873891 2014-07-27 09:38 2002 23.4
873892 2014-07-27 09:38 1963 26
873893 2014-07-27 09:38 2005 20.8
873894 2014-07-27 09:38 1234 23.7
873895 2014-07-27 09:38 1970 19.8
873896 2014-07-27 09:38 1991 22.7
873897 2014-07-27 09:39 1958 1.4
873898 2014-07-27 09:39 1962 22.1
873899 2014-07-27 09:39 1020 23.1
What is the most efficient way to get just the latest reading for each Probe?
e.g.of desired output (note: the "Value" is not e.g. a Max() or an Avg()):
LogID Minute ProbeID Value
====== ================= ======= =====
873899 27-Jul-2014 09:39 1020 3.1
873894 27-Jul-2014 09:38 1234 23.7
873897 27-Jul-2014 09:39 1958 1.4
873880 27-Jul-2014 09:36 1959 -30.1
873898 27-Jul-2014 09:39 1962 22.1
873892 27-Jul-2014 09:38 1963 26
873895 27-Jul-2014 09:38 1970 19.8
873888 27-Jul-2014 09:38 1972 24.1
873896 27-Jul-2014 09:38 1991 22.7
873890 27-Jul-2014 09:38 2001 30
873891 27-Jul-2014 09:38 2002 23.4
873893 27-Jul-2014 09:38 2005 20.8
873889 27-Jul-2014 09:38 3781 20.1

This is another approach
select *
from log l
where minute =
(select max(x.minute) from log x where x.probeid = l.probeid)
You can compare the execution plan w/ a fiddle - http://sqlfiddle.com/#!3/1d3ff/3/0

Try this:
SELECT T1.*
FROM Log T1
INNER JOIN (SELECT Max(Minute) Minute,
ProbeID
FROM Log
GROUP BY ProbeID)T2
ON T1.ProbeID = T2.ProbeID
AND T1.Minute = T2.Minute
You can play around with it on SQL Fiddle

Your question is: "What is the most efficient way to get just the latest reading for each Probe?"
To really answer this question, you test to test different solutions. I would generally go with the row_number() method suggested by #jyparask. However, the following might have better performance:
select l.*
from log l
where not exists (select 1
from log l2
where l2.probeid = l.probeid and
l2.minute > l.minute
);
For performance, you want an index on log(probeid, minute).
Although not exactly your problem, here is an example of where not exists performs better than other methods on SQL Server.

;WITH MyCTE AS
(
SELECT LogID,
Minute,
ProbeID,
Value,
ROW_NUMBER() OVER(PARTITION BY ProbeID ORDER BY Minute DESC) AS rn
FROM LOG
)
SELECT LogID,
Minute,
ProbeID,
Value
FROM MyCTE
WHERE rn = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Bigquery: How to extract value by date across different tables? - sql

Related

Create series of counting for each employee by date

Last change column Impala

Finding cumulative sum using SQL Server with ORDER BY

subtraction in SQL giving incorrect value

How to select most recent values?

Categories

Resources