Struggling with my FIRST_VALUE function in BigQuery

Struggling with my FIRST_VALUE function in BigQuery - sql

I have a table containing national drug codes and their price changes over the years. The columns are ndc (string) old_nadac_per_unit (numeric) and end_date (datetime)
The ndc shows up multiple times for every time the date changes, for instance ndc 1 might show up 5 times in the table because it has 5 different dates where the price changes. I wanted to do a query that only showed me the price at the earliest part of 2021. I entered this query
select
ndc,
old_nadac_per_unit,
MIN(end_date) AS start_date
FROM
`projectId.dataset.tableId`
GROUP BY
ndc,
old_nadac_per_unit
but when I ran the query it returned every row with a different price change, not just the first, so then I tried incorporating a FIRST VALUE function in a nested select statement.
This is the new query
SELECT
ndc,
old_nadac_per_unit,
MIN(end_date) AS first_date
FROM
`projectId.dataset.tableId`
WHERE old_nadac_per_unit =
(SELECT FIRST_VALUE (old_nadac_per_unit) OVER (PARTITION BY ndc ORDER BY ndc) AS first_price
FROM
`projectId.dataset.tableId`)
GROUP BY
ndc, old_nadac_per_unit
ORDER BY
ndc
Now it says I can run the query but that there is no data to display. I am at a loss. Any advice is appreciated.

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.

Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

How to get the second row by date and by id, without group by (not sure about row number)

My dataset is about sales, each line corresponds to an invoice. It is possible to have 2 registers in the same day for the same customer, if he had bought twice in that day.
As you can see in the image below, the blue square shows us that customer 355122 (id_cliente = customer_id) bought twice (275831N and 275826N invoice's id) in the same day (2020-12-19) (penult_data = second-last date). This query is meant to be a support table, to left join the main table and bring those results.
First of all, i've created a row number just over customer_id (blue arrow, aux), so that I could just join with aux = 2 (that should be the second-last register), but in cases that the customer bought twice that day, the second-last invoice is not the second-last date he bought. I need the second-last DATE. He can buy 1,2,3,4,5 times a day, so I cannot assume a correct aux number to filter.
Then, for some reason, I also created an aux2, it's a row number over customer and date, but it really didn't help. I needed something that would repeat the index for the same date, so that index = 2 would be the second-last date.
I cannot use group by because i'm retrieving the salesman id (penult_vend), the store id (penult_empe), and so on from the second-last date
This is the output of part of the query I'm using (as I said, the support table to left join the main table). I'm filtering to this customer's id.
Does somebody knows any function or method to make this work?
I'm using google big query.
Thanks

Assuming the column penult_data has only date information without time of the day, you can find the second to last "date" and then the last "invoice" on that date by using the DENSE_RANK() and ROW_NUMBER() functions:
dense_rank() over(partition by id_cliente
order by penult_data desc) as rnd,
row_number() over(partition by id_cliente, penult_data
order by penult_nf desc) as rnf,
Then, you can use the filtering condition:
where rnd = 2 and rnf = 1

I want NAV price as per (Today date minus 1) date

I have two tables. One is NAV where product daily new price is updated. Second is TDK table where item wise stock is available.
Now I want to get a summery report as per buyer name where all product wise total will come and from table one latest price will come.
I have tried below query...
SELECT dbo.TDK.buyer, dbo.NAV.Product_Name, sum(dbo.TDK.TD_UNITS) as Units, sum(dbo.TDK.TD_AMT) as 'Amount',dbo.NAV.NAValue
FROM dbo.TDK INNER JOIN
dbo.NAV
ON dbo.TDK.Products = dbo.NAV.Product_Name
group by dbo.TDK.buyer, dbo.NAV.Product_Name, dbo.NAV.NAValue
Imnportant: Common columns in both tables...
Table one NAV has column as Products
Table two TDK has column as Product_Name
If I have NAValue 4 records for one product then this query shows 4 lines with same total.
What I need??
I want this query to show only one line with latest NAValue price.
I want display one more line with Units*NAValue (latest) as "Latest Market Value".
Please guide.

What field contains the quote date? I am assuming you have a DATIME field, quoteDate, in dbo.NAV table and my other assumption is that you only store the Date part (i.e. mid-night, time = 00:00:00).
SELECT
t.buyer,
n.Product_Name,
sum(t.TD_UNITS) as Units,
sum(t.TD_AMT) as 'Amount',
n.NAValue
FROM dbo.TDK t
INNER JOIN dbo.NAV n
ON t.Products = n.Product_Name
AND n.quoteDate > getdate()-2
group by t.buyer, n.Product_Name, n.NAValue, n.QuoteDate
GetDate() will give you the current date and time. Subtracting 2 would get it before yesterday but after the day before yesterday.
Also, add n.quoteDate in your select and group by. Even though you don't need it, in case that one day you have a day of bad data with double record in NAV table, one with midnight time and another with 6 PM time.

Your code looks like SQL Server. I think you just want APPLY:
SELECT t.buyer, n.Product_Name, t.TD_UNITS as Units, t.TD_AMT as Amount, n.NAValue
FROM dbo.TDK t CROSS APPLY
(SELECT TOP (1) n.*
FROM dbo.NAV n
WHERE t.Products = n.Product_Name
ORDER BY ?? DESC -- however you define "latest"
) n;

How to calculate difference between two rows in a date interval?

I'm trying to compare data from an Access 2010 database based on a date interval. Example I have items from various purchase orders and I want to maintain the history of these item's delivery to a warehouse. So my purchase order has a request for a quantity of 10 of a material, for example, and it can be partially delivered in many deliveries and I want to know how this delivery varied in a date interval. To fill the date field the criteria used is the following: if the item had an update in the QtyPending field, I copy the current row deactivating it with a booelan field, create a new entry with the current update date updating the QtyPending field, so the active record is the actual state of the item. So I have a table that holds informations about these items like that
PO POItem QtyPending Date Active
4500000123 10 10 01/09/2014 FALSE
4500000123 10 8 05/09/2014 TRUE
4500000122 30 5 03/09/2014 FALSE
4500000122 30 1 04/09/2014 TRUE
With this example, for the first item, it means that from date 01/09 to 04/09 the QtyPending field didn't suffer a variation, meaning that the supplier didn't make any delivery to me, but from 01/09 to 05/08 he delivered me a qty of 2 of a material. For the second one, from date 03/09 to 04/09 the supplier delivered me a qty of 4 of a material. So, if I were to be making a report query from 02/09/2014 to 04/09/2014, the expected output is like this:
PO POItem QtyDelivered
4500000123 10 0
4500000122 30 4
And a report from 31/08/2014 to 10/09/2014, would have this output
PO POItem QtyDelivered
4500000123 10 2
4500000122 30 4
I'm not coming up with a query to make this report. Can anyone help me?

There are many ways of solving this. The easiest one would be to simply make a query of all the necessary records between two dates, loop over them and insert into a temporary table the result. This temporary table can then be the source of your report. A lot of people will scream at you for not using a big query instead but getting the result that you want in the fastest and simplest way should be your priority.
Your problem with your schema is that you don't have the QtyDelivered stored for each record. If you would have it, it would be an easy thing to sum over it in order to get needed result. By not storing this value, you have transformed a simple and fast query into a much harder and slower one because you need to recalculate this value in some way or other and you must do this without forgetting the fact that it's possible to have more than two records.
For calculating this value, you can either use a sub-query to retrieve the value from the previous row or a Left join do to the same. Once you have this value, you can subtract these two to get the needed difference; allowing for the possibility of Null value if there is no previous row. Once you have these values, you can now sum over them to get the final result with a Group By. Notice that in order to perform these calculations, you need to have one or two more levels of subquery. The first query should be something like:
Select PO, POItem, QtyPending, (Select Top 1 QtyPending from MyTable T2 where T1.PO = T2.PO and T2.Date < T1.Date And (T2.Date between #Date1 and #Date2) Order by T2.Date Desc) as QtyPending2 from MyTable T1 Where T1.Date between #Date1 and #Date2) ...
With this as either another subquery or as a View, you can then compute the desired difference by comparing the values of QtyPending and QtyPending2; without forgetting that QtyPendin2 may be Null. The remaining steps are easy to do.
Notice that the above example is for SQL-Server, you might have to change it a little for Access. In any case, you can find here many examples on how to compare two rows under Access. As noted earlier, you can also use a Left Join instead of a subquery to compare your rows.

I came up with this query that solved the problem, it wasn't that simple
SELECT
ItmDtIni.PO
,ItmDtIni.POItem AS [PO Item]
,ROUND(ItmDtIni.QtyPending - ItmDtEnd.QtyPending, 3) AS [Qty Delivered]
,ROUND((ItmDtIni.QtyPending - ItmDtEnd.QtyPending) * ItmDtEnd.Price, 2) AS [Value delivered(US$)]
//Filtering subqueries to bring only the items in the date interval to make a self join
FROM (((SELECT
PO
,POItem
,QtyPending
,MIN(Date) AS MinDate
FROM Item
WHERE Date BETWEEN FORMAT(begin_date, 'dd/mm/yyyy') AND FORMAT(end_date, 'dd/mm/yyyy')
GROUP BY
PO
,POItem
,QtyPending) AS ItmDtIni
//Self join filtering to bring only items in the date interval with the previously filtered table
INNER JOIN (SELECT
PO
,POItem
,QtyPending
,Price
,MAX(Date) AS MaxDate
FROM Item
WHERE Date BETWEEN FORMAT(begin_date, 'dd/mm/yyyy') AND FORMAT(end_date, 'dd/mm/yyyy')
GROUP BY
PO
,POItem
,QtyPending
,Price) AS ItmDtEnd
ON ItmDtIni.PO = ItmDtEnd.PO
AND ItmDtIni.POItem = ItmDtEnd.POItem)
INNER JOIN PO
ON ItmDtEnd.PO = PO.Numero)
WHERE
//Showing only items that had a variation in the date interval
ROUND(ItmDtIni.QtyPending - ItmDtEnd.QtyPending, 3) <> 0
//Anchoring min date in the interval for each item found by the first subquery
AND ItmDtIni.MinDate = (SELECT MIN(Item.Date)
FROM Item
WHERE
ItmDtIni.PO = Item.PO
AND ItmDtIni.POItem = Item.POItem
AND Date BETWEEN FORMAT(begin_date, 'dd/mm/yyyy') AND FORMAT(end_date, 'dd/mm/yyyy'))
//Anchoring max date in the interval for each item found by the second subquery
AND ItmDtEnd.MaxDate = (SELECT MAX(Item.Date)
FROM Item
WHERE
ItmDtEnd.PO = Item.PO
AND ItmDtEnd.POItem = Item.POItem
AND Date BETWEEN FORMAT(begin_date, 'dd/mm/yyyy') AND FORMAT(end_date, 'dd/mm/yyyy'))

analyze range and if true tell me

I want to see if the price of a stock has changed by 5% this week. I have data that captures the price everyday. I can get the rows from the last 7 days by doing the following:
select price from data where date(capture_timestamp)>date(current_timestamp)-7;
But then how do I analyze that and see if the price has increased or decreased 5%? Is it possible to do all this with one sql statement? I would like to be able to then insert any results of it into a new table but I just want to focus on it printing out in the shell first.
Thanks.

It seems odd to have only one stock in a table called data. What you need to do is bring the two rows together for last week's and today's values, as in the following query:
select d.price
from data d cross join
data dprev
where cast(d.capture_timestamp as date = date(current_timestamp) and
cast(dprev.capture_timestamp as date) )= cast(current_timestamp as date)-7 and
d.price > dprev.price * 1.05
If the data table contains the stock ticker, the cross join would be an equijoin.

You may be able to use query from the following subquery for whatever calculations you want to do. This is assuming one record per day. The 7 preceding rows is literal.
SELECT ticker, price, capture_ts
,MIN(price) OVER (PARTITION BY ticker ORDER BY capture_ts ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS min_prev_7_records
,MAX(price) OVER (PARTITION BY ticker ORDER BY capture_ts ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS max_prev_7_records
FROM data

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas