"Running Product" aggregate/ windowed function in PostgreSql?

"Running Product" aggregate/ windowed function in PostgreSql? - sql

I am trying to normalize End-of-Day stock prices in PostgreSql.
Let's say I have a stock table defined as such:
create table eod (
date date not null,
stock_id int not null,
split decimal(16,8) not null,
close decimal(12,6) not null,
constraint pk_eod primary key (date, stock_id)
);
Data in this table might look like this:
"date","stock_id","eod_split","close"
"2014-06-13",14010920,"1.00000000","182.560000"
"2014-06-13",14010911,"1.00000000","91.280000"
"2014-06-13",14010923,"1.00000000","41.230000"
"2014-06-12",14010911,"1.00000000","92.290000"
"2014-06-12",14010920,"1.00000000","181.220000"
"2014-06-12",14010923,"1.00000000","40.580000"
"2014-06-11",14010920,"1.00000000","182.250000"
"2014-06-11",14010911,"1.00000000","93.860000"
"2014-06-11",14010923,"1.00000000","40.860000"
"2014-06-10",14010911,"1.00000000","94.250000"
"2014-06-10",14010923,"1.00000000","41.110000"
"2014-06-10",14010920,"1.00000000","184.290000"
"2014-06-09",14010920,"1.00000000","186.220000"
"2014-06-09",14010911,"7.00000000","93.700000"
"2014-06-09",14010923,"1.00000000","41.270000"
"2014-06-06",14010923,"1.00000000","41.480000"
"2014-06-06",14010911,"1.00000000","645.570000"
"2014-06-06",14010920,"1.00000000","186.370000"
"2014-06-05",14010920,"1.00000000","185.980000"
"2014-06-05",14010911,"1.00000000","647.350000"
"2014-06-05",14010923,"1.00000000","41.210000"
...
"2005-03-04",14010920,"1.00000000","92.370000"
"2005-03-04",14010911,"1.00000000","42.810000"
"2005-03-04",14010923,"1.00000000","25.170000"
"2005-03-03",14010923,"1.00000000","25.170000"
"2005-03-03",14010911,"1.00000000","41.790000"
"2005-03-03",14010920,"1.00000000","92.410000"
"2005-03-02",14010920,"1.00000000","92.920000"
"2005-03-02",14010923,"1.00000000","25.260000"
"2005-03-02",14010911,"1.00000000","44.121000"
"2005-03-01",14010920,"1.00000000","93.300000"
"2005-03-01",14010923,"1.00000000","25.280000"
"2005-03-01",14010911,"1.00000000","44.500000"
"2005-02-28",14010923,"1.00000000","25.160000"
"2005-02-28",14010911,"2.00000000","44.860000"
"2005-02-28",14010920,"1.00000000","92.580000"
"2005-02-25",14010923,"1.00000000","25.250000"
"2005-02-25",14010920,"1.00000000","92.800000"
"2005-02-25",14010911,"1.00000000","88.990000"
"2005-02-24",14010923,"1.00000000","25.370000"
"2005-02-24",14010920,"1.00000000","92.640000"
"2005-02-24",14010911,"1.00000000","88.930000"
"2005-02-23",14010923,"1.00000000","25.200000"
"2005-02-23",14010911,"1.00000000","88.230000"
"2005-02-23",14010920,"1.00000000","92.100000"
...
"2003-02-24",14010920,"1.00000000","78.560000"
"2003-02-24",14010911,"1.00000000","14.740000"
"2003-02-24",14010923,"1.00000000","24.070000"
"2003-02-21",14010920,"1.00000000","79.950000"
"2003-02-21",14010923,"1.00000000","24.630000"
"2003-02-21",14010911,"1.00000000","15.000000"
"2003-02-20",14010911,"1.00000000","14.770000"
"2003-02-20",14010920,"1.00000000","79.150000"
"2003-02-20",14010923,"1.00000000","24.140000"
"2003-02-19",14010920,"1.00000000","79.510000"
"2003-02-19",14010911,"1.00000000","14.850000"
"2003-02-19",14010923,"1.00000000","24.530000"
"2003-02-18",14010923,"2.00000000","24.960000"
"2003-02-18",14010911,"1.00000000","15.270000"
"2003-02-18",14010920,"1.00000000","79.330000"
"2003-02-14",14010911,"1.00000000","14.670000"
"2003-02-14",14010920,"1.00000000","77.450000"
"2003-02-14",14010923,"1.00000000","48.300000"
"2003-02-13",14010920,"1.00000000","75.860000"
"2003-02-13",14010911,"1.00000000","14.540000"
"2003-02-13",14010923,"1.00000000","46.990000"
Note the "split" column. When a split value other than 1 is recorded, it basically means that the stock shares split by that factor. IOW, when the split is 2.0, the number of the outstanding shares doubled, but the value of each individual share is halved from that point on. If the stock was worth $100 per share, it's now worth $50 per share.
If you graph this with raw numbers, this sort of thing is truly ugly. Sharp cliffs show up, when the overall value of the company did not significantly change... and when you have multiple splits, you end up with a graph that does not properly reflect the trending of the company, often by a large margin. In the above example, where there was a 2:1 split, your close prices for a stock would look something like 100, 100, 100, 50, 50, 50.
I want to use this table to create a "normalized" price, in a reasonably efficient manner (there's quite a few records to chunk through). Continuing the sample, this would show the stock prices at 50, 50, 50, 50, 50, 50. If there were multiple splits, the data should still be consistent and smooth, if we ignored actual market value changes.
My idea is, if I can create a CTE of a "running product" aggregate of the split value, going back in time, I can define date ranges per stock and what the modifier value to apply to the closing cost should be, then join that back to the eod table and select into a new table the adjusted close value for each stock.
...the problem is, I cannot wrap my head around how to do that in anything other than a whole bunch of temp tables and multi-step processes. I do not know of any built-in functionality to make this easier, either.
Can someone show me how I can generate the normalized data?

You don't need a CTE. You just need a cumulative product. Postgres doesn't have one built in. But, arithmetic to the rescue!
select eod.*,
exp(sum(ln(eod_split)) over (partition by stock_id order by date)) as cume_split,
(close *
exp(sum(ln(eod_split)) over (partition by stock_id order by date))
) as normalized_price
from eod;

Hilarious, looking for this solution, I find that an associate already asked about it. Here is the basic algebra behind this ingenious solution: https://blog.prepscholar.com/natural-log-rules

Related

Big Query Error When Using CAST and determining decimals

I have linked a Big Query Project to my Google Ads Account. Within that, we have a campaignBasicStats table.
I want to pull the cost of a campaign from my Google Ads account into a big query workspace to apply some additional logic.
The cost column is coming through as an INTEGER and is described like this:
INTEGER NULLABLE
The sum of your cost-per-click (CPC) and cost-per-thousand impressions (CPM) costs during this period. Values can be one of: a) a money amount in micros, b) "auto: x" or "auto" if this field is a bid and AdWords is automatically setting the bid via the chosen bidding strategy, or c) "--" if this field is a bid and no bid applies to the row.
If I query the table, the cost returns in this value: Example:
2590000.0
965145.0
In Google Ads, the two costs for these campaigns are £25.90 and £96.51
So I have this code in my Big Query Workspace.
SELECT CAST(Cost AS FLOAT64)
FROM `db_table`
WHERE COST > 0
LIMIT 1000
The column returns these numbers:
2590000.0
965145.0
However, As I need the numbers to be a currency for example the first return 2590000.0 should be 25.90 and the second one should be 96.51
I changed my code to this:
SELECT CAST(Cost AS FLOAT64(4,2))
FROM `db_table`
WHERE COST > 0
LIMIT 1000
And now I get this error:
FLOAT64 does not support type parameters at [1:28]
Is there something I'm missing? how do I convert to decimal point and specify where I want the decimal point to be in BQ?
Thanks,

It appears you are using a Google Ads Data Transfer operation as detailed here.
In this case, it's important to note the Description of the Cost column in p_CampaignBasicStats:
The sum of your cost-per-click (CPC) and cost-per-thousand impressions
(CPM) costs during this period. Values can be one of: a) a money
amount in micros, b) "auto: x" or "auto" if this field is a bid and
AdWords is automatically setting the bid via the chosen bidding
strategy, or c) "--" if this field is a bid and no bid applies to the
row.
1 micro is 1-millionth of the fundamental currency. Thus, we need to transform this amount as such: cost / 1000000
Then, we simply need to ROUND to get the appropriate unit. If you prefer to always round up, see my answer regarding the correct way to do that here.
First, we'll set up an example table with the example values you've given:
CREATE TEMP TABLE ex_db_table ( Cost INTEGER );
INSERT INTO
ex_db_table
VALUES
( 2590000 );
INSERT INTO
ex_db_table
VALUES
( 965145 );
Then we'll select the data in your preferred unit:
SELECT
ROUND(Cost / 1000000, 2) as currency_cost
FROM
ex_db_table;
Of note, your math in your question is incorrect here as the actual values of your Cost examples equate to 2.59 and 0.97.

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".

There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Advice needed on weighted averaging in power query or power pivot

I need to create a weighted average that multiplies a column of volume manufactured for multiple manufacturing plants by a column containing the cost to manufacture at each plant, and returns one weighted average value for a specific product type for all plants.
I've tried adding this as a calculated column using:
=sumx('Plant','Plant'[Cost]*'Plant'[Tonnage])/sum('Plant'[Tonnage])
But this goes row by row, so it doesn't give me the full over riding average that I need for the company. I can aggregate the data, but really want to see the average lined up against individual plant for benchmarking
Any ideas how I can do this?

You can do this in multiple ways. You can either make a single more complex calculation, or you can make a few calculated columns to make the final calculation more transparent. I will pick the latter approach here, because it is more easy to show what is going on. I'm going to use the following DAX functions: CALCULATE, SUM, and ALLEXCEPT.
First, create three new calculated columns.
The first one should contain the [Volume] times [Cost] for each record:
VolumeTimesCost:=[Volume] * [Cost]
The second one should contain the sum of [VolumeTimesCost] for all plants within a given product type. It could look like this:
TotalProductTypeCost:=CALCULATE(SUM([VolumeTimesCost]),ALLEXCEPT([Product Type]))
Using the ALLEXCEPT([Product Type]) removes the filter from all other columns than the [Product Type] column.
The third calculated column should contain the SUM of [Volume] for all plants within a given product type. It could look like this:
TotalProductTypeVolume:=CALCULATE(SUM([Volume]),ALLEXCEPT([Product Type]))
You can then create your measure based on the two calculated columns [TotalProductTypeCost] and [TotalProductTypeVolume].
I hope that helps you solve the issue correctly. Otherwise feel free to let me know!

Access 2013 SQL to perform linear interpolation where necessary

I have a database in which there are 13 different products, sold in 6 different countries.
Prices increase once a year.
Prices need to be calculated using a linear interpolation method.  I have 21 different price and quantity increments for each product for each country for each year.
The user needs to be able to see how much an order would cost for any given value (as you would expect).
What the database needs to do (in English!) is to:
If there is a matching quantity from TblOrderDetail in the TblPrices,
use the price for the current product, country and year
if there isn't a matching quantity but the quantity required is greater than 1000 for one product (GT) and greater than 100 for every other product:
Find the highest quantity for the product, country and year (so, 1000 or 100, depending on the product), and calculate a pro-rated price.  eg.  If someone wanted 1500 of product GT for the UK for 2015, we'd look at the price for 1000 GT in the UK for 2015 and multiply it by 1.5.  If 1800 were required, we'd multiply it by 1.8.  I haven't been able to get this working yet as I'm looking at it alongside the formula for the next possibility...
If there isn't a matching quantity and the quantity required is less than 1000 for the product GT but 100 for the other products (this is the norm)...
Find the quantity and price for the increment directly below the quantity required by the user for the required product, country and year (let's call these quantitybelow and pricebelow)
Find the quantity and price for the increment directly above the quantity required by the user for the required product, country and year (let's call these quantityabove and priceabove)
Calculate the price for the required number of products for an account holder in a particular country for a given year using this formula.
ActualPrice: PriceBelow + ((PriceAbove - PriceBelow) * (The quantity required in the order detail - QuantityBelow) / (QuantityAbove - QuantityBelow))
I have spent days on this and have sought advice about this before but I am still getting very stuck.
The tables I've been working with to try and make this work are as follows:
TblAccount (primary key is AccountID, it also has a Country field which joins to the TblCountry.Code (primary key)
TblOrders (primary key is Order ID) which joins to TblAccount via the AccountID field; TblOrderDetail via the OrderID.  This table also holds the OrderDate and Recipient ID which links to a person in TblContact - I don't need that here but will need it later to generate an invoice 
TblOrderDetail (primary key is DetailID) which joins to TblOrders via OrderID field; TblProducts via ProductID field, and holds the Quantity required as well as the product
TblProducts (primary key is ProductCode) which as well as joining to TblOrderDetail, also joins to TblPrice via the Product field
TblPrices links to the TblProducts (as you have just read).  I've also created an Alias for the TblCountry (CountryAliasForProductCode) so I can link it to the TblPrices to show the country link. I'm not sure if I needed to do this - it doesn't work if I do or I don't do it, so I seek guidance again here.
This is the code I've been trying to use (and failing) to get my price and quantity steps above and I hope to replicate it, making a couple of tweaks to get the steps below:
SELECT MIN(TblPrices.stepquantity) AS QuantityAbove, MIN(TblPrices.StepPrice) AS PriceAbove, TblOrders.OrderID, TblOrders.OldOrderID, TblOrders.AccountID, TblOrders.OrderDate, TblOrders.RecipientID, TblOrders.OrderStatus, TblOrderDetail.DetailID, TblOrderDetail.Product, TblOrderDetail.Quantity
FROM (TblCountry INNER JOIN ((TblAccount INNER JOIN TblOrders ON TblAccount.AccountID = TblOrders.AccountID) INNER JOIN (TblOrderDetail INNER JOIN TblProducts ON TblOrderDetail.Product = TblProducts.ProductCode) ON TblOrders.OrderID = TblOrderDetail.OrderID) ON TblCountry.Code = TblAccount.Country) INNER JOIN (TblCountry AS CountryAliasForProduct INNER JOIN TblPrices ON CountryAliasForProduct.Code = TblPrices.CountryCode) ON TblProducts.ProductCode = TblPrices.Product
WHERE (StepQuantity >= TblOrderDetails.Quantity)
AND (TblPrices.CountryCode = TblAccount.Country)
AND (TblOrderDetail.Product = TblPrices.Product)
AND (DATEPART('yyyy', TblPrices.DateEffective) = DATEPART('yyyy', TblOrders.OrderDate));
I've also tried...
I've even tried going back to basics and trying again to generate the steps below in 1 query, then try the steps above in another and finally, create the final calculation in another query.
This is what I have been trying to get my prices and quantities below:
SELECT Max(StepQuantity) AS quantity_below, Max(StepPrice) AS price_below, TblOrderDetails.Quantity, TblAccounts.Country
FROM 
(TblProducts INNER JOIN TblPrices ON TblProducts.ProductCode = TblPrices.Product)
(TblOrderDetail INNER JOIN TblProducts ON TblOrderDetail.Product = TblProducts.ProductCode)
(TblOrders INNER JOIN TblOrderDetail ON TblOrders.OrderID = TblOrderDetail.OrderID)
(TblAccount INNER JOIN TblOrders ON TblAccount.AccountID = TblOrders.AccountID),
WHERE (((TblPrices.StepQuantity)<=(TblOrderDetail.Quantity)) AND ((TblPrices.CountryCode)=([TblAccounts].[country])) AND ((TblPrices.Product)=([TblOrderDetail].[product])) AND ((DatePart('yyyy',[TblPrices].[DateApplicable]))=(DatePart('yyyy',[TblOrders].[OrderDate]))));
You may be able to see glaring errors in this but I'm afraid I can't.  I've tried re-jigging it and I'm getting nowhere.
I need to be able to tie the information in to the OrderDetail records as the price generated will need to be added to a financial transactions table as a debit amount and will show as an amount owing on statements.
I'm really not very good at SQL.  I've read and worked though several self-study books and I have asked part of this question before; but I really am struggling with it.  If anyone has any ideas on how to proceed, or even where I've gone wrong with my code, I'd be delighted, even if you tell me I shouldn't be using SQL. For the record, I originally posted this question on a different forum under Visual Basic. Responses from that forum brought me to SQL - however, anything that works would be good!
I've even tried, using Excel, concatenating the Year&Product&Country&Quantity to get a unique product code, interpolating the prices for every quantity between 1 and 1000 for each product, country and year and bringing them into a TblProductsAndPrices table. In Access, I created a query to concatenate the Year(of order date from tblOrders)&Product(of tblorderdetails)&Country(of tblAccount) in order to get the required product code for the order. Another query would find a price for me. However, any product code that doesn't appear on the list (such as where a quantity isn't listed in the tblProductsAndPrices as it is larger than the highest price increment) doesn't have a price.
If there was a workable solution to what I've just described that would generate a price for everything, then I'd be so pleased.
I'd really like to be able to generate an order for any quantity of any product for any account based in any country on any date and retrieve a price which will be used to "debit" a financial account in the database, who in a transaction history for an account and appear on statements. I'd also like to be able to do an ad-hoc price check on the spot.
Thank you very much for taking the time to read this.  I really appreciate it. If you could offer any help or words of encouragement, I'd be very grateful.
Many thanks
Karen

Maybe no one thinks on an easy solution to the problem, since not all minds work in database thinking.
Easy solution: Create one view that gives all calculated values, not only the final one you need, each one as a column. Then you can use such view in a relation view and use on some rows one of the values and on other rows other values, etc.
How to think is simple, think in reverse order, instead of thinking "if that then I need to calculate such else I need this other", think as "I need "such" and I need "this other", both are columns of an intermediate view, then think on top level "if" that would be another view, such view will select the correct value ignoring the rest.
Never ever try to solve all in one step, that can be a really big headache.
Pros: You can isolate calculated values (needed or not), sql is much more easy to write and maintain.
Cons: Resources use is bigger than minimal, but most of times that extra calculated values does not represent a really big impact.
In terms of tutorial out there: Instead of a Top-Down method, use a Down-Top method.
Sometimes it is better (with your example) to calculate all three values (you write sentences on bold) ignoring the if part, and have all three possible values for your order and after that discard the ones not wanted, than trying to only calculate one.
Trying to calculate only one is thinking as a procedural programming, when working with databases most times one must get rid of such thinking and think as reverse, first do the most internal part of such procedural programming to have all data collected, then do the external selection of the procedural programing.
Note: If one of the values can not be calculated, just generate a Null.
I know it is hard to think on First in, last out (Down-Top) model, but it is great for things as the one you want.
Step1 (on specific view, or a join from one view per calculation):
Calculate column 1 as price for the current product, country and
year
Calculate column 2 as calculate a pro-rated price as if 1000
Calculate column 3 as calculate a pro-rated price as if 100
Calculate column 4 as etc
Calculate column N as etc
Step 2 (Another view, the one you want):
Calculate the if part, so you can choose adequate column from previous view (you can use immediately if or a calculated auxiliary field).
Hope you can follow theese way of thinking, I have solved a lot of things like that one (and more complex) thinking in that way, but it is not easy to think as that, needs an extra effort.

Calculating a Ratio using Column A & Column B - in Powerpivot/MDX/DAX, not in SQL

I have a query to pull clickthrough for a funnel, where if a user hit a page it records as "1", else NULL --
SELECT datestamp
,COUNT(visits) as Visits
,count([QE001]) as firstcount
,count([QE002]) as secondcount
,count([QE004]) as thirdcount
,count([QE006]) as finalcount
,user_type
,user_loc
FROM
dbname.dbo.loggingtable
GROUP BY user_type, user_loc
I want to have a column for each ratio, e.g. firstcount/Visits, secondcount/firstcount, etc. as well as a total (finalcount/Visits).
I know this can be done
in an Excel PivotTable by adding a "calculated field"
in SQL by grouping
in PowerPivot by adding a CalculatedColumn, e.g.
=IFERROR(QueryName[finalcount]/QueryName[Visits],0)
BUT I need give the report consumer the option of slicing by just user_type or just user_loc, etc, and excel will tend to ADD the proportions, which won't work b/c
SUM(A/B) != SUM(A)/SUM(B)
Is there a way in DAX/MDX/PowerPivot to add a calculated column/measure, so that it will be calculated as SUM(finalcount)/SUM(Visits), for any user-defined subset of the data (daterange, user type, location, etc.)?

Yes, via calculated measures. calculated columns are for creating values that you want to see on rows/columns/report header...calculated measures are for creating values that you want to see in the values section of a pivot table and can slice/dice by the columns in the model.
The easiest way would be to create 3 calculated "measures" in the calculation area of the powerpivot sheet.
TotalVisits:=SUM(QueryName[visits])
TotalFinalCount:=SUM(QueryName[finalcount])
TotalFinalCount2VisitsRatio:=[TotalFinalCount]/[TotalVisits]
You can then slice the calculated measure [TotalFinalCount2VisitsRatio] by user_type or just user_loc (or whatever) and the value will be calculated correctly. The difference here is that you are explicitly telling the xVelocity engine to SUM-then-DIVIDE. If you create the calculated column, then the engine thinks you want to DIVIDE-then-SUM.
Also, you don't have to break down the measure into 3 separate measures...it's just good practice. If you're interested in learning more, I'd recommend this book...the author is the PowerPivot/DAX guru and the book is very straightforward.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas