Calculating RMSE / R-squared / forecast error metrics through SQL BigQuery - sql

I'm having trouble trying to figure out how to evaluate different set of forecast values using GoogleSQL.
I have a table as follows:
date | country | actual_revenue | forecast_rev_1 | forecast_rev_2 |
---------------------------------------------------------------------------------------
x/x/xxxx ABC 134644.64 153557.44 103224.35
.
.
.
The table is partitioned by date and country, and consists of 60 days worth of actual revenue and forecast revenue from different forecast modelling configurations.
3 questions here:
Is calculating R-squared / Root Means Square Error (RMSE) the best way to evaluate model accuracy in SQL?
If Yes to #1, do I calculate the R-squared/ RMSE for each row, and take the average value of them?
Is there some sort of GoogleSQL functionality/better optimized ways of doing this?
Not quite sure about the best way on doing this as I'm not familiar with the statistics territory.
Appreciate any guidance here Thank you!

Related

Different Correlation Coefficent for Different Time Ranges

I built a DataFrame where there are the following data:
Daily Price of Gas Future of N Day;
Daily Price of Petroil
Future of N Day;
Daily Price of Dau-Ahead Eletricity Market in
Italy;
The data are taken from 2010 to 2022 time range, so 12 years of historical time data.
The DataFrame head looks like this:
PETROIL GAS ELECTRICITY
0 64.138395 2.496172 68.608696
1 65.196161 2.482612 113.739130
2 64.982403 2.505938 112.086957
3 64.272606 2.500000 110.043478
4 65.993436 2.521739 95.260870
So on this DataFrame I tried to build the Correlation Matric throught the Pandas metod .corr() and faced one big issue:
If I take all 12 years as data I get:
almost Zero as correlation between Electricity and Petroil price;
low correlation (0.12) between Electricity and Gas price;
While if I try to split in three time range (2010-2014; 2014-2018; 2018-2022) I get really high correlation (in both case between 0.40 and 0.60).
So I am here asking these two questions:
Why I get this so high difference when I split the time ranges?
Considering I am doing this kind of analysis to use Petroil and Gas
prices to predict the electricity price, which of these two analysis
should I consider? The first one (with low correlation) that
considers the entire time range or the second one (with higher
correlation) that is split into different time ranges?
Thank you for your answers.

Azure Analytics : Kusto Query Dashboard group by client_CountryOrRegion and so whenever changing UTC time: past 24 hours it shows deviation in result

pageViews
| where (url contains "https://***.com")
| summarize TotalUserCount = dcount(user_Id)
| project TotalUserCount
Now when summarizing by client_CountryOrRegion, there is a deviation in result for different time range selected i.e. for 24 days, 2 days, 3 days, 7 days etc... User count by country does not match the total count. Is it due to UTC timezone?
pageViews
| where (url contains "https://***.com")
| summarize Users= dcount(user_Id) by client_CountryOrRegion
Any help or suggestion would be like oxygen.
Quoting the documentation of dcount:
The dcount() aggregation function is primarily useful for estimating the cardinality of huge sets. It trades performance for accuracy, and may return a result that varies between executions. The order of inputs may have an effect on its output.
You can increase the accuracy of the estimation by providing the accuracy argument to dcount, e.g. dcount(user_Id, 4). Note that this will improve the estimation (at the cost of query performance), but still won't be 100% accurate. You can read more about this in the doc.

Writing equations in SQL using multiple variables

I'm trying to use data that is labeled by year (2012 - 2016) to calculate CAGR. The data was originally in one column indicating the total population with another column indicating the year. I've isolated the 2012 and 2016 data into two separate columns and am trying to use SQL to calculate the CAGR rate ((data from 2016)/(data from 2012)^(1/4))-1.
Is this the correct way to calculate CAGR/cummulative growth? I've tried simply using the two columns of data but because they are mismatched and have nulls, it doesn't work. Please let me know if you have any ideas.
Compound Annual Growth Rate (CAGR) doesn't really lend itself to what you're trying to do.
Usually this is used when you say, invest $1000 in a fund, and you calculate the annual growth based on the ending value.
Example - if you invest $1000 and in 5 years it's worth $5000:
( 5,000 / 1,000)1/5 - 1 = .37973 = 37.97%
If I was to write that in SQL Server it would be:
SELECT SUM(POWER((5000.0/1000.0),(1.0/5.0))-1.0)
You can replace the 5000 and 1000 to be the specific columns you want to compare, or a range of data you need to compare.
If you elaborate your question I will update this answer.

HR Cube in SSAS

I have to design a cube for students attendance, we have four status (Present, Absent, Late, in vacation). the cube has to let me know the number of students who are not present in a gap of time (day, month, year, etc...) and the percent of that comparing the total number.
I built a fact table like this:
City ID | Class ID | Student ID | Attendance Date | Attendance State | Total Students number
--------------------------------------------------------------------------------------------
1 | 1 | 1 | 2016-01-01 | ABSENT | 20
But in my SSRS project I couldn't use this to get the correct numbers. I have to filter by date, city and attendance status.
For example, I must know that in date X there is 12 not present which correspond to 11% of total number.
Any suggestion of a good structure to achieve this.
I assume this is homework.
Your fact table is wrong.
Don't store aggregated data (Total Students) in the fact as it can make calculations difficult.
Don't store text values like 'Absent' in the fact table. Attributes belong in the dimension.
Reading homework for you:
Difference between a Fact and Dimension and how they work together
What is the grain of a Fact and how does that affect aggregations and calculations.
There is a wealth of information at the Kimball Groups pages. Start with the lower # tips as they get more advanced as you move on.

identifying trends and classifying using sql

i have a table xyz, with three columns rcvr_id,mth_id and tpv. rcvr_id is an id given to a customer, mth_id is a column which stores the month number( mth_id is calculated as (2012-1900) * 12 + 1,2,3.. ( depending on the month). So for example Dec 2011 will have month_id of 1344, Jan 2012 1345 etc. Tpv is a variable which shows the customers transaction amount.
Example table
rcvr_id mth_id tpv
1 1344 23
2 1344 27
3 1344 54
1 1345 98
3 1345 102
.
.
.
so on
P.S if a customer does not have a transaction in a given month, his row for that month wont exist.
Now, the question. Based on transactions for the months 1327 to 1350, i need to classify a customer as steady or sporadic.
Here is a description.
The above image is for 1 customer. i have millions of customers.
How do i go about it? I have no clue how to identify trends in sql .. or rather how to do it the best way possible.
ALSO i am working on teradata.
Ok i have found out how to get standard deviation. Now the important question is : How do i set a standard deviation limit on my own? i just cant randomly say "if standard dev is above 40% he is sporadic else steady". I thought of calculating average of standard deviation for all customers and if it is above that then he is sporadic else steady. But i feel there could be a better logic
I would suggest the STDDEV_POP function - a higher value indicates a greater variation in values.
select
rcvr_id, STDDEV_POP(tpv)
from yourtable
group by rcvr_id
STDDEV_POP is the function for Standard Deviation
If this doesn't differentiate enough, you may need to look at regression functions and variance.