I built a DataFrame where there are the following data:
Daily Price of Gas Future of N Day;
Daily Price of Petroil
Future of N Day;
Daily Price of Dau-Ahead Eletricity Market in
Italy;
The data are taken from 2010 to 2022 time range, so 12 years of historical time data.
The DataFrame head looks like this:
PETROIL GAS ELECTRICITY
0 64.138395 2.496172 68.608696
1 65.196161 2.482612 113.739130
2 64.982403 2.505938 112.086957
3 64.272606 2.500000 110.043478
4 65.993436 2.521739 95.260870
So on this DataFrame I tried to build the Correlation Matric throught the Pandas metod .corr() and faced one big issue:
If I take all 12 years as data I get:
almost Zero as correlation between Electricity and Petroil price;
low correlation (0.12) between Electricity and Gas price;
While if I try to split in three time range (2010-2014; 2014-2018; 2018-2022) I get really high correlation (in both case between 0.40 and 0.60).
So I am here asking these two questions:
Why I get this so high difference when I split the time ranges?
Considering I am doing this kind of analysis to use Petroil and Gas
prices to predict the electricity price, which of these two analysis
should I consider? The first one (with low correlation) that
considers the entire time range or the second one (with higher
correlation) that is split into different time ranges?
Thank you for your answers.
Related
I am a stock trader who visualizes data in QuickSight. I identify the trades I want to submit to the market, sometimes for the same stock, at the same time, but in opposite directions depending on the price of the stock at that time. See below for an example of trades I might identify for 1/19/22 0800:
Date
Hour
Stock
Direction
Price
Volume
1/19/22
0800
Apple
BUY
$10
2
1/19/22
0800
Apple
SELL
$20
1
1/19/22
0800
Microsoft
BUY
$15
3
Using QuickSight, I want to visualize (in pivot tables and charts) the volume that I trade, using the maximum possible trade volume. For example, QuickSight simply sums the Volume column to 6, when really I want it to sum to 5, because the max possible trade volume for that hour is 5 (the Apple trades in the example are mutually exclusive, because the stock price cannot be both beneath $10, triggering a BUY, and above $20, triggering a SELL at the same date-time. Therefore, I want the day's traded volume to reflect the MAX possible volume I could have traded (2+3)).
I have used the maxOver() function as so: maxOver({volume}, [{stock}, {date}, {hour}], PRE_AGG), but I would like to view my trade volume rolled up to the day as so:
Date
Volume
1/19
5
Is there a way to do this using QuickSight calculated fields? Should this aggregation be done with a SQL custom field?
Add a new calculated field called
volume_direction_specifier
{Volume} * 10 + ifelse({Direction}='BUY', 1, 2)
This is a single number that will indicate the direction and volume. (this is needed in cases where the max possible volume is the same for both the BUY and SELL entries within the same hour).
Then compute the maxOver on this new field in a calculated field called max_volume_direction_specifier
maxOver({volume_direction_specifier}, [{stock}, {date}, {hour}], PRE_AGG)
Add a new field which will give the Volume for rows that have the max volume_direction_specifier per hour
volume_for_max_trade_volume_per_hour
ifelse(volume_direction_specifier = max_volume_direction_specifier, {volume}, null)
And finally, you should be able to add volume_for_max_trade_volume_per_hour to your table (grouped by day) and its SUM will give the maximum possible trade volume per day.
pageViews
| where (url contains "https://***.com")
| summarize TotalUserCount = dcount(user_Id)
| project TotalUserCount
Now when summarizing by client_CountryOrRegion, there is a deviation in result for different time range selected i.e. for 24 days, 2 days, 3 days, 7 days etc... User count by country does not match the total count. Is it due to UTC timezone?
pageViews
| where (url contains "https://***.com")
| summarize Users= dcount(user_Id) by client_CountryOrRegion
Any help or suggestion would be like oxygen.
Quoting the documentation of dcount:
The dcount() aggregation function is primarily useful for estimating the cardinality of huge sets. It trades performance for accuracy, and may return a result that varies between executions. The order of inputs may have an effect on its output.
You can increase the accuracy of the estimation by providing the accuracy argument to dcount, e.g. dcount(user_Id, 4). Note that this will improve the estimation (at the cost of query performance), but still won't be 100% accurate. You can read more about this in the doc.
I have telecom 7 months data, the definition of churn for telecom is "Customer who does not pay monthly bill consecutive 3 months is the churn" now i have to predict who will be churn in future. So, my question is how i can divide my data into train test to get better result? I did some work like i drop one month then divide the data into two parts 3 months each , i made flag from test and attach with train but the results are not good. can you help me out?
I'm trying to use data that is labeled by year (2012 - 2016) to calculate CAGR. The data was originally in one column indicating the total population with another column indicating the year. I've isolated the 2012 and 2016 data into two separate columns and am trying to use SQL to calculate the CAGR rate ((data from 2016)/(data from 2012)^(1/4))-1.
Is this the correct way to calculate CAGR/cummulative growth? I've tried simply using the two columns of data but because they are mismatched and have nulls, it doesn't work. Please let me know if you have any ideas.
Compound Annual Growth Rate (CAGR) doesn't really lend itself to what you're trying to do.
Usually this is used when you say, invest $1000 in a fund, and you calculate the annual growth based on the ending value.
Example - if you invest $1000 and in 5 years it's worth $5000:
( 5,000 / 1,000)1/5 - 1 = .37973 = 37.97%
If I was to write that in SQL Server it would be:
SELECT SUM(POWER((5000.0/1000.0),(1.0/5.0))-1.0)
You can replace the 5000 and 1000 to be the specific columns you want to compare, or a range of data you need to compare.
If you elaborate your question I will update this answer.
I have a problem and was wondering if anyone could help or if it is even possible to have an algorithm for something like this.
I need to create a predictive ordering wizard. So based on previous sales, we will determine that that a certain amount of an item is required. E.g 31 apples. Now i need to work out the number of cases that needs to be ordered. If the cases come in say 60, 30, 15, 10 apples, the order should be a case of 30 and a case of 10 apples.
The number of items that need to be ordered change in each row of the result set. The case sizes could also change for each item. So some items may have an option of 5 different cases and some items may land up with an option of only one case.
Other examples would be i need 39 cans of coke and the cases come in only 24 per case. Therefore needing 2 cases. I need 2 shots of baileys and the bottle of baileys come in 50cl or 70cl. Therefore i need the 50cl.
The results sets columns are ItemName, ItemSize, QuantityRequired, PackSize and PackSizeMultiple.
The ItemName is the item to be ordered. ItemSize is the size the item is used in eg. can of coke. QuantityRequired how man of the item, in this case cans of coke, need to be ordered. PackSize is the size of the case. PackSizeMultiple is the number to multiply the item with to work out how many of the items are in the case.
ps. this will be a query in SQL Server 2008
Sounds like you need a UOM (Unit of Measure) table and a function to calc co-pack measure count and and unit count measure qty. with UOM type based on time between orders. You would also need to create a cron cycle and freeze table managed by week/time interval in order to create a freeze view of the current qty sold each week and the number of units since last order. Based on the 2 previous orders to your prior order you would set the current prediction based on min time between the last 2 freeze cycles containing an order and the duration of days between them. based on the average time between orders and the unit qty in each order, you can create a unit decay ratio percentage based on days and store it in each slice forward. Based on a reference to this data you will be able to create a prediction that will allow you to trigger a notice to sales or a message to the client to reorder. In addition, if you engage response data from sales based on unit count feedback from the client, you can reference an actual and tune your decay rate against your prediction. You should also consider managing and rolling up these freezes by month, so that you can view historical trending and forecast revenue based on velocity of reorder and same period last year. Basically this is similar to sales forcasting and we are switching out your opportunity percentage of close with Predicted Remaining Qty. percentage remaining.