How i can divide test train data for churn prediction? - pandas

I have telecom 7 months data, the definition of churn for telecom is "Customer who does not pay monthly bill consecutive 3 months is the churn" now i have to predict who will be churn in future. So, my question is how i can divide my data into train test to get better result? I did some work like i drop one month then divide the data into two parts 3 months each , i made flag from test and attach with train but the results are not good. can you help me out?

Related

Different Correlation Coefficent for Different Time Ranges

I built a DataFrame where there are the following data:
Daily Price of Gas Future of N Day;
Daily Price of Petroil
Future of N Day;
Daily Price of Dau-Ahead Eletricity Market in
Italy;
The data are taken from 2010 to 2022 time range, so 12 years of historical time data.
The DataFrame head looks like this:
PETROIL GAS ELECTRICITY
0 64.138395 2.496172 68.608696
1 65.196161 2.482612 113.739130
2 64.982403 2.505938 112.086957
3 64.272606 2.500000 110.043478
4 65.993436 2.521739 95.260870
So on this DataFrame I tried to build the Correlation Matric throught the Pandas metod .corr() and faced one big issue:
If I take all 12 years as data I get:
almost Zero as correlation between Electricity and Petroil price;
low correlation (0.12) between Electricity and Gas price;
While if I try to split in three time range (2010-2014; 2014-2018; 2018-2022) I get really high correlation (in both case between 0.40 and 0.60).
So I am here asking these two questions:
Why I get this so high difference when I split the time ranges?
Considering I am doing this kind of analysis to use Petroil and Gas
prices to predict the electricity price, which of these two analysis
should I consider? The first one (with low correlation) that
considers the entire time range or the second one (with higher
correlation) that is split into different time ranges?
Thank you for your answers.

Calculating RMSE / R-squared / forecast error metrics through SQL BigQuery

I'm having trouble trying to figure out how to evaluate different set of forecast values using GoogleSQL.
I have a table as follows:
date | country | actual_revenue | forecast_rev_1 | forecast_rev_2 |
---------------------------------------------------------------------------------------
x/x/xxxx ABC 134644.64 153557.44 103224.35
.
.
.
The table is partitioned by date and country, and consists of 60 days worth of actual revenue and forecast revenue from different forecast modelling configurations.
3 questions here:
Is calculating R-squared / Root Means Square Error (RMSE) the best way to evaluate model accuracy in SQL?
If Yes to #1, do I calculate the R-squared/ RMSE for each row, and take the average value of them?
Is there some sort of GoogleSQL functionality/better optimized ways of doing this?
Not quite sure about the best way on doing this as I'm not familiar with the statistics territory.
Appreciate any guidance here Thank you!

Calculate average VBA from dynamic table

I ve a table on Excel with the week number and the weight (in Kg.)
Somebody can insert his weight every day during a week or just once or not at all. I can't manage that.
Then my table can literally change. from zero to 7 even more lines a week (like the yellow side of the image).
What I wanna do is to calculate the weight average per week. and then I will have one line for each week, when i got at least one weight (sometimes I won't have any line). We can have week without any weight so then I don't want this line at all. We can also easily have a weight for the week 2 but between the weeks 5 and 6 in the yellow table. That would happen if someone insert his weight after others.
How can I say this two weeks are similar, so we calculate the average for this two weight ?
I hope it's enough clear with this picture
Use formula below in Column C to calculate average(assume Week in column A and Weight in column B)
=AVERAGEIF(A:A,A2,B:B)
Average Column Copy->PasteSpecial value only,
then Remove Duplicates base on Week and the new Average Column

QlikView: aggregating a calculated expression

I have a table that is used to calculate a daily completion score by individuals at various locations. Example: on day 1, 9/10 people completed the task, so the location score is 90%. The dimension is "ReferenceDate." The expression is a calculation of count(distinct if(taskcompleted=yes, AccountNumber)) / count(distinct AccountNumber).
Now, I want to report on the average scores per month. I DO NOT want to aggregate all the data and then divide; I want the daily average. Example:
day 1: 9/10 = 90%
day 2: 90/100 = 90% (many more people showed up a the same location)
average of two days is 90%.
it's not 99/110
and it also not distinct(99) / distinct(110). It is the more simple (.9 + .9) /2
Does this make sense?
What I have now is a line graph showing the daily trend across many months. I need to roll that up into bar charts by month and then compare multiple locations so we can see what locations are having the lower average completion scores.
You need to use the aggr() function to tell QlikView to do the sum day by day and then average the answers.
It should look something like this. (I just split the lines to show which terms are working together.
avg(
aggr(
count(distinct if(taskcompleted=yes, AccountNumber))
/ count(distinct AccountNumber)
,ReferenceDate)
)

Predictive Ordering Logic

I have a problem and was wondering if anyone could help or if it is even possible to have an algorithm for something like this.
I need to create a predictive ordering wizard. So based on previous sales, we will determine that that a certain amount of an item is required. E.g 31 apples. Now i need to work out the number of cases that needs to be ordered. If the cases come in say 60, 30, 15, 10 apples, the order should be a case of 30 and a case of 10 apples.
The number of items that need to be ordered change in each row of the result set. The case sizes could also change for each item. So some items may have an option of 5 different cases and some items may land up with an option of only one case.
Other examples would be i need 39 cans of coke and the cases come in only 24 per case. Therefore needing 2 cases. I need 2 shots of baileys and the bottle of baileys come in 50cl or 70cl. Therefore i need the 50cl.
The results sets columns are ItemName, ItemSize, QuantityRequired, PackSize and PackSizeMultiple.
The ItemName is the item to be ordered. ItemSize is the size the item is used in eg. can of coke. QuantityRequired how man of the item, in this case cans of coke, need to be ordered. PackSize is the size of the case. PackSizeMultiple is the number to multiply the item with to work out how many of the items are in the case.
ps. this will be a query in SQL Server 2008
Sounds like you need a UOM (Unit of Measure) table and a function to calc co-pack measure count and and unit count measure qty. with UOM type based on time between orders. You would also need to create a cron cycle and freeze table managed by week/time interval in order to create a freeze view of the current qty sold each week and the number of units since last order. Based on the 2 previous orders to your prior order you would set the current prediction based on min time between the last 2 freeze cycles containing an order and the duration of days between them. based on the average time between orders and the unit qty in each order, you can create a unit decay ratio percentage based on days and store it in each slice forward. Based on a reference to this data you will be able to create a prediction that will allow you to trigger a notice to sales or a message to the client to reorder. In addition, if you engage response data from sales based on unit count feedback from the client, you can reference an actual and tune your decay rate against your prediction. You should also consider managing and rolling up these freezes by month, so that you can view historical trending and forecast revenue based on velocity of reorder and same period last year. Basically this is similar to sales forcasting and we are switching out your opportunity percentage of close with Predicted Remaining Qty. percentage remaining.