Expected Value formula - vb.net

I have a table as follows
Day Savings
1 : 251
2 : 722
3 : 1132
4 : 1929
5 : 3006
6 : 4522
7 : 8467
...
14 : x
These savings are growing day by day, I want to find a formula to expect the final value of day 14 which is x!

I didn't look at the data in any detail, but it seems like an exponential growth situation. If that's the case, then you can estimate the growth rate by fitting an exponential curve to the data using least squares approximation to get an estimated interest rate, r. If you find the data not conducive to that, you could try fitting it to some other curve. You can then use the estimated interest rate to compute the expected funds using the standard a = p*exp(r*i) where p is the initial principal and i is the elapsed time.
This all assumes compounding interest which is an exponential growth situation. If that assumption is incorrect, this approach is probably not going to work for you.

Related

Suitable Clustering Approach

I've got a total of 9 sensors in the ground, which measure the water content of the soil. 1-3 are in a depth of 1m, 4-6 are in a depth of 2m and sensors 7-9 are in a depth of 3m.
My dataset also contains the precipiation of the location. It is hourly data:
Time
Sensor-ID
Precipitation
Soil Water Content
2022-01-01 11:00
1
74
120
2022-01-01 11:00
2
74
100
2022-01-01 11:00
3
74
110
...
...
...
...
2022-01-01 11:00
9
74
30
The goal now is to find out if the different ground / soil depths behave differently regarding the water content after raining (over time).
I thought about a clustering method to find out if the sensors can be clustered based on the data and confirm this. Since I'm not very experienced in data science, would that be the right approach and is it even possible to analyse it with clustering?
For clustering, you can add a new column with three new classes to your data - for 1-3 sensors : Class 1, for 4-6 sensors : Class 2, for 7-9 sensors : Class 3 and perform your analysis using the new classes. Either can be done using Python, Power BI or Excel.
You should start by analyzing different variables w.r.t to the sensors at different ground depths: Use univariate, Bi-Variate and Multi-Variate plots to derive your goal.

Summing time series with slight variance in timestamps

I imagine that I have several time series like following, from different "sources":
time events
0 1000 1080000
1 2003 2122386
2 3007 3043985
3 4007 3872544
4 5007 4853763
Here, an monotonic increasing count events is sampled every 1000 ms. The sampling is not exact so most of the timestamps vary from their ideal values by a few ms - e.g., the second point is at 2003 instead of 2000.
I want to sum several of these time series: they will all be sampled at ~1000 ms but may not agree to the exact millsecond. E.g another time series could be:
time events
0 1000 1070000
1 2002 2122486
2 3006 3063985
3 4007 3872544
4 5009 4853763
I'd like something reasonable in terms of the final result. For example the same number of rows as each of the input dataframes, with a timestamp column the same as the first, or average of the inputs times. As long as the inputs are smooth, the outputs should be too.
I'd suggest DataFrame.reindex() with nearest method. Example:
def combine_datasources(reference_df, extra_dfs, tolerance_ms=100):
reindexed_df_list = [df.reindex(reference_df.index, method='nearest', tolerance=tolerance_ms) for df in extra_dfs]
combined = pd.concat([reference_df, *reindexed_df_list])
return combined.groupby(combined.index).sum()
combine_datasources(df_a, [df_b])
This code changes the index on the dataframes in the extra_dfs list to match the index for the reference dataframe. Then, it concatenates all of the dataframes together. It uses groupby to do the sum, which requires that the indexes match exactly to work. The timestamps will be the same as the one on the reference dataframe.
Note that if you have data from a time period not covered by the reference dataframe, that data will be dropped.
Here's the output for the dataset in your question:
events
time
1000 2150000
2003 4244872
3007 6107970
4007 7745088
5007 9707526

SQL Compound Growth Calculation based on previous rows (varying rate)

Given a column for 'Growth Factors' and a starting value I need to compute future values. For example, if a starting value of 1 is provided then the computed 'Value' column would be as shown below. Thus, Value(t2) = Value(t1) x Growth_Factor(t2). Base condition is Value(t1) = Starting_Value x Growth_Factor(t1). Example shown below.
How do I compute this in SQL (or Presto) where the computed value is dependent on previous computed values?
Growth Factor
Value
Time
1.2
1.2
1
1.1
1.32
2
1.5
1.98
3
1.7
3.366
4
You could sum the logarithms and invert when finished. This will work other than some possibility of small floating point error. But you're also going to introduce error once you multiply more than a few numbers with doubling decimal places at every iteration.
exp(
sum(ln(growth)) over (order by time)
)

Weighted Activity Selection Problem with allowing shifting starting time

I have some activities with weights, and I would like to select non overlapping activities by maximizing the total weight. This is known problem and solution exists.
In my case, I am allowed to shift the start time of activities in some extend while duration remains same. This will give me some flexibility and I might increase my utilization.
Example scenario is something like the following where all activities are supposed to be in interval (0-200):
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
Without shifting flexibility, I would choose (a1, a3, a6) and that is it. On the other hand I have shifting flexibility to the left/right by at most t units for any task where t is given. In that case I might come up with this schedule and all tasks can be selected except a7 since conflict cannot be avoided by shift .
t: 5
a1: 8 10 120 (shifted -2 to left)
a2: 10 13 100
a3: 14 18 150
a4: 18 24 100 (shifted +4 to right)
a5: 115 120 100 (shifted -5 to left)
a6: 120 140 150
In my problem, total time I have is very big with respect to activity duration. While activities are like 10sec on average, total time I have would even be 10000sec. However that does not mean all of activities can be selected since shifting flexibility would not be enough for some activities to non-overlap.
Also in my problem, there are clusters of activities which overlaps and very big empty space where no activities and there comes another cluster of overlapping activities i.e a1, a2, a3 and a4 are let say cluster1 and a5, a6 and a7 is cluster2. Each cluster can be expanded in time by shifting some of them to left and right. By doing that, I can select more activities than the original activity selection problem. However, I do not know how to decide which tasks to be shifted to left or right.
My expectation is to find an near-optimal solution where total profit would be somehow local optima. I do not need global optimum value. Also I do not have any criteria about cluster utilization., i.e I do not have a guarantee about a minimum number of activity per cluster etc. Actually, these clusters something I visually describe. There is not defined cluster. However, in time domain, activities are separated as clusters somehow.
Also activity start and end times are all integers since I can dismiss fractions. I would have around 50 activities whose duration would be 10 on average. And time window is like 10000.
Are there any feasible solution to this problem?
You mentioned that you can partition the activities into clusters that don't overlap even if activities within them are shifted to the extent. Each of these clusters can be considered independently, and the optimal results computed for each cluster simply summed up for the final answer. So the first step of the algorithm could be a trial run that extends all activities in both directions, finds which ones form clusters, and process each cluster independently. In the worst case, all of the activities might form a single cluster.
Depending on the maximum size of the remaining clusters, there are several approaches. If it's under 20 (or even 30, depending on whether you want your program to run in seconds or minutes), you could combine search over all subsets of activities in the given cluster with a greedy approach. In other words: if you are processing a subset of N elements, try every one of its 2^N possible subsets (okay, 2^N-1 if we forget the empty subset), check whether the activities in this specific subset can be scheduled in non-overlapping manner, and pick the subset that is eligible and has maximum sum.
How do we check that a given subset of activities can be scheduled in non-overlapping manner? Let's sort them in ascending order of their end and process them from left to right. For every activity, we try to schedule it as early as possible, making sure it does no intersect with activities we already considered. So, the first activity in the cluster is always started time t earlier than originally planned, the second one is started either when the first one ends, or t earlier than originally planned, whichever is larger, and so on. If at any point we can't schedule the next activity in a way that it does not overlap with previous one, then there is no way to schedule the activities in this subset in a non-overlapping manner. This algorithm takes O(NlogN) time, and overall each cluster is processed in O(2^N * NlogN). Once again, note that this function grows very quickly, so if you are dealing with large enough clusters, this approach goes out the window.
===
Another approach is specific to the additional restrictions you provided. If the activities' starts and ends and parameter t are all measured in integer number of seconds, and t is about 2 minutes, then the problem for each cluster is set in a small discrete space. Even though you could position a task to start at a non-integer second value, there always is an optimal solution that uses only integers. (To prove it, consider an optimal solution that does not use integers - since t is integer, you can always shift tasks, starting from the leftmost, to the left a bit so that it starts at an integer value.)
Knowing that the start and end times are discrete, you can build a DP solution: process the activities in the ascending order of their end*, and memoize the maximum possible sum of weights you can obtain from the first 1, 2, ..., N activities for each x from activity_start - t to activity_start + t if a given activity ends at time x. If we denote this memoized function as f[activity][end_time], then the recurrence relation is f[a][e] = weight[a] + max(f[i][j] over all i < a, j <= e - (end[a] - start[a]), which roughly translates to "if activity a ended at time e, the previous activity must have ended at or before start of a - so let's pick the maximum total weight over previous activities and their ends, and add the current activity's weight".
*Again, we can prove that there is at least one optimal answer where this ordering is preserved, even though there might be other optimal answers which do not possess this property
We could go further and eliminate the iteration over previous activities, instead encoding this information in f. Its definition would then change to "f[a][e] is the maximum possible total weight of the first a activities if none of them ends after e", and recurrence relation would become f[a][e] = max(f[a-1][e], weight[a] + max(f[a-1][i] over i <= e - (end[a] - start[a])])), and its computational complexity would be O(X * N), where X is the total span of the discrete space where task starts/ends are placed.
I assume you need to compute not just the maximum possible weight, but also the activities you need to select to obtain it, and possibly even the exact time each of them needs to be started. Thankfully, we can derive all of this from the values of f, or compute it at the same time as we compute f. The latter is easier to reason about, so let's introduce a second function g[activity][end]. g[activity][end] returns a pair (last_activity, last_activity_end), essentially pointing us to the exact activity and its timing that the optimal weight in f[activity][end] uses.
Let's go through the example you provided to illustrate how this works:
(start, end, profit)
a1: 10 12 120
a2: 10 13 100
a3: 14 18 150
a4: 14 20 100
a5: 120 125 100
a6: 120 140 150
a7: 126 130 100
We order the activities by their end time, thereby swapping a7 and a6.
We initialize the values of f and g for the first activity:
f[1][7] = 120, f[1][8] = 120, ..., f[1][17] = 120, meaning that the first activity could end anywhere from 7 to 17, and costs 120. f[1][i] for all other i should be set to 0.
g[1][7] = (1, 7), g[1][8] = (1, 8), ..., g[1][17] = (1, 17), meaning that the last activity that was included in f[1][i] values was a1, and it ended at i. g[1][i] for all i outside [7, 17] is undefined/irrelevant.
That's where something interesting begins. For each i such that a2 cannot end at time i, let's assign f[2][i] = f[1][i], g[2][i] = g[1][i], which essentially means that we wouldn't be using activity a2 in those answers. For all other i, namely, in [8..18] interval, we have:
f[2][8] = max(f[1][8], 100 + max(f[1][0..5])) = f[1][8]
f[2][9] = max(f[1][9], 100 + max(f[1][0..6])) = f[1][9]
f[2][10] = max(f[1][10], 100 + max(f[1][0..7])). This is the first time when the second clause is not just plain 100, as f[1][7]>0. It is, in fact, 100+f[1][7]=220, meaning that we can take activity a2, shift it in a way that puts its end at time 10, and get a total weight of 220. We continue computing f[2][i] this way for all i <= 18.
The values of g are: g[2][8]=g[1][8]=(1, 8), g[2][9]=g[1][9]=(1, 9), g[2][10]=(2, 10), because it was optimal to take activity a2 and end it at time 10 in this case.
I hope the pattern of how this continues is visible - we compute all the values of f and g through the end, and then pick the maximum f[N][e] over all possible end times e of the last activity. Armed with the auxiliary function g, we can traverse the values backwards to figure out the exact activities and times. Namely, the last activity we use and its timing is in g[N][e]. Let's call them A and T. We know that A began at T-(end[A]-start[A]). Then, the previous activity must have ended at that point or before - so let's look at g[A-1][T-(end[A]-start[A]) for it, and so on.
Note that this approach works even if you don't partition anything into clusters, but with the partitioning, the size of the space in which tasks can be scheduled is reduced, and with it the runtime.
You might notice that neither of these solutions is polynomial in the size of input. I have a feeling that your problem doesn't have a general polynomial solution, but I was unable to prove it by reducing another NP-complete problem to it. Would be really curious to read a reduction / better general solution!

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..