I have a time series data set that is almost monotone increasing. The values of the series will dip every now and then, but it generally increases. The dips that occur in the series are due to errors in a sensor reading either previous values being to high or later values being to low. My goal is to apply a pre-processing transformation to get a more stable signal. Here is an image for reference:
Pre-prossesing vs raw
I can do this relatively easily in python, but I'm having a difficult time doing this efficiently in Power BI. Here is an example table of raw and processed data to give an example of what I'm hoping to do:
Data table
Here is the DAX code that I've to apply to create the ProcessedValue Column:
ProcessedValue =
VAR CurrentIndex = Query1[Index]
RETURN
Calculate (
MAX ( Query1[Value] ),
FILTER (
ALL ( Query1 ),
Query1[Index] < CurrentIndex
&& Query1[Index] > CurrentIndex - 50
)
)
The two issues that I'm running into are 1) I run out of memory and 2) I don't know if the code is even doing what I'm intending it to do (because it doesn't complete). The table has ~ 1M data points. I'm fairly new to Power BI and I'm not aware of the tips and tricks needed to do fast calculations, so any help would be greatly appreciated. For reference here is the code that I'm using in python:
def backward_max(a, axis=0):
bmax = np.array([np.max(a[: max(1, i)], axis=axis) for i in range(len(a))])
return bmax
df['ProcessedValue'] = backward_max(df['Value'].values)
Related
To explain the question it's best to start with this
picture
I am modeling an optimization decision problem and a feature that I'm trying to implement is heat transfer between the process stages (a = 1, 2) taking into account which equipment type is chosen (j = 1, 2, 3) by the binary decision variable y.
The temperatures for the equipment are fixed values and my goal is to find (in the case of the picture) dT = 120 - 70 = 50 while keeping the temperature difference as a parameter (I want to keep the problem linear and need to multiply the temperature difference with a variable later on).
Things I have tried:
dT = T[a,j] - T[a-1,j]
(this obviously gives T = 80 for T[a-1,j] which is incorrect)
T[a-1] = sum(T[a-1,j] * y[a-1,j] for j in (1,2,3)
This will make the problem non-linear when I multiply with another variable.
I am using pyomo and the linear "glpk" solver. Thank you for reading my post and if someone could help me with this it is greatly appreciated!
If you only have 2 stages and 3 pieces of equipment at each stage, you could reformulate and let a binary decision variable Y[i] represent each of the 9 possible connections and delta_T[i] be a parameter that represents the temp difference associated with the same 9 connections which could easily be calculated and put into a model parameter.
If you want to keep in double-indexed, and assuming that there will only be 1 piece of equipment selected at each stage, you could take the sum-product of the selection variable and temps at each stage and subtract them.
dT[a] = sum(T[a, j]*y[a, j] for j in J) - sum(T[a-1, j]*y[a-1, j] for j in J)
for a ∈ {2, 3, ..., N}
I've fitted Prophet with logistic growth and multiplicative seasonality over time series data (daily observations spanning several years, no additional regressors; just ds, y) and have the forecast dataframe. How do I use the values from forecast to remove seasonality?
Within forecast, I acknowledge the weekly and yearly columns deal with seasonality on a weekly/yearly basis, and multiplicative_terms deals with changing magnitudes over time; but I don't know how to put this together to remove seasonality form my data.
I have the following possibly ways about attempting to remove seasonality, but believe what I'm doing is wrong.
#R
df$y - forecast$trend * forecast$multiplicative_terms * forecast$weekly * forecast$yearly
For reference, when using seasonal_decompose, I've had to use the following to get rid of seasonality; this doesn't hold for Prophet because of the additional terms.
#Python
df.y - trend* decomposition.seasonal
Edit- After doing a bit of research, I'm currently running this which looks right, but I was wondering if anyone can confirm whether this is the correct way to remove seasonality + trend?
df$y -
(forecast$trend +
(forecast$trend * forecast$weekly) +
(forecast$trend * forecast$yearly) +
(forecast$trend * forecast$multiplicative_terms)
)
I try to find 95% credible interval of 50 sample means. Sample sizes range from 2 to 600, and the values in each sample are bounded between 1 and 5.
ex:
sample 1 = (1,3.5,2.8,5,4.6)
sample 2 = (1,5)
sample 3 = (4.1,1.1,5,3.5,2,2.4,...)
Samples with size of 10 or more have a lognormal distribution where i used JAGS for Bayesian estimation of log-normal parameters adapted from John K. Kruschke, with model specification as below:
modelstring = "
model {
for( i in 1 : N ) {
y[i] ~ dlnorm( muOfLogY , 1/sigmaOfLogY^2 )
}
sigmaOfLogY ~ dunif( 0.001*sdOfLogY , 1000*sdOfLogY )
muOfLogY ~ dunif( 0.001*meanOfLogY , 1000*meanOfLogY )
muOfY <- exp(muOfLogY+sigmaOfLogY^2/2)
modeOfY <- exp(muOfLogY-sigmaOfLogY^2)
sigmaOfY <- sqrt(exp(2*muOfLogY+sigmaOfLogY^2)*(exp(sigmaOfLogY^2)-1))
}
"
The model works fine with sample size > 10. However, with 3 <= samples < 10 i got extreme values in upper limit (e.g., 3000) which exceeded the maximum possible value of the mean (e.g., 5).
In case of sample size = 2, i got the below error:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
NA/NaN/Inf in 'y'
I am new to JAGS and can't figure out how to solve this issues. I think for smaples < 10 the distribution is no longer lognormal!
Any ideas?
Thank you
First a semantic note. You are not using JAGS to find sample means. You are using JAGS to find the means of the populations from which the samples arose. If you wanted to find the sample (log)means, you could just take the mean of the (logarithms of the) sample values.
Now, if the values in each sample are bounded between 1 and 5 (due to some external constraint), then the sample is NEVER drawn from a log-normal distribution, which inherently puts probability mass over values greater than five.
Let's imagine, for the sake of saying, that the samples do arise from lognormal sampling (and therefore aren't inherently bounded between 1 and 5). Then JAGS is simply telling you that there is not enough information contained in the sample to get a good estimate of the population mean from which it is drawn. I wouldn't worry about understanding the error when the sample size is two, because there is literally no way to get good inference about the population mean from two samples. This is true even if you know that the population is indeed log-normally distributed. And since your populations are not actually log-normally distributed (they are bounded between 1 and 5) the entire inferential procedure is invalid anyway.
I'm working with billions of rows of data, and each row has an associated start latitude/longitude, and end latitude/longitude. I need to calculate the distance between each start/end point - but it is taking an extremely long time.
I really need to make what I'm doing more efficient.
Currently I use a function (below) to calculate the hypotenuse between points. Is there some way to make this more efficient?
I should say that I have already tried casting the lat/longs as spatial geographies and using SQL built in STDistance() functions (not indexed), but this was even slower.
Any help would be much appreciated. I'm hoping there is some way to speed up the function, even if it degrades accuracy a little (nearest 100m is probably ok).
Thanks in advance!
DECLARE #l_distance_m float
, #l_long_start FLOAT
, #l_long_end FLOAT
, #l_lat_start FLOAT
, #l_lat_end FLOAT
, #l_x_diff FLOAT
, #l_y_diff FLOAT
SET #l_lat_start = #lat_start
SET #l_long_start = #long_start
SET #l_lat_end = #lat_end
SET #l_long_end = #long_end
-- NOTE 2 x PI() x (radius of earth) / 360 = 111
SET #l_y_diff = 111 * (#l_lat_end - #l_lat_start)
SET #l_x_diff = 111 * (#l_long_end - #l_long_start) * COS(RADIANS((#l_lat_end + #l_lat_start) / 2))
SET #l_distance_m = 1000 * SQRT(#l_x_diff * #l_x_diff + #l_y_diff * #l_y_diff)
RETURN #l_distance_m
I haven't done any SQL programming since around 1994, however I'd make the following observations:The formula that you're using is a formula that works as long as the distances between your coordinates doesn't get too big. It'll have big errors for working out the distance between e.g. New York and Singapore, but for working out the distance between New York and Boston it should be fine to within 100m.I don't think there's any approximation formula that would be faster, however I can see some minor implementation improvements that might speed it up such as (1) why do you bother to assign #l_lat_start from #lat_start, can't you just use #lat_start directly (and same for #long_start, #lat_end, #long_end), (2) Instead of having 111 in the formulas for #l_y_diff and #l_x_diff, you could get rid of it there hence saving a multiplication, and instead of 1000 in the formula for #l_distance_m you could have 111000, (3) using COS(RADIANS(#l_lat_end)) or COS(RADIANS(#l_lat_start)) won't degrade the accuracy as long as the points aren't too far away, or if the points are all within the same city you could just work out the cosine of any point in the cityApart from that, I think you'd need to look at other ideas such as creating a table with the results, and whenever points are added/deleted from the table, updating the results table at that time.
I am a newbie with Matlab and I have the following scenario( which is part of a larger problem).
matrix A with 4754x1024 and matrix B with 6800x1024 rows.
For every row in matrix A i need to calculate the euclidean distance in matrix B. I am using the following technique to calculate the distance but I find that this is very inefficient and very time consuming in Matlab.
for i=1:row_A
A_data=A_test(i,:);
for j=1:row_B
B_data=B_train(j,:);
X=[A_data;B_data];
%calculate distance
d=pdist(X,'euclidean');
dist(j,i)=d;
end
end
Any suggestions to optimise this because the final step involves performing this operation on 50 such sets of A and B.
Thanks and Regards,
Bhavya
I'm not sure what your code is actually doing.
Assuming your data has the following properties
assert(size(A,2) == size(B,2))
Try
d = zeros(size(A,1), size(B,1));
for i = 1:size(A,1)
d(i,:) = sqrt(sum(bsxfun(#minus, B, A(i,:)).^2, 2));
end
Or possibly better organised by columns (See "Store and Access Data in Columns" in http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html):
At = A.'; Bt = B.';
d = zeros(size(At,2), size(Bt,2));
for i = 1:size(At,2)
d(i,:) = sqrt(sum(bsxfun(#minus, Bt, At(:,i)).^2, 1));
end