Different results from forecast::checkresiduals and Box.test - testing

I'm trying to fit an ar modelo with exogenous regressors, in particular seasonal dummies and trend with AR(3) error. For this I'm using the following code:
modelo<-Arima(log.licor, order = c(3,0,0), xreg = tend_esta, include.mean = F)
there is no mean included since I'm not leaving any seasonal dummy out of the regression.
The result of
forecast::checkresiduals(modelo, test = "LB")
is:
Ljung-Box test
data: Residuals from Regression with ARIMA(3,0,0) errors
Q* = 77.787, df = 7, p-value = 3.886e-14
Model df: 17. Total lags used: 24
but the result of
Box.test(residuals(modelo), type = "Ljung-Box")
is
Box-Ljung test
data: residuals(modelo)
X-squared = 1.3407, df = 1, p-value = 0.2469
am I doing something wrong with the arguments? the implication of each results are completly different.

I had the same problem. Play around with the lag and fitdf parameters in Box.test. So for your problem, see how it says "Model df: 17" and "Total lags used: 24" in the forecast version? Try those in the Box.test version.

Related

How to calculate rolling.agg('max') utilising a dataframe column as input to my function

I'm working with a kline dataframe. I'm adding a Swing_High and Swing_Low column to my df.
I've picked up an error where during low volatile periods my Close == Swing_Low price. This gives me a inf error in another function I have where close / Swing_Low.
To fix this I need to calculate the max/min value based on whether Close == Swing_Low or not. Default is for the rolling period to be 10 but if the above is true then increase the rolling period to 15.
Below is how I calculated the Swing_High and Swing_Low up to encountering Inf error.
import pandas as pd
df = pd.read_csv('Data/bybit_BTCUSD_15m.csv')
df["Date"] = df["Date"].astype('datetime64[ns]')
# Calculate the swing high and low for a given length
df['Swing_High'] = df['High'].rolling(10).agg('max')
df['Swing_Low'] = df['Low'].rolling(10).agg('min')
I tried the below function but it gives me a ValueError: The truth value of a Series is ambiguous
def swing_high(close, high, period1, period2):
a = high.rolling(period1).agg('max')
b = high.rolling(period2).agg('max')
if a != close:
return a
else:
return b
df['Swing_High'] = swing_high(df['Close'], df['High'], 10, 15)
How do I fix this or is there a better way to achieve my desired outcome?
A simple solution for what you're trying to achieve :
using the where function:
here’s the basic syntax using the pandas where() function:
df['col'] = (value_if_false).where(condition, value_if_true)
df['Swing_High_10']=df['High'].rolling(10).agg('max')
df['Swing_High_15']=df['High'].rolling(15).agg('max')
df['Swing_High']=(df['Swing_High_15']).where(df['Swing_High_10']!=df['Close'], df['Swing_High_15'])

Configuring Auto Arima for SARIMAX

I have a weekly time series from Jan 2018 to Dec 2021 that I am trying to build a model for. I am also using a weekly series of exogenous variables representing COVID cases.
I am not getting great results and I am unsure if it's because I've not considered something obvious (I'm new to time series prediction) or if it's because the data is just hard to predict. Would anyone be able to provide any advice on how to move forward in this situation?
Screenshots of the data.
Raw
Seasonal
Trend
Residual
Below is the code I'm using to run auto_arima. I set the seasonal differencing to 1 to force seasonal differencing as per tips on the auto_arima site.
from pmdarima.arima import auto_arima
y = merged_ts['y']
exogenous = merged_ts['exogenous']
train_size = int(len(y) * 0.8)
train_y = y[:train_size]
test_y = y[train_size:]
train_exogenous = merged_ts['exogenous'][:train_size]
test_exogenous = merged_ts['exogenous'][train_size:]
exogenous_df = pd.DataFrame(train_exogenous)[:train_size]
step_wise=auto_arima(
train_y,
X=exogenous_df,
m=52,
D=1,
seasonal=True,
trace=True,
stepwise = True,
n_jobs = -1,
error_action='ignore',
suppress_warnings=True)
best_order = step_wise.order
best_seasonal_order = step_wise.seasonal_order
I get best_order = (0, 0, 0) and best_seasonal_order = (1, 1, 0, 52)
The AIC for the best model is 827.
I then configure SARIMAX as follows
from pandas._libs.algos import take_1d_int16_float64
from statsmodels.tsa.statespace.sarimax import SARIMAX
start = len(train_y)
end = len(train_y) + len(test_y)
model = SARIMAX(
train_y,
exogenous=exogenous_df,
order=best_order,
seasonal_order=best_seasonal_order,
enforce_invertibility=False)
results = model.fit()
forecasting_window_for_validation = len(test_y)
forecast = results.predict(start, end, typ='levels')
forecast_based_on_forecasting_window=
pd.DataFrame(forecast[:forecasting_window_for_validation])
forecast_based_on_forecasting_window.set_index(
test_y.index[:forecasting_window_for_validation],
inplace=True)
forecast_based_on_forecasting_window = pd.merge(
forecast_based_on_forecasting_window,
test_y,
left_index=True,
right_index=True,
how='left')
forecast_based_on_forecasting_window.columns = ['Forecast', 'Actual']
forecast_based_on_forecasting_window['Forecast'] =
forecast_based_on_forecasting_window[['Forecast']]
forecast_based_on_forecasting_window['Actual'] =
forecast_based_on_forecasting_window[['Actual']]
forecast_based_on_forecasting_window.plot()
The mean squared error I get is 146.
Result
I would love some pointers in what I might be doing wrong or ways to improve it. My main issue is that I'm not sure if it's my lack of experience or the weak predictive power of the data, although I can see a seasonal pattern. I've tried using random walk approach, using a simple moving average and last value models but it feels like a seasonal model should be doable, but I'm not sure.
Thank you for any tips at all!

How do I do an OLS on GLS time series regression in python?

I am attempting to transfer my team's Eview code to Python and I got stock with the following line in Eviews:
equation eq_LSTrend.ls(cov=hac) log({Price})=c(1) * #trend + c(2).
Here, the time regression analysis of a certain time window is to be performed on the log(price) and the intercept c(1) as well as the slope c(2) have to be determined.
Let's say I have the following df:
import pandas as pd
Range = pd.date_range('1990-01-01', periods=8, freq='D')
log_price = [5.0835, 5.0906, 5.0946, 5.0916, 5.0825, 5.0833, 5.0782, 5.0709]
df = pd.DataFrame({ 'Date': Range, 'Log Price': log_price })
df.set_index('Date', inplace=True)
And the df looks like this:
Date Log Price
1990-01-01 5.0835
1990-01-02 5.0906
1990-01-03 5.0946
1990-01-04 5.0916
1990-01-05 5.0825
1990-01-06 5.0833
1990-01-07 5.0782
1990-01-08 5.0709
How could I, for example, take a rolling 5 period window, do a OLS or GLS analysis and get the wanted parameters (the slope and the intercept parameters?)
Also, which library would be appropriate for it (statsmodels or maybe some other library)?
Ideally, the code would look something like this:
df_window = df.rolling(window = 5)
slope_output = sm.GLS(df_window).slope
or if separate columns have to be provided as an input (in this case I would leave "Date" as a separate column in df)
df_window = df.rolling(window = 5)
slope_output = sm.GLS(depend_var = df_window["Log Price"], independ_var = df_window["Date"]).slope
I am quite new to python so please pardon my bad coding!

How to convert a mimira object (Cox regression model, from multiple imputations and a propensity score matching (MatchThem pkg)) into a Forest plot

Dear StackOverflow community,
as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background:
Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg.
I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue:
As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1
(not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot
Ggforest_plot
Picture GT_plot2
GT_plot2
It seems that there are two distinct problems here:
Problem #1. getting gtsummary() to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest() to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice() we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice and then matching with MatchThem works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice():
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice() and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph() to survival::coxph() in block #3 when you call the with() function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph() and pool_and_mice(), hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest() cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot() as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex
If you provide a reproducible example (i.e. an example on a data set that we can all run on our machines), we can better help you out.
The gtsummary package exports a plot() method you can use to construct a forest plot. Example below!
library(gtsummary)
library(survival)
ggforest <-
coxph(Surv(ttdeath, death) ~ trt + grade, trial) %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
ggforest
Created on 2021-08-26 by the reprex package (v2.0.1)

Pseudoinverse calculation in Python

Problem
I was working on the problem described here. I have two goals.
For any given system of linear equations, figure out which variables have unique solutions.
For those variables with unique solutions, return the minimal list of equations such that knowing those equations determines the value of that variable.
For example, in the following set of equations
X = a + b
Y = a + b + c
Z = a + b + c + d
The appropriate output should be c and d, where X and Y determine c and Y and Z determine d.
Parameters
I'm provided a two columns pandas DataFrame entitled InputDataSet where the two columns are Equation and Variable. Each row represents a variable's membership in a given equation. For example, the above set of equations would be represented as
InputDataSet = pd.DataFrame([['X','a'],['X','b'],['Y','a'],['Y','b'],['Y','c'],
['Z','a'],['Z','b'],['Z','c'],['Z','d']],columns=['Equation','Variable'])
The output will be stored in a 2 column DataFrame named OutputDataSet as well, where the first contains the variables that have unique solution, and the second is a comma delimited string of the minimal set of equations needed to solve the given variable. For example, the correct OutputDataSet would look like
OutputDataSet = pd.DataFrame([['c','X,Y'],['d','Y,Z']],columns=['Variable','EquationList'])
Current Solution
My current solution takes the InputDataSet and converts it into a NetworkX graph. After splitting the graph into connected subgraphs, it then converts the graph into a biadjacency matrix (since the graph by nature is bipartite). After this conversion, the SVD is computed, and the nullspace and pseudoinverse are calculated from the SVD (To see how they are calculated, see here and here: look at the source code for numpy.linalg.pinv and the cookbook function for nullspace. I fused the two functions since they both use SVD).
After calculating nullspace and pseudo-inverse, and rounding to a given tolerance, I find all rows in the nullspace where all of the coefficients are 0, and return those variables as those with a unique solution, and return those equations with non-zero coefficients for those variables in the pseudo-inverse.
Here is the code:
import networkx as nx
import pandas as pd
import numpy as np
import numpy.core as cr
def svd_lite(a, tol=1e-2):
wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
rcond = cr.asarray(tol)
a = a.conjugate()
u, s, vt = np.linalg.svd(a)
nnz = (s >= tol).sum()
ns = vt[nnz:].conj().T
shape = a.shape
if shape[0]>shape[1]:
u = u[:,:shape[1]]
elif shape[1]>shape[0]:
vt = vt[:shape[0]]
cutoff = rcond[..., cr.newaxis] * cr.amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = cr.divide(1, s, where=large, out=s)
s[~large] = 0
res = cr.matmul(cr.swapaxes(vt, -1, -2), cr.multiply(s[..., cr.newaxis],
cr.swapaxes(u, -1, -2)))
return (wrap(res),ns)
cols = InputDataSet.columns
tolexp=2
graphs = nx.connected_component_subgraphs(nx.from_pandas_dataframe(InputDataSet,cols[0],
cols[1]))
OutputDataSet = []
Eqs = InputDataSet[cols[0]].unique()
Vars = InputDataSet[cols[1]].unique()
for i in graphs:
EqList = np.array([val for val in np.array(i.nodes) if val in Eqs])
VarList = [val for val in np.array(i.nodes) if val in Vars]
pinv,nulls = svd_lite(nx.bipartite.biadjacency_matrix(i,EqList,VarList,format='csc')
.astype(float).todense(),tol=10**-tolexp)
df2 = np.where(~np.round(nulls,tolexp).any(axis=1))[0]
df3 = np.round(np.array(pinv),tolexp)
OutputDataSet.extend([[VarList[i],",".join(EqList[np.nonzero(df3[i])])] for i in df2])
OutputDataSet = pd.DataFrame(OutputDataSet)
Issues
On the data that I've tested this algorithm on, it performs pretty well with decent execution time. However, the main issue is that it suggests far too many equations as required to determine a given variable.
Often, with datasets of 10,000 equations, the algorithm will claim that 8,000 of those 10,000 are required to determine a given variable, which most definitely is not the case.
I tried raising the tolerance (what I round the coefficients in the pseudo-inverse) to .1, but even then, nearly 5000 equations had non-zero coefficients.
I had conjectured that perhaps the pseudo-inverse is collapsing upon a non-optimal set of coefficients, but the Moore-Penrose pseudoinverse is unique, so that isn't a possibility.
Am I doing something wrong here? Or is the approach I'm taking not going to give me what I desire?
Further Notes
All of the coefficients of all of the variables are 1
The results the current algorithm is producing are reliable ... When I multiply any vector of equation totals by the pseudoinverse generated by the algorithm, I get values essentially equal to those claimed to have a unique solution, which is promising.
What I want to know here is either whether I'm doing something wrong in how I'm extrapolating information from the pseudo-inverse, or whether my approach is completely wrong.
I apologize for not posting any actual results, but not only are they quite large, but they are somewhat unintuitive since they are reformatted into an XML which would probably take another question to explain anyways.
Thank you for you time!