How to convert a mimira object (Cox regression model, from multiple imputations and a propensity score matching (MatchThem pkg)) into a Forest plot - r-mice

Dear StackOverflow community,
as a surgeon, and full of enthusiasm for 6 months for R learning in self-taught mode (StackOverflow, and so many websites), I beg your indulgence in the triviality of my concern.
The background:
Briefly, my objective is to run a survival cox model regression for a dataset of cancer patients. Due to the retrospective aspect, I planned to make a matching 1:3 with propensity score matching (PSM). The missing data were dealt with multiple imputations ("mice" pkg). The PSM was managed with "MatchThem" pkg.
I used "survey" pkg for pooling the survival (svycoxph() pooled through with() function). This leads us to a mimira object, which I can easily print out into a beautiful Table, with tbl_regression ("gtsummary" pkg).
The issue:
As a usually print my cox regressions into a Hazard ratios Table and a graphical version (Forest plot with ggforest(), from "survminer" pkg), this time I am really stuck. The function ggforest doesn't recognize the mimira object as a "coxph object" and send this error :
Error in ggforest(tbl_regression_object, data = mimira_object) :
inherits(model, "coxph") is not TRUE
I guess that adding a PSM to my multiple imputations is the problem, as I had no problem for printing cox regression of multiple imputations with Forest plot (ggforest is able to deal mira objects without problem with pool_and_tidy_mice() function).
Here is the script:
#Data
library(fabricatr)
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 2000,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 1.8,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
# Merging df
library(dplyr)
mydata_complete <- bind_cols(survival_data, participant_data)
# generating missing value
library(missMethods)
mydata_uncomp <- delete_MCAR(mydata_complete, 0.3)
mydata <- mydata_uncomp
#1 imputation with "mice"
library(mice)
mydata$nelsonaalen <- nelsonaalen(mydata, eventtime, status)
mydata_mice_imp_m3 <- mice(mydata, maxit = 2, m = 3, seed = 20200801) # m=3 is for testing
#2 matching (PSM 1:3) with "MatchThem"
library(MatchThem)
mydata_imp_m3_psm <- matchthem(treatment ~ age + is_female + disease_stage, data = mydata_mice_imp_m3, approach = "within" ,ratio= 1, method = "optimal")
#3 Pooling Coxph models in multiple imputed datasets and PSM with "survey"
library(survey)
mimira_object <- with(data = mydata_imp_m3_psm, expr = svycoxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage))
pool_and_tidy_mice(mimira_object, exponentiate = TRUE, conf.int=TRUE) -> pooled_imp_m3_cph
# estimates with pool_and_tidy_mice() works with mimira_object but cannot bring me de degree of freedoms. Warning message :
In get.dfcom(object, dfcom) : Infinite sample size assumed.
> pooled_imp_m3_cph
term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
1 age 0.9995807 0.001961343 -0.2138208 NaN NaN NaN 1.489769e-06 NaN Inf NaN 0.5163574 3 1.067643 1.860509e-06
2 is_smoker 2.8626952 0.093476026 11.2516931 NaN NaN NaN 4.182884e-03 NaN Inf NaN 0.6382842 3 1.764601 3.160589e-03
3 disease_stage 1.2386947 0.044092483 4.8547535 NaN NaN NaN 8.995628e-04 NaN Inf NaN 0.6169374 3 1.610540 7.447299e-04
#4 Table summary of the pooled results
library(gtsummary)
tbl_regression_object <- tbl_regression(mimira_object, exp=TRUE, conf.int = TRUE) # 95% CI and p-value are missing due to an issue with an other issue in the pooling of the mimira_object. The Matchthem:::get.2dfcom function gives a dfcom = 999999 (another issue to be solved in my concern)
#5 What it should looks like as graphical summary
library(survival)
mydata.cox <- coxph(Surv(eventtime, status) ~ age+ is_smoker + disease_stage, mydata_uncomp) # (df mydata_uncomp is without imputation and PSM)
#with gtsummary
forestGT <-
mydata.cox %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
(forestGT) # See picture GT_plot1. Almost perfect. Would have been great to know how to add N, 95% CI, HR, p-value and parameters of the model (AIC, events, concordance, etc.)
#with survminer
HRforest <-
survminer::ggforest(mydata.cox, data = mydata_uncomp)
(HRforest) # See picture Ggforest. Everything I need to know about my cox regression is all in there. For me it is just a great regression cox forest plot.
#6 Actually what happens when I do the same thing with imputed and matched df
#with gtsummary
forestGT_imp_psm <-
mimira_object %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot() # WARNING message : In get.dfcom(object, dfcom) : Infinite sample size assumed.
(forestGT_imp_psm) # See picture GT_plot2. The plot is rendered but without 95% IC
#with survminer
HRforest_imp_psm <-
ggforest(mimira_object, data = mydata_imp_m3_psm) # ERROR:in ggforest(mimira_object, data = mydata_imp_m3_psm) : inherits(model, "coxph") is not TRUE
(HRforest_imp_psm)
#7 The lucky and providential step
# your solution/advise
Would greatly appreciate your help.
cheers.
AK
Picture GT_plot1
(not allowed to embed images in this post, here is sharelink : GT_plot1
Picture Ggforest_plot
Ggforest_plot
Picture GT_plot2
GT_plot2

It seems that there are two distinct problems here:
Problem #1. getting gtsummary() to produce a table with p values and confidence intervals of the pooled, matched data
Problem #2. producing a ggforest() to produce a plot of the pooled estimates.
Problem #1:
Let us follow the instructions in the paper "MatchThem:: Matching and Weighting after Multiple Imputation" (https://arxiv.org/ftp/arxiv/papers/2009/2009.11772.pdf) [page 15]
and modify your block #3. Instead of calling pool_and_tidy_mice() we do the following:
matched.results <- pool(mimira_object)
summary(matched.results, conf.int = TRUE)
This produces the following:
term estimate std.error statistic df p.value 2.5 % 97.5 %
1 age -0.0005997864 0.001448251 -0.4141453 55.266353 6.803707e-01 -0.003501832 0.00230226
2 is_smoker 1.1157796620 0.077943244 14.3152839 9.961064 5.713387e-08 0.942019234 1.28954009
3 disease_stage 0.2360965310 0.051799813 4.5578645 3.879879 1.111782e-02 0.090504018 0.38168904
This means that performing the imputation with mice and then matching with MatchThem works, since you do get the p values and the confidence intervals.
Compare to the output from pool_and_tidy_mice():
term estimate std.error statistic p.value b df dfcom fmi lambda m
1 age -0.0005997864 0.001448251 -0.4141453 NaN 2.992395e-07 NaN Inf NaN 0.1902260 3
2 is_smoker 1.1157796620 0.077943244 14.3152839 NaN 2.041627e-03 NaN Inf NaN 0.4480827 3
3 disease_stage 0.2360965310 0.051799813 4.5578645 NaN 1.444843e-03 NaN Inf NaN 0.7179644 3
riv ubar
1 0.2349124 1.698446e-06
2 0.8118657 3.352980e-03
3 2.5456522 7.567636e-04
Where everything is the same except for df and p.value which were not calculated in the latter table.
I therefore think this is an issue with the pool_and_tidy_mice() and you should post this as an issue on GitHub at gtsummary.
For right now, you can bypass this problem by changing svycoxph() to survival::coxph() in block #3 when you call the with() function. If you do that, then eventually you will get a gtsummary table with p.values and confidence intervals. Ultimately, the problem is probably some interaction between svycoxph() and pool_and_mice(), hence why I believe that you should post this on GitHub.
Problem #2:
The short answer is that there cannot be a ggforest plot with all the data that you are looking for.
https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/pool reads:
A common error is to reverse steps 2 and 3, i.e., to pool the multiply-imputed data instead of the estimates. Doing so may severely bias the estimates of scientific interest and yield incorrect statistical intervals and p-values. The pool() function will detect this case.
This means that there is no "real" dataset for the pooled estimates (i.e. you cannot really combine the datasets for imputations 1-3), which means that ggforest() cannot compute the desired plot (since it needs to have a dataset and that cannot be used because it would lead to erroneous estimates).
What you could do, is present all the ggforest plots for each imputation (so if you did 3 imputations, you will get 3 slightly different ggforest plots) and finally add the pooled estimates plot by using plot() as suggested above.
To create each ggforest plot you need the following line of code:
ggforest(mimira_object$analyses[[1]], complete(mydata_imp_m3_psm, 1))
This will create the ggforest plot for your first imputation. Change the numbers to 2 and 3 to check the remaining imputations.
I hope this helped,
Alex

If you provide a reproducible example (i.e. an example on a data set that we can all run on our machines), we can better help you out.
The gtsummary package exports a plot() method you can use to construct a forest plot. Example below!
library(gtsummary)
library(survival)
ggforest <-
coxph(Surv(ttdeath, death) ~ trt + grade, trial) %>%
tbl_regression(exponentiate = TRUE,
add_estimate_to_reference_rows = TRUE) %>%
plot()
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
ggforest
Created on 2021-08-26 by the reprex package (v2.0.1)

Related

Specific calculations for unique column values in DataFrame

I want to make a beta calculation in my dataframe, where beta = Σ(daily returns - mean daily return) * (daily market returns - mean market return) / Σ (daily market returns - mean market return)**2
But I want my beta calculation to apply to specific firms. In my dataframe, each firm as an ID code number (specified in column 1), and I want each ID code to be associated with its unique beta.
I tried groupby, loc and for loop, but it seems to always return an error since the beta calculation is quite long and requires many parenthesis when inserted.
Any idea how to solve this problem? Thank you!
Dataframe:
index ID price daily_return mean_daily_return_per_ID daily_market_return mean_daily_market_return date
0 1 27.50 0.008 0.0085 0.0023 0.03345 01-12-2012
1 2 33.75 0.0745 0.0745 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 0.00006 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 0.005125 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 0.0085 0.0846 0.04345 04-05-2014
5 4 22.75 0.00539 0.005125 0.0003 0.0006
I assume the following form of your equation is what you intended.
Then the following should compute the beta value for each group
identified by ID.
Method 1: Creating our own function to output beta
import pandas as pd
import numpy as np
# beta_data.csv is a csv version of the sample data frame you
# provided.
df = pd.read_csv("./beta_data.csv")
def beta(daily_return, daily_market_return):
"""
Returns the beta calculation for two pandas columns of equal length.
Will return NaN for columns that have just one row each. Adjust
this function to account for groups that have only a single value.
"""
mean_daily_return = np.sum(daily_return) / len(daily_return)
mean_daily_market_return = np.sum(daily_market_return) / len(daily_market_return)
num = np.sum(
(daily_return - mean_daily_return)
* (daily_market_return - mean_daily_market_return)
)
denom = np.sum((daily_market_return - mean_daily_market_return) ** 2)
return num / denom
# groupby the column ID. Then 'apply' the function we created above
# columnwise to the two desired columns
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
Method 2: Using pandas' builtin statistical functions
Notice that beta as stated above is just covarianceof DR and
DMR divided by variance of DMR. Therefore we can write the above
program much more concisely as follows.
import pandas as pd
import numpy as np
df = pd.read_csv("./beta_data.csv")
def beta(dr, dmr):
"""
dr: daily_return (pandas columns)
dmr: daily_market_return (pandas columns)
TODO: Fix the divided by zero erros etc.
"""
num = dr.cov(dmr)
denom = dmr.var()
return num / denom
betas = df.groupby("ID")["daily_return", "daily_market_return"].apply(
lambda x: beta(x["daily_return"], x["daily_market_return"])
)
print(f"betas: {betas}")
The output in both cases is.
ID
1 0.012151
2 NaN
3 NaN
4 -0.883333
dtype: float64
The reason for getting NaNs for IDs 2 and 3 is because they only have a single row each. You should modify the function beta to accomodate these corner cases.
Maybe you can start like this?
id_list = list(set(df["ID"].values.tolist()))
for firm_id in id_list:
new_df = df.loc[df["ID"] == firm_id]

Computing empirical markov transition probabilities in Julia Data Frames

I want to use Julia DataFrames to construct a 3x3 Markov transition matrix i.e. a frequency matrix that tells me the likelihood of transitioning from each of 3 states to the others. I am trying to learn data frames and I would like to learn the best way to do this. This is more for general learning than about this particular example.
Here's some code I tried so far with some example data but I am not really familiar enough with how to think about dataframes to know how to proceed.
Any suggestions? Thank you.
state=[2,2,3,1,1,3,3,2,1,1,3,1,2,3,2,3,1,2,3,3,1]
statelag=[1,2,2,3,1,1,3,3,2,1,1,3,1,2,3,2,3,1,2,3,3]
df = DataFrame(state=state, statelag=statelag)
markov = combine(groupby(df, [:statelag, :state]), nrow => :cat_countmar)
sort!(markov, :statelag, :state) # this gives the number of occurences of each tranistion
total = combine(groupby(df, :statelag), nrow => :cat_count)
# this gives the number of occurences of each state
trans = Array{Float64}(undef, (3,3))
# trans should give probability of transitioning between different states
I need to basically "divide" catcountmar of by cat_count so that I'm dividing the number of occurrences of a transition from state i to state j by the number of occurences of state i. This will give the desired transition frequency. But I don't see how to put markov and total together in one data frame and easily carry out this computation.
You can use the transform function to get results of the same length of your original dataframe. See the code below for an example.
state=[2,2,3,1,1,3,3,2,1,1,3,1,2,3,2,3,1,2,3,3,1]
statelag=[1,2,2,3,1,1,3,3,2,1,1,3,1,2,3,2,3,1,2,3,3]
df = DataFrame(state=state, statelag=statelag)
df = transform(groupby(df, [:statelag, :state]), nrow => :cat_countmar)
# get the number of occurrences for each transition
df = transform(groupby(df, :statelag), nrow => :cat_count)
# get the number of occurrences for each state
df[:,:prob] = df[:,:cat_countmar]./df[:,:cat_count]
# get the transition probability
trans = unique(df[:,[:statelag,:state,:prob]])
# remove unnecessary rows and columns
reshape_trans = unstack(trans,:state,:prob)
# Transform into a matrix format
trans_mat = convert(Matrix{Float64},reshape_trans[:,2:4])
# Finally, convert the dataframe to a matrix
Final Output:
3×3 Array{Float64,2}:
0.285714 0.428571 0.285714
0.166667 0.166667 0.666667
0.5 0.25 0.25

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

Xgboost cox survival time entry

In the new implementation of cox ph survival model in xgboost 0.81 how does one specify start and end time of an event?
Thanks
The R equivalent function would be for example :
cph_mod = coxph(Surv(Start, Stop, Status) ~ Age + Sex + SBP, data=data)
XGBoost do not allow for start (i.e. delayed entry). If it makes sense for the application, you can always change the underlying time scale so all subjects start at time=0. However, XGBoost does allow for right censored data. It seems impossible to find any documentation/example for how to implement a Cox model, but from the source code you can read "Cox regression for censored survival data (negative labels are considered censored)."
Here is a short example for anyone who want to try XGBoost with obj="survival:cox". We can compare the results to to the scikit-learn survival package sksurv. To make XGBoost more similar to that framework we use a linear booster instead of a tree booster.
import pandas as pd
import xgboost as xgb
from sksurv.datasets import load_aids
from sksurv.linear_model import CoxPHSurvivalAnalysis
# load and inspect the data
data_x, data_y = load_aids()
data_y[10:15]
Out[586]:
array([(False, 334.), (False, 285.), (False, 265.), ( True, 206.),
(False, 305.)], dtype=[('censor', '?'), ('time', '<f8')])
# Since XGBoost only allow one column for y, the censoring information
# is coded as negative values:
data_y_xgb = [x[1] if x[0] else -x[1] for x in data_y]
data_y_xgb[10:15]
Out[3]: [-334.0, -285.0, -265.0, 206.0, -305.0]
data_x = data_x[['age', 'cd4']]
data_x.head()
Out[4]:
age cd4
0 34.0 169.0
1 34.0 149.5
2 20.0 23.5
3 48.0 46.0
4 46.0 10.0
# Since sksurv output log hazard ratios (here relative to 0 on predictors)
# we must use 'output_margin=True' for comparability.
estimator = CoxPHSurvivalAnalysis().fit(data_x, data_y)
gbm = xgb.XGBRegressor(objective='survival:cox',
booster='gblinear',
base_score=1,
n_estimators=1000).fit(data_x, data_y_xgb)
prediction_sksurv = estimator.predict(data_x)
predictions_xgb = gbm.predict(data_x, output_margin=True)
d = pd.DataFrame({'xgb': predictions_xgb,
'sksurv': prediction_sksurv})
d.head()
Out[13]:
sksurv xgb
0 -1.892490 -1.843828
1 -1.569389 -1.524385
2 0.144572 0.207866
3 0.519293 0.502953
4 1.062392 1.045287
d.plot.scatter('xgb', 'sksurv')
Note that these are predictions on the same data that was use to fit the model. It seems that XGBoost get the values right but sometimes with a linear transformation. I do not know why. Play around with base_score and n_estimators. Perhaps someone can add to this answer.

Pseudoinverse calculation in Python

Problem
I was working on the problem described here. I have two goals.
For any given system of linear equations, figure out which variables have unique solutions.
For those variables with unique solutions, return the minimal list of equations such that knowing those equations determines the value of that variable.
For example, in the following set of equations
X = a + b
Y = a + b + c
Z = a + b + c + d
The appropriate output should be c and d, where X and Y determine c and Y and Z determine d.
Parameters
I'm provided a two columns pandas DataFrame entitled InputDataSet where the two columns are Equation and Variable. Each row represents a variable's membership in a given equation. For example, the above set of equations would be represented as
InputDataSet = pd.DataFrame([['X','a'],['X','b'],['Y','a'],['Y','b'],['Y','c'],
['Z','a'],['Z','b'],['Z','c'],['Z','d']],columns=['Equation','Variable'])
The output will be stored in a 2 column DataFrame named OutputDataSet as well, where the first contains the variables that have unique solution, and the second is a comma delimited string of the minimal set of equations needed to solve the given variable. For example, the correct OutputDataSet would look like
OutputDataSet = pd.DataFrame([['c','X,Y'],['d','Y,Z']],columns=['Variable','EquationList'])
Current Solution
My current solution takes the InputDataSet and converts it into a NetworkX graph. After splitting the graph into connected subgraphs, it then converts the graph into a biadjacency matrix (since the graph by nature is bipartite). After this conversion, the SVD is computed, and the nullspace and pseudoinverse are calculated from the SVD (To see how they are calculated, see here and here: look at the source code for numpy.linalg.pinv and the cookbook function for nullspace. I fused the two functions since they both use SVD).
After calculating nullspace and pseudo-inverse, and rounding to a given tolerance, I find all rows in the nullspace where all of the coefficients are 0, and return those variables as those with a unique solution, and return those equations with non-zero coefficients for those variables in the pseudo-inverse.
Here is the code:
import networkx as nx
import pandas as pd
import numpy as np
import numpy.core as cr
def svd_lite(a, tol=1e-2):
wrap = getattr(a, "__array_prepare__", a.__array_wrap__)
rcond = cr.asarray(tol)
a = a.conjugate()
u, s, vt = np.linalg.svd(a)
nnz = (s >= tol).sum()
ns = vt[nnz:].conj().T
shape = a.shape
if shape[0]>shape[1]:
u = u[:,:shape[1]]
elif shape[1]>shape[0]:
vt = vt[:shape[0]]
cutoff = rcond[..., cr.newaxis] * cr.amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = cr.divide(1, s, where=large, out=s)
s[~large] = 0
res = cr.matmul(cr.swapaxes(vt, -1, -2), cr.multiply(s[..., cr.newaxis],
cr.swapaxes(u, -1, -2)))
return (wrap(res),ns)
cols = InputDataSet.columns
tolexp=2
graphs = nx.connected_component_subgraphs(nx.from_pandas_dataframe(InputDataSet,cols[0],
cols[1]))
OutputDataSet = []
Eqs = InputDataSet[cols[0]].unique()
Vars = InputDataSet[cols[1]].unique()
for i in graphs:
EqList = np.array([val for val in np.array(i.nodes) if val in Eqs])
VarList = [val for val in np.array(i.nodes) if val in Vars]
pinv,nulls = svd_lite(nx.bipartite.biadjacency_matrix(i,EqList,VarList,format='csc')
.astype(float).todense(),tol=10**-tolexp)
df2 = np.where(~np.round(nulls,tolexp).any(axis=1))[0]
df3 = np.round(np.array(pinv),tolexp)
OutputDataSet.extend([[VarList[i],",".join(EqList[np.nonzero(df3[i])])] for i in df2])
OutputDataSet = pd.DataFrame(OutputDataSet)
Issues
On the data that I've tested this algorithm on, it performs pretty well with decent execution time. However, the main issue is that it suggests far too many equations as required to determine a given variable.
Often, with datasets of 10,000 equations, the algorithm will claim that 8,000 of those 10,000 are required to determine a given variable, which most definitely is not the case.
I tried raising the tolerance (what I round the coefficients in the pseudo-inverse) to .1, but even then, nearly 5000 equations had non-zero coefficients.
I had conjectured that perhaps the pseudo-inverse is collapsing upon a non-optimal set of coefficients, but the Moore-Penrose pseudoinverse is unique, so that isn't a possibility.
Am I doing something wrong here? Or is the approach I'm taking not going to give me what I desire?
Further Notes
All of the coefficients of all of the variables are 1
The results the current algorithm is producing are reliable ... When I multiply any vector of equation totals by the pseudoinverse generated by the algorithm, I get values essentially equal to those claimed to have a unique solution, which is promising.
What I want to know here is either whether I'm doing something wrong in how I'm extrapolating information from the pseudo-inverse, or whether my approach is completely wrong.
I apologize for not posting any actual results, but not only are they quite large, but they are somewhat unintuitive since they are reformatted into an XML which would probably take another question to explain anyways.
Thank you for you time!