Can't deploy random forest model - dataframe

emphasized textHalo I'm new to shinny and R, but can't deploy random forest model because the dataframe that I made didn't exist and error "**
Error in $<-.data.frame: replacement has 0 rows, data has 1`
Anyone with same problem ? thank you

Related

Feature Selection for Text Classification with Information Gain in R

I´m trying to prepare my dataset ideally for binary document classification with an SVM algorithm in R.
The dataset is a combination of 150171 labelled variables and 2099 observations stored in a dataframe. The variables are a combination uni- and bigrams which were retrieved from a text dataset.
When I´m trying to calculate the Information gain as a feature selection method, the Error "cannot allocate vector of size X Gb" occurs although I already extended my memory and I´m running on a 64-bit operating system. I tried the following package:
install.packages("FSelector")
library(FSelector)
value <- information.gain(Usefulness ~., dat_SentimentAnalysis)
Does anybody know a solution/any trick for this problem?
Thank you very much in advance!

pyomo matrix product

I would like to use pyomo to solve a multiple linear regression under constraint in pyomo.
to do so I have 3 matrices :
X (noted tour1 in the following code) My inputs (600x13)(bureaux*t1 in pyomo)
Y (noted tour2 in the following code) the matrix I want to predict (6003)(bureauxt2 inpyomo)
T (noted transfer In the code) (13x3)(t1*t2 in pyomo)
I would like to do the following
ypred = XT
minimize (ypred-y)**2
subject to
0<T<1
and Sum_i(Tij)=1
To that effect, I started the following code
from pyomo.environ import *
tour1=pd.DataFrame(np.random.random(size=(60,13)),columns=["X"+str(i) for i in range(13)],index=["B"+str(i) for i in range(60)])
tour2=pd.DataFrame(np.random.random(size=(60,3)),columns=["Y"+str(i) for i in range(3)],index=["B"+str(i) for i in range(60)])
def gettour1(model,i,j):
return tour1.loc[i,j]
def gettour2(model,i,j):
return tour2.loc[i,j]
def cost(model):
return sum((sum(model.tour1[i,k] * model.transfer[k,j] for k in model.t1) - model.tour2[i,j] )**2 for i in model.bureaux for j in model.tour2)
model = ConcreteModel()
model.bureaux = Set(initialize=tour1.index.tolist())
model.t1 = Set(initialize=tour1.columns)
model.t2 = Set(initialize=tour2.columns)
model.tour1 = Param(model.bureaux, model.t1,initialize=gettour1)
model.tour2 = Param(model.bureaux, model.t2,initialize=gettour2)
model.transfer = Var(model.t1,model.t2,bounds=[0,1])
model.obj=Objective(rule=cost, sense=minimize)
I unfortunately get an error at this stage :
KeyError: "Index '('X0', 'B0', 'Y0')' is not valid for indexed component 'transfer'"
anyone knows how I can calculate the objective ?
furthermore any help for the constrains would be appreciated :-)
A couple things...
First, the error you are getting. There is information in that error statement that should help identify the problem. The construction appears to be trying to index transfer with a 3-part index (x, b, y). That clearly is outside of t1 x t2. If you look at the sum equation you have, you are mistakenly using model.tour2 instead of model.t2.
Also, your bounds parameter needs to be a tuple.
While building the model, you should be pprint()-ing the model very frequently to look for these types of issues. That only works well if you have "small" data. 60 x 13 may be the normal problem size, but it is a total pain to troubleshoot. So, start with something tiny, maybe 3 x 4. Make a Set, pprint(). Make a Constraint, pprint()... Once the model computes/solves with "tiny" data, just pop in the real stuff.

I can't understand an numpy array concept in sklearn

my code
diabetes_x=np.array([[1],[2],[3]])
diabetes_x_train=diabetes_x
diabetes_x_test=diabetes_x
diabetes_y_train=np.array([3,2,4])
diabetes_y_test=np.array([3,2,4])
model=linear_model.LinearRegression()
model.fit(diabetes_x_train,diabetes_y_train)
diabetes_y_predict=model.predict(diabetes_x_test)
print("Mean Squared error is :",mean_squared_error(diabetes_y_test,diabetes_y_predict))
print("weights : ",model.coef_)
print("intercept : ",model.intercept_)
in this code we are taking diabetes_x value in 2-D but in diabetes_y_train and test why we are taking 1-D array. Can someone please explain me both of the concept of diabetes_x and _y
In machine learning terminology X is regarded as the input variable and y is regarded as output variable.
Suppose there is dataset with 5 columns where the last column is the result. So the input will consist of all the column except the last and the last column will be used to check if the mapping is correct after training or during validation to calculate the error.

How i remove the corresponding index in the label column?

I have a Dataset of News
After the cleaning stage, I noticed that 3 news becomes empty and I deleted them
But I want to delete the corresponding index within the label column (Series pandas) so that the two Series's (cleaned & label) becomes equal so that no problems grow when dividing in the training stage for classification such this error
ValueError: Found input variables with inconsistent numbers of samples: [997, 1000]
You can reset_index. It will solve the problem.
df.reset_index(inplace=True)
You have to first perform feature engineering(if necessary) and cleaning your data. Then split your data into X_train (dataframe with features) and y_train(target feature).
That way you find such issues.

Mlflow it is possible to log confusion matrix every step?

It is possible to log with mlflow the confusion matrix every step like a simple metrics?
If it is possible it have a visualization like this?
For every possible run, you could get the individual confusion matric values
# get confusion matrix values
conf_matrix = confusion_matrix(y_test,y_pred)
true_positive = conf_matrix[0][0]
true_negative = conf_matrix[1][1]
false_positive = conf_matrix[0][1]
false_negative = conf_matrix[1][0]
mlflow.log_metric("true_positive", true_positive)
mlflow.log_metric("true_negative", true_negative)
mlflow.log_metric("false_positive", false_positive)
mlflow.log_metric("false_negative", false_negative)
And then log_artifact(<your_plot>, "confusion_matrix")
Since you already have the confusion matrix as visualization you can log it with the mlflow.log_artifact() call as file.
The official documenation also has an example even though it is with a txt file it should be no problem to serialize the visualization somehow.