getting error ReconcileDynamicAxis - cntk

I am using WeightedLogistic in BS or weighted_binary_cross_entropy in python.
After defining weights as simple array of two floats I am getting following error.
FrameRange's dynamic axis is inconsistent with matrix. They are compatible though--are you missing a ReconcileDynamicAxis operation?

Related

Pandas to Koalas (Databricks) conversion code for big scoring dataset

I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.

Does the sklearn.ensemble.GradientBoostingRegressor support sparse input samples?

I’m using sklearn.ensemble.GradientBoostingRegressor on data that is sometimes lacking some values. I can’t easily impute these data because they have a great variance and the estimate is very sensitive to them. They are also almost never 0.
The documentation of the fit method says about the first parameter X:
The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
This has lead me to think that the GradientBoostingRegressor can work with sparse input data.
But internally it calls check_array with implicit force_all_finite=True (the default), so that I get the following error if I put in a csr_matrix with NaN values:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
Does the GradientBoostingRegressor not actually support sparse data?
Update:
I’m lucky in that I don’t have any meaningful zeros. My calling code now looks like this:
predictors['foobar'] = predictors['foobar'].fillna(0) # for columns that contain NaNs
predictor_matrix = scipy.sparse.csr_matrix(
predictors.values.astype(np.float)
)
predictor_matrix.eliminate_zeros()
model.fit(predictor_matrix, regressands)
This avoids the exception above. Unfortunately there is no eliminate_nans() method. (When I print a sparse matrix with NaNs, it lists them explicitly, so spareness must be something other than containing NaNs.)
But the prediction performance hasn’t (noticeably) changed.
Perhaps you could try using LightGBM. Here is a discussion in Kaggle about how it handles missing values:
https://www.kaggle.com/c/home-credit-default-risk/discussion/57918
Good luck

Fractional power of batch matrices using numpy, scipy or torch

Is there a way to calculate fractional power of a batch/3d array (e.g. shape : 4*3*3) using any scientific computing libraries?
I've come across scipy.linalg fractional_matrix_power, but it doesn't seem to work for batch matrices. Currently, I'm using list comprehension to iterate over the batch, but it doesn't seem very efficient.
Is there any workaround or libraries to parallelize the task?
D_nsqrt = fractional_matrix_power(D, -0.5)
Code above throws error : ValueError: expected square array_like input.
But following works fine:
D_nsqrt = fractional_matrix_power(D[0], -0.5)
Shape of D : 4*3*3

H2OTwoDimTable seems to be missing functionality

I discovered that I can get a collection of EigenVectors from glrm_model (H2O Generalized Low Rank Model Estimateor glrm (Sorry I can't put this in the tags)) this way:
EV = glrm_model._model_json["output"]['eigenvectors'])
However the type of EV is H2OTwoDimTable which is not very capable.
If I try to do (where M is an H2O Data Frame):
M.mult(EV)
I get the error
AttributeError: 'H2OTwoDimTable' object has no attribute 'nrows'
If I try to convert EV to a numpy matrix:
EV.as_matrix()
I get the error:
AttributeError: 'H2OTwoDimTable' object has no attribute 'as_matrix'
I can convert EV to a panda data frame and then convert it to a numpy matrix, which is an extra step and do the matrix multiplication
IMHO, it would be better if the eigenvector reference return an H2O Data Frame.
Also, it would be good if H2OTwoDimTable could better support matrix multiplication either as a left or right operand.
And EV.as_data_frame() has no use_pandas=False option.
Here's the python code which could be modified to better support matrix type things:
https://github.com/h2oai/h2o-3/blob/master/h2o-py/h2o/two_dim_table.py
The "TwoDimTable" class is used to store lightweight tabular data in a model. I am agreement with you about using H2OFrames instead of TwoDimTables, but it's a design choice that was made a long time ago (can't change it now).
Since H2OFrames can contain non-numeric data, there is an .as_data_frame() method to from an H2OFrame or TwoDimTable to a Pandas DataFrame. So you can chain .as_data_frame().as_matrix() together to get a matrix (numpy.ndarray) if that's what you're looking for. Here's an example:
import h2o
from h2o.estimators.glrm import H2OGeneralizedLowRankEstimator
h2o.init()
data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glrm_test/cancar.csv")
# Train a GLRM model with recover_svd=True to keep eigenvectors
glrm = H2OGeneralizedLowRankEstimator(k=4,
transform="NONE",
loss="Quadratic",
regularization_x="None",
regularization_y="None",
max_iterations=1000,
recover_svd=True)
glrm.train(x=data.names, training_frame=data)
# Get eigenvector TwoDimTable from the model
EV = glrm._model_json["output"]['eigenvectors']
# Convert to various formats
evdf = EV.as_data_frame() #pandas.core.frame.DataFrame
evmat = evdf.as_matrix() #numpy.ndarray
# or directly
evmat = EV.as_data_frame().as_matrix()
If you're interested in adding a .as_matrix() method to the TwoDimTable class, you could submit a pull request on the h2o-3 repo for that. I think that would be a useful extension. There's more info about how to contribute to H2O in our contributing guide.

Tensorflow error using while_loop: "List of Tensors when single Tensor expected"

I'm getting a TypeError("List of Tensors when single Tensor expected") when I run a Tensorflow while_loop. The error is from the third parameter, which should be a list of Tensors, according to the documentation. x, W, Win, Y, temp, and Wout are all previously declared as floats and arrays of floats. cond2 and test2 are functions I've written to be the condition and body. I use an almost identical call earlier in the program with no issues.
t=0
t,x,W,Win,Y,temp,Wout = sess.run(tf.while_loop(cond2, test2,
[t, tf.Variable(x), tf.constant(W),
tf.constant(Win), tf.Variable(Y),
tf.Variable(temp), tf.constant(Wout)],
shape_invariants=[tf.TensorShape(None),
tf.TensorShape(None),
tf.TensorShape(None),
tf.TensorShape(None),
tf.TensorShape(None),
tf.TensorShape(None),
tf.TensorShape(None)]))
I fixed the error by removing the tf.constant() for Wout, since Wout was already declared as a tensor.
This would be easier to diagnose with (a) your definitions for condition and body, and (b) the full error output from TensorFlow (it usually also outputs a full dump of the input tensors when issuing these errors.)
With that said, the source of the problem seems to be that TensorFlow is viewing your loop_vars list as a single Tensor, and/or your cond2 and test2 functions only accept a single argument each. If neither of these is true, then providing more detail would help answer the question (specifically the full error message and the definition for every value/tensor/function you're passing to tf.while_loop. I've found that the majority of while_loop errors can be fixed by paying attention to the tensors in the error output.
The while_loop can throw pretty confusing errors at times so I'd like to help; I'll check back and update/edit my answer if more info is provided.