label encoding in dask_cudf dataframe - xgboost

I am trying to use dask_cudf to preprocess a very large dataset (150,000,000+ records) for multi-class xgboost training and am having trouble encoding the class column (dtype is string). I tried using the 'replace' function, but the error message said the two dtypes must match. I tried using dask_ml.LabelEncoder, but it said string arrays aren't supported in cudf. I tried using compute() in various ways, but i kept running into out-of-memory errors (i'm assuming because operations on cudf dataframe require a smaller dataset). I also tried pulling the class column out, encoding, and then merging it back with the dataframe, but the partitions do not line up. I tried manually lining them up, but dask_cudf seemingly does not support repartioning using 'divisions' parameter (got error saying something like 'old and new partitions do not match'). Any help on how to do this would be much appreciated.

Strings aren't supported on xgboost. Not having seen your data, here are a few ways quick and dirty ways I've modified string columns to train, as generally strings may not matter:
If the strings were actually numeric (like dates), converting to int (int8 int16, int32)
I did this by hashmapping the strings and then running xgboost (basically creating a reversible conversion between string and integer as long as you don't change the integer) and train on your current, now hashed as an integer, column.
if the strings are classes, manually naming class numbers (0,1,2,...,n) in a new column and train on that one.
There are definitely other, better ways. As for the second part of your question, left a comment.
Now, your XGBoost model and your dask-cudf dataframe per-GPU allocation must fit on a single GPU, or you will get memory errors. If your model will be considering a large amount of data, please train on the largest GPU memory sized cluster you can. A100s can have 40GB and 80GB. Some older compute GPUs, V100 and GV100 have 32GB. A6000 and RTX8000 have 48GB. then it goes to 24, 16, and lower from there. Please size your GPUs accordingly

Related

Pandas to Koalas (Databricks) conversion code for big scoring dataset

I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.

Best way to convert TensorProto to TensorFlow tensor

As far as I can tell, there are at least two different ways to recover a Tensor from a TensorProto in Tensorflow 2.3. Say, for the sake of example, that we have
tensor = tf.range(10)
tproto = tf.make_tensor_proto(tensor)
Then:
You can use tf.make_ndarray like so
tf.constant(tf.make_ndarray(tproto))
Or you can use tf.io.parse_tensor like so
tf.io.parse_tensor(tproto.SerializeToString(), out_type=tf.int32)
I feel both of these are a bit artificial, since in the former you end up with an intermediate numpy array, and in the latter you have to serialize the TensorProto to a string and parse it back. Additionally, parse_tensor won't automatically recover the correct data type from the TensorProto. So:
Is there a function to do the conversion in a single step? I'd like to see something like tf.from_tensor_proto doing the conversion all at once optimizing for speed and memory allocation (or, if tf.constant(tf.make_ndarray(tproto)) is the best you can do, just wrapping this up).
Otherwise, which of the two options above should be preferred (in terms of efficiency, memory usage, etc.)?

Does the sklearn.ensemble.GradientBoostingRegressor support sparse input samples?

I’m using sklearn.ensemble.GradientBoostingRegressor on data that is sometimes lacking some values. I can’t easily impute these data because they have a great variance and the estimate is very sensitive to them. They are also almost never 0.
The documentation of the fit method says about the first parameter X:
The input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csr_matrix.
This has lead me to think that the GradientBoostingRegressor can work with sparse input data.
But internally it calls check_array with implicit force_all_finite=True (the default), so that I get the following error if I put in a csr_matrix with NaN values:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
Does the GradientBoostingRegressor not actually support sparse data?
Update:
I’m lucky in that I don’t have any meaningful zeros. My calling code now looks like this:
predictors['foobar'] = predictors['foobar'].fillna(0) # for columns that contain NaNs
predictor_matrix = scipy.sparse.csr_matrix(
predictors.values.astype(np.float)
)
predictor_matrix.eliminate_zeros()
model.fit(predictor_matrix, regressands)
This avoids the exception above. Unfortunately there is no eliminate_nans() method. (When I print a sparse matrix with NaNs, it lists them explicitly, so spareness must be something other than containing NaNs.)
But the prediction performance hasn’t (noticeably) changed.
Perhaps you could try using LightGBM. Here is a discussion in Kaggle about how it handles missing values:
https://www.kaggle.com/c/home-credit-default-risk/discussion/57918
Good luck

Caching a dataset with examples of varied length

My dataset is comprised of audio segments of between 5-180 seconds. The number of examples is small enough to allow caching it in memory, instead of reading from the disk over and over. Storing the data in a constant tensor / variable and using tf.train.slice_input_producer will allow me to cache the dataset in memory, but it requires storing all the data in one matrix. Since some examples are much longer than others, this matrix might be unnecessarily large and perhaps too large for the RAM.
I can simply have a list of numpy arrays for my data, and do the whole input reading-randomizing-preprocessing in a non-tensforflow way with a feed_dict, but I wonder if there is a way to do it without completely giving up on tensorflow for the input reading-randomizing-preprocessing part.
Thanks!
The more recent tf.data library provides a tf.data.Dataset.cache method to cache an entire dataset into memory or into a file.
For instance:
dataset = ...
dataset = dataset.map(preprocessing_fn) # apply preprocessing
dataset = dataset.cache() # cache entire dataset in memory after preprocessing
I've provided more details on how to use cache() in this answer.

Numpy/Scipy pinv and pinv2 behave differently

I am working with bidimensional arrays on Numpy for Extreme Learning Machines. One of my arrays, H, is random, and I want to compute its pseudoinverse.
If I use scipy.linalg.pinv2 everything runs smoothly. However, if I use scipy.linalg.pinv, sometimes (30-40% of the times) problems arise.
The reason why I am using pinv2 is because I read (here: http://vene.ro/blog/inverses-pseudoinverses-numerical-issues-speed-symmetry.html ) that pinv2 performs better on "tall" and on "wide" arrays.
The problem is that, if H has a column j of all 1, pinv(H) has huge coefficients at row j.
This is in turn a problem because, in such cases, np.dot(pinv(H), Y) contains some nan values (Y is an array of small integers).
Now, I am not into linear algebra and numeric computation enough to understand if this is a bug or some precision related property of the two functions. I would like you to answer this question so that, if it's the case, I can file a bug report (honestly, at the moment I would not even know what to write).
I saved the arrays with np.savetxt(fn, a, '%.2e', ';'): please, see https://dl.dropboxusercontent.com/u/48242012/example.tar.gz to find them.
Any help is appreciated. In the provided file, you can see in pinv(H).csv that rows 14, 33, 55, 56 and 99 have huge values, while in pinv2(H) the same rows have more decent values.
Your help is appreciated.
In short, the two functions implement two different ways to calculate the pseudoinverse matrix:
scipy.linalg.pinv uses least squares, which may be quite compute intensive and take up a lot of memory.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html#scipy.linalg.pinv
scipy.linalg.pinv2 uses SVD (singular value decomposition), which should run with a smaller memory footprint in most cases.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv2.html#scipy.linalg.pinv2
numpy.linalg.pinv also implements this method.
As these are two different evaluation methods, the resulting matrices will not be the same. Each method has its own advantages and disadvantages, and it is not always easy to determine which one should be used without deeply understanding the data and what the pseudoinverse will be used for. I'd simply suggest some trial-and-error and use the one which gives you the best results for your classifier.
Note that in some cases these functions cannot converge to a solution, and will then raise a scipy.stats.LinAlgError. In that case you may try to use the second pinv implementation, which will greatly reduce the amount of errors you receive.
Starting from scipy 1.7.0 , pinv2 is deprecated and also replaced by a SVD solution.
DeprecationWarning: scipy.linalg.pinv2 is deprecated since SciPy 1.7.0, use scipy.linalg.pinv instead
That means, numpy.pinv, scipy.pinv and scipy.pinv2 now compute all equivalent solutions. They are also equally fast in their computation, with scipy being slightly faster.
import numpy as np
import scipy
arr = np.random.rand(1000, 2000)
res1 = np.linalg.pinv(arr)
res2 = scipy.linalg.pinv(arr)
res3 = scipy.linalg.pinv2(arr)
np.testing.assert_array_almost_equal(res1, res2, decimal=10)
np.testing.assert_array_almost_equal(res1, res3, decimal=10)