Is it possible to train an xgboost model in Python and deploy it Run it in C/C++? - xgboost

How much cross compatibility is there between the different language APIs?
For example, is it possible to train and save a model in Python and run it in C/C++ or any other language?
I would try this myself however my skills in non-Python languages are very limited.

You can dump the model into a text file as like this:
model.get_booster().dump_model('xgb_model.txt')
Then you should parse the text dump and reproduce the prediction function in C++.
I have implemented this in a little library that I call FastForest, if you want to save some time and want to make sure you use a fast implementation:
https://github.com/guitargeek/XGBoost-FastForest
The mission of the library is to be:
Easy: deploying your xgboost model should be as painless as it can be
Fast: thanks to efficient structure-of-array data structures for storing the trees, this library goes very easy on your CPU and memory (it is about 3 to 5 times faster than xgboost in prediction)
Safe: the FastForest objects are immutable, and therefore they are an excellent choice in multithreading environments
Portable: FastForest has no dependency other than the C++ standard library
Here is a little usage example, loading the model you have dumped before and assuming the model requires 5 features:
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4"};
FastForest fastForest("xgb_model.txt", features);
std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};
float output = fastForest(input.data());
When you create the FastForest you have to tell it in which order you intend to pass the features, because the text file does not store the order of the features.
Also note that the FastForest does not do the logistic transformation for you, so in order to reproduce predict_proba() you need to apply the logistic transformation:
float proba = 1./(1. + std::exp(-output));

The treelite package(research paper, documentation) enables compilation of tree-based models, including XGBoost, to optimized C code, making inference much faster than with native model libraries.

You could consider dumping your model in a text file using
model.get_booster().dump_model('xgb_model.txt', with_stats=True)
then, after some parsing, you can easily reproduce the .predict() function in C/C++. For the rest I am not aware of native porting of xgboost to C

Related

Convert a .npy file to wav following tacotron2 training

I am training the Tacotron2 model using TensorflowTTS for a new language.
I managed to train the model (performed pre-processing, normalization, and decoded the few generated output files)
The files in the output directory are .npy files. Which makes sense as they are mel-spectograms.
I am trying to find a way to convert said files to a .wav file in order to check if my work has been fruitfull.
I used this :
melspectrogram = librosa.feature.melspectrogram(
"/content/prediction/tacotron2-0/paol_wavpaol_8-norm-feats.npy", sr=22050,
window=scipy.signal.hanning, n_fft=1024, hop_length=256)
print('melspectrogram.shape', melspectrogram.shape)
print(melspectrogram)
audio_signal = librosa.feature.inverse.mel_to_audio(
melspectrogram, sr22050, n_fft=1024, hop_length=256, window=scipy.signal.hanning)
print(audio_signal, audio_signal.shape)
sf.write('test.wav', audio_signal, sample_rate)
But it is given me this error : Audio data must be of type numpy.ndarray.
Although I am already giving it a numpy.ndarray file.
Does anyone know where the issue might be, and if anyone knows a better way to do it?
I'm not sure what your error is, but the output of a Tacotron 2 system are log Mel spectral features and you can't just apply the inverse Fourier transform to get a waveform because you are missing the phase information and because the features are not invertible. You can learn about why this is at places like Speech.Zone (https://speech.zone/courses/)
Instead of using librosa like you are doing, you need to use a vocoder like HiFiGan (https://github.com/jik876/hifi-gan) that is trained to reconstruct a waveform from log Mel spectral features. You can use a pre-trained model, and most off-the-shelf vocoders, but make sure that the sample rate, Mel range, FFT, hop size and window size are all the same between your Tacotron2 feature prediction network and whatever vocoder you choose otherwise you'll just get noise!

How to optimize SpaCy pipe for NER only (using an existing model, no training)

I am looking to use SpaCy v3 to extract named entities from a large list of sentences. What I have works, but it seems slower than it should be, and before investing in more machines, I'd like to know if I am doing more work than I need to in the pipe.
I've used ntlk to parse everything into sentences as an iterator, then process these using "pipe" to get the named entities. All of this appears to work well, and python appears to be hitting every cpu core on my machine fairly heavily, which is good.
nlp = spacy.load("en_core_web_trf")
for (doc, context) in nlp.pipe(lines, as_tuples=True, batch_size=1000):
for ent in doc.ents:
pass #handle each entity
I understand that I can use nlp.disable_pipes to disable certain elements. Is there anything I can disable that won't impact accuracy and that isn't required for NER?
For NER only with the transformer model en_core_web_trf, you can disable ["tagger", "parser", "attribute_ruler", "lemmatizer"].
If you want to use a non-transformer model like en_core_web_lg (much faster but slightly lower accuracy), you can disable ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] and use nlp.pipe(n_process=-1) for multiprocessing on all CPUs (or n_process=N to restrict to N CPUs).

What is the difference between TFRecordDataset and FixedLengthRecordDataset?

It will be great to get a use case possibly from a project and explain the use of each. Thanks in advance.
TFRecordDataset, FixedLengthRecordDataset as well as TextLineDataset are classes of Dataset.
Dataset is a base class containing methods to create and transform datasets. Also allows you initialize a dataset from data in memory, or from a Python generator.
Since release 1.4, Datasets is a new way to create input pipelines to TensorFlow models. This API is much more performant than using feed_dict or the queue-based pipelines, and it's cleaner and easier to use.
As a use case, you can think of the pre-processing of data to feed it into a model for training (Examples in the links below are pretty self-explanatory).
TFRecordDataset: Reads records from TFRecord files (Example 1, Example 2).
#Python
dataset = tf.data.TFRecordDataset("/path/to/file.tfrecord")
FixedLengthRecordDataset: Reads fixed size records from binary files (Example).
#Python
images = tf.data.FixedLengthRecordDataset(
images_file, 28 * 28, header_bytes=16).map(decode_image)
TextLineDataset: Reads lines from text files.
See this documentation (TextLineDataset example included)

Learning to rank how to save model

I successfully managed to implement learning to rank by following the tutorial TF-Ranking for sparse features using the ANTIQUE question answering dataset.
Now my goal is to successfully save the learned model to disk so that I can easily load it without training again. Due to the Tensorflow docs, the estimator.export_saved_model() method seems to be the way to go. But I can't wrap my head around how to tell Tensorflow how my feature structure looks like. Due to the docs here the easiest way seems to be calling tf.estimator.export.build_parsing_serving_input_receiver_fn(), which returns me the required inpur receiver function which I have to pass to the export_saved_model function. But how do I tell Tensorflow how my features from my learning to rank model look like?
From my current understanding the model has context feature specs and example feature specs. So I guess I somehow have to combine those two specs into one feature description, which I then can pass to the build_parsing_serving_input_receiver_fn function?
So I think you are on the right track;
You can get a build_ranking_serving_input_receiver_fn like this: (substitue context_feature_columns(...) and example_feature_columns(...) with defs you probably have for creating your own context and example structures for your training data):
def example_serving_input_fn():
context_feature_spec = tf.feature_column.make_parse_example_spec(
context_feature_columns(_VOCAB_PATHS).values())
example_feature_spec = tf.feature_column.make_parse_example_spec(
list(example_feature_columns(_VOCAB_PATHS).values()))
servingInputReceiver = tfr.data.build_ranking_serving_input_receiver_fn(
data_format=tfr.data.ELWC,
context_feature_spec=context_feature_spec,
example_feature_spec=example_feature_spec,
list_size=_LIST_SIZE,
receiver_name="input_ranking_data",
default_batch_size=None)
return servingInputReceiver
And then pass this to export_saved_model like this:
ranker.export_saved_model('path_to_save_model', example_serving_input_fn())
(ranker here is a tf.estimator.Estimator, maybe you called this 'estimator' in your code)
ranker = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=_MODEL_DIR,
config=run_config)

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.