I'm using mxnet to do deep reinforcement learning. I have a simple generator that yields observations from a random walk through a game (from openai gym):
import mxnet as mx
from mxnet import *
from mxnet.ndarray import *
import gym
def random_walk(env_id):
env, done = gym.make(env_id), True
min_rew, max_rew = env.reward_range
while True:
if done:
obs = env.reset()
action = env.action_space.sample()
obs, rew, done, info = env.step(action)
# some preprocessing ommited...
yield obs, rew, action # all converted to ndarrays now
I want to be able to save this data to a big file containing rows of (observation, reward, action), so I can later easily load, shuffle, and batch them with mxnet.
Is it possible to do using mxnet if so, how?
It doesn't sound like an MXNet task to do the saving. If you can serialize observation, reward and action to string, then you can use regular python to create and save data to file https://docs.python.org/3.7/tutorial/inputoutput.html#reading-and-writing-files
To do shuffling and batching in mxnet, you load your file first using regular python into a python list and then create SimpleDataset or ArrayDataset, depending if you have separate Y list or not. Then you pass the dataset object to DataLoader which can do shuffling and batching for you.
Take a look here for a full example: https://mxnet.incubator.apache.org/tutorials/gluon/datasets.html
Related
I trained a Faster R-CNN from the TF Object Detection API and saved it using export_inference_graph.py. I have the following directory structure:
weights
|-checkpoint
|-frozen_inference_graph.pb
|-model.ckpt-data-00000-of-00001
|-model.ckpt.index
|-model.ckpt.meta
|-pipeline.config
|-saved_model
|--saved_model.pb
|--variables
I would like to load the first and second stages of the model separately. That is, I would like the following two models:
A model containing each variable in the scope FirstStageFeatureExtractor which accepts an image (or serialized tf.data.Example) as input, and outputs the feature map and RPN proposals.
A model containing each variable in the scopes SecondStageFeatureExtractor and SecondStageBoxPredictor which accepts a feature map and RPN proposals as input, and outputs the bounding box predictions and scores.
I basically want to be able to call _predict_first_stage and _predict_second_stage separately on my input data.
Currently, I only know how to load the entire model:
model = tf.saved_model.load("weights/saved_model")
model = model.signatures["serving_default"]
EDIT 6/7/2020:
For Model 1, I may be able to extract detection_features as in this question, but I'm still not sure about Model 2.
This was more difficult when Object Detection was only compatible with TF1, but is now pretty simple in TF2. There's a good example in this colab.
from object_detection.builders import model_builder
from object_detection.utils import config_util
# Set path names
model_name = 'centernet_hg104_512x512_kpts_coco17_tpu-32'
pipeline_config = os.path.join('models/research/object_detection/configs/tf2/',
model_name + '.config')
model_dir = 'models/research/object_detection/test_data/checkpoint/'
# Load pipeline config and build a detection model
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
detection_model = model_builder.build(model_config=model_config,
is_training=False)
# Restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(
model=detection_model)
ckpt.restore(os.path.join(model_dir, 'ckpt-0')).expect_partial()
From here one can call detection_model.predict() and associated methods such as _predict_first_stage and _predict_second_stage.
I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.
I'd like to calculate Word Mover's Distance with Universal Sentence Encoder on TensorFlow Hub embedding.
I have tried the example on spaCy for WMD-relax, which loads 'en' model from spaCy, but I couldn't find another way to feed other embeddings.
In gensim, it seems that it only accepts load_word2vec_format file (file.bin) or load file (file.vec).
As I know, someone has written a Bert to token embeddings based on pytorch, but it's not generalized to other models on tf-hub.
Is there any other approach to transfer pretrained models on tf-hub to spaCy format or word2vec format?
You need two different things.
First tell SpaCy to use an external vector for your documents, spans or tokens. This can be done by setting the user_hooks:
- user_hooks["vector"] is for the document vector
- user_span_hooks["vector"] is for the span vector
- user_token_hooks["vector"] is for the token vector
Given the fact that you have a function that retrieves from TF Hub the vectors for a Doc/Span/Token (all of them have the property text):
import spacy
import tensorflow_hub as hub
model = hub.load(TFHUB_URL)
def embed(element):
# get the text
text = element.text
# then get your vector back. The signature is for batches/arrays
results = model([text])
# get the first element because we queried with just one text
result = np.array(results)[0]
return result
You can write the following pipe component, that tells spacy how to retrieve the custom embedding for documents, spans and tokens:
def overwrite_vectors(doc):
doc.user_hooks["vector"] = embed
doc.user_span_hooks["vector"] = embed
doc.user_token_hooks["vector"] = embed
# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)
For your question related to the custom distance, there is a user hook also for this one:
def word_mover_similarity(a, b):
vector_a = a.vector
vector_b = b.vector
# your distance score needs to be converted to a similarity score
similarity = TODO_IMPLEMENT(vector_a, vector_b)
return similarity
def overwrite_similarity(doc):
doc.user_hooks["similarity"] = word_mover_similarity
doc.user_span_hooks["similarity"] = word_mover_similarity
doc.user_token_hooks["similarity"] = word_mover_similarity
# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)
I have an implementation of the TF Hub Universal Sentence Encoder that uses the user_hooks in this way: https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub
Here is the implementation of WMD in spacy. You can create a WMD object and load your own embeddings:
import numpy
from wmd import WMD
embeddings_numpy_array = # your array with word vectors
calc = WMD(embeddings_numpy_array, ...)
Or, as shown in this example., you can create your own class:
import spacy
spacy_nlp = spacy.load('en_core_web_lg')
class SpacyEmbeddings(object):
def __getitem__(self, item):
return spacy_nlp.vocab[item].vector # here you can return your own vector instead
calc = WMD(SpacyEmbeddings(), documents)
...
...
calc.nearest_neighbors("some text")
...
I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!
If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size
I am using the object detection API to train with a different dataset and I would like to know if it is possible to have sample images of what is reaching the network during the training.
I ask this because I am trying to find a good combination of data augmentation options (here the options), but the result adding them has been worse. Seeing what reaches the network in training would be very helpful.
Another question is if it is possible to get the API to help with balancing the classes, in case that the dataset passed have them unbalanced.
Thank you!
Yes, it is possible. Shortly speaking, you need to get an instance of tf.data.Dataset. Then, you can iterate over it and get the network input data as NumPy arrays. Saving it to image files using PIL or OpenCV is trivial then.
Assuming you use TF2 the pseudo-code is like this:
ds = ... get dataset object somehow
sample_num = 0
for features, _ in ds:
images = features[fields.InputDataFields.image] # is a [batch_size, H, W, C] float32 tensor with preprocessed images
batch_size = images.shape[0]
for i in range(batch_size):
image = np.array(images[i] * 255).astype(np.uint8) # assuming input data is only scaled to [0..1]
cv2.imwrite(output_path, image)
sample_num += 1
if sample_num >= MAX_SAMPLES:
break
The trick here is to get the Dataset instance. Google object detection API is very sophisticated, but I guess you should start with calling train_input function here: https://github.com/tensorflow/models/blob/3c8b6f1e17e230b68519fd8d58c4dd9e9570d789/research/object_detection/inputs.py#L763
It requires pipeline config sub-parts describing training, train_input and the model.
You can find some code snippets on how to work with pipeline here: Dynamically Editing Pipeline Config for Tensorflow Object Detection
import argparse
import tensorflow as tf
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2
def parse_arguments():
parser = argparse.ArgumentParser(description='')
parser.add_argument('pipeline')
parser.add_argument('output')
return parser.parse_args()
def main():
args = parse_arguments()
pipeline_config = pipeline_pb2.TrainEvalPipelineConfig()
with tf.gfile.GFile(args.pipeline, "r") as f:
proto_str = f.read()
text_format.Merge(proto_str, pipeline_config)