How do I create gold data for TextCategorizer training? - spacy

I want to train a TextCategorizer model with the following (text, label) pairs.
Label COLOR:
The door is brown.
The barn is red.
The flower is yellow.
Label ANIMAL:
The horse is running.
The fish is jumping.
The chicken is asleep.
I am copying the example code in the documentation for TextCategorizer.
textcat = TextCategorizer(nlp.vocab)
losses = {}
optimizer = nlp.begin_training()
textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
The doc variables will presumably be just nlp("The door is brown.") and so on. What should be in gold1 and gold2? I'm guessing they should be GoldParse objects, but I don't see how you represent text categorization information in those.

According to this example train_textcat.py it should be something like {'cats': {'ANIMAL': 0, 'COLOR': 1}} if you want to train a multi-label model. Also, if you have only two classes, you can simply use {'cats': {'ANIMAL': 1}} for label ANIMAL and {'cats': {'ANIMAL': 0}} for label COLOR.
You can use the following minimal working example for a one category text classification;
import spacy
nlp = spacy.load('en')
train_data = [
(u"That was very bad", {"cats": {"POSITIVE": 0}}),
(u"it is so bad", {"cats": {"POSITIVE": 0}}),
(u"so terrible", {"cats": {"POSITIVE": 0}}),
(u"I like it", {"cats": {"POSITIVE": 1}}),
(u"It is very good.", {"cats": {"POSITIVE": 1}}),
(u"That was great!", {"cats": {"POSITIVE": 1}}),
]
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
textcat.add_label('POSITIVE')
optimizer = nlp.begin_training()
for itn in range(100):
for doc, gold in train_data:
nlp.update([doc], [gold], sgd=optimizer)
doc = nlp(u'It is good.')
print(doc.cats)

Related

How to maintain lookup ability of training data embedded in Tensorflow

I'm using universal-sentence-encoder to compare similarity of article texts.
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
def embed(input):
return model(input)
I then load an array of strings to the model...
articles = [
'article about this thing',
'news about war',
'tabloid british monarchy'
]
trained = trained_model = embed(articles)
This works as expected. I can use cosine similarity to do multiple types of semantic search/matching.
def semantic_search(query, data, vectors):
print("Extracting features...")
query_vec = embed([query])[0].ravel()
res = []
for i, d in enumerate(data):
qvec = vectors[i].ravel()
sim = cosine_similarity(query_vec, qvec)
res.append((sim, d, i))
return sorted(res, key=lambda x : x[0], reverse=True)
The problem is that I can't reference the articles back when I find matches. I need (I think, maybe I'm wrong) some type of ID accompanying the output.
What comes out...
[(0.9202662,
'war in ukraine heats up', 1),
(0.74028295,
'china eyes taiwan', 1776)]
What I need...
[(0.9202662,
'war in ukraine heats up',
<SOME_ID_TO_REFERENCE>),
(0.74028295,
'china eyes taiwan',
<SOME_ID_TO_REFERENCE>)]

colab_utils.annotate(), annotation format

I am following the Tensorflow notebook for Few shot learning ( https://colab.research.google.com/github/tensorflow/models/blob/master/research/object_detection/colab_tutorials/eager_few_shot_od_training_tf2_colab.ipynb#scrollTo=RW1FrT2iNnpy )
In it, I saw that they were annotating the images using colab_utils.annotate(). I can't understand the annotation format they are using (like YOLO or COCO format). Another problem is that we can't specify the classes at the time when we are drawing the bounding boxes and I have to remember the order in which I annotate the different images and classes so I can add them by code later on.
If someone can tell me what's that format so I can annotate the images on my PC locally rather than on COLAB which will save a lot of time.
Any help would be appreciated.
Regards
The colab_utils annotation tools is only practical for a single class. Below is the format from the source code:
[
// stuff for image 1
[
// stuff for rect 1
{x, y, w, h},
// stuff for rect 2
{x, y, w, h},
...
],
// stuff for image 2
[
// stuff for rect 1
{x, y, w, h},
// stuff for rect 2
{x, y, w, h},
...
],
...
]
As the annotations don't include any reference ID to the source image, order matters and you have to match the order of the box array with the order of your images; this tool is probably not practical for a large training set. The example from the colab you provided, below, is thus the example to follow.
gt_boxes = [
np.array([[0.436, 0.591, 0.629, 0.712]], dtype=np.float32),
np.array([[0.539, 0.583, 0.73, 0.71]], dtype=np.float32),
np.array([[0.464, 0.414, 0.626, 0.548]], dtype=np.float32),
np.array([[0.313, 0.308, 0.648, 0.526]], dtype=np.float32),
np.array([[0.256, 0.444, 0.484, 0.629]], dtype=np.float32)
]

GridSearch for XGBoost

I am reading the grid search for XGBoost on Analytics Vidhaya.
It has the following in the code
param_test1 = {'max_depth':range(3,10,2), 'min_child_weight':range(1,6,2)}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1,
n_estimators=140,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27),
param_grid = param_test1,
scoring='roc_auc',
n_jobs=4, iid=False,
cv=5)
My quesiton is if the grid search is used to find a better max_depth and min_child_weight,
then why these two parameters are set in gsearch1 as 5 and 1, respectively.
Moreover, in my own code when I comment these two out, then the result changes. Why is that?
Thanks

Output score , class and id Extraction using TensorFlow object detection

How can I extract the output scores for objects , object class ,object id detected in images , generated by the Tensorflow Model for Object Detection ?
I want to store all these details into individual variables so that later they can be stored in a database .
Using the same code as found in this link
https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb
Please Help me out with the solution to this problem .
I've Tried
print(str(output_dict['detection_classes'][0] ) , ":" , str(output_dict['detection_scores'][0]))
This works and gives the object id and score for the class with the highest probability . But I want to extract the class name too and also the scores , Ids and names for all objects present in the image
Example of output :
There are two dogs in the image . When I print out the result I get the id and score for the object with the highest probability[94% in this case] i want to print the object name too and also similar details for all other objects in the images
You may need some knowledge background about tensorflow object detection, short and quick solution here might be the way you expected :
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
for image_path in TEST_IMAGE_PATHS:
image = Image.open(image_path)
image_np = load_image_into_numpy_array(image)
image_np_expanded = np.expand_dims(image_np, axis=0)
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
scores = detection_graph.get_tensor_by_name('detection_scores:0')
classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')
# Actual detection.
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=8)
objects = []
threshold = 0.5 # in order to get higher percentages you need to lower this number; usually at 0.01 you get 100% predicted objects
for index, value in enumerate(classes[0]):
object_dict = {}
if scores[0, index] > threshold:
object_dict[(category_index.get(value)).get('name').encode('utf8')] = \
scores[0, index]
objects.append(object_dict)
print (objects)
print(len(np.where(scores[0] > threshold)[0])/num_detections[0])
plt.figure(figsize=IMAGE_SIZE)
plt.imshow(image_np)
Hope this helpful.
It gives you the class with the highest score because output tensors are sorted from highest score to lowest and you are asking for the highest score by indexing to the first element [0].
Look at object_detection/inference/detection_inference for inspiration.
As for class names, you can use the label map to create a category index dictionary to translate class ids to names.
Get class name,
your label map should be able to help here.
from object_detection.utils import label_map_util
label_map_path = os.path.join(annotations_dir, 'label_map.pbtxt')
label_map_dict = label_map_util.get_label_map_dict(label_map_path)
label_map_dict_number_to_name = {v: k for k, v in label_map_dict.iteritems()}
class_number = output_dict['detection_classes'][index]
class_name = label_map_dict_number_to_name[class_number]
Please paste your code, so we can figure out why only one box is in y
Code:
my_classes = detections['detection_classes'][0].numpy() + label_id_offset
my_scores = detections['detection_scores'][0].numpy()
min_score = 0.5
print ([{'class':category_index.get(value),'score':my_scores[index]}
for index,value in enumerate(my_classes)
if my_scores[index] > min_score
])
Sample output:
[{'class': {'id': 1, 'name': 'person'}, 'score': 0.7414642},
{'class': {'id': 77, 'name': 'cell phone'}, 'score': 0.6765256}]

Cloud ML Engine batch predictions - How to simply match returned predictions with input data?

According to the ML Engine documentation, an instance key is required to match the returned predictions with the input data. For simplicity purposes, I would like to use a DNNClassifier but apparently canned estimators don't seem to support instance keys yet (only custom or tensorflow core estimators).
So I looked at the Census code examples of Custom/TensorflowCore Estimators but they look quite complex for what I am trying to achieve.
I would prefer using a similar approach as described in this stackoverflow answer (wrapping a DNNClassifier into a custom estimator) but I can
not make it work and I got an error saying that 'DNNClassifier' object has no attribute 'model_fn'...
How can I achieve this in a simple manner?
In version 1.2 the contrib estimators (tf.contrib.learn.DNNClassifier for example), were changed to inherit from the core estimator class tf.estimator.Estimator which unlike it's predecessor, hides the model function as a private class member.
Try estimator._model_fn rather than estimator.model_fn. You should be able to leave everything else in my previous answer the same.
EDIT: I've updated my original answer here: https://stackoverflow.com/a/44443380/3597868 to reflect the necessary changes with version 1.2
My code as per Eli's example:
def key_model_fn_gen(estimator):
def _model_fn(feature_columns, labels, mode):
key = feature_columns.pop(KEY)
params = estimator.params
model_fn_ops = estimator._model_fn(features=feature_columns,
labels=labels,
mode=mode,
params=params)
model_fn_ops.predictions[KEY] = key
return model_fn_ops
return _model_fn
but still unsuccessful to display the instance key in the result of predictions using ML Engine batch predictions...
What do I need to change in the Experiment (or maybe in the export strategy)
to make it work?
System/Version Info
Canned census example committed on 2017_06_22_15:06:37.
TensorFlow 1.2.
Python 3
GCP ML Engine 1.2
Approach
Fabrice, I had the same question as you and it took me a while to figure this one out (with the generous help of Eli). I took a slightly different approach. Instead of trying to create an instance key, I made the assumption that the instance key would be in the data (training, evaluation, and prediction).
Here, I use the gender field as the instance key. Obviously, I would not use the gender field in reality as an instance key, I'm only using it here for illustration purposes.
Other than those changes described here, am not making any updates to any other functions or constants from the original script other than to change some things from python 2 to python 3, e.g., changing dict.iteritems() to dict.items().
Here is a gist of my modified model.py file. I did not make any changes to the task.py file.
Updating the key_model_fn_gen() function
This code relies on guidance I got from Eli. The insight for me was that I need to modify the output_alternatives dictionary in order to return the key and that I do not need to modify the predictions dictionary. (Additionally, I learned that I could get the params as an attribute of the estimator from your (Fabrice's) example, thanks for that.)
KEY = 'gender'
def key_model_fn_gen(estimator):
def _model_fn(features, labels, mode):
key = features.pop(KEY)
params = estimator.params
model_fn_ops = estimator._model_fn(features=features, labels=labels, mode=mode, params=params)
model_fn_ops.output_alternatives[None][1]['key'] = key
return model_fn_ops
return _model_fn
Updating the build_estimator() function
I remove gender from deep_columns list and wide_columns list so that it is not used as a feature for training and evaluation.
I modify the return to include the key wrapper per Eli's guidance.
I get the model_dir as an attribute of config.
Here is the full code:
def build_estimator(config, embedding_size=8, hidden_units=None):
(gender, race, education, marital_status, relationship,
workclass, occupation, native_country, age,
education_num, capital_gain, capital_loss, hours_per_week) = INPUT_COLUMNS
"""Build an estimator."""
# Reused Transformations.
# Continuous columns can be converted to categorical via bucketization
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
# Wide columns and deep columns.
wide_columns = [
# Interactions between different categorical features can also
# be added as new virtual features.
tf.feature_column.crossed_column(
['education', 'occupation'], hash_bucket_size=int(1e4)),
tf.feature_column.crossed_column(
[age_buckets, race, 'occupation'], hash_bucket_size=int(1e6)),
tf.feature_column.crossed_column(
['native_country', 'occupation'], hash_bucket_size=int(1e4)),
native_country,
education,
occupation,
workclass,
marital_status,
relationship,
age_buckets,
]
deep_columns = [
# Use indicator columns for low dimensional vocabularies
tf.feature_column.indicator_column(workclass),
tf.feature_column.indicator_column(education),
tf.feature_column.indicator_column(marital_status),
tf.feature_column.indicator_column(relationship),
tf.feature_column.indicator_column(race),
# Use embedding columns for high dimensional vocabularies
tf.feature_column.embedding_column(
native_country, dimension=embedding_size),
tf.feature_column.embedding_column(occupation, dimension=embedding_size),
age,
education_num,
capital_gain,
capital_loss,
hours_per_week,
]
return tf.contrib.learn.Estimator(
model_fn=key_model_fn_gen(
tf.contrib.learn.DNNLinearCombinedClassifier(
config=config,
linear_feature_columns=wide_columns,
dnn_feature_columns=deep_columns,
dnn_hidden_units=hidden_units or [100, 70, 50, 25],
fix_global_step_increment_bug=True)
),
model_dir=config.model_dir
)
Input format for batch predictions
After the version has been uploaded to ML Engine, the prediction input takes the following form:
{"native_country":" United-States","race":" Black","age":"44","relationship":" Other-relative","gender":" Male","marital_status":" Never-married","hours_per_week":"32","capital_gain":"0","education_num":"9","education":" HS-grad","occupation":" Other-service","capital_loss":"0","workclass":" Private"}
{"native_country":" United-States","race":" White","age":"35","relationship":" Not-in-family","gender":" Male","marital_status":" Divorced","hours_per_week":"40","capital_gain":"0","education_num":"9","education":" HS-grad","occupation":" Craft-repair","capital_loss":"0","workclass":" Private"}
{"native_country":" United-States","race":" White","age":"20","relationship":" Husband","gender":" Male","marital_status":" Married-civ-spouse","hours_per_week":"40","capital_gain":"0","education_num":"10","education":" Some-college","occupation":" Craft-repair","capital_loss":"0","workclass":" Private"}
{"native_country":" United-States","race":" White","age":"43","relationship":" Husband","gender":" Male","marital_status":" Married-civ-spouse","hours_per_week":"50","capital_gain":"0","education_num":"10","education":" Some-college","occupation":" Farming-fishing","capital_loss":"0","workclass":" Self-emp-not-inc"}
{"native_country":" England","race":" White","age":"33","relationship":" Husband","gender":" Male","marital_status":" Married-civ-spouse","hours_per_week":"40","capital_gain":"0","education_num":"13","education":" Bachelors","occupation":" Farming-fishing","capital_loss":"0","workclass":" Private"}
{"native_country":" United-States","race":" White","age":"38","relationship":" Unmarried","gender":" Female","marital_status":" Divorced","hours_per_week":"56","capital_gain":"0","education_num":"13","education":" Bachelors","occupation":" Prof-specialty","capital_loss":"0","workclass":" Private"}
{"native_country":" United-States","race":" White","age":"53","relationship":" Not-in-family","gender":" Female","marital_status":" Never-married","hours_per_week":"35","capital_gain":"8614","education_num":"14","education":" Masters","occupation":" ?","capital_loss":"0","workclass":" ?"}
{"native_country":" China","race":" Asian-Pac-Islander","age":"64","relationship":" Husband","gender":" Male","marital_status":" Married-civ-spouse","hours_per_week":"60","capital_gain":"0","education_num":"14","education":" Masters","occupation":" Prof-specialty","capital_loss":"2057","workclass":" Private"}
Output format of batch prediction
After completing the batch prediction job, I get the following output:
{"probabilities": [0.9633187055587769, 0.036681365221738815], "classes": ["0", "1"], "key": [" Male"]}
{"probabilities": [0.9452069997787476, 0.05479296296834946], "classes": ["0", "1"], "key": [" Male"]}
{"probabilities": [0.8586776852607727, 0.1413223296403885], "classes": ["0", "1"], "key": [" Male"]}
{"probabilities": [0.7370017170906067, 0.2629982531070709], "classes": ["0", "1"], "key": [" Male"]}
{"probabilities": [0.48797568678855896, 0.5120242238044739], "classes": ["0", "1"], "key": [" Male"]}
{"probabilities": [0.8111950755119324, 0.18880495429039001], "classes": ["0", "1"], "key": [" Female"]}
{"probabilities": [0.5560402274131775, 0.4439597725868225], "classes": ["0", "1"], "key": [" Female"]}
{"probabilities": [0.3235422968864441, 0.6764576435089111], "classes": ["0", "1"], "key": [" Male"]}