Guys
As Google Cloud ML request (AutoML Entity Extraction Project)
Each label must have at least 50 annotations. Though ideally, each
label should have at least 100 annotations. Fewer annotations often
result in inaccurate precision and recall. You must also have at least
10 annotations each assigned to your Train, Validation and Test sets.
More than 50 annotations, and at least 10 annotations
But we still can't click the Start Training Button.
For Entity Extraction, AutoML requires at least 100 training items to be able to train your model as is mentioned in the documentation 1.
Here is a more detailed reference regarding Entity Extraction 2
You supply between 50 and 100,000 documents to use for training your
custom model. You use between one and 100 unique labels to annotate
the entities you want the model to learn to extract. Each annotation
is a span of text and an associated label. Label names can be between
2 and 30 characters, and can be used to annotate between one and 10
words. We recommend using each label at least 200 times in your
training data set.
Related
Let say I have 3 images (an apple, an orange, a banana) and another 1000 arbitrary images. What I want to do is to see if those 1000 arbitrary images contain object(s) similar to the former 3 images, if yes, draw a bounding box to indicate those objects. However, none of these 1003 images or objects are labelled nor have any annotations.
I have do some research on the internet and try to find some deep learning object detection approach (e.g. Faster R-CNN, YOLOv3) but I couldn't think of how they can be related to my task.
I have also notice that there is a term called template matching, but it seems not much related to deep learning.
So my question is:
Is there any good approach or deep learning model that could meet my needs?
Will I be benefit from any pre-trained Faster R-CNN, YOLOv3 models? (e.g. If they are trained by cars, people, dogs, cats image set, will those meaningful features can also apply to new domain?)
I want to do is to see if those 1000 arbitrary images contain object(s) similar to the former 3 image
What did you mean by "similar?"
If you meant "I want to see if the 1000 images contain objects from the target classes: orange, apple, and banana", then here's the answer:
If your models were pre-trained with your target classes (orange,
apple, and banana), then you can use those pre-trained models to
detect the objects in your 1003 images. You can just select orange,
apple, and banana as the classes' names in the configuration.
If your pre-trained models weren't trained on your target classes and you only have your 1003 images, you will need to do what is called fine-tuning, which is training the last layer of the model. 1003 images might not be enough for training the model and you might need to perform data augmentation to expand your data. Also, consider making your classes balanced (meaning having the same number of objects per class).
For something close to "similarity score," you can consider the
confidence score for class x, which is the likelihood the bounding box contains an object x. However, this confidence score mainly depends on "how well trained" the model is on class x. For example, different models may differ in their confidence scores for the same images. Also, the same model may have different confidence scores for the same object in different angles, lighting, and orientation. Thus, it might be a better idea for you to fine-tune the models anyway so that they can be more "robust" to any representations of your target classes.
I have a dataset of many images where images have 5 magnifications (x10, x20, x30, x40, x50) to the same class but they are not a sequence data, and all images are in RGB mode and with size 512x512 and I want to give this 5 images as an input to the CNN, and I don't know how.
Also, there is another problem which is once the model was well trained on the 5 image pipeline, is it okay or will it work, when I have only one image (one magnification, x10 as an example)?
You have asked two questions.
For the 1st one, there are two ways to do it. 1- you can design the model in a way that the input size is 5×512×512×3, and you go to train the model.
For your 2nd question, you need to design your model in a way to handle a feature absence or missing features. For a complicated one, that I can think about, you can design the model in this way,
You have 5 inputs, per image, and each image goes through one or more CNN, and after one or a few layers you merge those together.
For each input, you can consider an additional feature, a boolean to indicate if this current image should be considered in training or not ( is absent or present). During your training, you should make a combination of all 5, and also consider the absence of some, so that your model learns to handle the absence of one or more images out of 5 in the input.
I hope I was clear enough and it helps.
Good luck.
I have approximately 100,000 images for doing detection task.
In average there are 10 target objects in each individual image. However, all the objects are not labeled. for example from 10 objects, 5 of them have boxes and others does not have any boxes. Do you think training the network with this data would be a good idea? or is it required to all the objects in the image have bounding boxes?
I have just started TF Object Detection API two weeks ago, and manage to train a model to recognize a custom object, in my case, a Mecanum wheel.
Here's the details:
No. of training images = 125
All training images are around 500 x 500 (plus minus)
Transfer Learning
Model used = ssd_mobilenet_v1_coco
batch size = 2
total steps ran = 12715
loss is around 0.5000 - 2.5000, some time it fluctuate to more than 10, I am not sure why
Here's the result:
The first image is encouraging.
The second image starts to disappoint me a little. I expect the model to detect FOUR (four boxes) Mecanum wheel. Why?
Then, I suspect that's there's something wrong with my trained model. I tried with the sample test images, the third image and fourth image, then I am sure that this is totally not the model I first aim for.
I have been reading this post which I think our problems are quite similar (and he manage to solve it). He mentioned that the input image needs to be less than 600 x 1024, so I tried with fifth image and unsurprisingly, the result is again disappointing.
I went through the tutorial series by sentdex and in the comment sections, I notice that there are many people face this problem too. So, what to do now?
Can someone please help me to edit the list? Why can't I make it to one paragraph one list?
125 images? You will not be able to get very good results with that many images. If you want to validate that this is indeed the problem, try training with just subsets of your original 125 images.
For example, how bad is the output when you train on 10 images?
Does it get better when you use 50 images?
Does it get better yet when you use 125 images?
If the accuracy improves with increasing dataset size, you can extrapolate and guess that with 1000 images, you will be able to do even better. I would guess that that is your problem.
I have a use case where I have around 100 images each of 10000 unique items. I have 10 items with me which are all from the 10000 set and I know which 10 items too but only at the time of testing on live data. I have to now match the 10 items with their names. What would be an efficient way to recognise these items? I have full control of training environment background and the testing environment background. If I make one model of all 10000 items, will it scale? Or should I make 10000 different models and run the 10 items on the 10 models I have pretrained.
Your question is regarding something called "one-vs-all classification" you can do a google search for that, the first hit is a video lecture by Andrew Ng that's almost certainly worth watching.
The question has been long studied and in a plethora of contexts. The answer to your question does very much depend on what model you use. But I'll assume that, if you're doing image classification, you are using convolutional neural networks, because, after all, they're state of the art for most such image classification tasks.
In the context of convolutional networks, there is something called "Multi task learning" that you should read up on. Boiled down to a single sentence, the concept is that the more you ask the network to learn the better it is at the individual tasks. So, in this case, you're almost certain to perform better training 1 model on 10,000 classes than 10,000 classes each performing a one-vs-all classification scheme.
Take for example the 1,000 class Imagenet dataset, and CIFAR-10's 10 class dataset. It has been demonstrated in numerous papers that first training against Imagenet's 1,000 class dataset, and then simply replacing the last layer with a 10 class output and re-training on CIFAR-10's dataset will produce a better result than just training on CIFAR-10's dataset alone. There are admittedly multiple reasons for this result, Imagenet is a larger dataset. But the richness of class labels, multi-task learning, in the Imagenet dataset is certainly among the reasons for this result.
So that was a long winded way of saying, use one model with 10,000 classes.
An aside:
If you want to get really, really interesting, and jump into the realm of research level thinking, you might consider a 1-hot vector of 10,000 classes rather sparse and start thinking about whether you could reduce the dimensionality of your output layer using an embedding. An embedding would be a dense vector, let's say size 100 as a good starting point. Now class labels turn into clusters of points in your 100 dimensional space. I bet your network will perform even better under these conditions.
If this little aside didn't make sense, it's completely safe to ignore it, your 10,000 class output is fine. But if it did peek your interest look up information on Word2Vec, and read this really nice post on how face recognition is achieved using embeddings: https://medium.com/#ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78. You might also consider using an Auto Encoder to generate an embedding for the images (though I favor triplet embeddings as typically used in face recognition myself).