I am using huggingface transformer models for text-summarization.
Currently I am testing different models such as T5 and Pegasus.
Now these models were trained for summarizing Big Texts into very short like a maximum of two sentences. Now I have the task, that I want summarizations, that are about half the size of the text, ergo the generated summaries are too small for my purpose.
My question now is, if there is a way to tell the model that another sentence came before?
Kind of similar to the logic inside stateful RNNs (although I know they work completly different).
If yes, I could summarize small windows over the sentences always with the information which content came before.
Is that just a thing of my mind? I cant believe that I am the only one, who wants to create shorter summaries, but not only 1 or two sentence long ones.
Thank you
Why not transfer learning? Train them on your specific texts and summaries.
I trained T5 on specific limited text over 5 epoch and got very good results. I adopted the code from here to my needs https://github.com/patil-suraj/exploring-T5/blob/master/t5_fine_tuning.ipynb
Let me know if you have a specific training questions.
Related
I have a dataset of many images where images have 5 magnifications (x10, x20, x30, x40, x50) to the same class but they are not a sequence data, and all images are in RGB mode and with size 512x512 and I want to give this 5 images as an input to the CNN, and I don't know how.
Also, there is another problem which is once the model was well trained on the 5 image pipeline, is it okay or will it work, when I have only one image (one magnification, x10 as an example)?
You have asked two questions.
For the 1st one, there are two ways to do it. 1- you can design the model in a way that the input size is 5×512×512×3, and you go to train the model.
For your 2nd question, you need to design your model in a way to handle a feature absence or missing features. For a complicated one, that I can think about, you can design the model in this way,
You have 5 inputs, per image, and each image goes through one or more CNN, and after one or a few layers you merge those together.
For each input, you can consider an additional feature, a boolean to indicate if this current image should be considered in training or not ( is absent or present). During your training, you should make a combination of all 5, and also consider the absence of some, so that your model learns to handle the absence of one or more images out of 5 in the input.
I hope I was clear enough and it helps.
Good luck.
I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.
I am new to machine learning field and based on what I have seen on youtube and read on internet I conjectured that it might be possible to count pedestrians in a video using tensorflow's object detection API.
Consequently, I did some research on tensorflow and read documentation about how to install tensorflow and then finally downloaded tensorflow and installed it. Using the sample files provided on github I adapted the code related to object_detection notebook provided here ->https://github.com/tensorflow/models/tree/master/research/object_detection.
I executed the adapted code on the videos that I collected while making changes to visualization_utils.py script so as to report number of objects that cross a defined region of interest on the screen. That is I collected bounding boxes dimensions (left,right,top, bottom) of person class and counted all the detection's that crossed the defined region of interest (imagine a set of two virtual vertical lines on video frame with left and right pixel value and then comparing detected bounding box's left & right values with predefined values). However, when I use this procedure I am missing on lot of pedestrians even though they are detected by the program. That is the program correctly classifies them as persons but sometimes they don't meet the criteria that I defined for counting and as such they are not counted. I want to know if there is a better way of counting unique pedestrians using the code rather than using the simplistic method that I am trying to develop. Is the approach that I am using the right one ? Could there be other better approaches ? Would appreciate any kind of help.
Please go easy on me as I am not a machine learning expert and just a novice.
You are using a pretrained model which is trained to identify people in general. I think you're saying that some people are pedestrians whereas some other people are not pedestrians, for example, someone standing waiting at the light is a pedestrian, but someone standing in their garden behind the street is not a pedestrian.
If I'm right, then you've reached the limitations of what you'll get with this model and you will probably have to train a model yourself to do what you want.
Since you're new to ML building your own dataset and training your own model probably sounds like a tall order, there's a learning curve to be sure. So I'll suggest the easiest way forward. That is, use the object detection model to identify people, then train a new binary classification model (about the easiest model to train) to identify if a particular person is a pedestrian or not (you will create a dataset of images and 1/0 values to identify them as pedestrian or not). I suggest this because a boolean classification model is about as easy a model as you can get and there are dozens of tutorials you can follow. Here's a good one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network.ipynb
A few things to note when doing this:
When you build your dataset you will want a set of images, at least a few thousand along with the 1/0 classification for each (pedestrian or not pedestrian).
You will get much better results if you start with a model that is pretrained on imagenet than if you train it from scratch (though this might be a reasonable step-2 as it's an extra task). Especially if you only have a few thousand images to train it on.
Since your images will have multiple people in it you have a problem of identifying which person you want the model to classify as a pedestrian or not. There's no one right way to do this necessarily. If you have a yellow box surrounding the person the network may be successful in learning this notation. Another valid approach might be to remove the other people that were detected in the image by deleting them and leaving that area black. Centering on the target person may also be a reasonable approach.
My last bullet-point illustrates a problem with the idea as it's I've proposed it. The best solution would be to alter the object detection network to ouput both a bounding box per-person, and also a pedestrian/non pedestrian classification with it; or to only train the model to identify pedestrians, specifically, in the first place. I mention this as more optimal, but I consider it a more advanced task than my first suggestion, and a more complex dataset to manage. It's probably not the first thing you want to tackle as you learn your way around ML.
I have a use case where I have around 100 images each of 10000 unique items. I have 10 items with me which are all from the 10000 set and I know which 10 items too but only at the time of testing on live data. I have to now match the 10 items with their names. What would be an efficient way to recognise these items? I have full control of training environment background and the testing environment background. If I make one model of all 10000 items, will it scale? Or should I make 10000 different models and run the 10 items on the 10 models I have pretrained.
Your question is regarding something called "one-vs-all classification" you can do a google search for that, the first hit is a video lecture by Andrew Ng that's almost certainly worth watching.
The question has been long studied and in a plethora of contexts. The answer to your question does very much depend on what model you use. But I'll assume that, if you're doing image classification, you are using convolutional neural networks, because, after all, they're state of the art for most such image classification tasks.
In the context of convolutional networks, there is something called "Multi task learning" that you should read up on. Boiled down to a single sentence, the concept is that the more you ask the network to learn the better it is at the individual tasks. So, in this case, you're almost certain to perform better training 1 model on 10,000 classes than 10,000 classes each performing a one-vs-all classification scheme.
Take for example the 1,000 class Imagenet dataset, and CIFAR-10's 10 class dataset. It has been demonstrated in numerous papers that first training against Imagenet's 1,000 class dataset, and then simply replacing the last layer with a 10 class output and re-training on CIFAR-10's dataset will produce a better result than just training on CIFAR-10's dataset alone. There are admittedly multiple reasons for this result, Imagenet is a larger dataset. But the richness of class labels, multi-task learning, in the Imagenet dataset is certainly among the reasons for this result.
So that was a long winded way of saying, use one model with 10,000 classes.
An aside:
If you want to get really, really interesting, and jump into the realm of research level thinking, you might consider a 1-hot vector of 10,000 classes rather sparse and start thinking about whether you could reduce the dimensionality of your output layer using an embedding. An embedding would be a dense vector, let's say size 100 as a good starting point. Now class labels turn into clusters of points in your 100 dimensional space. I bet your network will perform even better under these conditions.
If this little aside didn't make sense, it's completely safe to ignore it, your 10,000 class output is fine. But if it did peek your interest look up information on Word2Vec, and read this really nice post on how face recognition is achieved using embeddings: https://medium.com/#ageitgey/machine-learning-is-fun-part-4-modern-face-recognition-with-deep-learning-c3cffc121d78. You might also consider using an Auto Encoder to generate an embedding for the images (though I favor triplet embeddings as typically used in face recognition myself).
I have to analyse some images of drops, taken using a microscope, which may contain some cell. What would be the best thing to do in order to do it?
Every acquisition of images returns around a thousand pictures: every picture contains a drop and I have to determine whether the drop has a cell inside or not. Every acquisition dataset presents with a very different contrast and brightness, and the shape of the cells is slightly different on every setup due to micro variations on the focus of the microscope.
I have tried to create a classification model following the guide "TensorFlow for poets", defining two classes: empty drops and drops containing a cell. Unfortunately the result wasn't successful.
I have also tried to label the cells and giving to an object detection algorithm using DIGITS 5, but it does not detect anything.
I was wondering if these algorithms are designed to recognise more complex object or if I have done something wrong during the setup. Any solution or hint would be helpful!
Thank you!
This is a collage of drops from different samples: the cells are a bit different from every acquisition, due to the different setup and ambient lights
This kind of problem should definitely be possible. I would suggest starting with a cifar 10 convolutional neural network tutorial and customizing it for your problem.
In future posts you should tell us how your training is progressing. Make sure you're outputting the following information every few steps (maybe every 10-100 steps):
Loss/cost function output, you should see your loss decreasing over time.
Classification accuracy on the current batch of your training data
Classification accuracy on a held out test set (if you've implemented test set evaluation, you might implement this second)
There are many, many, many things that can go wrong, from bad learning rates, to preprocessing steps that go awry. Neural networks are very hard to debug, they are very resilient to bugs, making it hard to even know if you have a bug in your code. For that reason make sure you're visualizing everything.
Another very important step to follow is to save the images exactly as you are passing them to tensorflow. You will have them in a matrix form, you can save that matrix form as an image. Do that immediately before you pass the data to tensorflow. Make sure you are giving the network what you expect it to receive. I can't tell you how many times I and others I know have passed garbage into the network unknowingly, assume the worst and prove yourself wrong!
Your next post should look something like this:
I'm training a convolutional neural network in tensorflow
My loss function (sigmoid cross entropy) is decreasing consistently (show us a picture!)
My input images look like this (show us a picture of what you ACTUALLY FEED to the network)
My learning rate and other parameters are A, B, and C
I preprocessed the data by doing M and N
The accuracy the network achieves on training data (and/or test data) is Y
In answering those questions you're likely to solve 10 problems along the way, and we'll help you find the 11th and, with some luck, last one. :)
Good luck!