Dataset for training text classifier - text-mining

I'm new to data mining and I am trying to build a classifier that is able to classify student theses abstracts into a predefined set of categories under the area of Computer Science, e.g. Machine learning, Image Processing...etc.
I do not have enough classified abstracts to be used as a training dataset so would you please direct me to a dataset that can be used for this particular purpose.

You can use the DBLP data (downloadable from http://dblp.uni-trier.de/xml/) to generate a list of publications. Based on conferences/journal you can generate your classes e.g. MLJR is alway Machine Learning.
The abstracts you can acquire using:
https://github.com/arc12/Text-Mining-Weak-Signals/blob/master/Abstract%20Acquisition%20Scripts/DBLP%20XML%20fetch%20abstracts%20.pl

Related

Types of algirthms using to predict item inventory stock

I want to predict stock by analyzing stock mouvement.
Stock mouvement
enter image description here
STOCK :
enter image description here
Which step to start analzying data using machine learning.
Which algorithme ML and DL to use.
Thanks a lot
I need :
learn step to start analzying data using machine learning and deep learning.
Type of algorithme ML and DL to use.
Mohamed:
Your question requires a long and broad answer. I'll try to provide some steps and reference for you to explore further.
First of all you need the right data. The prediction will be as good as your data. So you need economic data, market trends, and company-specific events.
Then you need to follow the standard steps:
Collect and Preprocess Data: You'll need to gather stock data such as daily closing prices, trading volumes, and other relevant financial information. Preprocessing the data involves cleaning and transforming the data so that it can be used for modeling.
Feature Engineering: This involves creating new features from the existing data that can be used as inputs for the ML/DL models. For example, you can calculate technical indicators such as moving averages, relative strength index (RSI), and Bollinger Bands.
Split the Data: You'll need to split the data into training, validation, and testing sets. The training set is used to train the ML/DL models, the validation set is used to fine-tune the models, and the testing set is used to evaluate the performance of the models.
Select an Algorithm: There are various ML algorithms that can be used for stock price prediction. Some popular algorithms include:
ML Algorithms: Linear Regression, Random Forest, XGBoost, Support Vector Machines (SVM), etc.
Hope that helps.
Reference Book: https://learning.oreilly.com/api/v1/continue/9781800560796/
Machine Learning Engineering with MLflow - Chapter 2

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

How to know what Tensorflow actually "see"?

I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.

Tensorflow Stored Learning

I haven't tried Tensorflow yet but still curious, how does it store, and in what form, data type, file type, the acquired learning of a machine learning code for later use?
For example, Tensorflow was used to sort cucumbers in Japan. The computer used took a long time to learn from the example images given about what good cucumbers look like. In what form the learning was saved for future use?
Because I think it would be inefficient if the program should have to re-learn the images again everytime it needs to sort cucumbers.
Ultimately, a high level way to think about a machine learning model is three components - the code for the model, the data for that model, and metadata needed to make this model run.
In Tensorflow, the code for this model is written in Python, and is saved in what is known as a GraphDef. This uses a serialization format created at Google called Protobuf. Common serialization formats include Python's native Pickle for other libraries.
The main reason you write this code is to "learn" from some training data - which is ultimately a large set of matrices, full of numbers. These are the "weights" of the model - and this too is stored using ProtoBuf, although other formats like HDF5 exist.
Tensorflow also stores Metadata associated with this model - for instance, what should the input look like (eg: an image? some text?), and the output (eg: a class of image aka - cucumber1, or 2? with scores, or without?). This too is stored in Protobuf.
During prediction time, your code loads up the graph, the weights and the meta - and takes some input data to give out an output. More information here.
Are you talking about the symbolic math library, or the idea of tensor flow in general? Please be more specific here.
Here are some resources that discuss the library and tensor flow
These are some tutorials
And here is some background on the field
And this is the github page
If you want a more specific answer, please give more details as to what sort of work you are interested in.
Edit: So I'm presuming your question is more related to the general field of tensor flow than any particular application. Your question still is too vague for this website, but I'll try to point you toward a few resources you might find interesting.
The tensorflow used in image recognition often uses an ANN (Artificial Neural Network) as the object on which to act. What this means is that the tensorflow library helps in the number crunching for the neural network, which I'm sure you can read all about with a quick google search.
The point is that tensorflow isn't a form of machine learning itself, it more serves as a useful number crunching library, similar to something like numpy in python, in large scale deep learning simulations. You should read more here.

Retrain TF object detection API to detect a specific car model -- How to prepare the training data?

I am new to object detection and trying to retrain object-detection API in TensorFlow to detect a specific car model in photos. When preparing my own training data to retrain the model, besides things like drawing bounding boxes, etc, my question is, should I also prepare negative examples in the training data (cars that are not the model I am interested in) to reach good performance?
I have read through some tutorials and they usually give example in detecting one type of object, and they prepared training data with the label only for that type. I was thinking, since the model first proposal some area of interest, then try to classify those areas, should I also prepare negative examples if I want to detect very specific stuff from photos.
I am retaining faster_rcnn based model. Thanks for the help.
Yes, you will need negative examples also for better performance. Seems like are you thinking about using transfer learning to train a pre-trained faster_rcnn model to add a new class for your custom car. You should start an equal number of positive and negative examples (images with labelled bounding boxes). You will need have examples of several negative classes (e.g. negative car type 1, negative car type 2, negative car type 3) in addition to your target car type.
You can look at examples of one positive class and several negative classes training data for transfer learning in the data folder of the my github repo at: PSV Detector Github