Training data in sentiment analysis - data-science

I'm doing sentiment analysis of tweets related to recent acquisition of Twitter by Elon Musk. I have a corpus of 10 000 tweets and I'd like to use machine learning methods using models like SVM and Linear Regression. My question is, when I want to train the models, do I have to manually tag big portion of those 10 000 collected tweets with either positive or negative class to train the model correctly or can I use some other dataset of tweets not relating to this topic that's already tagged to train the model for sentiment analysis? Thank you for your answers!

Related

When we use XGBOOST in binary classification, do we need to MinMaxscale the features we input?

Given the retweet dynamic of a tweet for a certain period T, we want to predict whether the tweet will be widely spread?
The model we selected is XGBOOST
I have got about 10 dimension features. If I conduct the process of Minmaxscale (or other standard methods), the performance of the model can be improved. If I do not conduct the above process, the performance of the model can not be improved.
However, many people told me that there is no need to do this in the model xgboost

TensorFlow Kitti-trained models: Detailed underlying training procedure

for my ML project I want to use the faster_rcnn_resnet101_kitti model from tensorflow model zoo. As the number of images in the Kitti dataset is extremely small (about 7000 images) for a deep learning practice, I was wondering how this small amount of data leads to the decent inference performance (mAP#0.5=87)? One answer I can imagine is that the network was first trained on a different, rich dataset and fine tuned on the Kitti but I am not sure about it.
I am wondering how can I find out the exact underlying training procedure (apart from pipeline.config) for the models published on TF model zoo?
Thanks

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Low validation accuracy after mobilenet transfer learning

I need a tensorflow model which recognizes a dog's breed. I downloaded the Stanford Dogs Dataset - 20,580 images in 120 categories (=breeds). I followed the procedure described in TensorFlow For Poets to retrain mobilenet_1.0_224. I used --how_many_training_steps=4000 and defaults for everything else. I got this tensorboard graph:
Training and validation accuracy
The validation accuracy is only about 80%.
What can I do to improve it?
In the research paper MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, the test accuracy using the 'MobileNet_1.0_224' architecture on the Stanford Dogs dataset is 83.3%, which seems in line with your results.
When you visually examine the Stanford Dogs Dataset you will find a lot of the breeds look similar, which makes it hard to reach a higher accuracy, even with the state of the art image classifiers in accuracy. You might improve your results by either splitting similar looking breeds into larger subcategories.
Alternatively, you might tweak the training settings of the retrain.py script in the Tensorflow for Poets tutorial, but the gains will be likely be marginal.

How to train CNN models with additional categorical / numerical features?

In addition to the image itself in RGB, I also have a list of metadata / categorical / numerical features attached each image.
e.g. Local time of day, day of week of when the photo was taken, GPS / city name of the photo, and a brief description of the photo (written by human).
How do you train a CNN model using tensorflow with additional features?
Generally, it is a hard problem to flexibly include all various features (images, time, description etc) into one model. CNN is designed for and only for extracting information from images. The operations like convolution, pooling can only be applied to images and require a lot of efforts to design generalizations.
However, CNN does help you summarize the information contained in an image. You could use the prediction of your CNN as your features and feed in another model.