Objective - Subjective text Classifier : - text-mining

I am trying to built a classifier for subjective and objective text using imdb data . For objective data point I am using the movie's plot summary as input where as for subjective data points I am using review of the movies.
I took complete plot summary as one data point where as in case of reviews each review by a single user is a single data point .In my database different reviews of the same movie by different user is entered as different data points.
After this I cleaned the words of special character , removed stop word , calculated the Information gain to create the word dictionary and applied Naive Bayes using word frequency to calculate the probabilities .
Now my question are
Is my algo to build the classifier correct ?
My classifer is heavvily biased toward objective. Am I making mistake
in creation of training data ?
I want to create a genric classifer that can be used for tweets or
somthing extracted from blogs . Is movie review data is sufficient ? Right now its not working even for movie review data

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Can I use labeled data and rule-based matching for multiclass text classification with Spacy?

I have some labeled data (down to 1000) (shape: text, category ) and up to 10k unlabeled data. I want to use the rules-based matching linguistic tool of Spacy to define for every category a pattern. After this, I would like to train a new model using the rules and the data that I've had labeled. It is this possible? I've seen some tutorial on youtube* which does something similar, but they use the labeled data to determine if a sentence contains some entity. On the other hand, I want to put a label on an entire paragraph.
https://www.youtube.com/watch?v=IqOJU1-_Fi0

Multiple trained models vs Multple features and one model

I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.

Tensorflow: pattern training and generation

Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!