Can I use labeled data and rule-based matching for multiclass text classification with Spacy? - spacy

I have some labeled data (down to 1000) (shape: text, category ) and up to 10k unlabeled data. I want to use the rules-based matching linguistic tool of Spacy to define for every category a pattern. After this, I would like to train a new model using the rules and the data that I've had labeled. It is this possible? I've seen some tutorial on youtube* which does something similar, but they use the labeled data to determine if a sentence contains some entity. On the other hand, I want to put a label on an entire paragraph.
https://www.youtube.com/watch?v=IqOJU1-_Fi0

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

How to pass a list of numbers as a single feature to a neural network?

I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)

Tensorflow: pattern training and generation

Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!

Objective - Subjective text Classifier :

I am trying to built a classifier for subjective and objective text using imdb data . For objective data point I am using the movie's plot summary as input where as for subjective data points I am using review of the movies.
I took complete plot summary as one data point where as in case of reviews each review by a single user is a single data point .In my database different reviews of the same movie by different user is entered as different data points.
After this I cleaned the words of special character , removed stop word , calculated the Information gain to create the word dictionary and applied Naive Bayes using word frequency to calculate the probabilities .
Now my question are
Is my algo to build the classifier correct ?
My classifer is heavvily biased toward objective. Am I making mistake
in creation of training data ?
I want to create a genric classifer that can be used for tweets or
somthing extracted from blogs . Is movie review data is sufficient ? Right now its not working even for movie review data