How to train data of different lengths in machine learning? - data-science

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.

In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

Related

how to find closeness between two keras pad_sequences?

I am writing a small proof of concept where I turn a catalog into a json that has a url, and a label that explains the web page. I read this json in python, tokenize it and create a pad_sequences.
I need to then compare some free flow texts to find which index of the pad_sequences has the most words from the free flow text.
I am generating a pad_sequences() from the text too but not sure if I can somehow compare the two sequences for closeness?
Please help.
You can use cosine similarity or euclidean distance to compare two vectors.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity
https://www.tutorialexample.com/calculate-euclidean-distance-in-tensorflow-a-step-guide-tensorflow-tutorial/
For sequences you can make embedding to same lenght vector at first.

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

How to pass a list of numbers as a single feature to a neural network?

I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

Does Word2Vec maintain the sequential information of the input text?

I ask because i'd like to use it to process the text input that I will be using for my LSTM.
Any feedback would be much appreciated.
As the name suggests, it is "word" to vector. What it does is, to represent words in their vector form. It's more like placing similar words grouped together in the space.
Like, 'cat' and 'kitten' represent similar meaning, so they will be closer to each other, i.e, their vector representation will be similar. Whereas, vector representation of 'man' will be quite far apart when placed in the same space.
Here is a beautiful blog post which talk about word2vec in detail.