How to choose num_words parameter for keras Tokenizer? - tensorflow

tokenizer = Tokenizer(num_words=my_max)
I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?
My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")

The base question here is "What kinds of words establish sentiment, and how often do they occur in tweets?"
Which, of course, has no hard and fast answer.
Here's how I would solve this:
Preprocess your data so you remove conjunctions, stop words, and "junk" from tweets.
Get the number of unique words in your corpus. Are all of these words essential to convey sentiment?
Analyze the words with the top frequencies. Are these words that convey sentiment? Could they be removed in your preprocessing? The tokenizer records the first N unique words until the dictionary has num_words in it, so these popular words are much more likely to be in your dictionary.
Then, I would begin experimenting with different values, and see the effects on your output.
Apologies for no "real" answer. I would argue that there is no single true strategy to choosing this value. Instead, the answer should come from leveraging the characteristics and statistics of your data.

Related

How to extract the keywords on which universal sentence encoder was trained on?

I am using Universal sentence encoder to encode some documents into a 512 dimensional embeddings. These are then used to find similar items to a search query which is also encoded using USE. USE works pretty well on general english words in search query and documents but performs really bad when the search query contains rare keywords such as people's name etc. I am thinking of enabling a reranker over the search results that takes into account the number of rare words present in the search query and the document retrieved. This should boost the scores of documents which contain known words while reduce the score of documents that contain unknown words.
My question is How do I get the grammar of Universal sentence encoder to implement such re-ranker?

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

Does Lucene use Extended Boolean Model retrieval?

Some time ago I came across extended boolean model, which combine boolean retrieval logic with ability to rank documents the way similar to Vector Space Model does.
As far as I understand this is exactly the way Lucene does it's job in ranking documents. Am I right?
It is a combination of the Vector Space Model and the Boolean Model. Checkout the Scoring docs page:
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.
If you compare the formulas at Similarity with the classic VSM formula you'll note that they are similar (though not equal).