How to extract the keywords on which universal sentence encoder was trained on? - tensorflow

I am using Universal sentence encoder to encode some documents into a 512 dimensional embeddings. These are then used to find similar items to a search query which is also encoded using USE. USE works pretty well on general english words in search query and documents but performs really bad when the search query contains rare keywords such as people's name etc. I am thinking of enabling a reranker over the search results that takes into account the number of rare words present in the search query and the document retrieved. This should boost the scores of documents which contain known words while reduce the score of documents that contain unknown words.
My question is How do I get the grammar of Universal sentence encoder to implement such re-ranker?

Related

How to choose num_words parameter for keras Tokenizer?

tokenizer = Tokenizer(num_words=my_max)
I am using the keras preprocessing tokenizer to process a corpus of text for a machine learning model. One of the parameters for the Tokenizer is the num_words parameter that defines the number of words in the dictionary. How should this parameter be picked? I could choose a huge number and guarantee that every word would be included but certain words that only appears once might be more useful if grouped together as a simple "out of vocabulary" token. What is the strategy in setting this parameter?
My particular use case is a model processing tweets so every entry is below 140 characters and there is some overlap in the types of words that are used. the model is for a kaggle competition about extracting the text that exemplifies a sentiment (i.e "my boss is bullying me" returns "bullying me")
The base question here is "What kinds of words establish sentiment, and how often do they occur in tweets?"
Which, of course, has no hard and fast answer.
Here's how I would solve this:
Preprocess your data so you remove conjunctions, stop words, and "junk" from tweets.
Get the number of unique words in your corpus. Are all of these words essential to convey sentiment?
Analyze the words with the top frequencies. Are these words that convey sentiment? Could they be removed in your preprocessing? The tokenizer records the first N unique words until the dictionary has num_words in it, so these popular words are much more likely to be in your dictionary.
Then, I would begin experimenting with different values, and see the effects on your output.
Apologies for no "real" answer. I would argue that there is no single true strategy to choosing this value. Instead, the answer should come from leveraging the characteristics and statistics of your data.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Spell-checking homophones and numbers with spacy

I am working with a pipeline where speech is translated into text. The speech would be a sort of technical professional jargon, with occurences of acronyms as well as phrases which describe amounts and values.
There is a high chance that the stock speech-to-text translator is gonna misrepresent terms and phrases I would be interested in. I have a corpus where I can tag those words and phrases.
Is there a way to use spacy (or maybe some other NLP tool) to apply spell-checking, so for instance 'pee-age' or 'pee-oh-to' would be corrected to 'pH' and 'pO2', respectively?
Also, for phrases where numbers and amounts are specified, like for example 'one twenty cee-cee' would be corrected to '120cc'?
Thanks in advance

Dictionary API used for stressed syllables

This might end up being a very general question, but hopefully it will be useful to others as well.
I want to be able to request a word that is x number of syllables with a stress on x.[y] syllable. I've found plenty of APIs that return both of these such as Wordnik, but I'm not sure how to approach the search aspect. The URL to get the syllables is
GET /word.json/{word}/hyphenation
but I won't know the word ahead of time to make this request. They also have this:
GET /words.json/randomWords
which returns a list random words.
Is there a way to achieve what I want with this API without asking for random words over and over and checking if they meet my needs? That just seems like it would be really slow and push me over my usage limits.
Do I need to build my own data structure with the words and syllables to query locally?
I doubt you'll find this kind of specialized query on any of the big dictionary APIs. You'll need to download an English dictionary and create your own data structure to do this kind of thing.
The Moby Project has a hyphenated dictionary with about 185,000 words in it. There are many other dictionary projects available. A good place to start looking is http://www.dicts.info/dictionaries.php.
Once you've downloaded the dictionary, you'll need to preprocess it to build your data structure. You should be able to construct a dictionary or hash map that is indexed by (syllables, emphasis), and whose data member is a list of words. So you'd have an entry like (4, 2) (4-syllable word with emphasis on the 2nd syllable), and a list of all such words.
To query it, then, you'd just pack the query into a structure and look up that key in the hash map. Then pick a random word from the resulting list.

Lucene 4.*: FuzzyQuery and finding positions of hits

The various examples I see about how to find positions of the matches returned by an IndexSearcher either require retrieving the document's content and search a TokenStream or to index the positions and offsets in the term vectors, turn the query into a term and find it in the term vector.
But what happens when I use a FuzzyQuery? Is there a way to know which term(s) exactly matched in the hit so that I may look for them in the term vector of this document?
In case that's of any value, I'm new to Lucene and my goal here is to annotate a set of documents (the ones indexed in Lucene) with a set of terms, but the documents are from scanned texts and contain OCR errors, therefore I must use a FuzzyQuery. I thought about using lucene-suggest to do some spellchecking beforehand but it occured to me that it boiled down to trying to find fuzzy matches.