Identify irrelevant queries based on context - text-mining

I'm working on keyphrase extraction from individual queries using R/Python.
However, is there any way I can detect whether the query is valid in a particular context or not ?
Find an example scenario here:
Query1 = 'I need to order a birthday cake. Cherry topped vanilla cream over chocolate.'
Query2 = 'I wish to have a butterscotch cake with caramel sauce on top.'
Query3 = 'I need to throw my chocolate wrapper into the dustbin.'
Assuming the context is Cake customization, let's say Query1, Query2 are valid, but Query3 is not. Query3 is clearly not about cake customization, while the first two queries are about cake customization.
If I want to eliminate the invalid queries prior to my key-phrase extraction step. Any help in the approach to be taken is welcome.
Thanks!

You can think of doing some clustering as pre-processing step. Cluster the queries so that no queries in a cluster looks irrelevant. To compare queries for clustering, you can generate some query representation using say Bag-of-Word model using TF-IDF weights and then use cosine similarity to find out which queries are similar.
If you are interested in advanced approach, then you can think of deep learning techniques. There are many deep learning techniques which is designed to do this kind of task. But to keep things simple, you can train a deep neural network to generate query/sentence representations.
Actually you don't need to that because there exists many pre-trained models. For example, skip-thought vectors can produce sentence representation (vectors) for your queries. Then you can use it to compare queries for clustering.

Related

Can RDF/SPARQL be used for sub-graph matching?

I would like to build a knowledge graph of a set of instances, where each instance is itself a collection of ordered sub-instances. As a simple example, let's assume my instances are chains of marbles {CHAIN1, CHAIN2, CHAIN3, ...} and the sub-instances are colored marbles {CHAIN1: YELLOW-RED-BLUE-RED; CHAIN2: BLUE-YELLOW-GREEN; CHAIN3: GREEN-RED-BLUE-RED}.
Just to clarify, an incorrect approach would define CHAIN1 something like this:
:CHAIN1 :has_marble :YELLOW, :RED, :BLUE, :RED
but querying this would clearly only yield a "bag of marbles" situation.
I would like to be able to:
Query the knowledge graph such that I can get back the marbles for each chain in the correct order.
Match sequences of marbles between different chains. For example, I might want to get all the chains that have the sequence :RED-:BLUE-:RED as a sub-sequence (i.e., CHAIN1 and CHAIN3).
Questions:
What would be the best way of building this knowledge graph? Should I store the marbles as RDF sequences using rdf:first/rdf:rest? Or is there a better, more flexible option? If possible, I would like to be able to define the type of relation between the marbles, say :RED :is_followed_by :BLUE.
Is the type of graph matching I'm after possible? And how about if I'd like to match the sequences using some properties that describe each marble? Say, :BLUE :has_shape :SQUARE, and match the sequence of marbles by their shape?
Note: What I really want to model are chains of DNA and protein sequences, so if anyone has specific recommendations for such applications, that would be even more helpful.

What is the general name of this vector sum problem?

Given the USDA nutrient database: n vectors where each dimension is a particular nutrient,
find a set S of foods F whose vectors sum to .ge. RDA and .lt. any toxic value. Add various other constraints to the model, e.g., calories, mass.
Solve for any combination of vectors that meet the constraints.
Currently available websites allow one to choose foods one at a time and build a "recipe". I'm looking for a computational solution. I suspect that this is a trivial problem that someone has already solved. I am looking for the search terms that describe this sort of scenario.
"Deep learning" looks for patterns, but the goal "pattern" is an input. Probability is not involved, so a sizable chunk of ML is not applicable. I intuit that some sort of tree-traversal might be useful.
This is a combination of set theory and vector math. I expect that there exists a large solution set of sets.
I can set up the input vectors as parameterized SQL queries. I have downloaded the USDA nutrient database and loaded it into mariadb.
pseudocode:
Select *
from subset_nutrients
join rda_nutrs on nutrients.nut-1 = rda-nuts.nut-1 join toxicity on nutrients.nut-1 = toxicity.nut-t1
where sum(nut-1scalar) >= rda-1scalar and sum(nut-2scalar) >=rda-2scalar {etc} and sum(nut-1scalar) < toxicity.nut1_t_scalar and sum(nut-2scalar) < toxicity.nut2_t_scalar {etc}
SQL might actually solve the problem all by itself?
I am looking for human-suggested search terms to find original sources of information. Thank you for your suggestions.

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Model clause in Oracle

I am recently inclined towards in Oracle jargon and the more I am looking into the more is attracting me.
I have recently come across the MODEL clause but to be honest I am not understanding the behaviour of this. Can any one with some examples please let me know about the same.
Thanks in advance
Some examples of MODEL are given here.
Personally I've looked at MODEL several times and not yet succeeded in finding a use case for it. While it first appears to be useful, there's a lot of places where only literals work (rather than binds or variables) which restrict its flexibility. For example, on inter-row calculations, you can't readily refer to the 'previous' or 'next' row, but have to be able to absolutely identify it by its attributes. So you can't say 'take the value of the row with the same date in the previous month' but can only code a specific date.
It might be used (internally) by some analytical tools. But as an end-user tool, I never 'got' it. I have previously recommended that, if you ever find a problem you think can be solved by the application of the MODEL clause, go and have a lie down until the feeling wears off.
I think the MODEL clause is quite simple to understand, when you slowly read the official whitepaper. In my opinion, the whitepaper nicely explains the MODEL clause step by step, adding one feature at a time to the examples, leaving out the most advanced features to the official documentation.
From the whitepaper, I also find it easy to understand when to actually use the MODEL clause. In some examples, it is a lot simpler to do "Excel-spreadsheet-like" operations using MODEL rather than, for instance, using window functions, CONNECT BY, or subquery factoring. Think about Excel. Whenever you want to define a complex rule set for Excel columns, use the MODEL clause. Example Excel spreadsheet rules:
A10 = A9 + A8
B10 = A10 * 5
C10 = MAX(A1:A9)
D10 = C10 / A10
In other words, MODEL is a very powerful SQL spreadsheet!
The best explanation is in the official white paper. It uses the SH demo schema and you really need it installed.
http://www.oracle.com/technetwork/middleware/bi-foundation/10gr1-twp-bi-dw-sqlmodel-131067.pdf
I don't think they do a very good job explaining this. It basically lets you load up data into an array and and then loop through array using straight SQL, instead of having to write procedural logic. Alot of the terms are based on spreadsheet terms (they are used in the Excel Help). So if you have them in excel, this would be confusing.
They should have drawn a picture for each of the queries and shown the array created than shown how you look through the array. The syntax looks to be based on Excel syntax. I'm not sure if this is common to all spreadsheet tools or not.
It has uses. Bin fitting is the most common. See the 2nd example. This is basically a complex group by where you are grouping by a range, but that range can change. It requires procedural logic. The example gives 3 ways to do it one of which is the model clause.
http://www.oracle.com/technetwork/issue-archive/2012/12-mar/o22asktom-1518271.html
I think people (often managers) who do complex spreadsheet calculations may have an easier time seeing uses for this and getting the lingo.