I have written a plugin in lucene which annotates certain terms and stores their spans in this fashion <term>,<span>;<term>,<span>;..
Now i need to handle span near queries just using these spans and not the default lucene stored spans. This is because not all terms which are similar are annotated. So basically if i query terms within k tokens, then i should be able to get their span distance by subtracting their corresponding spans. How will i be able to do this in lucene? I'm a newbie, so please be as descriptive as possible.
Thanks,
Ananth.
A good general rule I follow in Lucene is to put specially-processed data into its own fields so there is little chance of a mix-up. In that way, you can perform your nearness queries in the way you want. (This will make your index bigger.)
Related
In my project, I am asked to implement a text query service on the database we are using; Postgresql. I have used Postgresql Full Text Search features, which works fairly fine in terms of time. One problem about full text search is, it does not have fuzzy search abilities. On the other hand, there is an extension named pgtrgm providing functions and operators for determining the similarity of alphanumeric text. Also there are several examples of text search using pgtrgm like:
select actor
from products
where actor % 'tomy';
As you know example of postgres FTS also here;
SELECT title
FROM pgweb
WHERE to_tsvector(body) ## to_tsquery('friend');
So, the main question is, what is the difference between these two search strategies? Which one is more appropriate way for searching texts? Is it possible to mix them? I also need to say that performance is an important concern as well. Thanks in advance!
They do completely different things. About the only thing that is not different between them is that they operate on text and can benefit from use of indexes. From you question, it seems like you already have a good sense of the differences. The appropriate one is the one that does what you want. If one of them was always appropriate, we probably wouldn't have created the other one.
You can mix them, but you will need different indexes for each one, they cannot share an index. Also, you probably need different tables as well, as full text search is more appropriate for sentences or paragraphs while trigram for individual words or short phrases.
One way to mix them would be to have one table of full texts, and another table which lists only each distinct word present in any of the full texts. The 2nd table could be used to detect probable typos in the query, and then once those are fixed by suggestions from trigram searching, run the fixed query against the 1st table.
The difference is quite huge - in fuzzy search, you're searching for a similar result, in full-text search - for the exact same. If one is more appropriate than the other is the matter of use-case.
If you don't need fuzziness, don't use it, it's a huge performance overhead because it has to match the text not exactly, but also try other combinations.
I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.
I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.
Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.
Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json