Semantic search on PostgreSQL - sql

I know PostgreSQL has trigram trigram similarity search and even indexing optimized for it (CREATE INDEX trgm_idx ON table USING gist (column gist_trgm_ops);), which can be used directly from Django (web framework):
Model.objects.filter(attribute__trigram_similar=query_string)
But what if, instead of surface similarity, I wanted to perform semantic similarity query on database objects? (which is obviously quite different from classic trigram similarity).
Good example would be Google's universal sentence encoder, where I would convert all strings into 512 dimensional embedding vectors (using the library) and perform query by calculating normalized dot product (cosine similarity) and yield the object with highest similarity (or perhaps n amount objects with similarity >=0.50).
Simplest thing to do is to iterate (at the framework level) on the database objects, but this is highly inefficient (especially if database is large), therefore I would rather to find a way where I could perform query on the database level (and perhaps if possible set up optimal indexing for semantic search?).
What would be the best way to perform this custom similarity search on the database of pre-vectorized objects?
What if, I get dot product of all objects in the pre-vectorized database manually?
Thank you!

Related

Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings
When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?
Moreover, I want to find did you mean queries
For example lets say i have a record as house. The user types hause
Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you
I have tried soundex but the results it generates are terrible
It returns so many unrelated words
And since there are so many records in my database and i need very fast results, i need something SQL server natively supports
Any ideas? Any way to achieve such thing with using indexes?
I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.
I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records
Short answer is no, Full Text Search cannot search for words that are similar, but different.
Full Text Search uses stemmers and thesaurus files:
The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").
A Full-Text Search thesaurus defines a set of synonyms for a specific language.
Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...
Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.
What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

Performance Impact of turning Columns into Rows

I'm planning to use JavaDB (Derby) or PostgreSQL.
I have the following problem: I need to store a large set of vectors. Currently all vectors contain a fixed number of elements. Hence the appropriate way of storing the set is using one row per vector and a column per element. However, the number of elements might change over time. Additionally, in my case, from a software engineering perspective, having a fixed number of columns reflects knowledge about a software component which the general model should be unaware of.
Therefore I'm thinking about "linearizing" the layout and use a general table that stores elements instead of vectors.
The first element of the vector 5 could then be queried like this:
SELECT value FROM elements where v_id = 5 and e_id = 1;
In general, I do not need full table reads, and only a relatively small subset of the vectors is accessed.
Maybe database-savvy people can judge what the performance impact will be?
Many thanks in advance.
This is a variant of what's referred to in general database terms as Entity-Attribute-Value or EAV design. It's a bit of a relational database design anti-pattern and should be avoided in most cases. Performance tends to be poor due to the need for many self-joins, and queries are ugly at best.
In PostgreSQL look into the intarray extension, it should solve your problem pretty ideally if the values are simple integers. Otherwise consider PostgreSQL's standard array types. They've got their own issues, but are generally a lot better than EAV, though they're not lovely to work with from JDBC.
Otherwise, if all you're storing is these vectors, maybe consider a non-relational DB.

Lucene: Query at least

I'm trying to find if there's a way to search in lucene to say find all documents where there is at least one word that does not match a particualar word.
E.g. I want to find all documents where there is at least one word besides "test". i.e. "test" may or may not be present but there should be at least one word other than "test". Is there a way to do this in Lucene?
thanks,
Purushotham
Lucene could do this, but this wouldn't be a good idea.
The performance of query execution is bound to two factors:
the time to intersect the query with the term dictionary,
the time to retrieve the docs for every matching term.
Performant queries are the ones which can be quickly intersected with the term dictionary, and match only a few terms so that the second step doesn't take too long. For example, in order to prohibit too complex boolean queries, Lucene limits the number of clauses to 1024 by default.
With a TermQuery, intersecting the term dictionary requires (by default) O(log(n)) operations (where n is the size of the term dictionary) in memory and then one random access on disk plus the streaming of at most 16 terms. Another example is this blog entry from Lucene committer Mike McCandless which describes how FuzzyQuery performance improved when a brute-force implementation of the first step was replaced by something more clever.
However, the query you are describing would require to examine every single term of the term dictionary and dismiss documents which are in the "test" document set only!
You should give more details about your use-case so that people can think about a more efficient solution to your problem.
If you need a query with a single negative condition, then use a BooleanQuery with the MatchAllDocsQuery and a TermQuery with occurs=MUST_NOT. There is no way to additionaly enforce the existential constraint ("must contain at least one term that is not excluded"). You'll have to check that separately, once you retrieve Lucene's results. Depending on the ratio of favorable results to all the results returned from Lucene, this kind of solution can range from perfectly fine to a performance disaster.

Determining search results quality in Lucene

I have been searching about score normalization for few days (now i know this can't be done) in Lucene using mailing list, wiki, blogposts, etc. I'm going to expose my problem because I'm not sure that score normalization is what our project need.
Background:
In our project, we are using Solr on top of Lucene with custom RequestHandlers and SearchComponents. For a given query, we need to detect when a query got poor results to trigger different actions.
Assumptions:
Inmutable index (once indexed, it is not updated) and Same query tipology (dismax qparser with same field boosting, without boost functions nor boost queries).
Problem:
We know that score normalization is not implementable. But is there any way to determine (using TF/IDF and boost field assumptions) when search results match quality are poor?
Example: We've got an index with science papers and other one with medcare centre's info. When a user query against first index and got poor results (inferring it from score?), we want to query second index and merge results using some threshold (score threshold?)
Thanks in advance
You're right that normalization of scores across different queries doesn't make sense, because nearly all similarity measures base on term frequency, which is of course local to a query.
However, I think that it is viable to compare the scores in this very special case that you are describing, if only you would override the default similarity to use IDF calculated jointly for both indexes. For instance, you could achieve it easily by keeping all the documents in one index and adding an extra (and hidden to the users) 'type' field. Then, you could compare the absolute values returned by these queries.
Generally, it could be possible to determine low quality results by looking at some features, like for example very small number of results, or some odd distributions of scores, but I don't think it actually solves your problem. It looks more similar to the issue of merging of isolated search results, which is discussed for example in this paper.

Multiple or single index in Lucene?

I have to index different kinds of data (text documents, forum messages, user profile data, etc) that should be searched together (ie, a single search would return results of the different kinds of data).
What are the advantages and disadvantages of having multiple indexes, one for each type of data?
And the advantages and disadvantages of having a single index for all kinds of data?
Thank you.
If you want to search all types of document with one search , it's better that you keep all
types to one index . In the index you can define more field type that you want to Tokenize or Vectore them .
It takes a time to introduce to each IndexSearcher a directory that include indeces .
If you want to search terms separately , it would better that index each type to one index .
single index is more structural than multiple index.
In other hand , we can balance our loading with multiple indeces .
Not necessarily answering your direct questions, but... ;)
I'd go with one index, add a Keyword (indexed, stored) field for the type, it'll let you filter if needed, as well as tell the difference between the results you receive back.
(and maybe in the vein of your questions... using separate indexes will allow each corpus to have it's own relevency score, don't know if excessively repeated terms in one corpus will throw off relevancy of documents in others?)
You should think logically as to what each dataset contains and design your indexes by subject-matter or other criteria (such as geography, business unit etc.). As a general rule your index architecture is similar to how you would databases (you likely wouldn't combine an accounting with a personnel database for example even if technically feasible).
As #llama pointed out, creating a single uber-index affects relevance scores, security/access issues, among other things and causes a whole new set of headaches.
In summary: think of a logical partitioning structure depending on your business need. Would be hard to explain without further background.
Agree that each kind of data should have its own index. So that all the index options can be set accordingly - like analyzers for the fields, what is stored for the fields for term vectors and similar. And also to be able to use different dynamic when IndexReaders/Writers are reopened/committed for different kinds of data.
One obvious disadvantage is the need to handle several indexes instead of one. To make it easier, and because I always use more than one index, created small library to handle it: Multi Index Lucene Manager