What exactly differs fuzzy search from Full Text Search? - sql

In my project, I am asked to implement a text query service on the database we are using; Postgresql. I have used Postgresql Full Text Search features, which works fairly fine in terms of time. One problem about full text search is, it does not have fuzzy search abilities. On the other hand, there is an extension named pgtrgm providing functions and operators for determining the similarity of alphanumeric text. Also there are several examples of text search using pgtrgm like:
select actor
from products
where actor % 'tomy';
As you know example of postgres FTS also here;
SELECT title
FROM pgweb
WHERE to_tsvector(body) ## to_tsquery('friend');
So, the main question is, what is the difference between these two search strategies? Which one is more appropriate way for searching texts? Is it possible to mix them? I also need to say that performance is an important concern as well. Thanks in advance!

They do completely different things. About the only thing that is not different between them is that they operate on text and can benefit from use of indexes. From you question, it seems like you already have a good sense of the differences. The appropriate one is the one that does what you want. If one of them was always appropriate, we probably wouldn't have created the other one.
You can mix them, but you will need different indexes for each one, they cannot share an index. Also, you probably need different tables as well, as full text search is more appropriate for sentences or paragraphs while trigram for individual words or short phrases.
One way to mix them would be to have one table of full texts, and another table which lists only each distinct word present in any of the full texts. The 2nd table could be used to detect probable typos in the query, and then once those are fixed by suggestions from trigram searching, run the fixed query against the 1st table.

The difference is quite huge - in fuzzy search, you're searching for a similar result, in full-text search - for the exact same. If one is more appropriate than the other is the matter of use-case.
If you don't need fuzziness, don't use it, it's a huge performance overhead because it has to match the text not exactly, but also try other combinations.

Related

Is sorting the database via a custom function inefficient?

I have a table with Id and Text fields. The Text field holds sentences, averaging 50 words. There are >1,000,000 rows.
This is part of a web app where users need to be able to search through these sentences. Here's the twist though - I need to be able to run a custom search function written in C# that uses Machine Learning instead.
From what I understand, this means I'll have to download the entire database of >1,000,000 rows every time a user makes a search! This seems really inefficient to me.
How would you implement this in the most efficient/fast way possible?
If this is relevant, I'm using EF Core with LINQ .Where(my_custom_search_function), with a PostgreSQL database
I think I've found the solution. Postgresql full-text search currently provides two ranking functions. In this case "sorting" in the question and "ranking" here refer to the same thing.
Postgresql docs state:
However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
These functions can any of the four kinds of supported postgresql functions.
Then they answer this exact question:
Ranking can be expensive since it requires consulting the tsvector of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches.
Credits to #Used_By_Already for pointing me to Postgresql full-text search.

Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate

Relational DB's View of View of View AntiPattern?

I have inherited a database that's causing me issues.
I'm in the need of describing something horrible to stakeholders. So far using the names of anti patterns and sending them away pointing them to a google search on this has been the most efficent to buy me some time.
Trouble is, I have not come across this before. Here's what's happening.
I have a simple single table, with a couple of columns. One of these columns contains values like:
660x90_SomeCity_SomeCountryISO_ImageName_SomeRubbish
or
SomeIataAirportCode_SomeCountry_660x90_SomeRubbish_ImageName
Now the database contains an (admittedly so far and on current data) faultless logic to extract and lookup things so that the output has additional columns such as:
AdSize
Country
City
The trouble is that this is achieved through gradual conversions implemented in a labyrinth of 50 (not joking) different views. I've now got to formalize the logic to something like
View One: Extract the first column and work out the length of it.
View Two: Now split of the 2nd column using the length.
View Three: If after replacing the x in the first column the value is numeric, store the value in "AdSize", and place the second value in the "CityCandidateOne" column.
To me this is a horrible antipattern and should all be done either in custom functions, or preferably during the ETL process, in one place so the logic can be captured.
However I'm not being given the time and wonder if this is a known anti pattern. Usually I can then use the credibility of a Google search to buy a little time to really sort this out.
I'd start with this answer which covers the violation of First Normal Form.
I also found this free ebook that might be of value.
I understand that what you are facing is something on a grander scale that just putting a couple of values in a field with a comma or other token to separate them, but I don't know of any antipattern that covers such a baroque mess.
Finally, here you can find more about "replacing SQL logic with Views" as an antipattern (just look for "Views as SQL Building Blocks Anti-Pattern" in the article) but take in account that in this case the problem seem to be about inefficient access to the data.
Last minute edit: maybe this is just a special case of the general Golden Hammer antipattern? (see also: http://en.wikipedia.org/wiki/Golden_hammer)
Why not simply rewrite the SQL how you would rather do it, then print out the execution plans of both, and show the performance and timing of both. That should be enough to show them that it needs to change (and if there is no major performance difference, then your only other argument can be one of maintainability and that's something you're going to have to argue by showing them what it takes to make changes).

Elasticsearch querying multiple types and grouped by types?

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.