Why is typesense returning only few of the matches in the index? - typesense

I have typesense server running and working. However I have found strange behaviour. There are documents in index with string column called "product_number". And there are product numbers in format "BKP001", "BKP002", .... "BKP999". There is 60 such documents.
However, when I query typesense and search for "BKP", it finds only 4 random matching documents. Strange is that if I do more specific search for "BKP01", it returns 4 documents "BKP010", "BKP011", "BKP012" and "BKP013".
And when I search for "BKP03" then it returns 4 documents "BKP031", "BKP032", "BKP033" and "BKP034". So it is clear that all the documents are correctly in index.
What could be the reason why it doesn't find all the documents ?

When there are several possible prefix matches for a particular keyword, Typesense limits the number of results picked for performance reasons, until more of the keyword is typed. If you want all results to be returned, you want to add max_candidates=1000 and exhaustive_search=true as search parameters.
You also want to make sure that you're using the latest version of Typesense as relevance improvements are regularly released with each version.

Related

Boosting CloudSearch results which match facet

I'm using AWS CloudSearch for a search index, and the user can currently search over it for records which match the name field and a few others. However, we have users in different languages and I would like to give a boost to results which match their local language. Every record has a locale field, which could be a facet. However, I don't want to simply exclude results which don't match, and nor do I want to simply sort it so that everything in their language always comes first regardless of 'relevance' - I simply want to give a 'boost' to any result where locale=<my locale>.
In other words, I would like highly relevant matches in a different locale to still beat barely relevant matches in my own language, but relevant matches in my own language should definitely rank higher than matches in a different language.
Is there a way to do this when I query CloudSearch or should I just do the reordering client side once I have fetched all of the results?
So this was painfully convoluted but I did manage to get it to work in the end. If you are doing a structured search (q.parser=structured) then you can perform a query that receives a boost if the locale field matches a given value.
Sadly, where it gets a bit cumbersome, is that it doesn't really seem to be designed for boosting things that match something other than your main query, so by default it filters out anything that doesn't match the locale and excludes them entirely from the results. So you have to combine two versions of the query with an (or) - one with the boost and one without.
So my basic query (which in my case was already (or 'un' (prefix 'un'))) now becomes (or (and <OLD_QUERY> (term field=locale boost=2.5 'en')) <OLD_QUERY>)
In other words: EITHER a match for the original query where locale='en', boosted by 2.5, OR a match for that same query in any locale without a boost.
Painful, but it works!

Sql Server search entire Json document for value

I have a few thousand rows in my table (SQL Server 2016).
One of the columns stores JSON documents (NVARCHAR(max)).
The JSON documents are quite complex in therms of nesting etc.. also they can be very different one to another.
My goal is to search each document for a certain match. Say: "MagicNo":"999000".
So if the document has a property "MagicNo" and if the value is 999000 then it's a match.
I know you can navigate through the document using the
JSON_VALUE $.
followed by the path, but since those docs can be very different the "MagicNo" property may appear pretty much everywhere in the document (a lot nesting). So xpathing is out of question here.
Is there some kind of wild card I could use with JSON_VALUE to say search the entire doc and return it if the match is found?
The simple
like '%999000%'
and
CONTAINS
searches on the VARCHAR column are out of question here due to the poor performance.
Any thoughts?
Thanks.

Lucene not giving results when specifying field

I have a database which I have indexed in Lucene (using Pylucene) by section (specified by markup in the document) using lucene's fields. This index seems to work fine. I can search it using the default field which is simply the entire document and get reasonable results.
The problem is, when I search it using a specific section (not the default), I expect to get a certain number of results back (as specified by IndexSearcher.search(query, results)), but instead it might simply return nothing. So my question is: how can I get it to return a ranked list with the number of results I specify?
The only place I specify the field is in the QueryParser, by calling:
QueryParser(Version.LUCENE_CURRENT, field, StandardAnalyzer)
I would verify the index using Luke (which is something I do often when modifying my index strategy).

Lucene - Which field contains search term?

I have developed a search application with Lucene. I have created the basic search. Basically, my app works as follows:
My index has many fields. (Around 40)
User can enter query to multiple fields i.e: +NAME:John +SURNAME:Doe
Queries can contain wildcards such as ? and * i.e: +NAME:J?hn +SURNAME:Do*
Queries can also contain fuzzy i.e: +NAME:Jahn~0.5
Now, I want to find, which field(s) contains my search term(s). As I am using wildcard and fuzzy, I cannot just make string comparison. How can I do it?
If you need it for debugging purposes, you could use IndexSearcher.explain.
Otherwise, this problem looks like highlighting, so you should be able to find out the fields that matched by:
re-analyzing your document,
or using its term vectors.

Alfresco: Lucene query by ID returns 2 rows

I'm using Alfresco 3.4d and imported some nodes as well as created a few with NodeService. Today I noticed that a Lucene query by ID does sometimes return two rows instead of just one. Not all nodes show this kind of behavior.
For example, when I execute the following Lucene query in the Alfresco Node Browser, I get the result shown below: ID:"workspace://SpacesStore/96c0cc27-cb8c-49cf-977d-a966e5c5e9ca"
How is it even possible that a query by ID can return more than one row? I tried rebuilding the Lucene index, but it didn't help. When I delete the node, the query returns 0 rows. What can I do to remove those "ghost" nodes from the query result?
I also ran across this problem and asked the Alfresco support for advice. They told me that it is perfectly normal to have duplicate entries in the lucene ID field and that this is related to whether there is an ANCESTOR present or not. They recommended using the sys:node-uuid field when doing a lucene search for the node's ID, e.g.:
#sys\:node-uuid:f13a21dd-b020-4c70-aa21-1a0e5c89d42b
I've seen this problem since Alfresco 3.2r, but maybe it is even older! I used the Lucene index Viewer "Luke" (http://www.getopt.org/luke/) to check the index directly and I saw that the corrupt index entry contains almost no information. As workaround we combined our search to some basic information like node type or aspect. I will ask a colleague if he has more information about this.
I don't know directly how this is possible but in your 'code' where you retrieve the nodes you could always do: if node.isDocument or node.isContainer to get true result or type is cm:content or cm:folder.
You could also try to re-index, but I doubt that will be of any help