Cratedb custom sorting - cratedb

For example, we have elements like
12345678
23487653
12475805
23382349
when we search for 234 I want to see order as
23487653
12345678
23382349
For searching, I am using ngram analyzer, but not able to figure out how to do the sorting.I tried sorting on _score, but _score is same for many values
Postgresql has position function, that is currently not supported in cratedb.

Related

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

Optimising LIKE expressions that start with wildcards

I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.
SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'
I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.
I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.
Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results
SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')
Is there any way to optimise queries with preceding wildcards?
Here is one (not really recommended) solution.
Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.
When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
abcd
bcd
cd
d
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on AddressSubstrings(AddressSubstring).
Then you can phrase your query as:
SELECT *
FROM Table t JOIN
AddressSubstrings ads
ON t.table_id = ads.table_id
WHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing HWilson with ilson and the row containing ABC123000654 with 654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like ABC123000654.
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Tables/TextSearch/UsingTextSearch.htm
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col
42|The Restaurant at the End of the Universe|Arthur Dent
43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token
42| 1|Restaurant
42| 2|End
42| 3|Universe
43| 1|Hitch-Hiker
43| 2|Guide
43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Pivot sql convert rows to columns
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane

Any way to use strings as the scores in a Redis sorted set (zset)?

Or maybe the question should be: What's the best way to represent a string as a number, such that sorting their numeric representations would give the same result as if sorted as strings? I devised a way that could sort up to 9 characters per string, but it seems like there should be a much better way.
In advance, I don't think using Redis's lexicographical commands will work. (See the following example.)
Example: Suppose I want to presort all of the names linked to some ID so that I can use ZINTERSTORE to quickly get an ordered list of IDs based on their names (without using redis' SORT command). Ideally I would have the IDs as the zset's members, and the numeric representation of each name would be the zset's scores.
Does that make sense? Or am I going about it wrong?
You're trying to use an order preserving hash function to generate a score for each id. While it appears you've written one, you've already found out that the score's range allows you to use only the first 9 characters (it would be interesting to see your function btw).
Instead of this approach, here's a simpler one that would be easier IMO - use set members of the form <name>:<id> and set the score to 0. You'll be able to use lexicographical ordering this way and use something like split(':') to get the id from the set's members.

Sort by preference nhibernate or sql server

I have a list of users in a table and when performing a search on table, I want the usernames that begin with search key to appear on top of the list, followed by users who have the search key in their username.
For example, consider the list of usernames:
rene
irene
adler
leroy
Argog
harry
evan
I am providing "suggestions" as the user types in a search box when they are trying to search for other users. If the users types va into the search box, more often than not they will be looking for the user vain, but because I'm sorting the users by username, ascending order, evan is always on top. What I'd want is to order like so:
searching for va
vain
evan
searching for re
rene
Irene
searching for ar
Argog
harry
Of course, if they enter one more character it will be narrowed down further.
Thinking of it, this is what I want to do - put the username that starts with search key on top (if multiple usernames start with the search key, sort them alphabetically). Then, if the required number of usernames isn't complete, append the other usernames that contain search key in them, alphabetically.
I'm paginating the results in sql itself - and I'm using nhibernate queryover to perform the task. By if the required number of usernames isn't complete, I mean if the page size is 10 and I have only 7 usernames, append other usernames that contain searchkey in them.
I can't currently see a way to do all this in one query.. do I have to split the query into two parts and contact the db twice to get this done? or is there a way I can sort with position of the string?
Any hints about how to efficiently do this would be very helpful. - I can't even think of the query that would do this in plain sql..
thanks.
Solution
The accepted answer pushed me in the right direction and this is what finally worked for me:
.OrderBy(Projections.Conditional(
Restrictions.Where(r => r.Username.IsLike(searchKey + "%")),
Projections.Constant(0),
Projections.Constant(1))).Asc();
In plain SQL you could craft an ORDER BY clause such as:
ORDER BY CASE WHEN field LIKE 'VA%' THEN 0
WHEN field LIKE '%VA%' THEN 1
ELSE 2
END
Of course you can use variables/field names instead.
Not sure as to the rest of your question.
QueryOver based on Goat_CO's idea:
session.QueryOver<YourClass>()
.OrderBy(
Projections.Conditional(
Restrictions.Like(Projections.Property<YourClass>(x => x.Pro),
searchString,
MatchMode.Anywhere),
Projections.Constant(0),
Projections.Constant(1)))
.Asc;

Lucene numeric range search with LUKE

I have a number of numeric Lucene indexed fields:
60000
78500
105000
If I use LUKE to query for 78500 as follows:
price:78500
It returns the correct record, however if I try to return all three record as a range I get no results.
price:[60000 TO 105000]
I realise this is due to padding as numbers are treated strings by Lucene however I just wish to know what I should be putting into LUKE to return the three records.
Many thanks for any help.
If the fields are indexed as NumericField you must use "Use XML Query Parser" option in query parser tab and the 3.5 version of Luke:
https://code.google.com/p/luke/downloads/detail?name=lukeall-3.5.0.jar&can=2&q=
An example of query with a string and numeric field is:
<BooleanQuery>
<Clause fieldName="colour" occurs="must">
<TermQuery>rojo</TermQuery>
</Clause>
<Clause fieldName="price" occurs="must">
<NumericRangeQuery type="int" lowerTerm="4000" upperTerm="5000" />
</Clause>
</BooleanQuery>
The solution I used for this was that the values inputted for price needed to be added to the index in padded form. Then I would just query the new padded value which works great. Therefore the new values in the index were:
060000
078500
105000
This solution was tied into an Examine search issue for Umbraco so there is a thread on the Forum of how to implement a numeric based range search if anyone requires this it is located here with a walk through end to end.
Umbraco Forum Thread
Zero padding won't come into this particular query since all the numbers you've shown have the same number of digits
The range query you've shown has too many zeros on the second part of the range
So the query for the data you've shown would be price:[10500 TO 78500]
Hope this helps,
I assume these fields are indexed as NumericFields. The problem with them is that Lucene/Luke does not know how to parse numeric queries automatically. You need to override Lucene's QueryParser and provide your own logic how these numbers should be interpreted.
As far as I know, Luke allows sticking in your custom parser, it just need to be present in the CLASSPATH.
Have a look at this thread on Lucene mailing list:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201102.mbox/%3CAANLkTi=XUpyw09tcbjuTzNRpMJa730Cq-6_1agMAjYz6#mail.gmail.com%3E