How to find href=blah but not href=/blah with Full-text search

How to find href=blah but not href=/blah with Full-text search - sql

I'm currently using the query
SELECT Url FROM Link WHERE CONTAINS(Url, 'href=blah')
It is including results with href=/blah. Any way I can tell the query to act more like WHERE Url LIKE '%href=blah%' and still use the full-text catalog?

Your problem is that = and / are both word breakers, in other words, sql fulltext is actually searching for href and blah
There are a couple of options you could try. First you could filter down the search domain using the fulltext engine, then search the subset of data using LIKE. You'll need to experiment to see how to squeeze out the best performance.
The other option is, if href=blah is a consistent term you could add that to a custom dictionary. A good article on this is here.

Related

SQL Server Efficient Search for LIKE '%str%'

In Sql Server, I have a table containing 46 million rows.
In "Title" column of table, I want make search. The word may be at any index of field value.
For example:
Value in table: BROTHERS COMPANY
Search string: ROTHER
I want this search to match the given record. This is exactly what LIKE '%ROTHER%' do. However, LIKE '%%' usage should not be used on large tables because of performance issues. How can I achieve it?

Though I don't know your requirements, your best approach may be to challenge them. Middle-of-the-string searches are usually not very practical. If you can get your users to perform prefix searches (broth%) then you can easily use Full Text's wildcard search (CONTAINS(*, '"broth*"')). Full Text can also handle suffix searches (%rothers) with a little extra work.
But when it comes to middle-of-the-string searches with SQL Server, you're stuck using LIKE. However you may be able to improve performance of LIKE by using a binary collation as explained in this article. (I hate to post a link without including its content but it is way too long of an article to post here and I don't understand the approach enough to sum it up.)
If that doesn't help and if middle-of-the-string searches are that important of a requirement then you should consider using a different search solution like Lucene.

Add Full-Text index if you want.
You can search the table using CONTAINS:
SELECT *
FROM YourTable
WHERE CONTAINS(TableColumnName, 'SearchItem')

SQL like '%term%' except without letters

I'm searching against a table of news articles. The 2 relevant columns are ArticleTitle and ArticleText. When I want to search an article for a particular term, i started out with
column LIKE '%term%'.
However that gave me a lot of articles with the term inside anchor links, for example <a href="example.com/*term*> which would potentially return an irrelevant article.
So then I switched to
column LIKE '% term %'.
The problem with this query is it didn't find articles who's title or text began/ended with the term. Also it didn't match against things like term- or term's, which I do want.
It seems like the query i want should be able to do something like this
'%[^a-z]term[^a-z]%
This should exclude terms within anchor links, but everything else. I think this query still excludes strings that begin/end with the term. Is there a better solution? Does SQL-Server's FULL TEXT INDEXING solve this problem?
Additionally, would it be a good idea to store ArticleTitle and ArticleText as HTML-free columns? Then i could use '%term%' without getting anchor links. These would be 2 extra columns though, because eventually i will need the original HTML for formatting purposes.
Thanks.

SQL Server's LIKE allows you to define Regex-like patterns like you described.
A better option is to use fulltext search:
WHERE CONTAINS(ArticleTitle, 'term')
exploits the index properly (the LIKE '%term%' query is slow), and provides other benefit in the search algorithm.
Additionally, you might benefit from storing a plaintext version of the article alongside the HTML version, and run your search queries on it.

SQL is not designed to interpret HTML strings. As such, you'd only be able to postpone the problem till a more difficult issue arrives (for example, a comment node that contains your search terms as part of a plain sentence).
You can still utilize FULL TEXT as a prefilter and then run an HTML analysis on the application layer to further filter your result set.

Cloudant Search: Why are my wildcards not working?

I have a Cloudant database with a search index. In the search index I index the titles of my documents. For instance, search for 'rijkspersoneel':
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspersoneel
Returns 48 rows.
However, when I replace the 'o' with a ? wildcard:
http://wetten.cloudant.com/regelingen/_design/RegelingInfo/_search/regeling?q=title:rijkspers?neel
I get 0 results. Why is that? The Cloudant docs say that this should match 'rijkspersoneel' as well!

My previous answer was definitely mistaken. Internal wildcads do appear to be supported. Try:
title:rijkspe*on*
title rijksper?on*
Fairly sure what is happening here is an analysis issue. Fairly sure you are using a stemming analyzer. I'm not really all the familiar with cloudant and their implementation of this, but in Lucene, wildcard queries are not subject to the same analysis as term queries. I'm guessing that your analysis of this field includes a stemmer, in which case "rijkspersoneel" is actually indexed as "rijkspersone".
So, when you search for
rijkspersonee*
or
rijkper?oneel
Since the "el" is missing from the end in the indexed form, you find no matches. When just searching for rijkpersoneel it does get analyzed though, and you search for the stemmed form of the word, and will find matches.
Stemming and wildcards just don't get along.

Basic Lucene Beginners q: Index and Autocomplete

I am using Lucene.NET and have a basic question:
Do I need to make an additional Index for Autocompletion?
I created an index based on two different tables from a database.
Here are two Docs:
stored,indexed,tokenized,termVector<URL:/Service/Zahlungsmethoden/Teilzahlung>
stored,indexed,tokenized,termVector<Website:Body:The Text of the first Page>
stored,indexed,tokenized,termVector<Website:ID:19>
stored,indexed,tokenized,termVector<Website:Title:Teilzahlung>
stored,indexed,tokenized,termVector<URL:/Service/Kundenservice/Kinderbetreeung>
stored,indexed,tokenized,termVector<Website:Body:The text of the second Page>
stored,indexed,tokenized,termVector<Website:ID:13>
stored,indexed,tokenized,termVector<Website:Title:Kinderbetreuung>
I need to create a dropdown for a search with suggestions:
eg: term "Pag" should suggest "Page"
so I assume that for every word (token) in every doc, I need a list like:
p
pa
pag
page
is this correct?
Where do I store these?
In an additional Index?
Or how would I re-arrange the existing structure of my index to hold the autocompletion-suggestions?
Thank you!

1) Like femtoRgon said above, look at the Lucene Suggest API.
2) That being said, one cheap way to perform auto-suggest is to look for all words that start with the string you've typed so far, like 'pa' returning 'pa' + 'pag' + 'page'. A wildcard query would return those results -- in Lucene query syntax, a query like 'pa*'. (You might want to restrict the suggestions to only those strings of length 2+.)

Mark Leighton Fisher has the right approach for a cheap way but performing a wildcard query just returns you the documents but not the words. It's better to look at the imlpementation of the WildcardQuery maybe. You need to use the Terms object retrieved from the IndexReader and iterate through the terms in the index.

How do I get accurate search result in Lucene using Query syntax

So far I have been testing the keywords that I inputted in Sitecore using the query syntax but the search result does not rank the page first.
For example if I put query syntax on the word book....(title:book)^1
I want the index page that is name book to appear first in the search result and not bookmark.
Also, every time I publish a new page in Sitecore the keywords for the word Book get push down to the last result or doesn't appear in the search page.
How do I get accurate result in Lucene for the search engine page?
Also I've been following http://www.lucenetutorial.com/lucene-query-syntax.html about how to increase search result but it doesn't work.
Can someone explain how the boost of the search term works.

I recommend you leverage the Advanced Database Crawler to get the best use of Lucene.NET with Sitecore. From that, there's a config file for the indexes with a section called <dynamicFields ... >. In that section, you can specify an individual Sitecore field and adjust the boost attribute. The default boost for every field is 1f which is 1 floating point.
More reading:
Sitecore Searcher and Advanced Database Crawler
Source code for the ADC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find href=blah but not href=/blah with Full-text search - sql

I'm currently using the query SELECT Url FROM Link WHERE CONTAINS(Url, 'href=blah') It is including results with href=/blah. Any way I can tell the query to act more like WHERE Url LIKE '%href=blah%' and still use the full-text catalog?

Related

SQL Server Efficient Search for LIKE '%str%'

SQL like '%term%' except without letters

Cloudant Search: Why are my wildcards not working?

Basic Lucene Beginners q: Index and Autocomplete

How do I get accurate search result in Lucene using Query syntax

Categories

Resources