Lucene gotchas with punctuation - lucene

Whilst building some unit tests for my Lucene queries I noticed some strange behavior related to punctuation, in particular around parentheses.
What are some of the best ways to deal with search fields that contain significant amounts of punctuation?

If you haven't customized the query parser, Lucene should behave according to the default query parser syntax. Are you getting something different than that? Do you want punctuation to have a special meaning or just to remove the punctuation from searches?
The other usual suspect here is the Analyzer, which determines how your field is indexed and how the query is broken into pieces for searching. Can you post specific examples of bad behavior?

It is not not just parentheses, other punctuations such as the colon, hyphen etc. will cause issues. Here is a way to deal with them.

Related

Lucene query-time boosting culture code

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.
Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

Lucene searching stack traces: splitting on dots

I'm writing an app that embeds Lucene to search for, amongst other things, parts of stack traces, including class names etc. For example, if a document contained:
java.lang.NullPointerException
The documents can also contain ordinary English text.
I'd like to be able to query for either NullPointerException or java.lang.NullPointerException and find the document. Using the StandardAnalyzer, I only get a match if I search for the full java.lang.NullPointerException.
What's the best way to go about supporting this? Can I get multiple tokens emitted? e.g. java, lang, NullPointerException and java.lang.NullPointerException? Or would I be better replacing all the . characters with spaces up front? Or something else?
the dot character is considered an "ambiguous terminator" for the purposes of the algorithm used by StandardAnalyzer. Lucene attempts to be intelligent about this and make the best possible guess for the situation.
You have a couple of options here:
If you don't want Lucene to apply a bunch of complicated lexical tokenization rules, you can try a simpler analyzer, such as SimpleAnalyzer, which will just create tokens of uninterrupted strings of letters.
Implement a filter that applies your own specialized rules, and incorporate it into an Analyzer similar to the StandardAnalyzer. This would allow you to test whatever identification techniques you like to recognize that a token is an exception, and split them up during the analysis phase.
As you said, you can replace the periods with spaces before they ever hit the analyzer at all.

Stemming + wildcarding: unexpected effects

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.
The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Search Query Optimization

I haven't ever dug into cleaning/reformatting search queries too much in the past, at least not more than general security things like preventing sql injection.
I am realizing that I should be implementing keywords like AND, OR, NOT, etc... and doing things like clearing punctuation such as apostrophes, hyphens, etc... As when a user types "Smiths" in a searchbox, the query would not return "Smith's" (with an apostrophe).
What other things can I do to improve my user's search queries (without being damaging to them)?
I am coming from a PHP MySQL-FTS setup; however, I'm sure that this could be extended to multiple platforms.
EDIT
Let me clarify that I'm not so interested in the SQL query to the database, what I'm interested in optimizing is the query that the user provides in the search box.
NEAR keyword
double quotes for "exact phrases"
remove short/common words ("a", "an", "the", etc)
stemming (remove common prefixes and suffixes)
I'd suggest reading through the answers to this similar question: Optimizing a simple search algorithm and also this article on some of Google's features.
Create an index on the "where" clause columns of your search queries.
To enable naive spell Correction perhaps, you could also store the soundex of the column you would like to offer spell-check for.
Enable logging for slow-queries which would help you in tracking down performance issues.

Is htmlencoding a suitable solution to avoiding SQL injection attacks?

I've heard it claimed that the simplest solution to preventing SQL injection attacks is to html encode all text before inserting into the database. Then, obviously, decode all text when extracting it. The idea being that if the text only contains ampersands, semi-colons and alphanumerics then you can't do anything malicious.
While I see a number of cases where this may seem to work, I foresee the following problems in using this approach:
It claims to be a silver bullet. Potentially stopping users of this technique from understanding all the possible related issues - such as second-order attacks.
It doesn't necessarily prevent any second-order / delayed payload attacks.
It's using a tool for a purpose other than that which it was designed for. This may lead to confusion amongst future users/developers/maintainers of the code. It's also likley to be far from optimal in performance of effect.
It adds a potential performance hit to every read and write of the database.
It makes the data harder to read directly from the database.
It increases the size of the data on disk. (Each character now being ~5 characters - In turn this may also impact disk space requirements, data paging, size of indexes and performance of indexes and more?)
There are potential issues with high range unicode characters and combining characters?
Some html [en|de]coding routines/libraries behave slightly differently (e.g. Some encode an apostrophe and some don't. There may be more differences.) This then ties the data to the code used to read & write it. If using code which [en|de]codes differently the data may be changed/corrupted.
It potentially makes it harder to work with (or at least debug) any text which is already similarly encoded.
Is there anything I'm missing?
Is this actually a reasonable approach to the problem of preventing SQL injection attacks?
Are there any fundamental problems with trying to prevent injection attacks in this way?
You should prevent sql injection by using parameter bindings (eg. never concatenate your sql strings with user input, but use place holders for your parameters and let the framework you use do the right escaping). Html encoding, on the other hand, should be used to prevent cross-site scripting.
Absolutely not.
SQL injections should be prevented by parametrized queries. Or in the worst case by escaping the SQL parameter for SQL, not HTML. Each database has its own rules about this, mysql API (and most frameworks) for example provides a particular function for that. Data itself in the database should not be modified when stored.
Escaping HTML entities prevents XSS and other attacks when returning web content to clients' browsers.
How you get the idea that HTML Encoded text only contains ampersands, semi-colons and alphanumerics after decoding?
I can really encode a "'" in HTML - and that is one of the things needed to get yo into trouble (as it is a string delimiter in SQL).
So, it works ONLY if you put the HTML encoded text into the database.
THEN you havequite some trouble with any text search... and presentation of readable text outside (like in SQL manager). I would consider that a really bad architected sitaution as you have not solved the issue just duct-taped away an obvious attack vector.
Numeric fields are still problematic, unless your HTML handling is perfect, which I would not assume given that workaround.
Use SQL parameters ;)
The single character that enables SQL injection is the SQL string delimer ', also known as hex 27 or decimal 39.
This character is represented in the same way in SQL and in HTML. So an HTML encode does not affect SQL injection attacks at all.