Is Escaping multiple characters in lucene possible?

Is Escaping multiple characters in lucene possible? - lucene

I have a lot of lucene queries that contains a lot of characters with special meaning like colons, slashes, quotation marks, etc.
I am aware that it is possible to escape single character by using '\', but is it possible to enclose whole sentence into something to be matched exactly in a query, without any of the symbols being interpreted?
Thanks.

Yes, QueryParser.escape escapes everything in the string passed in to it.
Also, using phrase queries generally makes most query syntax irrelevant (myfield:"I +do +not have:to /worry/ about^22 -query -syntax here~2"), with the exception of quotes. If a phrase is what you are attempting to search, that is.

Related

When to use single quotes in an SQL statement?

I know that I should use it when I deal with data of TEXT type (and I guess the ones that fall back to TEXT), but is it the only case?
Example:
UPDATE names SET name='Mike' WHERE id=3
I'm writing an SQL query auto generation in C++, so I want to make sure I don't miss cases, when I have to add quotes.

Single quotes (') denote textual data, as you noted (e.g., 'Mike' in your example). Numeric data (e.g., 3 in your example), object (table, column, etc) names and syntactic elements (e.g., update, set, where) should not be wrapped in quotes.

The single quote is the delimiter for the string. It lets the parser know where the string starts and where it ends as well as that is is a string. You will find that sometimes you get away with a double quote too.
The only way to be certain you don't miss any cases would be to escape the input, otherwise this will be vulnerable to abuse when somehow a single quote ends up in in the text.

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?

If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.

For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.

Is whitespace optional in SQL queries?

I have noticed that using either Oracle or SQLite, queries like this are perfectly valid
SELECT*FROM(SELECT a,MAX(b)i FROM c GROUP BY a)WHERE(a=1)OR(i=2);
Is that a “feature” of SQL that keywords or words of a query need not be surrounded with whitespace? If so, why was it designed this way? SQL has been designed to be readable, this seems to be a form of obfuscation (particularly the MAX(b)i thing where i is a token which serves as an alias).

SQL-92 BNF Grammar here explicitly states that delimiters (bracket, whitespace, * etc) are valid to break up the tokens, which makes the white space optional in various cases where other delimiters already break up the tokens.
This is true not only for SQLite and Oracle, but MySQL and SQL Server at least (that I work with and have tested), since it is specified in the language definition.

Whitespace is optional in pretty much any language where it is not absolutely necessary to preserve boundaries between keywords and/or identifiers. You could write code in C# that looked similar to your SQL, and as long as the compiler can still parse the identifiers and keywords, it doesn't care.
Case in point: The subquery of your statement is the only place where whitespace is needed to separate keywords from other alpha characters. Everywhere else, some non-alphanumeric character (which aren't part of any keyword in SQL) separates keywords, so the SQL parser can still digest this statement. As long as that is true, whitespace is purely for human readability.

Most of this is valid simply because you've enclosed key sections in parentheses where white space would ordinarily be required.

I think this is a side effect of the parser.
Usually the compilers will ignore white spaces via blocks SKIP, which are tokens ignored by the compiler but that cause errors if in the middle of a reserved word. For example in C: 'while' is valid, 'whi le' is not although the whitespace is a SKIP token.
The reason is that it simplifies the parser, if not they would have to manage all the white space and that can be quite complex unless you set strict rules like Python does, but that would be both hard to impose to vendors like Oracle and would make SQL more complex than it should.
And that simplification has the (unintended?) side effect of being able to remove MOST (not all) white spaces. Be aware in some cases the removal of white spaces may cause compilation errors (can't remove the space in GROUP BY as that's part of the token).

Special Characters in URL

We're currently replacing all special characters and spaces in our URLs with hypens (-). From an SEO and readability point-of-view this works fine. However, in some cases, we are feeding parts of the URL into a search after stripping the hyphens out. The problem occurs when the search term should have hyphens as it returns no results when they get stripped. We could modify the search algorithm we're using but this will slow it down (especially bad as we're using it with an AJAX-ed search box and this needs to be fast).
The best option to deal with this, as far as we can tell is to replace pre-existing hyphens with pipes (|). I have a feeling that this will have a negative impact on SEO for those terms as the pipe character will be treated as a part of the word and not as a separator. As far as I can tell, the only characters that are considered to be separators are hyphens and forward slashes (/).
So my questions are:
Are there alternative characters we can use to represent hyphens?
If we can't use any other characters, how much impact will using a pipe character have on a search engine?
Cheers,
Zac

Would ~ (tilde) work?
Edit: Google now treats underscores and dashes as word separators so you can use dashes as dashes and underscores as spaces.

Why not use Url Encoding? Most frameworks have built in utilities to do this.

I was going to say the same thing about URL encoding, but if you're trying to get rid of the special characters, I suppose you don't want URLs with percent signs, right?
What about altering the algorithm that "feeds parts of the URL into a search"? Couldn't you add some logic to not replace hyphens within the search query part of the URL?

What is causing the LIKE statement to disregard html-tags, words after commas, or end in periods?

I'm working on a search module that searches in text columns that contains html code. The queries are constructed like: WHERE htmlcolumn LIKE '% searchterm %';
Default the modules searches with spaces at both end of the searchterms, with wildcards at the beginning and/or the end of the searchterms these spaces are removed (*searchterm -> LIKE '%searchterm %'; Also i've added the possibility to exclude results with certain words (-searchterm -> NOT LIKE '% searchterm %'). So far so good.
The problem is that words that that are preceded by an html-tag are not found (<br/>searchterm is not found when searching on LIKE '% searchterm.., also words that come after a comma or end with a period etc.).
What i would like to do is search for words that are not preceded or followed by the characters A-Z and a-z. Every other characters are ok.
Any ideas how i should achieve this? Thanks!

Look into MySQLs fulltextsearch, it might be able to use non-letter characters as delimiters. It will alsow be much much faster than a %term% search since that requires a full table-scan.

You could use a regular expression: http://dev.mysql.com/doc/refman/5.0/en/regexp.html

Generally speaking, it is better to use full text search facilities, but if you really want a small SQL, here it is:
SELECT * FROM `t` WHERE `htmlcolumn` REGEXP '[[:<:]]term[[:>:]]'
It returns all records that contain word 'term' whether it is surrounded with spaces, punctuation, special characters etc

I don't think SQL's "LIKE" operator alone is the right tool for the job you are trying to do. Consider using Lucene, or something like it. I was able to integrate Lucene.NET into my application in a couple days. You'll spend more time than that trying to salvage your current approach.
If you have no choice but to make your current approach work, then consider storing the text in two columns in your database. The first column is for the pure text, with punctuation etc. The second column is the text that has been pre-preprocessed, just words, no punctuation, normalized so as to be easier for your "LIKE" approach.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Is Escaping multiple characters in lucene possible? - lucene

Related

When to use single quotes in an SQL statement?

How can you query a SQL database for malicious or suspicious data?

Is whitespace optional in SQL queries?

Special Characters in URL

What is causing the LIKE statement to disregard html-tags, words after commas, or end in periods?

Categories

Resources