What is causing the LIKE statement to disregard html-tags, words after commas, or end in periods? - sql

I'm working on a search module that searches in text columns that contains html code. The queries are constructed like: WHERE htmlcolumn LIKE '% searchterm %';
Default the modules searches with spaces at both end of the searchterms, with wildcards at the beginning and/or the end of the searchterms these spaces are removed (*searchterm -> LIKE '%searchterm %'; Also i've added the possibility to exclude results with certain words (-searchterm -> NOT LIKE '% searchterm %'). So far so good.
The problem is that words that that are preceded by an html-tag are not found (<br/>searchterm is not found when searching on LIKE '% searchterm.., also words that come after a comma or end with a period etc.).
What i would like to do is search for words that are not preceded or followed by the characters A-Z and a-z. Every other characters are ok.
Any ideas how i should achieve this? Thanks!

Look into MySQLs fulltextsearch, it might be able to use non-letter characters as delimiters. It will alsow be much much faster than a %term% search since that requires a full table-scan.

You could use a regular expression: http://dev.mysql.com/doc/refman/5.0/en/regexp.html

Generally speaking, it is better to use full text search facilities, but if you really want a small SQL, here it is:
SELECT * FROM `t` WHERE `htmlcolumn` REGEXP '[[:<:]]term[[:>:]]'
It returns all records that contain word 'term' whether it is surrounded with spaces, punctuation, special characters etc

I don't think SQL's "LIKE" operator alone is the right tool for the job you are trying to do. Consider using Lucene, or something like it. I was able to integrate Lucene.NET into my application in a couple days. You'll spend more time than that trying to salvage your current approach.
If you have no choice but to make your current approach work, then consider storing the text in two columns in your database. The first column is for the pure text, with punctuation etc. The second column is the text that has been pre-preprocessed, just words, no punctuation, normalized so as to be easier for your "LIKE" approach.

Related

Pattern matching in Big query vs SSMS-Return strings which contain special characters or numerics

I'm a bit lost.
I've had a look at the documentation but I'm not sure if you can use LIKE and pattern match in Big Query the same as SSMS.
The code shown here works in SSMS but the results are not correct in Big Query, so was wondering if there was another way to do it.
WHERE column_name NOT LIKE '[a-Z]%'
I'm looking to return strings which contain special characters or numerics.
Use REGEXP_CONTAINS instead
where not regexp_contains(column_name, r'[a-zA-Z]')
Meantime, LIKE is also supported as a comparison operator

Is there a way to exclude a character(s) from the SQL wildcard %?

I am attempting to do file searches with SQL. I need to keep the searches within the directory I am doing the comparisons against. When doing a LIKE or NOT LIKE comparison in SQL, the wildcard '%' is used to represent 0, 1, or many characters. Are there ways to exclude or delimit characters involved in that wildcard search? I need to do a comparison like the following:
LIKE 'C:/Data/%.txt'
However, I do not want the '%' to find any records where there is a slash '/' character involved in the wildcard search. I would want it to return examples like:
C:/Data/file1.txt
C:/Data/records_2017.txt
C:/Data/listing#7.txt
But I do NOT want the wildcard to find records with slash in it, because in my situation, it goes beyond the current folder I am interested in doing file searches in:
C:/Data/OtherFolder/file1.txt
C:/Data/System/SomethingElse/records_2016.txt
The above examples would be returned, because everything between the slash after Data and the .txt extension is all free game for the wildcard. I do NOT want the above examples to be returned in my wildcard search.
I tried doing character sets, such as [^/], but it seems to only work to exclude strings BEGINNING with a slash '/' character. I need to prevent wildcard from using a slash ANYWHERE in the string.
You can't do this with a single LIKE, but you could do this:
WHERE MyColumn LIKE 'C:/Data/%.txt'
AND MyColumn NOT LIKE 'C:/Data/%/%.txt'
Most databases support some form of regular expressions. For instance:
col regexp '^C:/Data/[^/]?.txt$'
The specific match operator/function varies by database.
If you are using Oracle, you can use REGEXP_LIKE instead of LIKE for finer control of what records will match. Check this document for details on how to use it.

Lucene: problems when keeping punctuation

I need Lucene to keep some punctuation marks when indexing my texts, so I'm now using a WhitespaceAnalyzer which doesn't remove the symbols.
If there is a sentence like oranges, apples and bananas in the text, I want the phrase query "oranges, apples" to be a match (not without the comma), and this is working fine.
However, I'd also want the simple query oranges to produce a hit, but it seems the indexed token contains the comma too (oranges,) so it won't be a match unless I write the comma in the query too, which is undesirable.
Is there any simple way to make this work the way I need?
Thanks in advance.
I know this is a very old question, but I'm bored and I'll reply anyway. I see two ways of doing this:
Creating a TokenFilter that will create a synonym (e.g. insert a token into the stream with a 0 position length) each time a word contains punctuation with the non-puncuation version.
Add another field with the same content but using a standard tokenizer that removes all punctuation. Both fields will be matched.

SQL2008 fulltext index search without word breakers

I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?
You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

How can you query a SQL database for malicious or suspicious data?

Lately I have been doing a security pass on a PHP application and I've already found and fixed one XSS vulnerability (both in validating input and encoding the output).
How can I query the database to make sure there isn't any malicious data still residing in it? The fields in question should be text with allowable symbols (-, #, spaces) but shouldn't have any special html characters (<, ", ', >, etc).
I assume I should use regular expressions in the query; does anyone have prebuilt regexes especially for this purpose?
If you only care about non-alphanumerics and it's SQL Server you can use:
SELECT *
FROM MyTable
WHERE MyField LIKE '%[^a-z0-9]%'
This will show you any row where MyField has anything except a-z and 0-9.
EDIT:
Updated pattern would be: LIKE '%[^a-z0-9!-# ]%' ESCAPE '!'
I had to add the ESCAPE char since you want to allow dashes -.
For the same reason that you shouldn't be validating input against a black-list (i.e. list of illegal characters), I'd try to avoid doing the same in your search. I'm commenting without knowing the intent of the fields holding the data (i.e. name, address, "about me", etc.), but my suggestion would be to construct your query to identify what you do want in your database then identify the exceptions.
Reason being there are just simply so many different character patterns used in XSS. Take a look at the XSS Cheat Sheet and you'll start to get an idea. Particularly when you get into character encoding, just looking for things like angle brackets and quotes is not going to get you too far.