Rewrite this exceedingly long query - sql

I just stumbled across this gem in our code:
my $str_rep="lower(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'-',''),'',''),'.',''),'_',''),'+',''),',',''),':',''),';',''),'/',''),'|',''),'\',''),'*',''),'~','')) like lower('%var%')";
I'm not really an expert in DB, but I have a hunch it can be rewritten in a more sane manner. Can it?

It depends on the DBMS you are using. I'll post some examples (feel free to edit this answer to add more).
MySQL
There is really not much to do; the only way to replace all the characters is nesting REPLACE functions as it has already been done in your code.
Oracle DB
Your clause can be rewritten by using the TRANSLATE function.
SQL Server
Like in MySQL there aren't any functions similar to Oracle's TRANSLATE. I have found some (much longer) alternatives in the answers to this question. In general, however, queries become very long. I don't see any real advantages of doing so, besides having a more structured query that can be easily extended.
Firebird
As suggested by Mark Rotteveel, you can use SIMILAR TO to rewrite the entire clause.
If you are allowed to build your query string via Perl you can also use a for loop against an array containing all the special characters.
EDIT: Sorry I did not see you indicated the DB in the tags. Consider only the last part of my answer.

Your flagged this as Perl, but it's probably not?
Here is a Perl solution anyway:
$var =~ s/[\-\.\_\+\,\:\;\/\|\\\*\~]+//g;

Sorry I don't know the languages concerned, but a couple of things come to mind.
Firstly you could look for a replace text function that does more that just a single character. Many languages have them. Some also do regular expression based find and replace.
Secondly the code looks like it is attempting to strip a specific list of characters. This list may not include all that is necessary which means a relatively high (pain in the butt) maintenance problem. A simpler solution might be to invert the problem and ask what characters do you want to keep? Inverting like this sometimes yields a simpler solution.

Related

Processing text of SQL script

I want to develop tool which will prettify SQL scripts - make all special words and commands (SELECT, JOIN, FROM, etc.) upper/lower case; add square brackets; and couple other things (yes, ). I'm going to implement it as extension for my IDE or as external tool - I'm not decided it yet.
I was going to split a script by spaces, brackets, commas and periods - get separate words - and check each word to match to one of the keywords. If it matches - then capitalize/lowercase word depending on settings. If not - leave it as it was.
But then I thought that it may be other solutions.
I thought about using RegEx (unfortunately I don't know much about it). I suppose that it will work more efficient. And therefore using it will be more preferred.
Is RegEx the best way to achieve my goal? Or my initial approach is also appropriate?
Is there other ways?
P.S. I know that similar tools can already exist out there. And I will appreciate if you share them. But I want to implement my own tool for self-education reasons.

Prevent use of pre ANSI-92 old syntax

I wonder if there's a way to prevent the creation of objects that contain old ansi sintax of join, maybe server triggers, can anyone help me?
You can create a DDL trigger and mine the eventdata() XML for the content of the proc. If you can detect the old syntax using some fancy string-parsing functions (maybe looking for commas between known table names or looking for *= or =*), then you can roll back the creation of the proc or function.
First reaction - code reviews and a decent QA process!
I've had some success looking at sys.syscomments.text. A simple where text like '%*=%' should do. Be aware that long SQL strings may be split across multiple rows. I realise this won't prevent objects getting in there in the first place. But then DDL triggers won't tell you how big your current problem is.
Although I fully understand your effort, I believe that this type of actions is the wrong way of getting where you want. First of all, you might get into serious trouble with your boss and, depending of where you work, get fired.
Second, as stated before, doing code reviews, explaining why the old syntax sucks. You have to have a decent reason why one should avoid the *= stuff. 'Because you don't like it' is not a feasible argument. In fact, there are quite some articles around showing that certain problems are just not solvable using this type of syntax.
Third, you might want to point out that separating conditions into grouping (JOIN ... ON...) and filtering conditions (WHERE...) increases the readability and might therefore be an options.
Collect your arguments and convince your colleagues rather than punishing them in quite an arrogant way.

What, if any, are the disadvantages of SQL::Interp over SQL::Abstract?

I'm currently looking at some light-weight SQL abstraction modules. My workflow is such that i usually write SELECT queries manually, and INSERT/UPDATE queries via subs which take hashes.
Both of these modules seem perfect for my needs and i have a hard time deciding. SQL::Interp claims SQL::Abstract cannot provide full expressivity in SQL, but discusses no other differences.
Does it have any disadvantages? If so, which?
I can't speak to SQL::Interp, but I use SQL::Abstract and it's pretty good. In conjunction with DBIx::Connector and plain old DBI, I was able to totally eliminate the use of an ORM in my system with very little downside.
The only limitations I have run into is that it's not possible to write GROUP BY queries directly (although it's easy to do by simply appending to the generated query, and LIMIT queries are handled by the extension SQL::Abstract::Limit.
I used SQL::Abstract for a over a year, and then switched to SQL::Interp, which I've stuck with since.
SQL::Abstract had trouble with complex clauses. For the ones it could support, you would end up with a nest of "(" "[" and {" characters, which you were mentally translate back to meaning "AND", "OR" or actually parentheses.
SQL::Interp has no such limitations and uses no middle representation. Your SQL looks like SQL with bind variables where you want them. It works for complex queries as well as simple ones. I find SQL::Interp especially pleasant to use in combination with DBIx::Simple's built-in support for it. DBIx::Simple+SQL::Interp is a friendly and intuitive replacement for using raw DBI. I use the combination in a 100,000k+ LoC mod_perl web app.

When should Regex be used over String.IndexOf()? or String.Contains()?

I'm, currently working my first project in .NET 4.0 and it requires several thousand string comparisons (I'm searching directories and sometimes entire drives for certain files). For the most part, the strings are quite short because I'm only looking at file paths so I have just made use of String.Contains() to see if the file path string contains my needle string.
I was wondering though, would Regex be a better idea? At what point will the Regex be faster than a standard string comparison? Is it based on the length of the strings being compared or the number of strings being compared?
It's variable. Comparison performance is a complex function of the input data, the culture being used for comparing, case sensitivity and CompareOptions. A Regex object is more expensive to instantiate (unless it's in the Regex cache), so if you're doing a lot of one off comparisons, it not that great to use and I've found it's typically slower than IndexOf(), but YMMV.
Keep in mind that when using Contains/IndexOf that the culture under which the user/thread is running will decide how the comparison is done. That can have a significant impact on performance. Not all cultures are as fast.
The Invariant culture is a very fast culture. If you use a CompareInfo directly, rather than doing String.IndexOf(), it will be somewhat faster still.
CultureInfo.InvariantCulture.CompareInfo.IndexOf(..)
The only way to have some confidence in making the right choice is to benchmark. That said, unless you're shifting through many megabytes of strings, it won't make a difference that matters to anyone. As ChrisF said earlier, focus on readable/maintainble code in that case.
Here's a good article on getting the most out of regex:
Optimizing Regular Expression Performance
If your search expression is simple then I don't think it's worth moving to a Regex - no matter how good you are at coding and reading them it will take you more time to understand the code when you (or more importantly, some one else) look at it again in 6 months time.
If the speed improvements are only marginal stay with the more readable, maintainable code.
I'm just guessing, but I suspect that for simple substring searches there will be little difference in performance between String.Contains(), String.IndexOf() and regex (if anything, I'd guess that regex would never be faster, but might be slower by a miniscule amount).
You shouldn't give any thought about moving to regex unless your requirements are (or become) such that you need to match on something more complex than a substring.
In .Net 4.0 there is an issue with the String.IndexOf call see Hotfix 2467309, it may help you decide your answer.

Search Query Optimization

I haven't ever dug into cleaning/reformatting search queries too much in the past, at least not more than general security things like preventing sql injection.
I am realizing that I should be implementing keywords like AND, OR, NOT, etc... and doing things like clearing punctuation such as apostrophes, hyphens, etc... As when a user types "Smiths" in a searchbox, the query would not return "Smith's" (with an apostrophe).
What other things can I do to improve my user's search queries (without being damaging to them)?
I am coming from a PHP MySQL-FTS setup; however, I'm sure that this could be extended to multiple platforms.
EDIT
Let me clarify that I'm not so interested in the SQL query to the database, what I'm interested in optimizing is the query that the user provides in the search box.
NEAR keyword
double quotes for "exact phrases"
remove short/common words ("a", "an", "the", etc)
stemming (remove common prefixes and suffixes)
I'd suggest reading through the answers to this similar question: Optimizing a simple search algorithm and also this article on some of Google's features.
Create an index on the "where" clause columns of your search queries.
To enable naive spell Correction perhaps, you could also store the soundex of the column you would like to offer spell-check for.
Enable logging for slow-queries which would help you in tracking down performance issues.