Processing text of SQL script - sql

I want to develop tool which will prettify SQL scripts - make all special words and commands (SELECT, JOIN, FROM, etc.) upper/lower case; add square brackets; and couple other things (yes, ). I'm going to implement it as extension for my IDE or as external tool - I'm not decided it yet.
I was going to split a script by spaces, brackets, commas and periods - get separate words - and check each word to match to one of the keywords. If it matches - then capitalize/lowercase word depending on settings. If not - leave it as it was.
But then I thought that it may be other solutions.
I thought about using RegEx (unfortunately I don't know much about it). I suppose that it will work more efficient. And therefore using it will be more preferred.
Is RegEx the best way to achieve my goal? Or my initial approach is also appropriate?
Is there other ways?
P.S. I know that similar tools can already exist out there. And I will appreciate if you share them. But I want to implement my own tool for self-education reasons.

Related

Rewrite this exceedingly long query

I just stumbled across this gem in our code:
my $str_rep="lower(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'-',''),'',''),'.',''),'_',''),'+',''),',',''),':',''),';',''),'/',''),'|',''),'\',''),'*',''),'~','')) like lower('%var%')";
I'm not really an expert in DB, but I have a hunch it can be rewritten in a more sane manner. Can it?
It depends on the DBMS you are using. I'll post some examples (feel free to edit this answer to add more).
MySQL
There is really not much to do; the only way to replace all the characters is nesting REPLACE functions as it has already been done in your code.
Oracle DB
Your clause can be rewritten by using the TRANSLATE function.
SQL Server
Like in MySQL there aren't any functions similar to Oracle's TRANSLATE. I have found some (much longer) alternatives in the answers to this question. In general, however, queries become very long. I don't see any real advantages of doing so, besides having a more structured query that can be easily extended.
Firebird
As suggested by Mark Rotteveel, you can use SIMILAR TO to rewrite the entire clause.
If you are allowed to build your query string via Perl you can also use a for loop against an array containing all the special characters.
EDIT: Sorry I did not see you indicated the DB in the tags. Consider only the last part of my answer.
Your flagged this as Perl, but it's probably not?
Here is a Perl solution anyway:
$var =~ s/[\-\.\_\+\,\:\;\/\|\\\*\~]+//g;
Sorry I don't know the languages concerned, but a couple of things come to mind.
Firstly you could look for a replace text function that does more that just a single character. Many languages have them. Some also do regular expression based find and replace.
Secondly the code looks like it is attempting to strip a specific list of characters. This list may not include all that is necessary which means a relatively high (pain in the butt) maintenance problem. A simpler solution might be to invert the problem and ask what characters do you want to keep? Inverting like this sometimes yields a simpler solution.

Search Query Optimization

I haven't ever dug into cleaning/reformatting search queries too much in the past, at least not more than general security things like preventing sql injection.
I am realizing that I should be implementing keywords like AND, OR, NOT, etc... and doing things like clearing punctuation such as apostrophes, hyphens, etc... As when a user types "Smiths" in a searchbox, the query would not return "Smith's" (with an apostrophe).
What other things can I do to improve my user's search queries (without being damaging to them)?
I am coming from a PHP MySQL-FTS setup; however, I'm sure that this could be extended to multiple platforms.
EDIT
Let me clarify that I'm not so interested in the SQL query to the database, what I'm interested in optimizing is the query that the user provides in the search box.
NEAR keyword
double quotes for "exact phrases"
remove short/common words ("a", "an", "the", etc)
stemming (remove common prefixes and suffixes)
I'd suggest reading through the answers to this similar question: Optimizing a simple search algorithm and also this article on some of Google's features.
Create an index on the "where" clause columns of your search queries.
To enable naive spell Correction perhaps, you could also store the soundex of the column you would like to offer spell-check for.
Enable logging for slow-queries which would help you in tracking down performance issues.

Diff Tool That Ignores Newlines [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I frequently need to compare SQL procedures to determine what has changed in the newest version. The problem is, everyone has their own style of formatting, and SQL doesn't (usually) care about where one puts their newlines (e.g. where clauses all on one line vs. newline before each AND).
This makes it very difficult (especially for long procedures) to see the actual differences. I cannot seem to find a free diff/merge utility that will allow me to ignore newlines (i.e. treat as whitespace). So far I've tried WinMerge and Beyond Compare without any luck. Does anyone know of a diff tool (ideally free) that would see these two examples as identical?
Ex. 1:
the quick
brown
Ex. 2:
the
quick
brown
Thanks in advance.
I really like SourceGear's DiffMerge!
It works on all platforms and has built in rulesets, but allows you to create and add your own. Which means that you can ignore what you want, when you want it.
Bonus, it is free!
What i've done in my own similar case is to use a sql prettifier which will organize two sets of semi-disparate SQL in very similar fashion automatically. i then paste and compare the results with WinMerge.
It's a two-step process but it's much more palatable than many other options, especially when many lines of code are involved.
Link to web-based Sql Pretty printer that's decent.
I love Araxis merge. Not free but well worth it. it can, among other things, ignore any kind of whitespace if you want.
You can use The DTP (Data Tool Project) of the Eclipse IDE.
To show it I created two almost identical SQL files and let eclipse show me the differences. After clicking "show next" I took a screenshot.
As you can see it still highlights the newlines, but by the way it does you can immediately see that they contain no substantial change to the SQL. It's easy to spot where I changed the ID from 1 to 2.
Here's the result.
Compare++ is an option, you can try "Ignore code style changes" in the 'smart' menu. It support structured comparison for many langugages such as C/C++, JavaScript, C#, Java, ...
Regardless on your definition of "Free" (beer vs speech/libre), Poor Man's T-SQL Formatter is also available to do this, either with WinMerge (using the winmerge plugin) or Beyond Compare and other comparison tools that allow for command-line pre-formatting, using the command-line bulk formatter.
If you prefer to take it for a whirl without downloading anything, it's available for immediate use online (like its non-libre counterparts T-SQL Tidy, Instant SQL Formatter, etc):
http://poorsql.com
Our SD Smart Differencer compares two source programs according to their precise grammatical syntax and structure, rather than according to raw text. It does so by parsing (SQL) source
the way a compiler would, and comparing the corresponding compiler data structures (e.g., abstract syntax trees). The SmartDifference consequently does not care about newlines, whitespace or intervening comments.
It reports differences, not in terms of line breaks, but rather in terms of programming language structures (variables, expressions, statements, blocks, functions, ...) and in terms close to programmer intentions (delete, insert, move, copy, rename) rather than line-insert or line delete.
SQL (like many other computer language names) is the name of a family of computer languages that are similar in syntax but differ in detail. So for the Smart Differencer, which dialect of SQL you are using matters. We have SQL front ends (therefore SmartDifferncers) for PLSQL and SQL2011. To the extent you SQL stays within the bounds of either of these, the Smart Differencer can work for you; to the extent you use extra goodies of SQL Server or Postgres, the SmartDifferencer presently can't help you. [We develop language parsers as part of our business, so I expect this is a matter of delay rather than never].
While the OP asked about SQL in the details, his headline question is language agnostic.
There are SmartDifferencers already for many other widely used languages other than SQL too: C, C++, C#, Java, ...
Another alternative is Emacs' Ediff. Works great if you are not afraid of Emacs.
You can use the command-line tool wdiff to ignore newlines. wdiff is a GNU tool for comparing files on a word-by-word basis. It can ignore newlines with the -n option.
Suppose I put your 2 example files into ex1.txt and ex2.txt. Then we can run:
$> wdiff -n ex1.txt ex2.txt
the
quick
brown
The output is actually the contents of the first file. Note that there are no + or - signs, which means the files have the same strings.
If I had added "fox" to the end of ex1.txt, then the output would look like this:
the
quick
brown [-fox-]
If seeing the common words still bothers you, you can add -3 or --no-common. Here's the example again where I added "fox" to the first file:
$> wdiff -n -3 /tmp/ex1.txt /tmp/ex2.txt
======================================================================
[-fox-]
======================================================================
PHPStorm's diff tool's "ignore white space: all" command does it perfectly as you want. And it has integrated support for many VCS like SVN, git, etc. As well as integrated SQL support!
Not free but time isn't free either. Want to waste time doing it the hard way? Go ahead.
I still can't believe it's 2014 and this wasn't a standard feature of all diff tools!!
BTW I believe WebStorm's diff tool would also work.
Have you tried KDiff? I'm certain you can ignore whitespace with it, and if it's not powerful enough for you it allows you to run a preprocessor over the file. Best of all it's free and open source.
If you're on Windows, WinMerge is pretty slick. Under Linux (and maybe OS X), there's Meld.
Both are free as in beer and work pretty well. Not quite as cool as Araxis, but then we don't want you drooling on your desk.
Both are all-purpose diff tools with such features as white space ignoring. I'm not absolutely certain they ignore blank lines, but chances are good they can.

What packages are available to typeset SQL in LaTeX?

I'm looking for a package to typeset SQL statements in LaTeX. So far I have heard of listings and lgrind, are there any other suggestions?
[edit] Added requirement: I'd like the package to be able in intelligently insert page breaks, so that where possible statements do not span multiple pages. Still reading documentation, so it is possible that either of the a/m are able to do this already- Please let me know if this is the case.
Related: question
I use the listings package, but mostly for snippets. I haven't needed to worry about page breaks in general. One of the great things about listings is the high degree of flexibility available. For instance, I don't capitalize my SQL, but I can print my listings with capitalized keywords:
\makeatletter
\newcommand{\lstuppercase}{\uppercase\expandafter{\expandafter\lst#token
\expandafter{\the\lst#token}}}
\newcommand{\lstlowercase}{\lowercase\expandafter{\expandafter\lst#token
\expandafter{\the\lst#token}}}
\makeatother
\lstdefinestyle{Oracle}{basicstyle=\ttfamily,
keywordstyle=\lstuppercase,
emphstyle=\itshape,
showstringspaces=false,
}
And define more keywords as I need them:
\lstdefinelanguage[Oracle]{SQL}[]{SQL}{
morekeywords={ACCESS, MOD, NLS_DATE_FORMAT, NVL, REPLACE, SYSDATE,
TO_CHAR, TO_NUMBER, TRUNC},
}
To make use of these definitions:
\lstset{language=[Oracle]SQL,
style=Oracle,
}
If I were to print out larger pieces of code, I'd either not worry about page breaks or write a preprocessor to divide the code up before passing it to LaTeX.
You want to use the listings package. Is there a specific thing you want to do with it, or are you just asking which package works best in general? I've never encountered any big problems with listings, though getting it to do Exactly What I Want is sometimes tricky (it's LaTeX; to expect anything else would be folly).
Edit (to address your edit): intelligent page breaking might be problematic; it's certainly beyond my abilities. listings might be able to do it with explicit markup (escape to LaTeX and insert a negative page break penalty at appropriate place; likely macro-izable), but I don't think listings can do it automatically, and I doubt LGrind can do it either. You might have better luck searching or asking on a LaTeX-specific list (comp.text.tex on Usenet is a great place to try), but page breaking in TeX has never been as good as line breaking, and so I wouldn't hold out too much hope, unfortunately.

What are the things should we consider while writing a Spell Checker?

I want to write a very simple Spell Checker. The spell checker will try to match the input word with equivalent words form the dictionary.
What can be done to find those 'equivalent words'? What analysis can be preformed on two words to mark them equivalent?
Before investing too much trying to unravel that i'd first look to already existing implementations like Aspell or netspell for two main reasons
Not much point in re-inventing the wheel. Spell checking is much trickier than it first appears and it makes sense to build on work that has already been done
If your interest is finding out how to do it, the source code and community will be a great benefit should you decide to implement your own anyway
Much depends on your use case. For example:
Is your dictionary very small (about twenty words)? In this case it probably is better to precompute all possible nearby mistaken words and use a table/hash lookup.
What is your error model? Aspell has at least two (one for spelling errors caused by nearby letters on the keyboard, and the other for spelling errors caused by the way a word sounds).
How dynamic is your dictionary? Can you afford to do a massive preparation in order to get an efficient retrieval?
You may need a "word equivalence" measure like Double Metaphone, in addition to edit distance.
You can get some feel by reading Peter Norvig's great description of spelling correction.
And, of course, whenever possible, steal code. Do not reinvent the wheel without a reason - a reason could be a very special domain, a special way your users make spelling mistakes, or just to learn how it's done.
Edit Distance is the theory you need to write a spell checker. You also need a dictionary. Most UNIX systems come with a dictionary already installed for your locale.
I just finished implementing a spell checker and used a combination of the following in getting a list of "suggested" words
Phonetic hashing of the "misspelled" word to lookup a hash of identical dictionary hashed real words (for java check out Apache Commons Codec for a suitable library). The phonetic hash of your dictionary file can be precomputed.
Edit distance between the input and the potentials (this is reasonably expensive so you need to reduce the list first with something like a phonetic hash, assuming a higher volume load - in my case, a server based spell check)
A known list of common misspellings, e.g. recieve vs. receive.
An ordered list of the most common words in the english language
Essentially I weighted each potential word primarily based on edit-distance and commonality. e.g. if word probability is a percentage, then
weight = edit-distance * 100 / probability
(lower weights are better)
But then I also also override any result with the known common misspellings (i.e. these always float to the top suggested result).
There may be better ways, but this worked pretty well.
You may also wish to ignore ALL CAPS words, initials etc, so choosing what to ignore is also something to think about.
Under linux/unix you have ispell. Why reinventing the whell.