Are there any open source tools that can generate a natural language description of a given SQL query? If not, some general pointers would be appreciated.
I don't know much about NLP, so I am not sure how difficult this is, although I saw from some previous discussion that the vice versa conversion is still an active area of research. It might help to say that the SQL tables I will be handling are not arbitrary in any sense, yet mine, which means that I know exact semantics of each table and its columns.
I can devise two approaches:
SQL was intended to be "legible" to non-technical people. A naïve and simpler way would be to perform a series of replacements right on the SQL query: "SELECT" -> "display"; "X=Y" -> "when the field X equals to value Y"... in this approach, using functions may be problematic.
Use a SQL parser and use a series of templates to realize the parsed structure in a textual form: "(SELECT (SUM(X)) (FROM (Y)))" -> "(display (the summation of (X)) (in the table (Y))"...
ANTLR has a grammar of SQL you can use: https://github.com/antlr/grammars-v4/blob/master/sqlite/SQLite.g4 and there are a couple SQL parsers:
http://www.sqlparser.com/sql-parser-java.php
https://github.com/facebook/presto/tree/master/presto-parser/src/main
http://db.apache.org/derby/
Parsing is a core process for executing a SQL query, check this for more information: https://decipherinfosys.wordpress.com/2007/04/19/parsing-of-sql-statements/
There is a new project (I am part of) called JustQuery.Me which intends to do just that with NLP and google's SyntaxNet. You can go to the https://github.com/justquery-me/justqueryme page for more info. Also, sign up for the mailing list at justqueryme-development#googlegroups.com and we will notify you when we have a proof of concept ready.
Related
My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.
Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.
Having read almost all topics related to dynamic where clauses, I still can't find a way through.
Here is my source table:
Source Table
And the result I want is:
Results
In fact I want to return all values satisfying the Test value condition but don't know how to implement it dynamicly (I have a table with 700K lines).
Thanks a lot for your help.
EDIT:
Following your answers, I will detailed a bit more the approach.
Unfortunately, as I'm a new user, I'm not allowed to post pictures directly in the post.
I'm basicly performing segregation of duties controls over the SAP system.
Basicly, I want to test if some of the access of a customer are conflictual based on SAP extractions against a knowledge template stating potential conflicts.
Here is an simplified example of the SAP extract:
And here is a simplified example of the Potential conflict template:
<table><tbody><tr><th>Field</th><th>Value</th></tr><tr><td>ACTVT</td><td>AZ0220</td></tr><tr><td>KOART</td><td>K</td></tr><tr><td>BUKRS</td><td>*</td></tr><tr><td>EKGRP</td><td>03</td></tr></tbody></table>
This is a faxe example of the raw data of the customer:
<table><tbody><tr><th>FIELD</th><th>RANGE_START</th><th>RANGE_END</th></tr><tr><td>ACTVT</td><td>AZ01*</td><td>AZ99*</td></tr><tr><td>KOART</td><td>A</td><td>L</td></tr><tr><td>BUKRS</td><td>011</td><td>099</td></tr><tr><td>EKGRP</td><td>2</td><td>10</td></tr></tbody></table>
I thought a way of doing this is to use dynamic where clause.
Thanks a lot for your help
I have a function that uses the SQL DIFFERENCE function to see if the name of a client is similar to a client already in the database
SELECT ID FROM People p
WHERE DIFFERENCE(p.FullName, #fullName) = 4
Being #fullname a variable passed to the function. The issue I'm having is that if I pass "pedro sanchez" as a parameter, the query will bring me all the Peter's in the database, or if I enter "pablo sanchez", it'll bring record "PEOPLE'S CREDIT UNION".
As I understand the DIFFERENCE function should returns 4 when the two strings are almost identical, but the results I'm having say otherwise.
Is there a way to further specify the resemblance to the DIFFERENCE function, or maybe another approach in finding similar names ?
Difference() is based on soundex(), which in turn -- to be frank -- is a lousy system for comparing strings. Let me add a caveat: it is pretty good for the purpose it was designed for, which is matching last names of people in English. You can read about the rules here and you can try it out here. Using the latter link, you can see that "Pedro" and "People" have the same code, P-140.
Soundex encodes the consonants and basically the first four matching consonants the list it cares about. (Some languages, such as Hawaiian and other Polynesian languages are rather light in consonants. One assumes the designers were not thinking about names in such languages.)
When you are looking for proximity among written strings, Levenshtein distance is a common metric. Unfortunately, SQL Server does not have this functionality built-in, but you can easily find implementations on the web. For most real applications, Levenshtein distance is too slow. Happily, the functionality of the full text search component is usually sufficient for most purposes.
I would like to build a safe dynamic select statement that can handle multiple WHERE clauses.
For example the base SQL would look like:
SELECT * FROM Books Where Type='Novel'
I would pass the function something like:
SafeDynamicSQL("author,=,Herman Melville","pages,>,1000");
Which would sanitize inputs and concatenate like:
SELECT * FROM Books Where Type='Novel' AND author=#author AND pages>#pages
The function would sanitize the column name by checking against an array of predefined column names. The operator would only be allowed to be >,<,=. The value would be added as a normal paramater.
Would this still be vulnerable to SQL injection?
There will be some string manipulation and small loops which will affect performance but my thoughts are that this will only take a few milliseconds compared to the request which on average take 200ms. Would this tax the server more than I am thinking if these requests are made about once a second?
I know this isn't best practice by any means, but it will greatly speed up development. Please give me any other reasons why this could be a bad idea.
It looks like you're reinventing any number of existing ORM solutions which offer a similar API for creating WHERE clauses.
The answer to your question hinges on what you mean by "The value would be added as a normal paramater." If by that you mean performing string concatenation to produce the string you showed then yes, that would still be subject to SQL injection attack. If you mean using an actual parameterized query then you would be safe. In the later case, you would produce something like
SELECT * FROM Books Where Type='Novel' AND author=? AND pages > ?
and then bind that to a list of values like ['Herman Melville', 1000]. Exactly what it would look like depends on what programming language you're using.
Finally, if you pursue this path I would strongly recommend changing from comma-delimited arguments to three separate arguments, you'd save yourself a lot of programming time.
Pretty much any code that appends together (or interpolates) strings to create SQL is bad form from a security point of view, and is probably subject to some SQLi attack vector. Just use bound parameters and avoid the entire problem; avoiding SQL injection is super-easy.
Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;