Related
I have a repository of SQL queries and I want to understand which queries use certain tables or fields.
Let's say I want to understand what queries use the email field, how can I write it?
Example SQL query:
select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users
So to state the problem more accurately, you are sorting through a list of SQL queries [as text], and you now need to find the queries that use certain fields using SQL & RegEx (Regular Expressions) in PostgreSQL. (please tag the question so that StackOverflow indexes your question correctly, more importantly, readers have more context about the question)
PostgreSQL has Regular Expression support OOTB (Out Of The Box). So we skip exploring other ways to do this. (If you are reading this as Microsoft SQL Server person, then I strongly suggest you to have a read of this brilliant article on Microsoft's website on defining a Table-Valued UDF (User Defined Function))
The simplest way I could think of to approach your problem, is to throw away what we don't want out of the query text first, and then filter out what's left.
This way, after throwing away the stuff you don't need, you will be left with a set of "tokens" that you can easily filter, and I'm putting token in quotes since we are not really parsing the SQL language, but if we did that would be the first step: to extract tokens.. (:
Take this query for example:
With Queries (
Id
, QueryText
) As (
values (1, 'select
users.email as email_user
,users.email as email_user_too
,email as email_user_too_2,
email as email_user_too_3,
back_email as wrong_email -- wrong field
from users')
)
Select QueryText
, found
From (
Select Id
, QueryText
, regexp_split_to_table (QueryText, '(--[\s\w]+|select|from|as|where|[ \s\n,])') As found
From Queries
) As Result
Where found != ''
And found = 'back_email'
I have sourced the concept of a "query repository" with a WITH statement for ease of doing the pseudo-code.
I have also selected few words/characters to split QueryText with. Like select, where etc. We don't need these in our 'found' set.
And in the end, as you can see above, I simply used found as what's left and filtered it with the field name you are looking for. (Assuming that you know the field you are looking for)
You could improve upon the RegEx I did, or change the method as you wish to make it better. But I think the general concept addresses what you need to achieve. One problem I can see with my solution right off the bat is the fact that you can search for anything really, not just names of the selected fields - which begs the question, why use RegEx, and not Like statements? But again, as I mentioned, you can improve upon the RegEx and address specific requirements you may have. Using Like might limit you in that direction. (In other words, only you know what's good for you. I can't say that from here.)
You can play with the query online here: db-fiddle query and use https://regex101.com/ for testing your RegEx.
Disclaimer I'm not a PostgreSQL developer. There must be other, perhaps better ways of doing this. (:
I recently came across a program which is developed using sql statements in a table with a code for each statement. rather than having specific sql statements in the program itself.
So, rather than having code like this:
string query = "SELECT id, name from [Users]";
cmd.ExecuteQuery(query);
They use code like this: (simplified)
string firstQuery = "SELECT queryText from [Queries] where queryCode = 'SELECT_ALL_USERS'";
string userQuery = cmd.ExecuteQuery(firstQuery);//pretend this directly returns the result of the first query
cmd.ExecuteQuery(userQuery);
The logic behind this as far as I've heard is that it makes the program easier to maintain as the developer is free to change the "user sql" without having to actually change the program.
However, this struck me as maybe a little counterproductive. Would this kind of code be considered a good idea?
EDIT: I'm not looking for suggestions like "use an ORM". Assume that sql queries are the only option.
In my opinion, this approach is ridiculous. There is value (maintainability, modularity) in separating as much SQL from the middle tier as possible, but to accomplish this end, I would recommend using stored procedures.
No i really dont think its a good idea to proceed further with design.
As a test or learning activity is a differetn part, but going foward with such implementations is definately not advisable.
pros:
1. We get complete modularity. The Real Business Schema can change at any time, and we do not need to modify the Running application to get the results from Different schema (Considering result Format dont change).
Cons.
1. With this implementation we are firing 2 SQLs to Database each time when we want to execute 1. I/O call including DB calls are always performnace hit, and with this implementation we are doubling the performance which is definately not advisable.
When I try to run this query in Access through the ODBC interface into a MySQL database I get an "Expression too complex in query expression" error. The essential thing I'm trying to do is translate abbreviated names of languages into their full body English counterparts. I was curious if there was some way to "trick" access into thinking the expression is smaller with sub queries, or if someone else had a better idea of how to solve this problem. I thought about making a temporary table and doing a join on it, but that's not supported in Access SQL.
Just as an FYI, the query worked fine until I added the big long IFF chain. I tested the query on a smaller IFF chain for three languages, and that wasn't an issue, so the problem definitely stems from the huge IFF chain (It's 26 deep). Also, I might be able to drop some of the options (like combining the different forms of Chinese or Portuguese)
As a test, I was able to get the SQL query to work after paring it down to 14 IFF() statements, but that's a far cry from the 26 languages I'd like to represent.
SELECT TOP 5 Count( * ) AS [Number of visits by language], IIf(login.lang="ar","Arabic",IIf(login.lang="bg","Bulgarian",IIf(login.lang="zh_CN","Chinese (Simplified Han)",IIf(login.lang="zh_TW","Chinese (Traditional Han)",IIf(login.lang="cs","Czech",IIf(login.lang="da","Danish",IIf(login.lang="de","German",IIf(login.lang="en_US","United States English",IIf(login.lang="en_GB","British English",IIf(login.lang="es","Spanish",IIf(login.lang="fr","French",IIf(login.lang="el","Greek",IIf(login.lang="it","Italian",IIf(login.lang="ko","Korean",IIf(login.lang="hu","Hungarian",IIf(login.lang="nl","Dutch",IIf(login.lang="pl","Polish",IIf(login.lang="pt_PT","European Portuguese",IIf(login.lang="pt_BR","Brazilian Portuguese",IIf(login.lang="ru","Russian",IIf(login.lang="sk","Slovak",IIf(login.lang="sl","Slovenian","IIf(login.lang="fi","Finnish",IIf(login.lang="sv","Swedish",IIf(login.lang="tr","Turkish","Unknown")))))))))))))))))))))))))) AS [Language]
FROM login, reservations, reservation_users, schedules
WHERE (reservations.start_date Between DATEDIFF('s','1970-01-01 00:00:00',[Starting Date in the Following Format YYYY/MM/DD]) And DATEDIFF('s','1970-01-01 00:00:00',[Ending Date in the Following Format YYYY/MM/DD])) And reservations.is_blackout=0 And reservation_users.memberid=login.memberid And reservation_users.resid=reservations.resid And reservation_users.invited=0 And reservations.scheduleid=schedules.scheduleid And scheduletitle=[Schedule Title]
GROUP BY login.lang
ORDER BY Count( * ) DESC;
# Michael Todd
I completely agree. The list of languages should have been a table in the database and the login.lang should have been a FK into that table. Unfortunately this isn't how the database was written, and it's not really mine to modify. The languages are placed into the login.lang field by the PHP running on top of the database.
I thought about making a temporary table and doing a join on it, but that's not supported in Access SQL.
Did you try making a table of languages within Access, and joining it to the MySQL tables?
You may try the below expression. what I did is, your expression is cut down to two parts, then a final 'IIf' check will do the trick. You will have additional 2 fields and you may ignore those. I had the same situation and this worked well for me. PS: You may need to double check the closing brackets in the below expression. I did it quickly.
Thanks,
Shibin
IIf(login.lang="ar","Arabic",IIf(login.lang="bg","Bulgarian",IIf(login.lang="zh_CN","Chinese (Simplified Han)",IIf(login.lang="zh_TW","Chinese (Traditional Han)",IIf(login.lang="cs","Czech",IIf(login.lang="da","Danish",IIf(login.lang="de","German",IIf(login.lang="en_US","United States English",IIf(login.lang="en_GB","British English",IIf(login.lang="es","Spanish",IIf(login.lang="fr","French",IIf(login.lang="el","Greek",IIf(login.lang="it","Italian",""))))))))))))) as l1,
IIf(login.lang="ko","Korean",IIf(login.lang="hu","Hungarian",IIf(login.lang="nl","Dutch",IIf(login.lang="pl","Polish",IIf(login.lang="pt_PT","European Portuguese",IIf(login.lang="pt_BR","Brazilian Portuguese",IIf(login.lang="ru","Russian",IIf(login.lang="sk","Slovak",IIf(login.lang="sl","Slovenian","IIf(login.lang="fi","Finnish",IIf(login.lang="sv","Swedish",IIf(login.lang="tr","Turkish","Unknown")))))))))))) as l2,
IIf(l1="",l2,l1) AS [Language]
If you can't use a lookup table, create a custom VB function, so that instead of 26 IIf statements, you have one function call.
I'm entirely new at SQL, but let's say that on the StackExchange Data Explorer, I just want to list the top 15 users by reputation, and I wrote something like this:
SELECT TOP 15
DisplayName, Id, Reputation, Reputation/1000 As RepInK
FROM
Users
WHERE
RepInK > 10
ORDER BY Reputation DESC
Currently this gives an Error: Invalid column name 'RepInK', which makes sense, I think, because RepInK is not a column in Users. I can easily fix this by saying WHERE Reputation/1000 > 10, essentially repeating the formula.
So the questions are:
Can I actually use the RepInK "column" in the WHERE clause?
Do I perhaps need to create a virtual table/view with this column, and then do a SELECT/WHERE query on it?
Can I name an expression, e.g. Reputation/1000, so I only have to repeat the names in a few places instead of the formula?
What do you call this? A substitution macro? A function? A stored procedure?
Is there an SQL quicksheet, glossary of terms, language specification, anything I can use to quickly pick up the syntax and semantics of the language?
I understand that there are different "flavors"?
Can I actually use the RepInK "column" in the WHERE clause?
No, but you can rest assured that your database will evaluate (Reputation / 1000) once, even if you use it both in the SELECT fields and within the WHERE clause.
Do I perhaps need to create a virtual table/view with this column, and then do a SELECT/WHERE query on it?
Yes, a view is one option to simplify complex queries.
Can I name an expression, e.g. Reputation/1000, so I only have to repeat the names in a few places instead of the formula?
You could create a user defined function which you can call something like convertToK, which would receive the rep value as an argument and returns that argument divided by 1000. However it is often not practical for a trivial case like the one in your example.
Is there an SQL quicksheet, glossary of terms, language specification, anything I can use to quickly pick up the syntax and semantics of the language?
I suggest practice. You may want to start following the mysql tag on Stack Overflow, where many beginner questions are asked every day. Download MySQL, and when you think there's a question within your reach, try to go for the solution. I think this will help you pick up speed, as well as awareness of the languages features. There's no need to post the answer at first, because there are some pretty fast guns on the topic over here, but with some practice I'm sure you'll be able to bring home some points :)
I understand that there are different "flavors"?
The flavors are actually extensions to ANSI SQL. Database vendors usually augment the SQL language with extensions such as Transact-SQL and PL/SQL.
You could simply re-write the WHERE clause
where reputation > 10000
This won't always be convenient. As an alternativly, you can use an inline view:
SELECT
a.DisplayName, a.Id, a.Reputation, a.RepInK
FROM
(
SELECT TOP 15
DisplayName, Id, Reputation, Reputation/1000 As RepInK
FROM
Users
ORDER BY Reputation DESC
) a
WHERE
a.RepInK > 10
Regarding something like named expressions, while there are several possible alternatives, the query optimizer is going to do best just writing out the formula Reputation / 1000 long-hand. If you really need to run a whole group of queries using the same evaluated value, your best bet is to create view with the field defined, but you wouldn't want to do that for a one-off query.
As an alternative, (and in cases where performance is not much of an issue), you could try something like:
SELECT TOP 15
DisplayName, Id, Reputation, RepInk
FROM (
SELECT DisplayName, Id, Reputation, Reputation / 1000 as RepInk
FROM Users
) AS table
WHERE table.RepInk > 10
ORDER BY Reputation DESC
though I don't believe that's supported by all SQL dialects and, again, the optimizer is likely to do a much worse job which this kind of thing (since it will run the SELECT against the full Users table and then filter that result). Still, for some situations this sort of query is appropriate (there's a name for this... I'm drawing a blank at the moment).
Personally, when I started out with SQL, I found the W3 schools reference to be my constant stopping-off point. It fits my style for being something I can glance at to find a quick answer and move on. Eventually, however, to really take advantage of the database it is necessary to delve into the vendors documentation.
Although SQL is "standarized", unfortunately (though, to some extent, fortunately), each database vendor implements their own version with their own extensions, which can lead to quite different syntax being the most appropriate (for a discussion of the incompatibilities of various databases on one issue see the SQLite documentation on NULL handling. In particular, standard functions, e.g., for handling DATEs and TIMEs tend to differ per vendor, and there are other, more drastic differences (particularly in not support subselects or properly handling JOINs). If you care for some of the details, this document provides both the standard forms and deviations for several major databases.
You CAN refer to RepInK in the Order By clause, but in the Where clause you must repeat the expression. But, as others have said, it will only be executed once.
There are good answers for the technical problem already, so I'll only address some of the rest of your questions.
If you're just working with the DataExplorer, you'll want to familiarize yourself with SQL Server syntax since that's what it's running. The best place to find that, of course, is MSDN's reference.
Yes, there are different variations in SQL syntax. For example, the TOP clause in the query you gave is SQL Server specific; in MySQL you'd use the LIMIT clause instead (and these keywords don't necessarily appear in the same spot in the query!).
Are there any formal techniques for refactoring SQL similar to this list here that is for code?
I am currently working on a massive query for a particular report and I'm sure there's plenty of scope for refactoring here which I'm just stumbling through myself bit by bit.
I have never seen an exhaustive list like the sample you provided.
The most effective way to refactor sql that I have seen is to use the with statement.
It allows you to break the sql up into manageable parts, which frequently can be tested independently. In addition it can enable the reuse of query results, sometimes by the use of a system temporary table. It is well worth the effort to examine.
Here is a silly example
WITH
mnssnInfo AS
(
SELECT SSN,
UPPER(LAST_NAME),
UPPER(FIRST_NAME),
TAXABLE_INCOME,
CHARITABLE_DONATIONS
FROM IRS_MASTER_FILE
WHERE STATE = 'MN' AND -- limit to Minne-so-tah
TAXABLE_INCOME > 250000 AND -- is rich
CHARITABLE_DONATIONS > 5000 -- might donate too
),
doltishApplicants AS
(
SELECT SSN, SAT_SCORE, SUBMISSION_DATE
FROM COLLEGE_ADMISSIONS
WHERE SAT_SCORE < 100 -- Not as smart as the average moose.
),
todaysAdmissions AS
(
SELECT doltishApplicants.SSN,
TRUNC(SUBMISSION_DATE) SUBMIT_DATE,
LAST_NAME, FIRST_NAME,
TAXABLE_INCOME
FROM mnssnInfo,
doltishApplicants
WHERE mnssnInfo.SSN = doltishApplicants.SSN
)
SELECT 'Dear ' || FIRST_NAME ||
' your admission to WhatsaMattaU has been accepted.'
FROM todaysAdmissions
WHERE SUBMIT_DATE = TRUNC(SYSDATE) -- For stuff received today only
One of the other things I like about it, is that this form allows you to separate the filtering from the joining. As a result, you can frequently copy out the subqueries, and execute them stand alone to view the result set associated with them.
There is a book on the subject: "Refactoring Databases". I haven't read it, but it got 4.5/5 stars on Amazon and is co-authored by Scott Ambler, which are both good signs.
Not that I've ever found. I've mostly done SQL Server work and the standard techniques are:
Parameterise hard-coded values that might change (so the query can be cached)
Review the execution plan, check where the big monsters are and try changing them
Index tuning wizard (but beware you don't cause chaos elsewhere from any changes you make for this)
If you're still stuck, many reports don't depend on 100% live data - try precalculating portions of the data (or the whole lot) on a schedule such as overnight.
Not about techniques as much, but this question might help you find SQL refactoring tools:
Is there a tool for refactoring SQL, a bit like a ReSharper for SQL