Is it possible to concisely tell SQL SELECT to omit some temporary/dummy columns? - sql

Suppose I have this query:
select
x + y as _total,
abs(x - y) / _total as _err,
round(100 * _err) as pct_err,
x,
y
from foo;
This assumes I have a table with x and y, and calculates an error between them. Note that columns I prefixed with _ are dummy columns - they're only there to show the steps of the calculation more clearly. Is there a way to omit them from the result?
I don't want to simply collapse the three columns into a single expression. It would be messier, and consider also a calculation with 10 steps and much longer field names.
I don't want to make this a CTE and then re-select only the columns I want. That seems too much hassle for such a simple thing.
It would be okay if I could just put the dummy columns at the end where they would be out of the way, but SQL doesn't seem to allow referencing a column that comes after.
Note that "no" is an acceptable answer, if you have reasonably comprehensive knowledge of SQL syntax :)

Related

Replacing a query with SELECT * FROM X WHERE Y is not NULL

What does this query try to achieve?
SELECT * FROM X WHERE (X.Y in (select Y from X))
As far as I figured, it is yielding me the same result as
SELECT * FROM X WHERE Y is not NULL
Is there anything more to the first query? The first query is actually very slow with a large dataset and hence I want to know whether I can replace it with the second query.
You are right, the two queries are equivalent.
It is unclear, why the first query was written this way. Maybe it looked different once.
As is, your second query is better, because it is easier to read and understand (and even faster as you say).
your second query is perfect than the 1st one
because in 1st query you may get abnormal(null) result in case if column Y contains null value but you will not get abnormal result in 2nd one if null values contain in column Y.
So based on values of your table two query will behave two different way

PostgreSQL - How to ignore gross errors when using AVG() and MAX()?

I have data in a table and I'd like to deliberately ignore some of the obviously incorrect data and take an average of the more plausible data.
Here's a simplified example of what I mean. Let's say I have a table that lists people and their height in cms.
I might use this to get the average height.....
SELECT AVG(height) FROM people;
That's fine if the data was all added correctly, but if they're (say) ten people in the database with correct heights, and one person who's height has been recorded as a billion centimeters tall then the AVG() won't return a sensible value - a classic example of GIGO (garbage in, garbage out)
Is there any way to adjust the above SQL function to ignore the outlying data points? - the data that is so different from all the rest it's got to be wrong?
I'm pretty sure the solution will involve one of the functions listed here but I'm having trouble finding some plain-english explanations of what they do and how they work.
UPDATE.......
My quoted example using height was selected for simplicity of explanation. Any proposed solution CAN'T simply filter between sensible values (i.e. height above 1.5m and below 2m) because for the actual data I'm using I don't know what the sensible values are! The solution needs to reject data that is massively different from the majority of the other data - so I guess that's where a knowledge of stats comes in handy.
Update 2) Sorry, going to have un-accept the answer I previously accepted (helpful though it was!). The standard deviation gives a value for the 'spread' of the data, but doesn't give any idea of where the outlying data is (i.e. stupidly tall people, or stupidly short people), so a clause like this...
WHERE height BETWEEN (SELECT a-2*sd FROM cte) AND (SELECT a+2*sd FROM cte);
Doesn't just remove the one stupidly tall person from one end of the range, it also removes all of the 'normal height' people from the other end of the range!
I can adjust the WHERE clause like this....
WHERE height BETWEEN (SELECT a-(sd/100) FROM cte) AND (SELECT a+(sd/100) FROM cte);
But I'm looking for a solution that doesn't require individual tweaking for each different set of data
You could use FILTER:
SELECT AVG(height) FILTER (WHERE height BETWEEN x AND y) AS avg_height
FROM people;
-- or `WHERE`:
SELECT AVG(height) AS avg_height
FROM people
WHERE height BETWEEN x AND y;
x and y are plausible values.
Alternatively you could filter out values that are outside range average() +/- 2*stddev()
WITH cte AS (
SELECT AVG(height) a, STDDEV(height) sd
FROM people
)
SELECT AVG(height)
FROM people
WHERE height BETWEEN (SELECT a-2*sd FROM cte) AND (SELECT a+2*sd FROM cte);
db<>fiddle demo

SQL - HAVING (execution vs structure)

I'm a beginner, studying on my own... please help me to clarify something about a query: I am working with a soccer database and trying to answer this question: list all seasons with an avg goal per Match rate of over 1, in Matchs that didn’t end with a draw;
The right query for it is:
select season,round((sum(home_team_goal+away_team_goal) *1.0) /count(id),3) as ratio
from match
where home_team_goal != away_team_goal
group by season
having ratio > 1
I don't understand 2 things about this query:
Why do I *1.0? why is it necessary?
I know that the execution in SQL is by this order:
from
where
group
having
select
So how does this query include: having ratio>1 if the "ratio" is only defined in the "select" which is executed AFTER the HAVING?
Am I confused?
Thanks in advance for the help!
The multiplication is added as a typecast to convert INT to FLOAT because by default sum of ints is int and the division looses decimal places after dividing 2 ints.
HAVING. You can consider HAVING as WHERE but applied to the query results. Imagine the query is executed first without HAVING and then the HAVING condition is applied to result rows leaving only suitable ones.
In you case you first select grouped data and calculate aggregated results and then skip unnecessary results of aggregation.
the *1.0 is used for its ".0" part so that it tells the system to treat the expression as a decimal, and thus not make an integer division which would cut-off the decimal part (eg 1 instead of 1.33).
About the second part: select being at the end just means that the last thing
to be done is showing the data. Hoewever, assigning an alias to a calculated field is being done, you could say, at first priority. Still, I am a bit doubtful; I am almost certain field aliases cannot be used in the where/group by/having in, say, sql server.
There is no order of execution of a SQL query. SQL is a descriptive language not a procedural language. A SQL query describes the result set that the query is producing. The SQL engine can execute it however it likes. In fact, most SQL engines compile the query into a directed acyclic graph, which looks nothing like the original query.
What you are referring to might be better phrased as the "order of interpretation". This is more simply described by simple rules. Column aliases can be used in the ORDER BY clause in any database. They cannot be used in the FROM, WHERE, or GROUP BY clauses. Some databases -- such as SQLite -- allow them to be referenced in the HAVING clause.
As for the * 1.0, it is because some databases -- such as SQLite -- do integer arithmetic. However, the logic that you want is probably more simply expressed as:
round((avg(home_team_goal + away_team_goal * 1.0), 3)

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Repeating operations vs multilevel queries

I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.