SQL Error when I use distinct order by upper/lower - sql

This does not work in postgres 8.4:
SELECT DISTINCT col1 FROM mytable
ORDER BY UPPER(col1);
but this works:
SELECT DISTINCT col1 FROM mytable
ORDER BY col1;
I know it might be bit confusing for the database whether to apply DISTINCT first and then UPPER or first convert to UPPER and then apply DISTINCT. Based on order how it applies one may get different result. Not sure if SQL standard says anything in this regards.
Any help will be highly appreciated.

Many SQL engines only allow you to sort on columns that are being selected. So the fix to add the UPPER(col1) to the select.
SELECT DISTINCT UPPER(col1), col1 FROM mytable ORDER BY UPPER(col1)

Just stumbled across the same problem and in my particular context the following was simpler to implement:
select * from (select distinct col1 from mytable) x order by upper(col1)
I did not bother doing any performance testing (in my case the data volume is fairly low), but possibly this might even improve speed as the sorting may happen on significantly less data (Postgres doc says that sorting happens before applying DISTINCT, while the way above we first DISTINCT, then sort).

Related

Fetch data sequentially using union operator

I am fetching data using Union operator. I want my output to be in the same order as my select queries are fetching but instead Union sorts it in alphabetical order. Can you suggest me a way to avoid getting it sorted by default.
Try to do it in a subquery like this:
select * from (select x , y ,z from table1
UNION ALL
select x,y,z from table2)
order by y
Thilo is correct: to be safe, a query should always explicitly order the results. Relying on implicit sorting has caused many problems in the past and will continue to cause more problems in the future.
The suggestion that UNION ALL will avoid the sorting is almost always correct. It should work in 11g and below. But 12c introduced concurrent execution of union all, which does not guarantee the order of results any more.
Even if implicit sorting works right now it's always a good idea to add an ORDER BY.
You should not rely on the order if you do not specify any ORDER BY clause.
UNION guaranties uniquines of the result. Pre 10g versions used sort to remove duplicates. Newer Oracle version also might (but not must) use a hash-table to remove duplicates - therefore the result is not necessarily sorted.
UNION ALL does not care about uniquines.
You can simply type:
select x , y ,z from table1
UNION ALL
select x,y,z from table2
order by y
Order by applies onto the whole result.

sql function column not available for where clause

I have a query:
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
where RowNum = 1
and I'm getting the following error:
Invalid column name 'RowNum'.
I did some search here and found that column aliasing is not available in WHERE.
so I tried the the following and it worked:
select *
from
(
SELECT ROW_NUMBER() OVER(ORDER BY LogId) AS RowNum
FROM [Log] l
) as t
where t.RowNum = 1
Is there a better way, from performance point of view, to make this query?
Thanks in advance.
That's just the way it is.
Column aliases can not be used on the same logical level where they were defined. You will have to use the derived table (sub-query) as you have found out.
If you are concerned about performance, then don't. The derived table is mere syntactical sugar, it won't make the query slower (compared to the solution you tried first).
An alternative to this specific query, which won't perform any different but is simpler to write:
SELECT TOP 1 <col list> FROM dbo.[Log] ORDER BY LogId;
As #a_horse explained, don't be concerned that because your second query looks like more code that it is more expensive. If you want to measure the efficiency of different queries that get the same results, compare their execution plans, not code complexity.

How to get all results, except one row based on a timestamp?

I have an simple question (?) about SQL. I have come across this problem a few times before and I have always solved it, but I'm looking for a more elegant solution and perhaps a faster solution.
The problem is that I would like to select all rows in a table except the one with the max value in a timestampvalue (in this case this is a summary row but it's not marked as this is any way, and it's not releveant to my result).
I could do something like this:
select * from [table] t
where loggedat < (select max(loggedat) from [table] and somecolumn='somevalue')
and somecolumn='somevalue'
But when working with large tables this seems kind of slow. Any suggestions?
If you don't want to change your DB structure, then your query (or one with a slight variation using <> instead of <) is the way to go.
You could add a column IsSummary bit to the table, and always mark the most recent row as true (and all others false). Then your query would change to:
Select * from [table] where IsSummary = 0 and somecolumn = 'somevalue'
This would sacrifice slower speed on inserts (since an insert would also trigger an update of the IsSummary value) in exchange for faster speed on the select query.
If only you don't mind one tiny (4 byte) extra column, then you might possibly go like this:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY loggedat DESC) AS rownum
FROM [table] t
WHERE somecolumn = 'somevalue'
/* and all the other filters you want */
) s
WHERE rownum > 1
In case you do mind the extra column, you'll just have to list the necessary columns explicitly in the outer SELECT.
It may not be the elegant SQL query you're looking for, but it would be trivial to do it in Java, PHP, etc, after fetching the results. To make it as simple as possible, use ORDER BY timestamp DESC and discard the first row.

How do I calculate a moving average using MySQL?

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.
This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1
Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1
When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).
My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc
In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.
In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.

Why does SQL force me to repeat all non-aggregated fields from my SELECT clause in my GROUP BY clause? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
This has bugged me for a long time.
99% of the time, the GROUP BY clause is an exact copy of the SELECT clause, minus the aggregate functions (MAX, SUM, etc.).
This breaks the Don't Repeat Yourself principle.
When can the GROUP BY clause not contain an exact copy of the SELECT clause minus the aggregate functions?
edit
I realise that some implementations allow you to have different fields in the GROUP BY than in the SELECT (hence 99%, not 100%), but surely that's a very minor exception?
Can someone explain what is supposed to be returned if you use different fields?
Thanks.
I tend to agree with you - this is one of many cases where SQL should have slightly smarter defaults to save us all some typing. For example, imagine if this were legal:
Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By *
where "*" meant "all the non-aggregate fields". If everybody knew that's how it worked, then there would be no confusion. You could sub in a specific list of fields if you wanted to do something tricky, but the splat means "all of 'em" (which in this context means, all the possible ones).
Granted, "*" means something different here than in the SELECT clause, so maybe a different character would work better:
Select ClientName, InvoiceAmount, Sum(PaymentAmount) Group By !
There are a few other areas like that where SQL just isn't as eloquent as it could be. But at this point, it's probably too entrenched to make many big changes like that.
Because they are two different things, you can group by items that aren't in the select clause
EDIT:
Also, is it safe to make that assumption?
I have a SQL statement
Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Is it "correct" for the server to assume I want to group by ClientName AND InvoiceAmount?
I personally prefer (and think it's safer) to have this code
Select ClientName, InvAmt, Sum(PayAmt) as PayTot
Group By ClientName
throw an error, prompting me to change the code to
Select ClientName, Sum(InvAmt) as InvTot, Sum(PayAmt) as PayTot
Group By ClientName
I hope/expect we'll see something more comprehensive soon; a SQL history lesson on the subject would be useful and informative. Anyone? Anyone? Bueller?
In the meantime, I can observe the following:
SQL predates the DRY principle, at least as far as it it was documented in The Pragmatic Programmer.
Not all DBs require the full list: Sybase, for example, will happily execute queries like
SELECT a, b, COUNT(*)
FROM some_table
GROUP BY a
... which (at least every time I accidentally ran such a monster) often leads to such enormous inadvertent recordsets that panic-stricken requests quickly ensue, begging the DBAs to bounce the server. The result is a sort of partial Cartesian product, but I think it may mostly be a failure on Sybase's part to implement the SQL standard properly.
Perhaps we need a shorthand form - call it GroupSelect
GroupSelect Field1, Field2, sum(Field3) From SomeTable Where (X = "3")
This way, the parser need only throw an error if you leave out an aggregate function.
The good reason for it is that you would get incorrect results more often than not if you did not specify all columns. Suppose you have three columns, col1, col2 and col3.
Suppose your data looks like this:
Col1 Col2 Col3
a b 1
a c 1
b b 2
a b 3
select col1, col2, sum(col3) from mytable group by col1, col2
would give the following results:
Col1 Col2 Col3
a b 4
a c 1
b b 2
How would it interpret
select col1, col2, sum(col3) from mytable group by col1
My guess would be
Col1 Col2 Col3
a b 5
a c 5
b b 2
These are clearly bad results. Of course the more complex the query and the more joins the less likely it would be that the query would return correct results or that the programmer would even know if they were incorrect.
Personally I'm glad that group by requires the fields.
I agree with GROUP BY ALL, GROUP BY *, or something similar. As mentioned in the original post, in 99% (perhaps more) of the cases you want to group by all non-aggregate columns/expressions.
Here is however one example where you would need GROUP BY columns, for backward compatibility reasons.
SELECT
MIN(COUNT(*)) min_same_combination_cnt,
MAX(COUNT(*)) max_same_comb_cnt,
AVG(COUNT(*)) avg_same_comb_cnt,
SUM(COUNT(*)) total_records,
COUNT(COUNT(*)) distinct_combinations_cnt
FROM <some table>
GROUP BY <list of columns>
This works in Oracle. I use it to estimate selectivity on columns. The group by is applied to the inner aggregate function. Then, the outer aggregate is applied.
It would be nice to put forward a suggestion for this improvement to the SQL standard. I just don't know how that works.
Actually, wouldn't that be 100% of the time? Is there a case in which you can have a (non-aggregate) column in the select that is not in the GROUP BY?
I don't have an answer though. It certainly does seem like a awkward moment for the language.
I share the op's view that repeating is a bit annoying, especially if the non-aggregate fields contain elaborate statements like ifs and functions and a whole lot of other things. It would be nice if there could be some shorthand in the group by clause - at least a column alias. Referring to the columns by number may be another option, albeit one that probably has their own problems.
There could be a situation that you needed to extract one id of all the rows grouped, and sum of their quantities - for example. In this case you would i.e. group them by name and leave ids not grouped. SQLite seems to work this way.
Since group by result in single tuple for a whole group of tuples so other non group by attributes must be used in aggregate function only. If u add non group by attribute in select then sql cant decide which which value to be select from that group.