Is SELECT DISTINCT ON (col) * valid? - sql

SELECT DISTINCT ON (some_col)
*
FROM my_table
I'm wondering if this is valid and will work as expected. Meaning, will this return all columns from my_table, based on distinct some_col? I've read the Postgres docs and don't see any reason why this wouldn't work as expected, but have read old comments here on SO which state that columns need to be explicitly listed when using distinct on.
I do know it's best practice to explicitly list columns, and also to use order by when doing the above.
Background that you probably don't need or care about
For background and the reason I ask, is we are migrating from MySQL to Postgres. MySQL has a very non-standards compliant "trick" which allows a SELECT * ... GROUP BY which allows one to easily select * based on a group by. Previous answers and comments about migrating this non-standard-compliant trick to Postgres are murky at best.

SELECT DISTINCT ON (some_col) *
FROM my_table;
I'm wondering if this is valid
Yes. Typically, you want ORDER BY to go with it to determine which row to pick from each set of peers. But choosing an arbitrary row (without ORDER BY) is a valid (and sometimes useful!) application. You just need to know what you are doing. Maybe add a comment for the afterworld?
See:
Select first row in each GROUP BY group?
will this return all columns from my_table, based on distinct some_col?
It will return all columns. One arbitrary row per distinct value of some_col.
Note how I used the word "arbitrary", not "random". Returned rows are not chosen randomly at all. Just arbitrarily, depending on current implementation details. Typically the physically first row per distinct value, but that depends.
I do know it's best practice to explicitly list columns.
That really depends. Often it is. Sometimes it is not. Like when I want to get all columns to match a given row type.

Related

What does * mean in sql?

For example, I know what SELECT * FROM example_table; means. However, I feel uncomfortable not knowing what each part of the code means.
The second part of a SQL query is the name of the column you want to retrieve for each record you are getting.
You can obviously retrieve multiple columns for each record, and (only if you want to retrieve all the columns) you can replace the list of them with *, which means "all columns".
So, in a SELECT statement, writing * is the same of listing all the columns the entity has.
Here you can find probably the best tutorial for SQL learning.
I am providing you answer by seperating each part of code.
SELECT == It orders the computer to include or select each content from the database name(table ) .
(*) == means all {till here code means include all from the database.}
FROM == It refers from where we have to select the data.
example_table == This is the name of the database from where we have to select data.
the overall meaning is :
include all data from the databse whose name is example_table.
thanks.
For a beginner knowing the follower concepts can be really useful,
SELECT refers to attributes that you want to have displayed in your final query result. There are different 'SELECT' statements such as 'SELECT DISTINCT' which returns only unique values (if there were duplicate values in the original query result)
FROM basically means from which table you want the data. There can be one or many tables listed under the 'FROM' statement.
WHERE means the condition you want to satisfy. You can also do things like ordering the list by using 'order by DESC' (no point using order by ASC as SQL orders values in ascending order after you use the order by clause).
Refer to W3schools for a better understanding.

How does SQL choose which row to show when grouping multiple rows together?

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...
That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).
All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.
You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome
It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

SQL Query: Which one should i use? count("columnname") or count(1)

In my SQL query I just need to check whether data exists for a particular userid.
I always only want one row that will be returned when data exist.
I have two options
1. select count(columnname) from table where userid=:userid
2. select count(1) from tablename where userid=:userid
I am thinking second one is the one I should use because it may have a better response time as compared with first one.
There can be differences between count(*) and count(column). count(*) is often fastest for reasons discussed here. Basically, with count(column) the database has to check if column is null or not in each row. With count(column) it just returns the total number of rows in the table which is probably has on hand. The exact details may depend on the database and the version of the database.
Short answer: use count(*) or count(1). Hell, forget the count and select userid.
You should also make sure the where clause is performing well and that its using an index. Look into EXPLAIN.
I'd like to point out that this:
select count(*) from tablename where userid=:userid
has the same effect as your second solution, with th advantage that count(*) it unambigously means "count all rows".
The * in COUNT(*) will not expand into all columns - that is to say, the * in SELECT COUNT(*) is not the same as in SELECT *. So you need not worry about performance when writing COUNT(*)
The disadvantage of writing COUNT(1) is that it is less clear: what did you mean? A literal one (1) may look like a lower case L (this: l) in some fonts.
Will give different results if columnname can be NULL, otherwise identical performance.
The optimiser (SQL Server at least) realises COUNT(1) is trivial. You can also use COUNT(1/0)
It depends what you want to do.
The first one counts rows with non-null values of columnname. The second one counts ALL rows.
Which behaviour do you want? From the way your question is worded, I guess that you want the second one.
To count the number of records you should use the second option, or rather:
select count(*) from tablename where userid=:userid
You could also use the exists() function:
select case when exists(select * from tablename where userid=:userid) then 1 else 0 end
It might be possible for the database to do the latter more efficiently in some cases, as it can stop looking as soon as a match is found instead of comparing all records.
Hey how about Select count(userid) from tablename where userid=:userid ? That way the query looks more friendly.

SQL Server UNION - What is the default ORDER BY Behaviour

If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.

Is there a difference between Select * and Select [list each col] [duplicate]

This question already has answers here:
Which is faster/best? SELECT * or SELECT column1, colum2, column3, etc
(49 answers)
Closed 8 years ago.
I'm using MS SQL Server 2005. Is there a difference, to the SQL engine, between
SELECT * FROM MyTable;
and
SELECT ColA, ColB, ColC FROM MyTable;
When ColA, ColB, and ColC represent every column in the table?
If they are the same, is there a reason why you should use the 2nd one anyway? I have a project that's heavy on LINQ, and I'm not sure if the standard SELECT * it generates is a bad practice, or if I should always be a .Select() on it to specify which cols I want.
EDIT: Changed "When ColA, ColB, and ColC are all the columns to the table?" to "When ColA, ColB, and ColC represent every column in the table?" for clarity.
Generally, it's better to be explicit, so Select col1, col2 from Table is better. The reason being that at some point, an extra column may be added to that table, and would cause unneeded data to be brought back from the query.
This isn't a hard and fast rule though.
1) The second one is more explicit about which columns are returned. The value of the 2nd one then is how much you value explicitly knowing which columns come back.
2) This involves potentially less data being returned when there are more columns than the ones explicitly used as well.
3) If you change the table by adding a new column, the first query changes and the second does not. If you have code like "for all columns returned do ..." then the results change if you use the first, but not the 2nd.
I'm going to get a lot of people upset with me, but especially if I'm adding columns later on, I usually like to use the SELECT * FROM table. I've been called lazy for this reason, because if I make any modifications to my tables, I'd like not to track down all the stored procs that use that table, and just change it in the data access layer classes in my application. There are cases in which I will specify the columns, but in the case where I'm trying to get a complete "object" from the database, I'd rather just use the "*". And, yes, I know people will be hating me for this, but it has allowed me to be quicker and less bug free while adding fields to my applications.
The two sides of the issue are this: Explicit column specification gives better performance as new columns are added, but * specification requires no maintenance as new columns are added.
Which to use depends on what kind of columns you expect to add to the table, and what the point of the query is.
If you are using your table as a backing store for an object (which seems likely in the LINQ-to-SQL case), you probably want any new columns added to this table to be included in your object, and vice-versa. You're maintaining them in parallel. For this reason, for this case, * specification in the SELECT clause is right. Explicit specification would give you an extra bit of maintenance every time something changed, and a bug if you didn't update the field list correctly.
If the query is going to return a lot of records, you are probably better off with explicit specification for performance reasons.
If both things are true, consider having two different queries.
You should specify an explicit column list. SELECT * will bring back more columns than you need creating more IO and network traffic, but more importantly it might require extra lookups even though a non-clustered covering index exists (On SQL Server).
Some reasons not to use the first statement (select *) are:
If you add some large fields (a BLOB column would be very bad) later to that table, you could suffer performance problems in the application
If the query was a JOIN query with two or more tables, some of the fields could have the same name. It would be better to assure that your field names are different.
The purpose of the query is clearer with the second statement from an programming esthetics viewpoint
When you select each field individually, it is more clear which fields are actually being selected.
SELECT * is a bad practice in most places.
What if someone adds a 2gb BLOB column to that table?
What is someone adds really any column to that table?
It's a bug waiting to happen.
A couple things:
A good number of people have posted here recommending against using *, and given several good reasons for those answers. Out of 10 other responses so far only one doesn't recommend listing columns.
People often make exceptions to that rule when posting to help sites like StackOverflow, because they often don't know what columns are in your table or are important to your query. For that reason, you'll see a lot of code here and elsewhere on the web that uses the * syntax, even though the poster would tend to avoid it in his own code.
Its good for forward-compatiblity.
When you use
SELECT * FROM myTable
and in "myTable" are 3 columns. You get same results as
SELECT Column1, Column2, Column3 FROM myTable
But if you add new column in future, you get a diferent results.
Of course, if you change name one of existing column, in first case you get results and in the second case you get a error ( I think, this is correct behaviour of application ).
If your code relies on certain columns being in a certain order, you need to list the columns. If not, it doesn't really make a difference if you use "*" or write the column names out in the select statement.
An example is if you insert a column into a table.
Take this table:
ColA ColB ColC
You might have a query:
SELECT *
FROM myTable
Then the code might be:
rs = executeSql("SELECT * FROM myTable")
while (rs.read())
Print "Col A" + rs[0]
Print "Col B" + rs[1]
Print "Col C" + rs[2]
If you add a column between ColB and ColC, the query wouldn't return what you're looking for.
For LinqToSql, if you plan to modify those records later, you should pull the whole record into memory.
It depends on what you mean by "difference". There is the obvious syntax difference, but the real difference is one of performance.
When you say SELECT * FROM MyTable, you are telling the SQL query engine to return a data set with all of the columns from that table, while SELECT ColA, ColB, ColC FROM MyTable tells the query engine to return a data set with only ColA, ColB, and ColC from the table.
Say you have a table with 100 columns defined as CHAR[10]. SELECT * will return 100 columns * 10 bytes worth of data while SELECT ColA, ColB, ColC will return 3 columns * 10 bytes worth of data. This is a huge size difference in the amount of data that is being passed back across the wire.
Specifying the column list also makes it much clearer what columns you are interested in. The drawback is that if you add/remove a column from the table you need to ensure that the column list is updated as well, but I think that's a small price compared to the performance gain.
SELECT * FROM MyTable
select * is dependent on the column order in the schema so if you refer to the result set by the index # of the collection you will be looking at the wrong column.
SELECT Col1,Col2,Col3 FROM MyTable
this query will give you a collection that stays the same over time, but how often are you changing the column order anyways?
A quick look at the query execution plan shows that the querys are the same.
The general rule of thumb is that you will want to limit your queries to only the fields that you need returned.
selecting each column is better than just * because in case you add or delete a new row you HAVE to look at the code and take a look what you were doing with the retrieved data.
Also, it helps you understand your code better and allows you to use aliases as column names (in case you're performing a join of tables with a column sharing the name)
An example as to why you never (imho) should use SELECT *. This does not relate to MSSQL, but rather MySQL. Versions prior to 5.0.12 returned columns from certain types of joins in a none-standard manner. Of course, if your queries defines which columns you want and in which order you have no problem. Imagine the fun if they don't.
(One possible exception: Your query SELECTs from just one table and you identify columns in your programming language of choice by name rather than position.)
Using "SELECT *" optimizes for programmer typing. That's it. That's the only advantage.