How does SQL SELECT statement work? - sql

I created a table with id as primary key, firstname, lastname, email as fields. I issued the following query:
"SELECT email,COUNT(email) FROM mytable;"
The result had the first value for email i.e from the first record and the total number of values from the column 'email'. Why did it not display all the different values for email from different records?

https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html says:
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
Like with any grouping query, the COUNT(*) returns the count of rows in the group, and reduces the result to one row per group.
But since you don't have any GROUP BY clause, the whole table is treated as one big group, and the result only returns one row.
The email column returns one of the email values in the group. Technically this value is chosen arbitrarily from one of the rows in the group.
In practice, MySQL's implementation chooses the value from the first row read from the group, the order of which depends on which index was used to scan the table. Though this behavior of picking the first row is not documented and they make no guarantee to make this behavior the same from version to version.
It's better to avoid depending on queries that return an arbitrary, implementation-dependent value. It makes your code harder to maintain, because there's a risk it will change if you upgrade MySQL and they change their undocumented behavior.
To protect you from making these sorts of arbitrary queries, newer versions of MySQL enforce an SQL mode called ONLY_FULL_GROUP_BY. That option makes it an error to run a query like that one you show. For what it's worth, the query is already an error in most other brands of SQL database.

Related

Removing rows with duplicated column values based on another column's value

Hey guys, maybe this is a basic SQL qn. Say I have this very simple table, I need to run a simple sql statement to return a result like this:
Basically, the its to dedup Name based on it's row's Value column, whichever is larger should stay.
Thanks!
Framing the problem correctly would help you figure it out.
"Deduplication" suggests altering the table - starting with a state with duplicates, ending with a state without them. Usually done in three steps (getting the rows without duplicates into temp table, removing original table, renaming temp table).
"Removing rows with duplicated column values" also suggests alteration of data and derails train of thought.
What you do want is to get the entire table, and in cases where the columns you care about have multiple values attached get the highest one. One could say... group by columns you care about? And attach them to the highest value, a maximum value?
select id,name,max(value) from table group by id,name

When should I use distinct in a query?

This is a general question:
Does anyone have a tip as to how i can know when i should use distinct in my queries ? I am struggling at understanding when to use it exactly. I tend to use it when I don't need it and not when I do.
thank you all very much.
Basically, there is little reason to use select distinct -- although it is sometime convenient short-hand.
If it can be avoided, avoid it! SQL incurs overhead for removing duplicates, even if there are no duplicates. So, select distinct is slower than select.
Often select distinct is more appropriately written using group by -- because often you want some column to be aggregated (such as the maximum date/time).
That said, it can be convenient shorthand, so it should not be avoided altogether, just used rarely.
There is no general rule as to when to use DISTINCT, it is based on your requirement i.e. when you have two same values in one column but you only require one value so you will use distinct.
Suppose you have a list of banks and branches in a city. But you need to know how many unique banks are operating in the city then you will write
select distinct bank_name from city;
I use distinct when I want to ensure rows are not duplicated in a query that could have duplicate records for the field combination I am selecting. Generally, this would be when selecting a set of columns that do not include a primary/unique key and are not guaranteed to be unique when the selected fields are taken together.
For example, if I was selecting customers that had purchased this year to send a letter to, and customers can have more than one order in a single year and I want to ensure that I send only one letter per person and address, I would use Distinct to ensure that I get one occurrence of each unique customer name / address combination.
--could return multiple records for repeat customers if Distinct was not present
Select Distinct BillingName, BillingAddress
from Orders
where OrderDate > '2019-08-01'

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

view query performance scenarios

Let's say I have a view named AllUsers which returns a result set of data for all users in a system. Let's say that the underlying query is somewhat complex.
Let's say that I need the ability to get a user by ID. All of the attributes should be returned for the single user as is returned for the full user set.
I'm assuming that SQL Server is smart enough to optimize performance when I apply a where clause to a view so that it optimizes the query as a single record query as opposed to selecting all records and then filtering the records based on the single ID provided in the where clause. Can you please confirm this?
You must insert a Where clause at the end of Select sentence. Example:
Select id, name, department
from AllUsers
Where id = 1
The optimization depends wether the AllUsers table have or not an index defined with the id column inner.
If the Id Column is not defined in a index SQL Server will try to scan in the whole table seeking your record.

How are these tasks done in SQL?

I have a table, and there is no column which stores a field of when the record/row was added. How can I get the latest entry into this table? There would be two cases in this:
Loop through entire table and get the largest ID, if a numeric ID is being used as the identifier. But this would be very inefficient for a large table.
If a random string is being used as the identifier (which is probably very, very bad practise), then this would require more thinking (I personally have no idea other than my first point above).
If I have one field in each row of my table which is numeric, and I want to add it up to get a total (so row 1 has a field which is 3, row 2 has a field which is 7, I want to add all these up and return the total), how would this be done?
Thanks
1) If the id is incremental, "select max(id) as latest from mytable". If a random string was used, there should still be an incremental numeric primary key in addition. Add it. There is no reason not to have one, and databases are optimized to use such a primary key for relations.
2) "select sum(mynumfield) as total from mytable"
for the last thing use a SUM()
SELECT SUM(OrderPrice) AS OrderTotal FROM Orders
assuming they are all in the same column.
Your first question is a bit unclear, but if you want to know when a row was inserted (or updated), then the only way is to record the time when the insert/update occurs. Typically, you use a DEFAULT constraint for inserts and a trigger for updates.
If you want to know the maximum value (which may not necessarily be the last inserted row) then use MAX, as others have said:
SELECT MAX(SomeColumn) FROM dbo.SomeTable
If the column is indexed, MSSQL does not need to read the whole table to answer this query.
For the second question, just do this:
SELECT SUM(SomeColumn) FROM dbo.SomeTable
You might want to look into some SQL books and tutorials to pick up the basic syntax.