When should I use distinct in a query? - sql

This is a general question:
Does anyone have a tip as to how i can know when i should use distinct in my queries ? I am struggling at understanding when to use it exactly. I tend to use it when I don't need it and not when I do.
thank you all very much.

Basically, there is little reason to use select distinct -- although it is sometime convenient short-hand.
If it can be avoided, avoid it! SQL incurs overhead for removing duplicates, even if there are no duplicates. So, select distinct is slower than select.
Often select distinct is more appropriately written using group by -- because often you want some column to be aggregated (such as the maximum date/time).
That said, it can be convenient shorthand, so it should not be avoided altogether, just used rarely.

There is no general rule as to when to use DISTINCT, it is based on your requirement i.e. when you have two same values in one column but you only require one value so you will use distinct.
Suppose you have a list of banks and branches in a city. But you need to know how many unique banks are operating in the city then you will write
select distinct bank_name from city;

I use distinct when I want to ensure rows are not duplicated in a query that could have duplicate records for the field combination I am selecting. Generally, this would be when selecting a set of columns that do not include a primary/unique key and are not guaranteed to be unique when the selected fields are taken together.
For example, if I was selecting customers that had purchased this year to send a letter to, and customers can have more than one order in a single year and I want to ensure that I send only one letter per person and address, I would use Distinct to ensure that I get one occurrence of each unique customer name / address combination.
--could return multiple records for repeat customers if Distinct was not present
Select Distinct BillingName, BillingAddress
from Orders
where OrderDate > '2019-08-01'

Related

How does SQL SELECT statement work?

I created a table with id as primary key, firstname, lastname, email as fields. I issued the following query:
"SELECT email,COUNT(email) FROM mytable;"
The result had the first value for email i.e from the first record and the total number of values from the column 'email'. Why did it not display all the different values for email from different records?
https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html says:
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
Like with any grouping query, the COUNT(*) returns the count of rows in the group, and reduces the result to one row per group.
But since you don't have any GROUP BY clause, the whole table is treated as one big group, and the result only returns one row.
The email column returns one of the email values in the group. Technically this value is chosen arbitrarily from one of the rows in the group.
In practice, MySQL's implementation chooses the value from the first row read from the group, the order of which depends on which index was used to scan the table. Though this behavior of picking the first row is not documented and they make no guarantee to make this behavior the same from version to version.
It's better to avoid depending on queries that return an arbitrary, implementation-dependent value. It makes your code harder to maintain, because there's a risk it will change if you upgrade MySQL and they change their undocumented behavior.
To protect you from making these sorts of arbitrary queries, newer versions of MySQL enforce an SQL mode called ONLY_FULL_GROUP_BY. That option makes it an error to run a query like that one you show. For what it's worth, the query is already an error in most other brands of SQL database.

SQL or statement vs multiple select queries

I'm having a table with an id and a name.
I'm getting a list of id's and i need their names.
In my knowledge i have two options.
Create a forloop in my code which executes:
SELECT name from table where id=x
where x is always a number.
or I'm write a single query like this:
SELECT name from table where id=1 OR id=2 OR id=3
The list of id's and names is enormous so i think you wouldn't want that.
The problem of id's is the id is not always a number but a random generated id containting numbers and characters. So talking about ranges is not a solution.
I'm asking this in a performance point of view.
What's a nice solution for this problem?
SQLite has limits on the size of a query, so if there is no known upper limit on the number of IDs, you cannot use a single query.
When you are reading multiple rows (note: IN (1, 2, 3) is easier than many ORs), you don't know to which ID a name belongs unless you also SELECT that, or sort the results by the ID.
There should be no noticeable difference in performance; SQLite is an embedded database without client/server communication overhead, and the query does not need to be parsed again if you use a prepared statement.
A "nice" solution is using the INoperator:
SELECT name from table where id in (1,2,3)
Also, the IN operator is syntactic sugar built for exactly this purpose..
SELECT name from table where id IN (1,2,3,4,5,6.....)
Hoping that you are getting the list of ID's on which you have to perform a query for names as input temp table #InputIDTable,
SELECT name from table WHERE ID IN (SELECT id from #InputIDTable)

Joining multiple Tables in Oracle gives out duplicated records

I am a newbie to sql. I have three tables mr1,mr2,mr3. Caseid is the primary keys in all these tables. I need to join all these table columns and display result.
Problem is that i dont know which join to use.
when i joined all these just like below query:
select mr1.col1,mr1.col2,mr2.col1,mr2.col2,mr3.col1,mr3.col2
from mr1,mr2,mr3
where mr1.caseid = mr2.caseid
and mr2.caseid = mr3.caseid;
it displays 4 records, eventhough the maximum number of records is two, which is in table mr2.
records are duplicated, can anyone help me in this regard?
Distinct will do it but it's not the correct approch.
You need to add another join (mr1.caseid = mr3.caseid) because mr2 and mr3 rows must be related to the same row in mr1, otherwise you end up with 2 pairs, onde for each tabled joined to your primary table (mr2).
First answer in SO, so forgive me if it wasn't that clear.
Your problem is that your tables are in a one-to many relationship. When you join them, it is expected that the number of rows will go up unless you take steps to limit the records returned. How to fix depends on the meaning of the data.
If all the fields are exactly the same, then adding DISTINCT will fix the problem. However, it may be faster, depending on the size of the tables and the number of records you are returning, to use a derived table to limit the records in the join to only one from the table with multiple records.
If at least one of the fields is different however, then you need to know the business rule that will allow you to pick the correct record. It might be accomplished by adding a where clause or by using an aggregate function and group by or even both. This really depends on the meaning of the result set which is why you need to ask further question in your own organization as they are the only ones who will know which of the multiple records is the correct one to pick from the perspectives of the people who will be using the results of the query. Further, the business might actually want to see all of the records and you have no problem at all.

SQL Select If Then

The table contains numerous rows for the same person and all columns contain the same data. For Instance:
FullName Gender DOB
---------------------------------
Mary Jones Female 2012-05-01
I would like to select one row for each individual to appear in a report. I thought the easiest way might be to us an if statment to check if the next row is the same as the first unique row.
As others have said, since all the rows are the same you can use the distinct keyword:
SELECT DISTINCT <columns>
FROM table
However, you're likely to find that the reason for these duplicate records is that at least one column somewhere in the table is different. In that case, you may need to use GROUP BY instead:
SELECT <columns>
FROM table
GROUP BY <columns>
If you need to show data that is not part of the grouped columns, you only list the columns that match in the GROUP BY clause and then you have to use an aggregate function or sub query to bring data from the unique columns into the select list. GROUP BY queries can get really complex and make your head hurt.
If the columns really are consistently the same, something in your schema or application design isn't right. It's rare that you should ever have truly duplicate records in a table, and it's more likely to mean something is wrong.
Finally, I need to comment about your IF/THEN request. This idea is way off. SQL has a SET-based or declarative programming style. Thinking about problems using procedural tools like IF/THEN will only lead to trouble. Even if you really do need a conditional expression, the way to do it within an SQL statement is via a CASE expression.
If the data is the same in all columns, then you can use the DISTINCT keyword:
SELECT DISTINCT FullName, Gender, DOB
FROM yourtable
This will check that the data in all fields are unique removing any duplicates.
See SQL Fiddle with Demo
Another way to write this is using a GROUP BY on all fields:
SELECT FullName, Gender, DOB
FROM yourtable
GROUP BY FullName, Gender, DOB
You should use the distinct clause instead.. if all the columns are in fact completely same.
select distinct FullName, Gender, DOB
from <your_table_name>;
Having duplicate rows is usually a sign of something wrong (may be the data is being loaded multiple times). You might have to investigate to see the actual reason.

change sql code to eliminate repeating

Got a question regarding SQL and ColdFusion: I can't write SQL code properly, so that it won't repeat the variables twice. So far I've got:
<cfquery name="get_partner_all" datasource="#dsn#">
SELECT
C.COMPANY_ID,
C.FULLNAME,
CP.MOBILTEL,
CP.MOBIL_CODE,
CP.IMCAT_ID,
CP.COMPANY_PARTNER_TEL,
CP.COMPANY_PARTNER_TELCODE,
CP.COMPANY_PARTNER_TEL_EXT,
CP.MISSION,
CP.DEPARTMENT,
CP.TITLE,
CP.COMPANY_PARTNER_SURNAME,
CP.COMPANY_PARTNER_NAME,
CP.PARTNER_ID,
CP.COMPANY_PARTNER_EMAIL,
CP.HOMEPAGE,
CP.COUNTY,
CP.COUNTRY,
CP.COMPANY_PARTNER_ADDRESS,
CP.COMPANY_PARTNER_FAX,
CC.COMPANYCAT,
CRM.BAKIYE,
CRM.BORC,
CRM.ALACAK
FROM
COMPANY_PARTNER CP,
COMPANY C,
COMPANY_CAT CC,
#DSN2_ALIAS#.COMPANY_REMAINDER_MONEY CRM
WHERE
C.COMPANY_ID = CP.COMPANY_ID
AND C.COMPANY_ID = CRM.COMPANY_ID
AND C.COMPANYCAT_ID = CC.COMPANYCAT_ID
As you can see definition C.COMPANY_ID is repeated twice, so the variable shown also twice, but I need this (CRM) definition to display some money issues.
Can anyone show me how I can define it in a different way so that the output of this code won't repeat the variables?
I assume you mean that you get multiple columns in the result set, each with the name "COMPANY_ID". The solution to this is to specify specific columns from all of the tables, instead of SELECT * (not just for the COMPANY_CAT table, alias CC).
If you're getting "repeated" rows, then you need to examine the contents of these rows. What's happening there is that one or more rows from another table is matching one row from the "COMPANY" table. Each matching pair of rows generates a row in the output. Now you've expanded your column list, compare a pair of rows which have the same COMPANY_ID - in which columns do they differ? If it's in, say, the last 3 columns, then there are multiple rows in CRM which match the same COMPANY_ID.
Once you've identified the other table that is causing duplicates to occur, you need to decide how to limit them - should you be aggregating values from that table (e.g. SUM or MAX), or is there a way to further refine which row from the other table you want to match to the row in COMPANY.
At a guess though, I'd speculate that one company could have multiple partners...
Don't use select table.*. Instead, name each column explicitly and don't repeat columns, as follows:
select
c.company_id,
c.blah_blah,
-- don't select cp.company_id
cp.foo_bar,
-- etc
You just need to remove * and replace with column name list. It is always advisable to write column list instead of * as performance point of view. Also if you are adding any column in database table and using * to get data sometime it will not reflect new column in query result due to caching.
In you case just keep company_id for any one of the table. That's it.