The table contains numerous rows for the same person and all columns contain the same data. For Instance:
FullName Gender DOB
---------------------------------
Mary Jones Female 2012-05-01
I would like to select one row for each individual to appear in a report. I thought the easiest way might be to us an if statment to check if the next row is the same as the first unique row.
As others have said, since all the rows are the same you can use the distinct keyword:
SELECT DISTINCT <columns>
FROM table
However, you're likely to find that the reason for these duplicate records is that at least one column somewhere in the table is different. In that case, you may need to use GROUP BY instead:
SELECT <columns>
FROM table
GROUP BY <columns>
If you need to show data that is not part of the grouped columns, you only list the columns that match in the GROUP BY clause and then you have to use an aggregate function or sub query to bring data from the unique columns into the select list. GROUP BY queries can get really complex and make your head hurt.
If the columns really are consistently the same, something in your schema or application design isn't right. It's rare that you should ever have truly duplicate records in a table, and it's more likely to mean something is wrong.
Finally, I need to comment about your IF/THEN request. This idea is way off. SQL has a SET-based or declarative programming style. Thinking about problems using procedural tools like IF/THEN will only lead to trouble. Even if you really do need a conditional expression, the way to do it within an SQL statement is via a CASE expression.
If the data is the same in all columns, then you can use the DISTINCT keyword:
SELECT DISTINCT FullName, Gender, DOB
FROM yourtable
This will check that the data in all fields are unique removing any duplicates.
See SQL Fiddle with Demo
Another way to write this is using a GROUP BY on all fields:
SELECT FullName, Gender, DOB
FROM yourtable
GROUP BY FullName, Gender, DOB
You should use the distinct clause instead.. if all the columns are in fact completely same.
select distinct FullName, Gender, DOB
from <your_table_name>;
Having duplicate rows is usually a sign of something wrong (may be the data is being loaded multiple times). You might have to investigate to see the actual reason.
Related
I'm working on a query that pulls demographic information for people who have visited a location. However, the fields required for this report aren't all available in the DB, and some will need to be added manually from a separate Excel file after I've exported the results of my query. The person who will be responsible for merging the two files asked if it would be possible to create blank columns so they can more easily see where the missing data needs to go, and I wasn't sure how to go about that. Obviously those blank columns could just be created in the exported spreadsheet, but I wondered if there was a way to add them in the SQL query itself.
My SELECT statement currently looks something like this—I've just added comments to mark the missing fields so I can keep track of what order all the fields for this report need to be in.
SELECT DISTINCT
PersonID,
PersonFName,
PersonLName,
PersonDOB,
VisitID,
--StaffFName,
--StaffLName,
FacilityPhone,
FacilityAddress,
...and so on
Since those two staff name fields don't exist in my DB, I obviously can't actually include them in the SELECT list. But is there a way to still include those two fields as blank columns in my query results? Something along the lines of "select [nothing] as StaffFName"?
Just add literal nulls to the select clause:
SELECT DISTINCT
PersonID,
PersonFName,
PersonLName,
PersonDOB,
VisitID,
null as StaffFName,
null as StaffLName,
FacilityPhone,
FacilityAddress,
...
Or, if you prefer, you can use empty strings instead:
...
'' as StaffFName,
'' as StaffLName,
...
But null is the canonical way to represent the absence of data.
This is a general question:
Does anyone have a tip as to how i can know when i should use distinct in my queries ? I am struggling at understanding when to use it exactly. I tend to use it when I don't need it and not when I do.
thank you all very much.
Basically, there is little reason to use select distinct -- although it is sometime convenient short-hand.
If it can be avoided, avoid it! SQL incurs overhead for removing duplicates, even if there are no duplicates. So, select distinct is slower than select.
Often select distinct is more appropriately written using group by -- because often you want some column to be aggregated (such as the maximum date/time).
That said, it can be convenient shorthand, so it should not be avoided altogether, just used rarely.
There is no general rule as to when to use DISTINCT, it is based on your requirement i.e. when you have two same values in one column but you only require one value so you will use distinct.
Suppose you have a list of banks and branches in a city. But you need to know how many unique banks are operating in the city then you will write
select distinct bank_name from city;
I use distinct when I want to ensure rows are not duplicated in a query that could have duplicate records for the field combination I am selecting. Generally, this would be when selecting a set of columns that do not include a primary/unique key and are not guaranteed to be unique when the selected fields are taken together.
For example, if I was selecting customers that had purchased this year to send a letter to, and customers can have more than one order in a single year and I want to ensure that I send only one letter per person and address, I would use Distinct to ensure that I get one occurrence of each unique customer name / address combination.
--could return multiple records for repeat customers if Distinct was not present
Select Distinct BillingName, BillingAddress
from Orders
where OrderDate > '2019-08-01'
I've 250+ columns in customer table. As per my process, there should be only one row per customer however I've found few customers who are having more than one entry in the table
After running distinct on entire table for that customer it still returns two rows for me. I suspect one of column may be suffixed with space / junk from source tables resulting two rows of same information.
select distinct * from ( select * from customer_table where custoemr = '123' ) a;
Above query returns two rows. If you see with naked eye to results there is not difference in any of column.
I can identify which column is causing duplicates if I run query every time for each column with distinct but thinking that would be very manual task for 250+ columns.
This sounds like very dumb question but kind of stuck here. Please suggest if you have any better way to identify this, thank you.
Solving this one-time issue with sql is too much effort. Simply copy-paste to excel, transpose data into columns and use some simple function like "if a==b then 1 else 0".
I'm having a table with an id and a name.
I'm getting a list of id's and i need their names.
In my knowledge i have two options.
Create a forloop in my code which executes:
SELECT name from table where id=x
where x is always a number.
or I'm write a single query like this:
SELECT name from table where id=1 OR id=2 OR id=3
The list of id's and names is enormous so i think you wouldn't want that.
The problem of id's is the id is not always a number but a random generated id containting numbers and characters. So talking about ranges is not a solution.
I'm asking this in a performance point of view.
What's a nice solution for this problem?
SQLite has limits on the size of a query, so if there is no known upper limit on the number of IDs, you cannot use a single query.
When you are reading multiple rows (note: IN (1, 2, 3) is easier than many ORs), you don't know to which ID a name belongs unless you also SELECT that, or sort the results by the ID.
There should be no noticeable difference in performance; SQLite is an embedded database without client/server communication overhead, and the query does not need to be parsed again if you use a prepared statement.
A "nice" solution is using the INoperator:
SELECT name from table where id in (1,2,3)
Also, the IN operator is syntactic sugar built for exactly this purpose..
SELECT name from table where id IN (1,2,3,4,5,6.....)
Hoping that you are getting the list of ID's on which you have to perform a query for names as input temp table #InputIDTable,
SELECT name from table WHERE ID IN (SELECT id from #InputIDTable)
Pulling data from a cmdb into another repository. Problem is the cmdb data has misspelled/duplicate records (e.g., some assets have a Department Name as Marketing, or Markting, or Marketing& -- when they are all just in Marketing). Want to run a select query that displays all incorrectly named department records as the single, correct name. Any help on how to approach this?
You can use CASE in to display "marketing" for its wrong entries. But query can be complicated depending on variations.
Better + easier way is a global search and replace in column. Following article describes it:
http://www.codecandle.com/articles/sql/update/483-sql-update-with-global-search-and-replace.html
Cleaning duplicate rows, following article may help:
http://www.codecandle.com/articles/sql/windowing/503-deleting-duplicate-rows-using-windowing.html
I'm sure this is passed but http://openrefine.org/ would probably help you clean the messy data.
you can use the SELECT DISTINCT statement is used to return only distinct (different) values.
you should use distinct keyword before coloumn names in select statement.
e.g: select distinct name (Coloumn name)
from table name;