SQL select distinct by 2 or more columns - sql

I have a table with a lot of columns and what I need to do is to write select that would take only unique values. The main problem is that I need to check three columns at the same time and if all three columns have same values in their columns(not between them, but in their own column) then distinct. Idea should be something like distinct(column1 and column2 and column3)
Any ideas? Or you need more information, because I'm not sure if everybody gets what I have in mind.
This is example. Select should return two rows from this, one where last column would have Yes and other row withNo`.

This is exactly what the distinct keyword is for:
SELECT distinct col1, col2, col3
FROM mytable

Related

Copying all rows from one table to another without writing out all of the columns

I'm trying to copy over all rows from one table into another that are distinct on one column (Using a Postgresql database). I know that this can be done like so:
INSERT INTO table2(col1, col2, col3, ...)
SELECT
DISTINCT ON (col1) col1, col2, col3, ...
FROM table1;
The problem I'm having is that table1 has 100+ columns and so I don't want to write out all of the column names. I tried to do something like:
INSERT INTO table2 (*)
SELECT
DISTINCT ON (col1) *
FROM table1;
which resulted in a syntax error. Could someone please provide a code snippet with the correct syntax?
If the columns exactly line up, you can use:
INSERT INTO table2
SELECT DISTINCT ON (col1) t1.*
FROM table1 t1
ORDER BY col1;
Very importantly: When using DISTINCT ON, you should always have an ORDER BY, where the keys for the ORDER BY match the expressions in parentheses.
Leaving out the explicit columns in the INSERT is dangerous -- precisely because there might be some slip-up (columns out of order or a different number of columns). Sometimes when you are writing scripts and you know that the destination table really does match the source table, though, it can be handy.

Adding a Records Count row to top of query results in SQL

I want a row count above my query results. I found an article that suggested using sections but the summary and select query do not have matching columns/data types.
Ex.
Total Records 25
Col1 Col2 Col3...
XXXX XXXX XXXX
example from the suggestion I found but my columns and datatypes do not match between the two queries
SELECT * FROM (SELECT [Section]=2, Col1, Col2, ..., Value1, Value2
FROM #TEMP
UNION ALL
SELECT [Section]=1, 'Total', '----', ..., SUM(Value1), SUM(Value2)
FROM #TEMP
) AS T
ORDER BY [Section], Col1, ...
To be polite, you are not using the tool the way it is meant to be used. The contents of those columns are strongly typed. Each one contains strings, dates, numbers, etc, and you're adding another row with strings on top.
The only way I can see this working is if you were to convert all of your columns to VARCHAR columns and cast all of your data to VARCHAR(MAX) in the query.
Otherwise, I think that the most reasonable solution would be to perform a second query for the totals.

Turning multiple rows into single row based on ID, and keeping null values

I have tried some of the various solutions posted on Stack for this issue but none of them keep null values (and it seems like the entire query is built off that assumption).
I have a table with 1 million rows. There are 10 columns. The first column is the id. Each id is unique to "item" (in my case a sales order) but has multiple rows. Each row is either completely null or has a single value in one of the columns. No two rows with the same ID have data for the same column. I need to merge these multiple rows into a single row based on the ID. However, I need to keep the null values. If the first column is null in all rows I need to keep that in the final data.
Can someone please help me with this query I've been stuck on it for 2 hours now.
id - Age - firstname - lastname
1 13 null null
1 null chris null
should output
1 13 chris null
It sounds like you want an aggregation query:
select id, max(col1) as col1, max(col2) as col2, . . .
from t
group by id;
If all values are NULL, then this will produce NULL. If one of the rows (for an id) has a value, then this will produce that value.
select id, max(col1), max(col2).. etc
from mytable
group by id
As some others have mentioned, you should use an aggregation query to achieve this.
select t1.id, max(t1.col1), max(t1.col2)
from tableone t1
group by t1.id
This should return nulls. If you're having issues handling your nulls, maybe implement some logic using ISNULL(). Make sure your data fields really are nulls and not empty strings.
If nulls aren't being returned, check to make sure that EVERY single row that has a particular ID has ONLY nulls. If one of them returns an empty string, then yes, it will drop the null and return anything else over the null.

duplicate data in opposite column

I have a table which has two fields lets say col1 and col2
DATA AS
col1,col2
10,age
20,30
30,param
age,10
30,20
param,30
Each row is duplicate but in reverse column order
say
10,age
age,20
In my final output i just want single row to be present among the duplicate one, so final
output will be like
col1,col2
10,age
20,30
30,param
only three rows will be left rest rows will be ignored according to the given scenario
I have tried across so many different ways but can't find the solution.
So if any of you can help or just provide an approach then it will be a great help
Thanks
select distinct col1,col2 from t t1
where col1<=col2
or not exists (select 1 from t where t.col1=t1.col2
and
t.col2=t1.col1)
SqlFiddle demo
You can do the same using Greatest and Least
select distinct
least("col1","col2") AS "col1"
,greatest("col1","col2") as "col2"
from Table1
order by "col1"
SQL Fiddle 1
According to Updated Question SQL Fiddle 2
Shouldn't this be enough?
SELECT *
FROM Data
WHERE Col1 < Col2
(if you are sure every row has a duplicate pair)

Fast way to eyeball possible duplicate rows in a table?

Similar: How can I delete duplicate rows in a table
I have a feeling this is impossible and I'm going to have to do it the tedious way, but I'll see what you guys have to say.
I have a pretty big table, about 4 million rows, and 50-odd columns. It has a column that is supposed to be unique, Episode. Unfortunately, Episode is not unique - the logic behind this was that occasionally other fields in the row change, despite Episode being repeated. However, there is an actually unique column, Sequence.
I want to try and identify rows that have the same episode number, but something different between them (aside from sequence), so I can pick out how often this occurs, and whether it's worth allowing for or I should just nuke the rows and ignore possible mild discrepancies.
My hope is to create a table that shows the Episode number, and a column for each table column, identifying the value on both sides, where they are different:
SELECT Episode,
CASE WHEN a.Value1<>b.Value1
THEN a.Value1 + ',' + b.Value1
ELSE '' END AS Value1,
CASE WHEN a.Value2<>b.Value2
THEN a.Value2 + ',' + b.Value2
ELSE '' END AS Value2
FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode
WHERE a.Value1<>b.Value1
OR a.Value2<>b.Value2
(That is probably full of holes, but the idea of highlighting changed values comes through, I hope.)
Unfortunately, making a query like that for fifty columns is pretty painful. Obviously, it doesn't exactly have to be rock-solid if it will only be used the once, but at the same time, the more copy-pasta the code, the more likely something will be missed. As far as I know, I can't just do a search for DISTINCT, since Sequence is distinct and the same row will pop up as different.
Does anyone have a query or function that might help? Either something that will output a query result similar to the above, or a different solution? As I said, right now I'm not really looking to remove the duplicates, just identify them.
Use:
SELECT DISTINCT t.*
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
DISTINCT is just shorthand for writing a GROUP BY with all the columns involved. Grouping by all the columns will show you all the unique groups of records associated with the episode column in this case. So there's a risk of not having an accurate count of duplicates, but you will have the values so you can decide what to remove when you get to that point.
50 columns is a lot, but setting the ORDER BY will allow you to eyeball the list. Another alternative would be to export the data to Excel if you don't want to construct the ORDER BY, and use Excel's sorting.
UPDATE
I didn't catch that the sequence column would be a unique value, but in that case you'd have to provide a list of all the columns you want to see. IE:
SELECT DISTINCT t.episode, t.column1, t.column2 --etc.
FROM TABLE t
ORDER BY t.episode --, and whatever other columns
There's no notation that will let you use t.* but not this one column. Once the sequence column is omitted from the output, the duplicates will become apparent.
Instead of typing out all 50 columns, you could do this:
select column_name from information_schema.columns where table_name = 'your table name'
then paste them into a query that groups by all of the columns EXCEPT sequence, and filters by count > 1:
select
count(episode)
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
This should give you a list of all the rows that have the same episode number. (But just neither the sequence nor episode numbers themselves). Here's the rub: you will need to join this result set to YourTable on ALL the columns except sequence and episode since you don't have those columns here.
Here's where I like to use SQL to generate more SQL. This should get you started:
select 't1.' + column_name + ' = t2.' + column_name
from information_schema.columns where table_name = 'YourTable'
You'll plug in those join parameters to this query:
select * from YourTable t1
inner join (
select
count(episode) 'epcount'
, col1
, col2
, col3
, ...
from YourTable
group by
col1
, col2
, col3
, ...
having count(episode) > 1
) t2 on
...plug in all those join parameters here...
select count distinct ....
Should show you without having to guess. You can get your columns by viewing your table definition so you can copy/paste your non-sequence columns.
I think something like this is what you want:
select *
from t
where t.episode in (select episode from t group by episode having count(episode) > 1)
order by episode
This will give all rows that have episodes that are duplicated. Non-duplicate rows should stick out fairly obviously.
Of course, if you have access to some sort of scripting, you could just write a script to generate your query for you. It seems pretty straight-forward. (i.e. describe t and iterate over all the fields).
Also, your query should have some sort of ordering, like FROM Table1 a INNER JOIN Table1 b ON a.Episode = b.Episode AND a.Sequence < b.Sequence, otherwise you'll get duplicate non-duplicates.
A relatively simple solution that Ponies sparked:
SELECT t.*
FROM Table t
INNER JOIN ( SELECT episode
FROM Table
GROUP BY Episode
HAVING COUNT(*) > 1
) AS x ON t.episode = x.episode
And then, copy-paste into Excel, and use this as conditional highlighting for the entire result set:
=AND($C2=$C1,A2<>A1)
Column C is Episode. This way, you get a visual highlight when the data's different from the row above (as long as both rows have the same value for episode).
Generate and store a hash key for each row, designed so the hash values mirror your
definition of sameness. Depending on the complexity of your rows, updating the
hash might be a simple trigger on modifying the row.
Query for duplicates of the hash key, which are your "very probably" identical rows.