T-SQL Eliminating duplicate rows while ignoring certain columns - sql

I'm struggling to find the proper statements to select non-duplicate entries that are duplicates only for particular columns. As an example, in the following table I only care about rows that have unique values in col1, col2, and col3 and the values in col4 and col5 do not matter. This means I would consider row 1 and row 2 to be duplicates and row 4 and row 5 to be duplicates:
col1 col2 col3 col4 col5
A 2 p 0 2
A 2 p 1 8
A 3 r 4 12
B 0 f 3 1
B 0 f 6 5
And I would want to select only the following:
col1 col2 col3 col4 col5
A 2 p 0 2
A 3 r 4 12
B 0 f 3 1
Is there a way to combine multiple DISTINCT statements to achieve this or specify certain columns to ignore when comparing rows for duplicates?

You have to choose which lines you want to keep, you can use the ROW_NUMBER() function for this:
SELECT col1, col2, col3, col4, col5
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY col1, col2, col3 ORDER BY col4 DESC) 'RowRank'
FROM table
)sub
WHERE RowRank = 1
You can change the ORDER BY section to change which row you keep and which you toss. The ROW_NUMBER() function just assigns a number to each row, in this example, you want to preserve each combination of col1, col2, col3, so you PARTITION BY them, meaning that numbering will start at 1 for each combination of them. You can run just the inside query to get the idea.
Alternatively, you could use GROUP BY and aggregate functions, ie:
SELECT col1, col2, col3, MAX(col4), MAX(col5)
FROM table
GROUP BY col1, col2, col3
The downside here is that the MAX() of col4 and col5 might come from different rows, so you're not necessarily returning one single row from your original table, but if you don't care which row you return then it doesn't matter.

Related

Sum of columnA where another columnB is specific value without showing columnB

I have a table that I'm grouping data together from. I'm running into a problem where I want the sum of a number from Column2 where Column3 has a specific value without showing Column3
Table X:
Col1 Col2 Col3 Col4
A 4 tt 6y
B 5 tt 6y
C 4 ee 7y
A 3 ee 7u
A 4 ee 6y
B 5 tt 8u
C 4 tt 7y
A 3 xx 8u
My Select grouping is
select Col1, Sum(Col2), Col4
from table x
group by Col1, Col4
I need to add 2 new columns in the group, the sum of column Col2 where Col3 is tt and another is the sum of Col2 where Col3 is ee. I do not need to show the value of Col3 and do not want to group by Col3.
I have looked at a partition by but I can't figure out how to specify the partition to the value of the column.
You need conditional aggregation:
Select
Col1, Col4,
Sum(Col2) sumcol2,
Sum(case when col3 = 'tt' then Col2 else 0 end) sumtt,
Sum(case when col3 = 'ee' then Col2 else 0 end) sumee
from table x
group by Col1,Col4
Use HAVING in the similar way as WHERE condition while grouping.
Something like:
SELECT Col1,Sum(Col2),Col4
FROM table x
GROUP BY Col1,Col4
HAVING COl3 LIKE 'ee'
As I am not sitting at my SQL machine, I cannot test it - test it yoursef.

How to easily list columns in BigQuery

I'm have a table with many columns in BigQuery.
I wanna list its columns in select query, but listing the all columns is hard.
I wanna do like this
SELECT
col1,
col2,
col3,
...
SOME_METHOD(col30),
...
col50
FROM
foo.bar;
Is there any ways to write such query easily?
Below is for BigQuery Standard SQL
SELECT * EXCEPT(col30), SOME_METHOD(col30)
FROM foo.bar
or
SELECT * REPLACE(SOME_METHOD(col30) as col30)
FROM foo.bar
for example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 col1, 2 col2, 3 col3, 4 col4, 5 col5
)
SELECT * EXCEPT(col3), 2 * col3 AS col3
FROM `project.dataset.table`
with result
Row col1 col2 col4 col5 col3
1 1 2 4 5 6
or
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 col1, 2 col2, 3 col3, 4 col4, 5 col5
)
SELECT * REPLACE(2 * col3 AS col3)
FROM `project.dataset.table`
with result
Row col1 col2 col3 col4 col5
1 1 2 6 4 5
This is untested in Big Query, but one trick which is available in other databases, such as SQL Server, is to do a SELECT *, but then also list other items you want to select. So you may try one of the following:
SELECT *, SOME_METHOD(col30) AS output
FROM yourTable;
Or
SELECT SOME_METHOD(col30), * AS output
FROM yourTable;
Note that depending on what the other things are you explicitly list, you could end up with the same column (and name) appearing more than once in the result set.

SQL Server: remove multiple row null value

I have a table like this:
Result Col1 Col2 Col3
-----------------------------
Row1 null 1 null
Row1 2 null null
Row1 null null 3
Row1 1 null null
Row1 null 2 null
Row1 null null 3
and I would like to get the result like
Result Col1 Col2 Col3
-----------------------------
Row1 2 1 3
Row1 1 2 3
How to get this done in the SQL Server table? I know that if I use the MAX of Col1, Col2, Col3 I will get only one row. But I need to get the two rows.
How can I do this?
This is tricky. You can assign a sequential value using row_number() to each value and then aggregate.
Your data lacks ordering -- SQL tables represent unordered sets. Assuming you have an ordering column and you have only one non-NULL value per row:
select t.result, max(col1) as col1, max(col2) as col2, max(col3) as col3
from (select t.*,
row_number() over (partition by case when col1 is not null then 1
when col2 is not null then 2
when col3 is not null then 3
order by ? -- the ordering column
) as seqnum
from t
) t
group by t.result, seqnum;
If you can have multiple non-NULL values per row, then the question is ill-defined. Ask another question and provide sample data and desired results.

Postgres - does DISTINCT ON take preference for non-null values in column?

If I use DISTINCT ON on 2 columns, and there is a third column that can have null values, does DISTINCT ON always try to return a row where that third column is not null, or is it just down to the ORDER BY?
So for example with this table:
col1 col2 col3
1 88 8
1 88 9
1 88 1
2 88 3
2 88
3 88
I want to be able to SELECT DISTINCT ON (col1, col2) and get rows where col3 is not null, unless the DISTINCT ON (col1, col2) does not have a row where col3 is not null.
It is entirely based on what you ORDER BY. If you want to prefer rows with a non-NULL col3, just include that in your ordering:
SELECT DISTINCT ON (col1, col2) ... ORDER BY col1, col2, col3 ASC NULLS LAST.
select distinct col1, col2, col3
from table t
where t.col3 is not null
or not exists(select 1 from table tt
where tt.col1 = t.col1 and tt.col2 = t.col2 and tt.col3 is not NULL)

Element-wise quotient of two columns in SQL

How can I combine the columns returned by two SELECT statements to give their element-wise quotient?
Query 1:
SELECT COUNT(*) AS count
FROM table1
WHERE col2 = 1 AND col3 > 5
GROUP BY col4
ORDER BY col4
Query 2:
SELECT COUNT(*) AS count
FROM table1
WHERE col2 = 1
GROUP BY col4
ORDER BY col4
So if they return something like:
Query 1 Query 2
count count
-----------------------
1 5
2 4
I will get:
quotient
-------
0.2
0.5
With the 4-column version of the question, we can assume that the quotient is between groups with the same value in col4. So, the answer becomes:
SELECT col4, SUM(CASE WHEN col3 > 5 THEN 1 ELSE 0 END) / COUNT(*) AS quotient
FROM table1
WHERE col2 = 1
GROUP BY col4;
I've retained col4 in the output because I don't think the ratios (quotients) will be useful without something to identify which quotient is associated with which values, though theoretically, the answer doesn't want that column in the output.
In this case, you don't need two separate queries at all:
SELECT SUM(col3 > 5) / COUNT(*)
FROM table1
WHERE col2 = 1
GROUP BY col4
ORDER BY col4
In case your actual queries cannot be simplified as per the other answers, you can join the subqueries, like this:
select j1.count / j2.count as quotient
from (
SELECT col4, COUNT(*) AS count
FROM table1
WHERE col2 = 1 AND col3 > 5
GROUP BY col4
) j1
join (
SELECT col4, COUNT(*) AS count
FROM table1
WHERE col2 = 1
GROUP BY col4
) j2 on j1.col4=j2.col4