SQL: Use distinct on groups of similar data - sql

Hello Mates I have the following problem in a Vertica database: I have a large Table
+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
| 1 | A | B |
| 2 | A | B |
| 3 | D | E |
| 2 | C | D |
| 1 | C | D |
+------+------+------+
As you can see I have redundant data, just taken on different dates (row 1 & 2 and row 4 & 5). So I would like a table that removes that redundant data by deleting the rows with the lower date, giving me a result like that:
+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
| 2 | A | B |
| 2 | C | D |
| 3 | D | E |
+------+------+------+
Using distinct would not work since it will delete rows randomly not considering the date, so I might end up with a table like this:
SELECT DISTINCT Col2, Col3 from Table
+------+------+------+
| Date | Col1 | Col2 |
+------+------+------+
| 2 | A | B |
| 1 | C | D |
| 3 | D | E |
+------+------+------+
which is not desired.
Is there anyway to accomplish that?
Thanks mates

Do a GROUP BY on your 2 columns and aggregate on the highest date:
SELECT MAX(Date), col1, col2
FROM table
GROUP BY Col1, Col2

I'm just generalizing the patterns here and adding one, for the exact question asked any of these methods would probably work, the devil is in the details.
The aggregate method proposed by #Thomas_G works because you only have 1 column outside the grouping. If you had two it could mix/match (some data from one row, some from another) which is not likely what you want as a duplicate handling strategy.
The analytical method proposed by #Gordon_Linoff is good, but be aware that if the date is duplicated in the source data, then you'll get multiple rows if they exist on the max date. This might be what you want, but maybe not.
Another method is to just peel off the top row in the window. It will choose the first row in the partition based on your window ordering. If there are multiples dates at the max, then you can't guarantee which one will be chosen unless you include something more in the window order. But at least you know you'll only get one row, for what it's worth.
select t.*
from (select t.*, row_number() over (partition by col1, col2 order by date desc) as rn
from t
) t
where rn = 1;

If there are other columns that you care about, you can use window functions:
select t.*
from (select t.*, max(date) over (partition by col1, col2) as maxd
from t
) t
where date = maxd;

Related

SQL: Select Most Recent Sequentially Distinct Value w/ Grouping

I am having trouble writing a query that would select the last "new" sequentially distinct value (let's call this column Col A) grouped based on another column (Col B). Since this is a bit ambiguous/confusing, here is an example to explain (assume row number is indicative of sequence inside groups; in my issue the rows are ordered by date):
|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1 | A | A |
| 2 | B | A |
| 3 | C | A |
| 4 | B | B |
| 5 | A | B |
| 6 | B | B |
Would select:
| 3 | C | A |
| 6 | B | B |
Note that although B also appears in row 4, the fact that row 5 contains A means that the B in row 6 is sequentially distinct. But if table looked like this:
|--------|-------|-------|
| RowNum | Col A | Col B |
|--------|-------|-------|
| 1 | A | A |
| 2 | B | A |
| 3 | C | A |
| 4 | B | B |
| 5 | A | B |
| 6 | A | B | <--
Then we would want to select:
| 3 | C | A |
| 5 | A | B |
I think that this would be an easier problem if I wasn't concerned with values being distinct but not sequential. I'm not really sure how to even consider sequence when making a query.
I have attempted to solve this by calculating the min/max row numbers where each value of Col A appears. That calculation (using the second sample table) would produce a result like this:
|--------|--------|--------|--------|
| ColA | ColB | MinRow | MaxRow |
|--------|--------|--------|--------|
| A | A | 1 | 1 |
| B | A | 2 | 2 |
| C | A | 3 | 3 |
| A | B | 5 | 6 |
| B | B | 4 | 4 |
A solution raised in a related post (SQL: Select Row with Last New Sequentially Distinct Value) went on a similar path, essentially taking the most recent RowNum which differs from the last ColA and then picks the next row. However, in that question I failed to address the need for the query to work for multiple groups, hence the new post.
Any help with this problem, if it is at all possible to do in SQL, would be greatly appreciated. I am running SQL 2008 SP4.
Hmmm . . . One method is to get the last value. Then choose all the last rows with that value and aggregate:
select min(rownum), colA, colB
from (select t.*,
first_value(colA) over (partition by colB order by rownum desc) as last_colA
from t
) t
where rownum > all (select t2.rownum
from t t2
where t2.colB = t.colB and t2.colA <> t.last_colA
)
group by colA, colB;
Or, without the aggregation:
select t.*
from (select t.*,
first_value(colA) over (partition by colB order by rownum desc) as last_colA,
lag(colA) over (partition by colB order by rownum) as prev_clA
from t
) t
where rownum > all (select t2.rownum
from t t2
where t2.colB = t.colB and t2.colA <> t.last_colA
) and
(prev_colA is null or prev_colA <> colA);
But in SQL Server 2008, let's treat this as a gaps-and-islands problem:
select t.*
from (select t.*,
min(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as min_rownum_group,
max(rownum) over (partition by colB, colA, (seqnum_b - seqnum_ab) ) as max_rownum_group
from (select t.*,
row_number() over (partition by colB order by rownum) as seqnum_b,
row_number() over (partition by colB, colA order by rownum) as seqnum_ab,
max(rownum) over (partition by colB order by rownum) as max_rownum
from t
) t
) t
where rownum = min_rownum_group and -- first row in the group defined by adjacent colA, colB
max_rownum_group = max_rownum -- last group for each colB;
This identifies each of the groups using a difference of row numbers. It calculates the maximum rownum for the group and overall in the data. These are the same for the last group.

Oracle SQL statement without duplicates

I have a requirement to write a SQL statement to return 2 columns, however there cannot be duplicates in either of these columns. For example:
|---------------------|------------------|
| 10 | A |
|---------------------|------------------|
| 11 | B |
|---------------------|------------------|
| 12 | C |
|---------------------|------------------|
| 13 | A | <--- Don't return
|---------------------|------------------|
Using distinct doesn't work, since the row highlighted above is distinct. It also doesn't matter which of the duplicates is returned.
Does anyone know of a way to do this? It feels as though I'm missing something obvious.
Thanks.
You can try to make row number by col2 and get rn = 1 data row.
CREATE TABLE T(
col1 int,
col2 varchar(5)
);
insert into t values (10,'A');
insert into t values (11,'B');
insert into t values (12,'C');
insert into t values (13,'A');
Query 1:
SELECT t1.col1,t1.col2
FROM (
SELECT t1.*,ROW_NUMBER() OVER(PARTITION BY col2 ORDER BY col1) rn
FROM T t1
)t1
WHERE t1.rn = 1
Results:
| COL1 | COL2 |
|------|------|
| 10 | A |
| 11 | B |
| 12 | C |
If you just want the lowest value from the first column, do:
SELECT MIN(column1), column2
FROM YourTable
GROUP BY column2
This is not posible in one query, because each column have different number of unique values

In SQL is there a way to partition by a value if it's not continuous

I would like to do the rank the values over a partition with two columns. col1 will be the key and col2 will be some value that is also going to be used in ORDER BY. I would like to start a new partition only when col2 is discontinued. For example, I would like to do the following:
+------+------+------+
| col1 | col2 | rank |
+------+------+------+
| a | 1 | 1 |
| a | 2 | 2 |
| a | 3 | 3 |
| a | 9 | 1 |
| a | 10 | 2 |
| b | 1 | 1 |
| b | 2 | 2 |
| b | 8 | 1 |
+------+------+------+
Thinking somewhere in lines of
SELECT col1, RANK() OVER (PARTITION BY col1, SOMETHING HERE??? ORDER BY col2 DESC)
Does anyone have any ideas?
If I understand correctly, you want to enumerate by "islands" of adjoining sequential values. You can do so with a simple observation: subtracting a sequence from col2 will be constant for each group. So, let's use this observation:
select t.*,
row_number() over (partition by col1, grp order by col1) as rnk
from (select t.*,
(col2 - row_number() over (partition by col1 order by col2)) as grp
from t
) t

Selecting a row after multiple groupings in postgres

i have a table in a postgres DB which has the following structure:
id | date | groupme1 | groupme2 | value
----------------------------------------
1 |
2 |
3 |
Now i want to achieve the following:
Grouping the table after groupme1 and groupme2
Get the value for every group
But only the last entry for each group-compination (odered after date)
Example:
id | date | groupme1 | groupme2 | value
---------------------------------------
| | A | 1 | 4
| | A | 2 | 7
| | A | 3 | 3
| | B | 1 | 9
My current approach looks like this:
SELECT a.*
FROM table AS a
JOIN (SELECT max(id) AS id
FROM table
GROUP BY groupme1, groupme2) AS b
ON a.id = b.id
The Problems of this approach:
it asumes that higher dates have a higher id
it takes long
Is there a faster and better way of doing this? Can windowing function help with this?
I think you just want window functions:
select t.*
from (select t.*,
row_number() over (partition by groupme1, groupme2 order by date desc) as seqnum
from t
) t
where seqnum = 1;
Or, a better way to do this in Postgres uses distinct on:
select distinct on (groupme1, groupme2) t.*
from t
order by groupme1, groupme2, date desc;

Remove duplicates from query, while repeating

I have an SQL table with some data like this, it is sorted by date:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:02 | a |
| 12:00:03 | b |
| 12:00:04 | b |
| 12:00:05 | c |
| 12:00:06 | c |
| 12:00:07 | a |
| 12:00:08 | a |
+----------+------+
So, I want my select result to be the following:
+----------+------+
| Date | Col2 |
+----------+------+
| 12:00:01 | a |
| 12:00:03 | b |
| 12:00:05 | c |
| 12:00:07 | a |
+----------+------+
I have used the distinct clause but it removes the last two rows with Col2 = 'a'
You can use lag (SQL Server 2012+) to get the value in the previous row and then compare it with the current row value. If they are equal assign them to one group (1 here) and a different group (0 here) otherwise. Finally select the required rows.
select dt,col2
from (
select dt,col2,
case when lag(col2,1,0) over(order by dt) = col2 then 1 else 0 end as somecol
from t) x
where somecol=0
If you are using Microsoft SQL Server 2012 or later, you can do this:
select date, col2
from (
select date, col2,
case when isnull(lag(col2) over (order by date, col2), '') = col2 then 1 else 0 end as ignore
from (yourtable)
) x
where ignore = 0
This should work as long as col2 cannot contain nulls and if the empty string ('') is not a valid value for col2. The query will need some work if either assumption is not valid.
same as accepted answer (+1) just moving the conditions
assumes col2 is not null
select dt, col2
from ( select dt, col2
lag(col2, 1) over(order by dt) as lagCol2
from t
) x
where x.lagCol2 is null or x.lagCol2 <> x.col2