Only select first row of repeating value in a column in SQL - sql

I have table that has a column that may have same values in a burst. Like this:
+----+---------+
| id | Col1 |
+----+---------+
| 1 | 6050000 |
+----+---------+
| 2 | 6050000 |
+----+---------+
| 3 | 6050000 |
+----+---------+
| 4 | 6060000 |
+----+---------+
| 5 | 6060000 |
+----+---------+
| 6 | 6060000 |
+----+---------+
| 7 | 6060000 |
+----+---------+
| 8 | 6060000 |
+----+---------+
| 9 | 6050000 |
+----+---------+
| 10 | 6000000 |
+----+---------+
| 11 | 6000000 |
+----+---------+
Now I want to prune rows where the value of Col1 is repeated and only select the first occurrence.
For the above table the result should be:
+----+---------+
| id | Col1 |
+----+---------+
| 1 | 6050000 |
+----+---------+
| 4 | 6060000 |
+----+---------+
| 9 | 6050000 |
+----+---------+
| 10 | 6000000 |
+----+---------+
How can I do this in SQL?
Note that only burst rows should be removed and values can be repeated in non-burst rows! id=1 & id=9 are repeated in sample result.
EDIT:
I achieved it using this:
select id,col1 from data as d1
where not exists (
Select id from data as d2
where d2.id=d1.id-1 and d1.col1=d2.col1 order by id limit 1)
But this only works when ids are sequential. With gaps between ids (deleted ones) the query breaks. How can I fix this?

You can use a EXISTS semi-join to identify candidates:
Select wanted rows:
SELECT * FROM tbl t
WHERE NOT EXISTS (
SELECT *
FROM tbl
WHERE col1 = t.col1
AND id = t.id - 1
)
ORDER BY id;
Get rid of unwanted rows:
DELETE FROM tbl AS t
-- SELECT * FROM tbl t -- check first?
WHERE EXISTS (
SELECT *
FROM tbl
WHERE col1 = t.col1
AND id = t.id - 1
);
This effectively deletes every row, where the preceding row has the same value in col1, thereby arriving at your set goal: only the first row of every burst survives.
I left the commented SELECT statement because you should always check what is going to be deleted before you do the deed.
Solution for non-sequential IDs:
If your RDBMS supports CTEs and window functions (like PostgreSQL, Oracle, SQL Server, ... but not SQLite prior to v3.25, MS Access or MySQL prior to v8.0.1), there is an elegant way:
WITH cte AS (
SELECT *, row_number() OVER (ORDER BY id) AS rn
FROM tbl
)
SELECT id, col1
FROM cte c
WHERE NOT EXISTS (
SELECT *
FROM cte
WHERE col1 = c.col1
AND rn = c.rn - 1
)
ORDER BY id;
Another way doing the job without those niceties (should work for you):
SELECT id, col1
FROM tbl t
WHERE (
SELECT col1 = t.col1
FROM tbl
WHERE id < t.id
ORDER BY id DESC
LIMIT 1) IS NOT TRUE
ORDER BY id;

select min(id), Col1 from tableName group by Col1

If your RDBMS supports Window Aggregate functions and/or LEAD() and LAG() functions you can leverage them to accomplish what you are trying to report. The following SQL will help get you started down the right path:
SELECT id
, Col AS CurCol
, MAX(Col)
OVER(ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS PrevCol
, MIN(COL)
OVER(ORDER BY id ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) AS NextCol
FROM MyTable
From there you can put that SQL in a derived table with some CASE logic that if the NextCol or PrevCol is the same as CurCol then set CurCol = NULL. Then you can collapse eliminate all the id records CurCol IS NULL.
If you don't have the ability to use window aggregates or LEAD/LAG functions your task is a little more complex.
Hope this helps.

Since id is always sequential, with no gaps or repetitions, as per your comment, you could use the following method:
SELECT t1.*
FROM atable t1
LEFT JOIN atable t2 ON t1.id = t2.id + 1 AND t1.Col1 = t2.Col1
WHERE t2.id IS NULL
The table is (outer-)joined to itself on the condition that the left side's id is one greater than the right side's and their Col1 values are identical. In other words, the condition is ‘the previous row contains the same Col1 value as the current row’. If there's no match on the right, then the current record should be selected.
UPDATE
To account for non-sequential ids (which, however, are assumed to be unique and defining the order of changes of Col1), you could also try the following query:
SELECT t1.*
FROM atable t1
LEFT JOIN atable t2 ON t1.id > t2.id
LEFT JOIN atable t3 ON t1.id > t3.id AND t3.id > t2.id
WHERE t3.id IS NULL
AND (t2.id IS NULL OR t2.Col1 <> t1.Col1)
The third self-join is there to ensure that the second one yields the row directly preceding that of t1. That is, if there's no match for t3, then either t2 contains the preceding row or it's got no match either, the latter meaning that t1's current row is the top one.

Related

Assistance with SQL Query in Oracle

I heed your help with the following:
I have a table like this:
Table_Values
ID | Value | Date
1 | ASD | 01-Jan-2019
2 | ZXC | 10-Jan-2019
3 | ASD | 01-Jan-2019
4 | QWE | 05-Jan-2019
5 | RTY | 15-Jan-2019
6 | QWE | 29-Jan-2019
That I need is to get the values that are duplicated and have a different Date, for example the value "QWE" is duplicated and has different date:
ID | Value | Date
4 | QWE | 05-Jan-2019
6 | QWE | 29-Jan-2019
With EXISTS:
select * from Table_Values t
where exists (
select 1 from Table_Values
where value = t.value and date <> t.date
)
Using Join:
select
t1.*
from
Table_Values t1
join
Table_Values t2
on t1.Value = t2.Value
and t1.Date <> t2.Date
However, Exists approach is better.
You want all rows where there is more than one date per value. You can use COUNT OVER for this.
One method (featured as of Oracle 12c):
select id, value, date
from mytable
order by case when count(distinct date) over (partition by value) > 1 then 1 else 2 end
fetch first row with ties
But you'll have to put this into a subquery (derived table / cte), if you want the result sorted.
And another method without FETCH FIRST clause (valid as of Oracle 8i):
select id, value, date
from
(
select id, value, date, count(distinct date) over (partition by value) as cnt
from mytable
)
where cnt > 1
order by id, value, date;
forpas' solution with EXISTS may be faster, though. Well, pick whichever method you like better :-)
With EXISTS, "correlated subquery" is used. So I don't think it's better than JOIN.
However, Oracle optimizer could re-write "EXISTS" to JOIN.
I like to use JOIN in classic way :)
SELECT t1.*
FROM table_values t1, table_values t2
WHERE t1.f_value = t2.f_value
AND t1.f_date <> t2.f_date
ORDER BY 1;

How to get a single result with columns from multiple records in a single table?

Platform: Oracle 10g
I have a table (let's call it t1) like this:
ID | FK_ID | SOME_VALUE | SOME_DATE
----+-------+------------+-----------
1 | 101 | 10 | 1-JAN-2013
2 | 101 | 20 | 1-JAN-2014
3 | 101 | 30 | 1-JAN-2015
4 | 102 | 150 | 1-JAN-2013
5 | 102 | 250 | 1-JAN-2014
6 | 102 | 350 | 1-JAN-2015
For each FK_ID I wish to show a single result showing the two most recent SOME_VALUEs. That is:
FK_ID | CURRENT | PREVIOUS
------+---------+---------
101 | 30 | 20
102 | 350 | 250
There is another table (lets call it t2) for the FK_ID, and it is here that there is a reference
saying which is the 'CURRENT' record. So a table like:
ID | FK_CURRENT | OTHER_FIELDS
----+------------+-------------
101 | 3 | ...
102 | 6 | ...
I was attempting this with a flawed sub query join along the lines of:
SELECT id, curr.some_value as current, prev.some_value as previous FROM t2
JOIN t1 curr ON t2.fk_current = t1.id
JOIN t1 prev ON t1.id = (
SELECT * FROM (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (ORDER BY SOME_DATE DESC) as rno FROM t1
WHERE t1.fk_id = t2.id
) WHERE rno = 2
)
)
However the t1.fk_id = t2.id is flawed (i.e. wont run), as (I now know) you can't pass a parent
field value into a sub query more than one level deep.
Then I started wondering if Common Table Expressions (CTE) are the tool for this, but then I've no
experience using these (so would like to know I'm not going down the wrong track attempting to use them - if that is the tool).
So I guess the key complexity that is tripping me up is:
Determining the previous value by ordering, but while limiting it to the first record (and not the whole table). (Hence the somewhat convoluted sub query attempt.)
Otherwise, I can just write some code to first execute a query to get the 'current' value, and then
execute a second query to get the 'previous' - but I'd love to know how to solve this with a single
SQL query as it seems this would be a common enough thing to do (sure is with the DB I need to work
with).
Thanks!
Try an approach with LAG function:
SELECT FK_ID ,
SOME_VALUE as "CURRENT",
PREV_VALUE as Previous
FROM (
SELECT t1.*,
lag( some_value ) over (partition by fk_id order by some_date ) prev_value
FROM t1
) x
JOIN t2 on t2.id = x.fk_id
and t2.fk_current = x.id
Demo: http://sqlfiddle.com/#!4/d3e640/15
Try out this:
select t1.FK_ID ,t1.SOME_VALUE as CURRENT,
(select SOME_VALUE from t1 where p1.id2=t1.id and t1.fk_id=p1.fk_id) as PREVIOUS
from t1 inner join
(
select t1.fk_id, max(t1.id) as id1,max(t1.id)-1 as id2 from t1 group by t1.FK_ID
) as p1 on t1.id=p1.id1

Query to skip first row after id changes in SQL Server

I have a long table like the following. The table adds two similar rows after the id changes. E.g in the following table when ID changes from 1 to 2 a duplicate record is added. All I need is a SELECT query to skip this and all other duplicate records only if the ID changes.
# | name| id
--+-----+---
1 | abc | 1
2 | abc | 1
3 | abc | 1
4 | abc | 1
5 | abc | 1
5 | abc | 2
6 | abc | 2
7 | abc | 2
8 | abc | 2
9 | abc | 2
and so on
You could use NOT EXISTS to eliminate the duplicates:
SELECT *
FROM yourtable AS T
WHERE NOT EXISTS
( SELECT 1
FROM yourtable AS T2
WHERE T.[#] = T2.[#]
AND T2.ID > T.ID
);
This will return:
# name ID
------------------
. ... .
4 abc 1
5 abc 2
6 abc 2
. ... .
... (Some irrelevant rows have been removed from the start and the end)
If you wanted the first record to be retained, rather than the last, then just change the condition T2.ID > T.ID to T2.ID < T.ID.
You can use the following CTEs to simulate LAG window function not available in SQL Server 2008:
;WITH CTE_RN AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY [#], id) AS rn
FROM #mytable
), CTE_LAG AS (
SELECT t1.[#], t1.name,
t1.id AS curId, t2.id AS prevId,
t1.[#] AS cur#, t2.[#] AS lag#
FROM CTE_RN t1
LEFT JOIN CTE_RN t2 ON t1.rn = t2.rn + 1 )
You can now filter out the 'duplicate' records using the above CTE_LAG and the following predicate in your WHERE clause:
;WITH (
... cte definitions here
) SELECT *
FROM CTE_LAG
WHERE (NOT ((prevId <> curId) AND (cur# = lag#))) OR (prevId IS NULL)
If prevId <> curId and cur# = lag#, then there is a change in the value of the id column and the following record has the same [#] value as the previous one, i.e. it is a duplicate.
Hence, using NOT on (prevId <> curId) AND (cur# = lag#), filters out all 'duplicate' records. This means record (5, abc, 2) will be eliminated.
SQL Fiddle Demo here
P.S. You can also add column name in the logical expression of the WHERE clause, depending on what defines a 'duplicate'.
So I achieved it by using the following query in SQL server.
select #, name, id
from table
group by #, name, id
having count(*) > 0

SQL Select First column and for each row select unique ID and the last date

I have a problems this mornig , I have tried many solutions and nothing gave me the expected result.
I have a table that looks like this :
+----+----------+-------+
| ID | COL2 | DATE |
+----+----------+-------+
| 1 | 1 | 2001 |
| 1 | 2 | 2002 |
| 1 | 3 | 2003 |
| 1 | 4 | 2004 |
| 2 | 1 | 2001 |
| 2 | 2 | 2002 |
| 2 | 3 | 2003 |
| 2 | 4 | 2004 |
+----+----------+-------+
And I have a query that returns a result like this :
I have the unique ID and for this ID I want to take the last date of the ID
+----+----------+-------+
| ID | COL2 | DATE |
+----+----------+-------+
| 1 | 4 | 2004 |
| 2 | 4 | 2004 |
+----+----------+-------+
But I don't have any idea how I can do that.
I tried Join , CROSS APPLY ..
If you have some idea ,
Thank you
Clement FAYARD
declare #t table (ID INT,Col2 INT,Date INT)
insert into #t(ID,Col2,Date)values (1,1,2001)
insert into #t(ID,Col2,Date)values (1,2,2001)
insert into #t(ID,Col2,Date)values (1,3,2001)
insert into #t(ID,Col2,Date)values (1,4,2001)
insert into #t(ID,Col2,Date)values (2,1,2002)
insert into #t(ID,Col2,Date)values (2,2,2002)
insert into #t(ID,Col2,Date)values (2,3,2002)
insert into #t(ID,Col2,Date)values (2,4,2002)
;with cte as(
select
*,
rn = row_number() over(partition by ID order by Col2 desc)
from #t
)
select
ID,
Col2,
Date
from cte
where
rn = 1
SELECT ID,MAX(Col2),MAX(Date) FROM tableName GROUP BY ID
If col2 and date allways the highest value in combination than you can try
SELECT ID, MAX(COL2), MAX(DATE)
FROM Table1
GROUP BY ID
But it is not realy good.
The alternative is a subquery with:
SELECT yourtable.ID, sub1.COL2, sub1.DATE
FROM yourtable
INNER JOIN -- try with CROSS APPLY for performance AND without ON 1=1
(SELECT TOP 1 COL2, DATE
FROM yourtable sub2
WHERE sub2.ID = topquery.ID
ORDER BY COL2, DATE) sub1 ON 1=1
You didn't tell what's the name of your table so I'll assume below it is tbl:
SELECT m.ID, m.COL2, m.DATE
FROM tbl m
LEFT JOIN tbl o ON m.ID = o.ID AND m.DATE < o.DATE
WHERE o.DATE is NULL
ORDER BY m.ID ASC
Explanation:
The query left joins the table tbl aliased as m (for "max") against itself (alias o, for "others") using the column ID; the condition m.DATE < o.DATE will combine all the rows from m with rows from o having a greater value in DATE. The row having the maximum value of DATE for a given value of ID from m has no pair in o (there is no value greater than the maximum value). Because of the LEFT JOIN this row will be combined with a row of NULLs. The WHERE clause selects only these rows that have NULL for o.DATE (i.e. they have the maximum value of m.DATE).
Check the SQL Antipatterns: Avoiding the Pitfalls of Database Programming book for other SQL tips.
In order to do this you MUST exclude COL2 Your query should look like this
SELECT ID, MAX(DATE)
FROM table_name
GROUP BY ID
The above query produces the Maximum Date for each ID.
Having COL2 with that query does not makes sense, unless you want the maximum date for each ID and COL2
In that case you can run:
SELECT ID, COL2, MAX(DATE)
GROUP BY ID, COL2;
When you use aggregation functions(like max()), you must always group by all the other columns you have in the select statement.
I think you are facing this problem because you have some fundemental flaws with the design of the table. Usually ID should be a Primary Key (Which is Unique). In this table you have repeated IDs. I do not understand the business logic behind the table but it seems to have some flaws to me.

Returning max values of query based on different column

My table stores various versions of the few documents.
-------------------------------
| id | doc_type | download |
-------------------------------
| 1 | type1 | file |
-------------------------------
| 2 | type2 | file |
-------------------------------
| 3 | type3 | file |
-------------------------------
| 4 | type1 | file |
-------------------------------
The table stores different versions of the same type of documents. I need to build a query which will return distinct types of doc_type having max(id) - which is the newest version of the file. Number of doc_types is not limited and is dynamic. My query so far:
select max(id) from template_table
where doc_type in (select distinct doct_type from template_table);
This returns only one largest result. If I could sort results by id ASC and the limit result to 4 largest but it will not guarantee that it will return distinct doc_types. Also number of document types in DB might be changing from 4 it needs to count how many there is.
select * from template_table
order by id limit 4;
Thanks for any help.
Query:
SELECT t1.id,
t1.doc_type,
t1.download
FROM template_table t1
JOIN (SELECT MAX(id) AS id,
doc_typ
FROM template_table
GROUP BY doc_type) t2
ON t2.doc_type = t1.doc_type
AND t2.id = t1.id
OR:
SELECT t1.id,
t1.doc_type,
t1.download
FROM template_table t1
WHERE t1.id = (SELECT MAX(t2.id)
FROM template_table t2
WHERE t2.doc_type = t1.doc_type)
you can use GROUP BY to get the desired result
select
doc_type
, max(id) AS last_id
, max(download) KEEP (DENSE_RANK FIRST order by id desc) AS last_download
from template_table
group by doc_type
;