SQL command to remove groups of entries where all are equal (not merely DISTINCT) - sql

i am so green in SQL that I don't even know how to properly phrase my question or look for an existing answer in stack overflow or anywhere else. Sorry!
Assume i have 3 columns. One is an ID and two data columns A and B. A single ID can have multiple entries. I like to remove all entries, where A and B are same for a given ID. Probably i give an example
ID
A
B
01
x
y
01
x
y
01
x
y
02
x
y
02
x
z
02
x
y
In this table I would like to remove all 3 entries that belong to ID 01 as A as well as B are all x and y, respectively. For ID 02, however, column B differs for the first and second entry. Therefore I like to keep ID 02. I hope this illustrates the idea sufficiently :-).
I am look for a 'scalable' solution, as I am not only looking at two data columns A and B, but actually 4 different columns.
Does anyone know how to set a proper filter in SQL to remove those entries according to my needs?
Many thanks.
Benjamin

As for this, it basically doesn't matter how many coumns you actually have, as long as they are identical
this can be used for an as joining basis for a DELETE
WITH CTE AS
(SELECT DISTINCT "ID", "A", "B" FROM tab1),
CTE2 AS (SELECT "ID", COUNT(*) count_ FROM CTE GROUP BY "ID" HAVING COUNT(*) >1)
SELECT "ID" FROM CTE2
| ID |
| -: |
| 2 |
db<>fiddle here

Related

SQL pivot with text based fields

Forgive me, but I can't get this working.
I can find lots of complex pivots using numeric values, but nothing basic based on strings to build upon.
Lets suppose this is my source query from a temp table. I can't change this:
select * from #tmpTable
This provides 12 rows:
Row | Name | Code
---------------------------------
1 | July 2019 | 19/20-01
2 | August 2019 | 19/20-02
3 | September 2019 | 19/20-03
.. .. ..
12 | June 2020 | 19/20-12
I want to pivot this and return the data like this:
Data Type | [0] | [1] | [3] | [12]
---------------------------------------------------------------------------
Name | July 2019 | August 2019 | September 2019 | June 2020
Code | 19/20-01 | 19/20-02 | 19/20-03 | 19/20-12
Thanks in advance..
Strings and numbers aren't much different in pivot terms, it's just that you can't use numeric aggregators like SUM or AVG on them. MAX will be fine and in this case you'll only have one Value so nothing will be lost
You need to pull your data out to a taller key/value representation before pivoting it back to look the other way round as it does now
unpivot the data:
WITH upiv AS(
SELECT 'Name' as t, row as r, name as v FROM #tempTable
UNION ALL
SELECT 'Code' as t, row, code FROM #tempTable
)
Now the data can be re grouped and conditionally aggregated on the r columns:
SELECT
t,
MAX(CASE WHEN r = 1 THEN v END) as r1,
MAX(CASE WHEN r = 2 THEN v END) as r2,
...
MAX(CASE WHEN r = 12 THEN v END) as r12
FROM
upiv
GROUP BY
t
You'll need to put the two sql blocks I present here together so they form a single sql statement. If you want to know more about how this works, I suggest you run the sql statement inside the with block, take a look at it, and also remove the group by/max words from the full statement and look at the result. You'll see the WITH block query makes the data taller, essentially a key/value pair that is tracking what type the data is (name or code). When you run the full sql without the group by/max you'll see the tall data spreads out sideways to give a lot of nulls and a diagonal set of cell data (if ordered by r). The group by collapses all these nulls because a MAX will pick any value over null (of which there is only one)
You could also do this as an UNPIVOT followed by a PIVOT. I've always preferred to use this form because not every database supports the UN/PIVOT keywords. Arguably, UNPIVOT/PIVOT could perform better because there may be specific optimizations the developers can make (eg UNPIVOT could single scan a table; this multiple Union approach may require multiple scans and ways round it could be more memory intensive) but in this case it's only 12 rows. I suspect you're using SQLServer but if you're using a database that doesn't understand WITH you can place the bracketed statement of the WITH (including the brackets) between the FROM and the upiv to make it a subquery if the pattern SELECT ... FROM (SELECT ... UNION ALL SELECT ...) upiv GROUP BY ...; there is no difference
I'll leave renaming the output columns as an exercise for you but I would urge you to consider not putting spaces or square brackets in the column names as you show in your question

Same entity from different tables/procedures

I have 2 procedures (say A and B). They both return data with similar columns set (Id, Name, Count). To be more concrete, procedures results examples are listed below:
A:
Id Name Count
1 A 10
2 B 11
B:
Id Name Count
1 E 14
2 F 15
3 G 16
4 H 17
The IDs are generated as ROW_NUMBER() as I don't have own identifiers for these records because they are aggregated values.
In code I query over the both results using the same class NameAndCountView.
And finally my problem. When I look into results after executing both procedures sequentially I get the following:
A:
Id Name Count
1 A 10 ->|
2 B 11 ->|
|
B: |
Id Name Count |
1 A 10 <-|
2 B 11 <-|
3 G 16
4 H 17
As you can see results in the second set are replaced with results with the same IDs from the first. Of course the problem take place because I use the same class for retrieving data, right?
The question is how to make this work without creating additional NameAndCountView2-like class?
If possible, and if you don't really mind about the original Id values, maybe you can try having the first query return even Ids :
ROW_NUMBER() over (order by .... )*2
while the second returns odd Ids :
ROW_NUMBER() over (order by .... )*2+1
This would also allow you to know where the Ids come from.
I guess this would be repeatable with N queries by having the query number i selecting
ROW_NUMBER() over (order by .... )*n+i
Hope this will help

SQL Get n last unique entries by date

I have an access database that I'm well aware is quite poorly designed, unfortunately it is what I must use. It looks a little like the following:
(Row# is not a column in the database, it's just there to help me describe what I'm after)
Row# ID Date Misc
1 001 01/8/2013 A
2 001 01/8/2013 B
3 001 01/8/2013 C
4 002 02/8/2013 D
5 002 02/8/2013 A
6 003 04/8/2013 B
7 003 04/8/2013 D
8 003 04/8/2013 D
What I'm trying to do is obtain all information entered for the last n (by date) 'entries' where an 'entry' is all rows with a unique ID.
So if I want the last 1 entry I will get rows 6, 7 and 8. The last two entries will get me rows 4-8 etc.
I've tried to get the SN's needed in a subselect and then select all entries where those SN's appear, but I couldn't get it to work. Any help appreciated.
Thanks.
The proper Access syntax:
select *
from t
where ID in (select top 10 ID
from t
group by ID
order by max([date]) desc
)
I think this will work:
select *
from table
where Date in (
select distinct(Date) as unique_date from table order by unique_date DESC limit <num>
)
The idea is to use the subselect with a limit to only identify dates you care about.
EDIT: Some databases do not allow a limit in a subquery (I'm looking at you, mysql). In that case, you'll have to make a temporary table out of the subquery then select * from it.

Aggregation over order-dependent partition?

I have a source data set like this (simplified to be more clear):
Key F1 F2
1 X 4
2 X 5
3 Y 6
4 X 9
5 X 7
6 X 8
7 Y 9
8 X 6
9 X 5
10 Y 3
The data is sorted by the Key field. Now, I want to compute an aggregate of the F2 field over partitions that are defined by the F1 field: A partition starts at the first X value and ends with the first subsequent Y value.
So, for example, I might want wo compute the MIN() over the partitions defined as described above. Then the result set would look like this:
rownum MIN(F2)
1 4
2 7
3 3
I have tried a number of resources (incl. our own intranet community and of course stackoverflow) but found nothing for my case. Usually partitioning only works with a field that can be used to identify the partitions. Here, the partitions are defined by a change in a field's content with respect to a given order.
Although I am aware that I may have to resort to writing a procedural solution I would prefer to solve this in pure SQL.
Any ideas how such a partitioning could be achieved with a SQL select statement?
Thanks and regards
Kai.
A little bit shorter solution: http://sqlfiddle.com/#!12/7390d/24
Query:
select min(f2)
from t t1
group by (select max(key)
from t t2
where t2.f1='Y' and
t1.key > t2.key)
Result:
| MIN |
-------
| 4 |
| 7 |
| 3 |
The idea is to find the key of preceding 'Y' for each row and group by it. Should work with any SQL engine.
You didn't specify engine or dialect or version so I assumed SQL Server 2012.
Example that you can run to see the solution: http://sqlfiddle.com/#!6/f5d38/21
You solve it by creating correct partitions in your set. Code looks like this.
WITH groupLimits as
(
SELECT
[Key] AS groupend
,COALESCE(LAG([Key]) OVER (order by [Key]),0)+1 AS groupstart
FROM sourceData
WHERE F1 = 'Y'
)
SELECT
MIN(sourceData.F2)
FROM groupLimits
INNER JOIN sourceData
ON sourceData.[Key] BETWEEN groupLimits.groupstart and groupLimits.groupend
GROUP BY groupLimits.groupstart
ORDER BY groupLimits.groupstart

Select distinct values for a particular column choosing arbitrarily from duplicates

I have health data relating to deaths. Individual should die once maximum. In the database they sometimes don't; probably because causes of death were changed but the original entry was not deleted. I don't really understand how this was allowed to happen, but it has. So, as a made up example, I have:
Row_number | Individual_ID | Cause_of_death | Date_of_death
------------+---------------+-----------------------+---------------
1 | 1 | Stroke | 3 march 2008
2 | 2 | Myocardial infarction | 1 jan 2009
3 | 2 | Pulmonary Embolus | 1 jan 2009
I want each individual to have only one cause of death.
In the example, I want a query that returns row 1 and either row 2 or row 3 (not both). I have to make an arbitrary choice between rows 2 and 3 because there is no timestamp in any of the fields that can be used to determine which is the revision; it's not ideal but is unavoidable.
I can't make the SQL work to do this. I've tried inner joining distinct Individual_ID to the other fields, but this still gives all the rows. I've tried adding a 'having count(Individual_ID) = 1' clause with it. This leaves out people with more than one cause of death completely. Suggestions on the internet seem to be based on using a timestamped field to choose the most recent, but I don't have that.
IBM DB2. Windows XP. Any thoughts gratefully received.
Have you tried using MIN (or MAX) against the cause of death. (and the date of death, if they died on two different dates)
SELECT IndividualID, MIN(Cause_Of_Death), MIN (Date_Of_Death)
from deaths
GROUP BY IndividualID
I don't know DB2 so I'll answer in general. There are two main approaches:
select *
from T
join (
select keys, min(ID) as MinID
from T
group by keys
) on T.ID = MinID
And
select *, row_number() over (partition by keys) as r
from T
where r = 1
Both return all rows, no matter if duplicate or not. But they returns only one duplicate per "key".
Notice, that both statements are pseudo-SQL.
The row_number() approach is probably preferable from a performance standpoint. Here is usr's example, in DB2 syntax:
select * from (
select T.*, row_number() over (partition by Individual_ID) as r
from T
)
where r=1;