Discard rows which is not MAX in that group - sql

I have data like this:
a b c
-------|--------|--------
100 | 3 | 50
100 | 4 | 60
101 | 3 | 70
102 | 3 | 70
102 | 4 | 80
102 | 5 | 90
a : key
b : sub_id
c : value
I want to NULL the c row for each element which has non-max a column.
My resulting table must look like:
a b c
-------|--------|--------
100 | 3 | NULL
100 | 4 | 60
101 | 3 | 70
102 | 3 | NULL
102 | 4 | NULL
102 | 5 | 90
How can I do this with an SQL Query?
#UPDATE
My relational table has about a billion rows. Please remind that while providing an answer. I cannot wait couple of hours or 1 day for executing.

Updated after the requirement was changed to "update the table":
with max_values as (
select a,
b,
max(c) over (partition by a) as max_c
from the_table
)
update the_table
set c = null
from max_values mv
where mv.a = the_table.a
and mv.b = the_table.b
and mv.max_c <> the_table.c;
SQLFiddle: http://sqlfiddle.com/#!15/1e739/1
Another possible solution, which might be faster (but you need to check the execution plan)
update the_table t1
set c = null
where exists (select 1
from the_table t2
where t2.a = t1.a
and t2.b = t2.b
and t1.c < t2.c);
SQLFiddle: http://sqlfiddle.com/#!15/1e739/2
But with "billion" rows there is no way this is going to be really fast.

DECLARE #TAB TABLE (A INT,B INT,C INT)
INSERT INTO #TAB VALUES
(100,3,50),
(100,4,60),
(101,3,70),
(102,3,70),
(102,4,80),
(102,5,90)
UPDATE X
SET C = NULL
FROM #TAB X
LEFT JOIN (
SELECT A,MAX(C) C
FROM #TAB
GROUP BY A) LU ON X.A = LU.A AND X.C = LU.C
WHERE LU.A IS NULL
SELECT * FROM #TAB
Result:
this approach will help you

How about this formulation?
select a, b,
(case when c = max(c) over (partition by a) then c end) as c
from table t;
I'm not sure if you can get this faster. An index on a, c might help.

SELECT a, b,
CASE ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) WHEN 1 THEN с END c
FROM mytable

Related

Get count of foreign key from multiple tables

I have 3 tables, with Table B & C referencing Table A via Foreign Key. I want to write a query in PostgreSQL to get all ids from A and also their total occurrences from B & C.
a | b | c
-----------------------------------
id | txt | id | a_id | id | a_id
---+---- | ---+----- | ---+------
1 | a | 1 | 1 | 1 | 3
2 | b | 2 | 1 | 2 | 4
3 | c | 3 | 3 | 3 | 4
4 | d | 4 | 4 | 4 | 4
Output desired (just the id from A & total count in B & C) :
id | Count
---+-------
1 | 2 -- twice in B
2 | 0 -- occurs nowhere
3 | 2 -- once in B & once in C
4 | 4 -- once in B & thrice in C
SQL so far SQL Fiddle :
SELECT a_id, COUNT(a_id)
FROM
( SELECT a_id FROM b
UNION ALL
SELECT a_id FROM c
) AS union_table
GROUP BY a_id
The query I wrote fetches from B & C and counts the occurrences. But if the key doesn't occur in B or C, it doesn't show up in the output (e.g. id=2 in output). How can I start my selection from table A & join/union B & C to get the desired output
If the query involves large parts of b and / or c it is more efficient to aggregate first and join later.
I expect these two variants to be considerably faster:
SELECT a.id,
, COALESCE(b.ct, 0) + COALESCE(c.ct, 0) AS bc_ct
FROM a
LEFT JOIN (SELECT a_id, count(*) AS ct FROM b GROUP BY 1) b USING (a_id)
LEFT JOIN (SELECT a_id, count(*) AS ct FROM c GROUP BY 1) c USING (a_id);
You need to account for the possibility that some a_id are not present at all in a and / or b. count() never returns NULL, but that's cold comfort in the face of LEFT JOIN, which leaves you with NULL values for missing rows nonetheless. You must prepare for NULL. Use COALESCE().
Or UNION ALL a_id from both tables, aggregate, then JOIN:
SELECT a.id
, COALESCE(ct.bc_ct, 0) AS bc_ct
FROM a
LEFT JOIN (
SELECT a_id, count(*) AS bc_ct
FROM (
SELECT a_id FROM b
UNION ALL
SELECT a_id FROM c
) bc
GROUP BY 1
) ct USING (a_id);
Probably slower. But still faster than solutions presented so far. And you could do without COALESCE() and still not loose any rows. You might get occasional NULL values for bc_ct, in this case.
Another option:
SELECT
a.id,
(SELECT COUNT(*) FROM b WHERE b.a_id = a.id) +
(SELECT COUNT(*) FROM c WHERE c.a_id = a.id)
FROM
a
Use left join with a subquery:
SELECT a.id, COUNT(x.id)
FROM a
LEFT JOIN (
SELECT id, a_id FROM b
UNION ALL
SELECT id, a_id FROM c
) x ON (a.id = x.a_id)
GROUP BY a.id;

Select greatest number of unique pairs from table

I have the following table:
| a | b |
|---|---|
| 2 | 4 | x
| 2 | 5 |
| 3 | 1 | x
| 6 | 4 |
| 6 | 5 | x
| 7 | 5 |
| 7 | 4 |
|---|---|
I want to select the greatest number of unique pairs possible where neither a or b are repeated. So the entries with x's next to them should be what the select would grab. Any ideas how to do this?
Currently I have some SQL that will do the opposite, select those that aren't unique and delete them but it has not been working the way I want it to. This is the SQL I have right now, but I think I'm going to scrap it and work at it from the angle I have stated above.
delete t
from #temp2 t
where (exists(select * from #temp2
where (b = t.b
and a < t.a))
or exists(select * from #temp2
where a = t.a
and (b < t.b and ) and
(not exists(select * from #temp2
where b = t.b
and a < t.a)
or not exists(select * from #temp2
where a = t.a
and b < t.b))
Thanks!
I'm assuming here that being non-unique and unique are mutually exclusive and will encompass all records in your table. If so, use your existing script, write it to a CTE, then join to the CTE from your source table selecting those records not in the CTE.
With Non_Unique_Records as (
--Insert your existing script here
)
Select t.a
, t.b
From #temp2 t
Left Outer Join Non_Unique_Records CTE
on t.a = CTE.a
and t.b = CTE.b
Where CTE.b is null
Then just delete the records that the Select statement returns.

Trying to select multiple columns where one is unique

I am trying to select several columns from a table where one of the columns is unique. The select statement looks something like this:
select a, distinct b, c, d
from mytable
The table looks something like this:
| a | b | c | d | e |...
|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5
| 1 | 2 | 3 | 4 | 6
| 2 | 5 | 7 | 1 | 9
| 7 | 3 | 8 | 6 | 4
| 7 | 3 | 8 | 6 | 7
So the query should return something like this:
| a | b | c | d |
|---|---|---|---|
| 1 | 2 | 3 | 4
| 2 | 5 | 7 | 1
| 7 | 3 | 8 | 6
I just want to remove all of the rows where b is duplicated.
EDIT: There seems to be some confusion about which row I want to be selected in the case of duplicate b values. I don't care because the a, c, and d should (but are not guaranteed to) be the same.
Try this
SELECT * FROM (SELECT ROW_NUMBER() OVER (PARTITION BY b ORDER BY a) NO
,* FROM TableName) AS T1 WHERE NO = 1
I think you are nearly there with DISTINCT try:
SELECT DISTINCT a, b, c, d
FROM myTable
You haven't said how to pick a row for each b value, but this will pick one for each.
Select
a,
b,
c,
d,
e
From (
Select
a,
b,
c,
d,
e,
row_number() over (partition by b order by b) rn
From
mytable
) x
Where
x.rn = 1
If you don't care what values you get for B, C, D, and E, as long as they're appropriate for that key, you can group by A:
SELECT A, MIN(B), MIN(C), MIN(D), MIN(E)
FROM MyTable
GROUP BY A
Note that MAX() would be just as valid. Some RDBMSs support a FIRST() aggregate, or similar, for exactly these circumstances where you don't care which value you get (from a certain population).
This will return what you're looking for but I think your example is flawed because you've no determinism over which value from the e column is returned.
Create Table A1 (a int, b int, c int, d int, e int)
INSERT INTO A1 (a,b,c,d,e) VALUES (1,2,3,4,5)
INSERT INTO A1 (a,b,c,d,e) VALUES (1,2,3,4,6)
INSERT INTO A1 (a,b,c,d,e) VALUES (2,5,7,1,9)
INSERT INTO A1 (a,b,c,d,e) VALUES (7,3,8,6,4)
INSERT INTO A1 (a,b,c,d,e) VALUES (7,3,8,6,7)
SELECT * FROM A1
SELECT a,b,c,d
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY b ORDER BY a) RowNum ,*
FROM A1
) As InnerQuery WHERE RowNum = 1
You cannot put DISTINCT on a single column. You should put it right after the SELECT:
SELECT DISTINCT a, b, c, d
FROM mytable
It return the result you need for your sample table. However if you require to remove duplicates only from a single column (which is not possible) you probably misunderstood something. Give us more descriptions and sample, and we try to guide you to the right direction.

Return count(*) even if 0

I have the following query:
select bb.Name, COUNT(*) as Num from BOutcome bo
JOIN BOffers bb ON bo.ID = bb.BOutcomeID
WHERE bo.EventID = 123 AND bo.OfferTypeID = 321 AND bb.NumA > bb.NumB
GROUP BY bb.Name
The table looks like:
Name | Num A | Num B
A | 10 | 3
B | 2 | 3
C | 10 | 3
A | 9 | 3
B | 2 | 3
C | 9 | 3
The expected output should be:
Name | Count
A | 2
B | 0
C | 2
Because when name is A and C then Num A is bigger to times than Num B and when Name is B, in both records Num A is lower than Num B.
My current output is:
Name | Count
A | 2
C | 2
Because B's output is 0, i am not getting it back in my query.
What is wrong with my query? how should I get it back?
Here is my guess. I think this is a much simpler approach than all of the left/right join hoops people have been spinning their wheels on. Since the output of the query relies only on columns in the left table, there is no need for an explicit join at all:
SELECT
bb.Name,
[Count] = SUM(CASE WHEN bb.NumA > bb.NumB THEN 1 ELSE 0 END)
-- just FYI, the above could also be written as:
-- [Count] = COUNT(CASE WHEN bb.NumA > bb.NumB THEN 1 END)
FROM dbo.BOffers AS bb
WHERE EXISTS
(
SELECT 1 FROM dbo.BOutcome
WHERE ID = bb.BOutcomeID
AND EventID = 123
AND OfferTypeID = 321
)
GROUP BY bb.Name;
Of course, we're not really sure that both Name and NumA/NumB are in the left table, since the OP talks about two tables but only shows one table in the sample data. My guess is based on the query he says is "working" but missing rows because of the explicit join.
Another wild guess. Feel free to downvote:
SELECT ba.Name, COUNT(bb.BOutcomeID) as Num
FROM
( SELECT DISTINCT ba.Name
FROM
BOutcome AS b
JOIN
BOffers AS ba
ON ba.BOutcomeID = b.ID
WHERE b.EventID = 123
AND b.OfferTypeID = 321
) AS ba
LEFT JOIN
BOffers AS bb
ON AND bb.Name = ba.Name
AND bb.NumA > bb.NumB
GROUP BY ba.Name ;

sql problem,challenge

I want to get
id a b c
--------------------
1 1 100 90
6 2 50 100
...from:
id a b c
--------------------
1 1 100 90
2 1 300 50
3 1 200 20
4 2 200 30
5 2 300 70
6 2 50 100
It's the row with the smallest b group by a.
How to do it with sql?
EDIT
I thought it can be achieved by
select * from table group by a having min(b);
which I found later it's wrong.
But is it possible to do it with having statement?
I'm using MySQL
SELECT t1.*
FROM mytable t1
LEFT OUTER JOIN mytable t2
ON (t1.a=t2.a AND t1.b>t2.b)
WHERE t2.a IS NULL;
This works because there should be no matching row t2 with the same a and a lesser b.
update: This solution has the same issue with ties that other folks have identified. However, we can break ties:
SELECT t1.*
FROM mytable t1
LEFT OUTER JOIN mytable t2
ON (t1.a=t2.a AND (t1.b>t2.b OR t1.b=t2.b AND t1.id>t2.id))
WHERE t2.a IS NULL;
Assuming for instance that in the case of a tie, the row with the lower id should be the row we choose.
This doesn't do the trick:
select * from table group by a having min(b);
Because HAVING MIN(b) only tests that the least value in the group is not false (which in MySQL means not zero). The condition in a HAVING clause is for excluding groups from the result, not for choosing the row within the group to return.
In MySQL:
select t1.* from test as t1
inner join
(select t2.a, min(t2.b) as min_b from test as t2 group by t2.a) as subq
on subq.a=t1.a and subq.min_b=t1.b;
Here is the proof:
mysql> create table test (id int unsigned primary key auto_increment, a int unsigned not null, b int unsigned not null, c int unsigned not null) engine=innodb;
Query OK, 0 rows affected (0.55 sec)
mysql> insert into test (a,b,c) values (1,100,90), (1,300,50), (1,200,20), (2,200,30), (2,300,70), (2,50,100);
Query OK, 6 rows affected (0.39 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> select * from test;
+----+---+-----+-----+
| id | a | b | c |
+----+---+-----+-----+
| 1 | 1 | 100 | 90 |
| 2 | 1 | 300 | 50 |
| 3 | 1 | 200 | 20 |
| 4 | 2 | 200 | 30 |
| 5 | 2 | 300 | 70 |
| 6 | 2 | 50 | 100 |
+----+---+-----+-----+
6 rows in set (0.00 sec)
mysql> select t1.* from test as t1 inner join (select t2.a, min(t2.b) as min_b from test as t2 group by t2.a) as subq on subq.a=t1.a and subq.min_b=t1.b;
+----+---+-----+-----+
| id | a | b | c |
+----+---+-----+-----+
| 1 | 1 | 100 | 90 |
| 6 | 2 | 50 | 100 |
+----+---+-----+-----+
2 rows in set (0.00 sec)
Use:
SELECT DISTINCT
x.*
FROM TABLE x
JOIN (SELECT t.a,
MIN(t.b) 'min_b'
FROM TABLE T
GROUP BY t.a) y ON y.a = x.a
AND y.min_b = x.b
You're right. select min(b), a from table group by a. If you want the entire row, then you've use analytics function. That depends on database s/w.
It depends on the implementation, but this is usually faster than the self-join method:
SELECT id, a, b, c
FROM
(
SELECT id, a, b, c
, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b ASC) AS [b IN a]
) As SubqueryA
WHERE [b IN a] = 1
Of course it does require that you SQL implementation be fairly up-to-date with the standard.