Distinct values by multiple columns after already applied GROUP BY - sql

Basically, I have the following query (actually more complex, but I think this simplification is ok):
SELECT a, b, x
FROM table
output:
a | b | x
-----------
1 | 2 | 34
1 | 3 | 35
1 | 3 | 36
1 | 4 | 37
2 | 3 | 38
2 | 3 | 39
2 | 4 | 40
3 | 4 | 41
3 | 5 | 42
To count the number of occurrence of each "pair of a and b", I'm using GROUP BY:
SELECT a, b, COUNT(x) AS count
FROM table
GROUP BY a, b
ORDER BY count
output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1
2 | 4 | 1
3 | 4 | 1
3 | 5 | 1
1 | 3 | 2
2 | 3 | 2
What bothers me is the multiple occurence of a and b. I would like to keep the "count" as it is, but remove every following row, if a or b was already in a previous row. It would be a nice to have, if it would also remove a row, if the value of "a" appeared in a previous row as "b" and vice versa.
Preferred expected output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1 <- should not be in output since we had b=
3 | 4 | 1
3 | 5 | 1 <- should not be in output since we had a=3
1 | 3 | 2 <- should not be in output since we had a=1 / a=3
2 | 3 | 2 <- should not be in output since we had b=2 / a=3
Therefore, this:
a | b | count
--------------
1 | 2 | 1
3 | 4 | 1
Alternative expected output, if the above would be too complex:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1
3 | 4 | 1 <- should not be in output since we had b=4
3 | 5 | 1
1 | 3 | 2 <- should not be in output since we had a=1
2 | 3 | 2 <- should not be in output since we had a=2
Therefore, this:
a | b | count
--------------
1 | 2 | 1
2 | 4 | 1
3 | 5 | 1

This is rather a mess of a problem, but here's something to consider:
SELECT a, b, count
FROM (
SELECT a, b, count,
rank() over (partition by b order by count, a) as b_rank
FROM (
SELECT a, b, count,
rank() over (partition by a order by count, b) as a_rank
FROM (
SELECT a, b, COUNT(*) AS count
FROM t
GROUP BY a, b
ORDER BY count
) pc
) pc2
WHERE a_rank < 3
) pc3
WHERE b_rank = 1
Each a value will appear at most twice in the results, but b values will be unique. Some b values appearing in low-count pairs may not be reflected in the results. There is a trade-off between possible duplication of a and the number of b values that may be missed altogether: allowing more duplicates of a (by changing to, e.g, WHERE a_rank < 4) reduces the number of b values that may be missed.

This query will give you the desired output.
DECLARE #id INT = 1,
#a INT,
#b INT,
#count INT
DECLARE #tbl TABLE
(
id INT IDENTITY(1,1),
a INT,
b INT,
count INT
)
INSERT INTO #tbl
SELECT a, b, COUNT(1) AS COUNT FROM dbo.myTable
GROUP BY a, b
ORDER BY COUNT,a,b
SELECT #count = COUNT(1) FROM #tbl
WHILE #id <= #count
BEGIN
SELECT TOP 1 #a = a,#b = b FROM #tbl WHERE id = #id
IF EXISTS(SELECT 1 FROM #tbl WHERE id < #id AND (a = #a OR b = #b))
DELETE #tbl WHERE id = #id
SET #id += 1
END
SELECT a,b,count FROM #tbl
Check it on SQLFiddle

Related

Select rows with a column equal to 1 OR those that are referenced in an other table

I have 2 tables A and B, as so:
A: B:
A_id | Val B_id | A_id
------------ ------------
1 | 1 1 | 1
2 | 1 2 | 1
3 | 2 3 | 2
4 | 1 4 | 2
5 | 3 5 | 5
6 | 1
I would like to select rows from A where Val = 1 OR those that are referenced in B.
So for this particular example, the select would retrieve:
A_id | Val
----------
1 | 1
2 | 1
4 | 1
5 | 3
Note that row 4 is not referenced in table B, but Val is equal to one, and Val from row 5 is != 1 but the row is referenced in table B.
I tried using the DISTINCT keyword but the problem is that it doesn't select row 4 because it is not referenced in table B.
Thanks for your help.
You can use EXISTS:
SELECT A.*
FROM A
WHERE A.Val = 1
OR EXISTS (SELECT 1 FROM B WHERE B.A_id = A.A_id);

generate serial number in decreasing order given a variable in netezza aginity sql

Is there any SQL syntax using netezza SQL, given column number, trying to generate rows for number in decreasing order down to 0.
Below is an example of what I'm trying to do
BEFORE
ID
NUMBER
A
4
B
5
AFTER
ID
NUMBER
A
4
A
3
A
2
A
1
B
5
B
4
B
3
B
2
B
1
please also click to see screenshot for example thanks
You can use the _v_vector_idx table for this purpose
select
id, idx
from
test join _v_vector_idx
on idx <= number
order
by id asc, idx desc ;
Here's the example in action
select * from test
ID | NUMBER
-------+--------
A | 4
B | 5
(2 rows)
select id, idx from test join _v_vector_idx on
idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
(11 rows)
insert into test values ('C', 3);
INSERT 0 1
select * from test;
ID | NUMBER
-------+--------
A | 4
B | 5
C | 3
(3 rows)
select id, idx from test join _v_vector_idx
on idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
C | 3
C | 2
C | 1
C | 0
(15 rows)

Group observations with SQL and Specifying in same group

I have a table consisting of two columns (X,Y) that represent correlations between observations like below.
X Y
1 2
2 3
3 4
A B
B C
I want a create new column that represent the relation between observation. 1 become 2, 2 become 3, 3 become 4. So i wanna show this variables in same group(1,2,3,4 are belong to same group). The table should be like below.
X Y Z
1 2 Group 1
2 3 Group 1
3 4 Group 1
A B Group 2
B C Group 2
I am using SAS Enterprise Guide. The solution would be great with proc sql or any sql type. I need the logic.
Note: I have no additional information except this table.
Try the following, here is the demo which is in PostgreSQL but you may be able to use the same logic.
with cte as
(
select
*,
lag(y) over (order by x) as rnk
from myTable
)
select
x,
y,
concat('Group ', sum(case when x = rnk then 0 else 1 end) over (order by x)) as z
from cte;
Output:
| x | y | z |
| --- | --- | ------- |
| 1 | 2 | Group 1 |
| 2 | 3 | Group 1 |
| 3 | 4 | Group 1 |
| A | B | Group 2 |
| B | C | Group 2 |

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?

The Rows Holding the Group-wise Maximum of a Certain Column (how to kill duplicates...)

I've asked similar question earlier, but this similarity is superficial, the problem lies somewhere deeper...
So consider the following MS SQL Server 2008 table:
ID | X | Y
------+-------+-------
1 | 1 | 1
2 | 1 | 2
3 | 1 | 2
4 | 1 | 3
5 | 1 | 3
6 | 2 | 4
7 | 2 | 5
8 | 2 | 5
9 | 2 | 5
10 | 3 | 1
11 | 3 | 10
12 | 3 | 10
I need to receive ONE of the following results (doesn't really matter which it would be):
ID | X | Y
------+-------+-------
4 | 1 | 3
7 | 2 | 5
11 | 3 | 10
Or
ID | X | Y
------+-------+-------
!! 5 | 1 | 3
7 | 2 | 5
11 | 3 | 10
Or
.....
Or
ID | X | Y
------+-------+-------
5 | 1 | 3
9 | 2 | 5
12 | 3 | 10
I need to
Group table by X
Select the maximum Y
Select the ID of that maximum Y
The result should also be grouped by X
There shouldn't be results like:
ID | X | Y
------+-------+-------
4 | 1 | 3
5 | 1 | 3
7 | 2 | 5
8 | 2 | 5
9 | 2 | 5
11 | 3 | 10
12 | 3 | 10
DECLARE #Data TABLE (ID INTEGER, X INTEGER, Y INTEGER)
INSERT #Data VALUES (1,1,1),(2,1,2),(3,1,2),(4,1,3),(5,1,3),
(6,2,4),(7,2,5),(8,2,5),(9,2,5),(10,3,1),(11,3,10),(12,3,10)
;WITH CTE AS
(
SELECT ID, X, Y,
ROW_NUMBER() OVER(PARTITION BY X ORDER BY Y DESC, ID ASC) AS RowNo
FROM #Data
)
SELECT ID, X, Y FROM CTE WHERE RowNo = 1
So, using ROW_NUMBER() to assign each row an incremental number which resets to 1 for each new X value. For rows with the same value for X, the row number is assigned incrementally ordered by Y DESCENDING and ID ASCENDING - so for a particular X value, row number 1 will be assigned to the one with the HIGHEST Y value and the LOWEST ID value. We then add a restriction to only return those where RowNo is 1.
This will bring the first row for each duplicate combination of x,y
SELECT t1.*
FROM tablename t1
INNER JOIN
(SELECT MIN(id) id FROM tablename GROUP BY X,Y HAVING COUNT(*)>1)
t2 ON t1.id = t2.id
Testing the code got the following result:
ID X Y
2 1 2
4 1 3
7 2 5
11 3 10
There is one more elegant solution for such a type of tasks:
DECLARE #Data TABLE (ID INTEGER, X INTEGER, Y INTEGER);
INSERT #Data VALUES (1,1,1);
INSERT #Data VALUES (2,1,2);
INSERT #Data VALUES (3,1,2);
INSERT #Data VALUES (4,1,3);
INSERT #Data VALUES (5,1,3);
INSERT #Data VALUES (6,2,4);
INSERT #Data VALUES (7,2,5);
INSERT #Data VALUES (8,2,5);
INSERT #Data VALUES (9,2,5);
INSERT #Data VALUES (10,3,1);
INSERT #Data VALUES (11,3,10);
INSERT #Data VALUES (12,3,10);
SELECT TOP 1 WITH TIES
ID, X, Y
FROM #Data
ORDER BY ROW_NUMBER() OVER(PARTITION BY X ORDER BY Y DESC, ID ASC);