PostgreSQL distinct rows joined with a count of distinct values in one column - sql

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?

Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?

Related

generate serial number in decreasing order given a variable in netezza aginity sql

Is there any SQL syntax using netezza SQL, given column number, trying to generate rows for number in decreasing order down to 0.
Below is an example of what I'm trying to do
BEFORE
ID
NUMBER
A
4
B
5
AFTER
ID
NUMBER
A
4
A
3
A
2
A
1
B
5
B
4
B
3
B
2
B
1
please also click to see screenshot for example thanks
You can use the _v_vector_idx table for this purpose
select
id, idx
from
test join _v_vector_idx
on idx <= number
order
by id asc, idx desc ;
Here's the example in action
select * from test
ID | NUMBER
-------+--------
A | 4
B | 5
(2 rows)
select id, idx from test join _v_vector_idx on
idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
(11 rows)
insert into test values ('C', 3);
INSERT 0 1
select * from test;
ID | NUMBER
-------+--------
A | 4
B | 5
C | 3
(3 rows)
select id, idx from test join _v_vector_idx
on idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
C | 3
C | 2
C | 1
C | 0
(15 rows)

Efficient query to Group by column name in SQL or hive

Imagine I have a table with 2 columns m_1 and m_2:
m1 | m2
3 | 17
3 | 18
4 | 17
9 | 9
I would like to get a table with 3 columns:
m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.
In the example, the result is:
m | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1
The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?
A naive solution is to execute two times a parametric query like this:
for (i in 1 .. 2)
SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i
But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.
Will the solution becomes easier if I use hive instead of a relational data base?
You can write this using union all and group by:
select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
(select 'm_2' as colname, m2 as d from t)
) m12
group by colname, d;
posexplode(array(m1,m2))
select concat('m_',cast(pe.pos+1 as string)) as m
,pe.val as d
,count(*) as `count`
from mytable t
lateral view posexplode(array(m1,m2)) pe
group by pos
,val
;
+------+-----+--------+
| m | d | count |
+------+-----+--------+
| m_1 | 3 | 2 |
| m_1 | 4 | 1 |
| m_1 | 9 | 1 |
| m_2 | 9 | 1 |
| m_2 | 17 | 2 |
| m_2 | 18 | 1 |
+------+-----+--------+

Combine 2 tables which doesn't have any relationship

I have couple of tables like below-
Table1:
A B C D <<Columns
1 2 3 4 <<single row
Table2:
W X Y Z << Columns
5 6 7 8 << Single row
I want to combine these 2 tables such a way that it will give me following result
Result:
P Q R S << Column headers
1 2 3 4 << row from table1
5 6 7 8 << row from table2
Expected result will have column headers as P, Q, R, S and row from table1 and row from table2
How to achieve this using SQL?
UNION ALL will not eliminate duplicates
In set operations (UNION / INTERSECT / EXCEPT) the aliases are taken from the first query (Currently I'm aware of only one exception- Hive requires the aliases to be the same for all queries - I consider this as a bug)
select A as P, B as Q, C as R, D as S
from table1
union all
select W,X,Y,Z
from table2
+---+---+---+---+
| p | q | r | s |
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
+---+---+---+---+
table2 with 3 Columns
select B as Q, C as R, D as S
from table1
union all
select X,Y,Z
from table2
+---+---+---+
| q | r | s |
+---+---+---+
| 2 | 3 | 4 |
| 6 | 7 | 8 |
+---+---+---+
or
select A as P, B as Q, C as R, D as S
from table1
union all
select null,X,Y,Z
from table2
+--------+---+---+---+
| p | q | r | s |
+--------+---+---+---+
| 1 | 2 | 3 | 4 |
| (null) | 6 | 7 | 8 |
+--------+---+---+---+
_Updated to be more strict and more complete, thanks to #AntDC (and #Matt) and #Dudu Markovitz__
Use UNION with aliases, like this:
SELECT A AS P, B AS Q, C AS R, D AS S
FROM table1
UNION
-- or UNION ALL if you want to keep duplicate rows
SELECT W, X, Y, Z
FROM table2

Distinct values by multiple columns after already applied GROUP BY

Basically, I have the following query (actually more complex, but I think this simplification is ok):
SELECT a, b, x
FROM table
output:
a | b | x
-----------
1 | 2 | 34
1 | 3 | 35
1 | 3 | 36
1 | 4 | 37
2 | 3 | 38
2 | 3 | 39
2 | 4 | 40
3 | 4 | 41
3 | 5 | 42
To count the number of occurrence of each "pair of a and b", I'm using GROUP BY:
SELECT a, b, COUNT(x) AS count
FROM table
GROUP BY a, b
ORDER BY count
output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1
2 | 4 | 1
3 | 4 | 1
3 | 5 | 1
1 | 3 | 2
2 | 3 | 2
What bothers me is the multiple occurence of a and b. I would like to keep the "count" as it is, but remove every following row, if a or b was already in a previous row. It would be a nice to have, if it would also remove a row, if the value of "a" appeared in a previous row as "b" and vice versa.
Preferred expected output:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1 <- should not be in output since we had b=
3 | 4 | 1
3 | 5 | 1 <- should not be in output since we had a=3
1 | 3 | 2 <- should not be in output since we had a=1 / a=3
2 | 3 | 2 <- should not be in output since we had b=2 / a=3
Therefore, this:
a | b | count
--------------
1 | 2 | 1
3 | 4 | 1
Alternative expected output, if the above would be too complex:
a | b | count
--------------
1 | 2 | 1
1 | 4 | 1 <- should not be in output since we had a=1
2 | 4 | 1
3 | 4 | 1 <- should not be in output since we had b=4
3 | 5 | 1
1 | 3 | 2 <- should not be in output since we had a=1
2 | 3 | 2 <- should not be in output since we had a=2
Therefore, this:
a | b | count
--------------
1 | 2 | 1
2 | 4 | 1
3 | 5 | 1
This is rather a mess of a problem, but here's something to consider:
SELECT a, b, count
FROM (
SELECT a, b, count,
rank() over (partition by b order by count, a) as b_rank
FROM (
SELECT a, b, count,
rank() over (partition by a order by count, b) as a_rank
FROM (
SELECT a, b, COUNT(*) AS count
FROM t
GROUP BY a, b
ORDER BY count
) pc
) pc2
WHERE a_rank < 3
) pc3
WHERE b_rank = 1
Each a value will appear at most twice in the results, but b values will be unique. Some b values appearing in low-count pairs may not be reflected in the results. There is a trade-off between possible duplication of a and the number of b values that may be missed altogether: allowing more duplicates of a (by changing to, e.g, WHERE a_rank < 4) reduces the number of b values that may be missed.
This query will give you the desired output.
DECLARE #id INT = 1,
#a INT,
#b INT,
#count INT
DECLARE #tbl TABLE
(
id INT IDENTITY(1,1),
a INT,
b INT,
count INT
)
INSERT INTO #tbl
SELECT a, b, COUNT(1) AS COUNT FROM dbo.myTable
GROUP BY a, b
ORDER BY COUNT,a,b
SELECT #count = COUNT(1) FROM #tbl
WHILE #id <= #count
BEGIN
SELECT TOP 1 #a = a,#b = b FROM #tbl WHERE id = #id
IF EXISTS(SELECT 1 FROM #tbl WHERE id < #id AND (a = #a OR b = #b))
DELETE #tbl WHERE id = #id
SET #id += 1
END
SELECT a,b,count FROM #tbl
Check it on SQLFiddle

Select non distinct rows from two columns

My question is very similar to Multiple NOT distinct only it deals with multiple columns instead of one. I have a table like so:
A B C
1 1 0
1 2 1
2 1 2
2 1 3
2 2 4
2 3 5
2 3 6
3 1 7
3 3 8
3 1 9
And the result should be:
A B C
2 1 2
2 1 3
2 3 5
2 3 6
3 1 7
3 1 9
Essentially, like the above question, removing all unique entries only where uniqueness is determined by two columns instead of one. I already tried various tweaks to the above answer but couldn't get any of them to work.
You are using SQL Server, so this is easier than in Access:
select A, B, C
from (select t.*, count(*) over (partition by A, B) as cnt
from t
) t
where cnt > 1;
This use of count(*) is as a window function. It is counting the number of rows with the same value of A and B. The final where just selects the rows that have more than one entry.
Another possible solution with EXISTS
SELECT a, b, c
FROM Table1 t
WHERE EXISTS
(
SELECT 1
FROM Table1
WHERE a = t.a
AND b = t.b
AND c <> t.c
)
It should be fast enough.
Output:
| A | B | C |
-------------
| 2 | 1 | 2 |
| 2 | 1 | 3 |
| 2 | 3 | 5 |
| 2 | 3 | 6 |
| 3 | 1 | 7 |
| 3 | 1 | 9 |
Here is SQLFiddle demo