matching groups of rows in two databases - sql

I have the following (simplified) situation in two databases:
ID Prog T Qt
|---------|--------|---------|---------|
| a | 1 | N | 100 |
| b | 1 | Y | 10 |
| b | 2 | N | 90 |
| c | 1 | N | 25 |
| c | 2 | Y | 25 |
| c | 3 | Y | 25 |
| c | 4 | Y | 25 |
|---------|--------|---------|---------|
ID Prog T Qt
|---------|--------|---------|---------|
| 1 | 1 | Y | 10 |
| 1 | 2 | N | 90 |
| 2 | 1 | Y | 100 |
| 3 | 1 | Y | 100 |
| 4 | 1 | Y | 50 |
| 4 | 2 | Y | 25 |
| 4 | 3 | Y | 25 |
|---------|--------|---------|---------|
I need to compare groups of rows (primary keys are ID and Prog), to find out which groups of rows represent the same combination of factors (not considering ID).
In the example above, ID "b" in the first table and ID "1" in the second have the same combination of values for Prog, T and Qt, while no one else can be considered exactly the same between the 2 dbs (while ID "2" and "3" in the second table are equal, I'm not interested in comparing in the same db).
I hope I explained everything.

A join and aggregation should work for this purpose:
select t1.id, t2.id
from (select t1.*, count(*) over (partition by id) as cnt
from t1
) t1 join
(select t2.*, count(*) over (partition by id) as cnt
from t2
) t2
on t1.prog = t2.prog and t1.T = t2.T and t1.Qt = t2.Qt and t1.cnt = t2.cnt
group by t1.id, t2.id, t1.cnt
having count(*) = t1.cnt;
This is a little tricky. The subqueries count the number of rows for each id in each table. The on clause gets matches between the three columns -- and checks that the ids have the same count. The group by and having then get rows where number of matching rows is the total number of rows.

Join the two tables on the conditions you want to match. The results will be the values that match between them.
CREATE TABLE a (ID CHAR(1), Prog INT, T CHAR(1), Qt INT);
CREATE TABLE b (ID int, Prog INT, T CHAR(1), Qt INT);
INSERT INTO dbo.a
( ID ,Prog ,T ,Qt)
VALUES ('a',1,'N',100), ('b',1,'Y',10), ('b',2,'N',90),('c',1,'N',25),('c',2,'Y',25),('c',3,'Y',25),('c',4,'Y',25)
INSERT INTO dbo.b
( ID ,Prog ,T ,Qt)
VALUES (1,1,'Y',10),(1,2,'N',90),(2,1,'Y',100),(3,1,'Y',100),(4,1,'Y',50),(4,2,'Y',25),(4,3,'Y',25)
WITH CTEa
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.a
),
CTEb
AS (SELECT ID,
Prog,
T,
Qt,
Cnt = COUNT(ID) OVER (PARTITION BY ID)
FROM dbo.b
)
SELECT ID_A = a.ID,
ID_B = b.ID,
b.Prog,
b.T,
b.Qt,
b.Cnt
FROM CTEa AS a
INNER JOIN CTEb AS b
ON a.Prog = b.Prog
AND a.T = b.T
AND a.Qt = b.Qt
AND a.Cnt = b.Cnt;
Results:
ID_A ID_B Prog T Qt Cnt
b 1 1 Y 10 2
b 1 2 N 90 2

Related

Grouping data using PostgreSQL based on 2 fields

I have a problem with grouping data in postgresql. let say that I have table called my_table
some_id | description | other_id
---------|-----------------|-----------
1 | description-1 | a
1 | description-2 | b
2 | description-3 | a
2 | description-4 | a
3 | description-5 | a
3 | description-6 | b
3 | description-7 | b
4 | description-8 | a
4 | description-9 | a
4 | description-10 | a
...
I would like to group my database based on some_id then differentiate which one has same and different other_id
I would expecting 2 type of queries: 1 that has same other_id and 1 that has different other_id
Expected result
some_id | description | other_id
---------|-----------------|-----------
2 | description-3 | a
2 | description-4 | a
4 | description-8 | a
4 | description-9 | a
4 | description-10 | a
AND
some_id | description | other_id
---------|-----------------|-----------
1 | description-1 | a
1 | description-2 | b
3 | description-5 | a
3 | description-6 | b
3 | description-7 | b
I am open for suggestion both using sequelize or raw query
thank you
One approach, using MIN and MAX as analytic functions:
WITH cte AS (
SELECT *, MIN(other_id) OVER (PARTITION BY some_id) min_other_id,
MAX(other_id) OVER (PARTITION BY some_id) max_other_id
FROM yourTable
)
-- all some_id the same
SELECT some_id, description, other_id
FROM cte
WHERE min_other_id = max_other_id;
-- not all some_id the same
SELECT some_id, description, other_id
FROM cte
WHERE min_other_id <> max_other_id;
Demo
You can also do this using exists and not exists:
-- all same
select t.*
from my_table t
where not exists (select 1
from my_table t2
where t2.some_id = t.some_id and t2.other_id <> t.other_id
);
-- any different
select t.*
from my_table t
where exists (select 1
from my_table t2
where t2.some_id = t.some_id and t2.other_id <> t.other_id
);
Note that this ignores NULL values. If you want them treated as a "different" value then use is distinct from rather than <>.

SQL Query - Check for Two Distinct Values

Given the below data set I want to run a query to highlight any 'pairs' that do not consist of a 'left' and 'right'.
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 1 | A | A1 | Left |
| 1 | A | A2 | Right |
| 2 | B | B1 | Right |
| 2 | B | B2 | Left |
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
| 4 | D | D1 | Right |
| 4 | D | D2 | Left |
| 5 | E | E1 | Left |
| 5 | E | E2 | Right |
+---------+-----------+---------------+----------------------+
In this instance Pair 3 'C' has two lefts. Therefore, I would look to display the following:
+---------+-----------+---------------+----------------------+
| Pair_Id | Pair_Name | Individual_Id | Individual_Direction |
+---------+-----------+---------------+----------------------+
| 3 | C | C1 | Left |
| 3 | C | C2 | Left |
+---------+-----------+---------------+----------------------+
You can simply use not exists:
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) ;
With an index on (pair_id, Individual_Direction), this should not only be the most concise solution but also the fastest.
If you want to be sure that there are pairs (the above returns singletons):
select t.*
from t
where not exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_Direction <> t.Individual_Direction
) and
exists (select 1
from t t2
where t2.pair_id = t.pair_id and
t2.Individual_ID <> t.Individual_ID
);
You can also do this using window functions:
select t.*
from (select t.*,
count(*) over (partition by pair_id) as cnt,
min(status) over (partition by pair_id) as min_status,
max(status) over (partition by pair_id) as max_status
from t
) t
where cnt > 1 and min_status <> max_status;
One option uses aggregation:
WITH cte AS (
SELECT Pair_Name
FROM yourTable
WHERE Individual_Direction IN ('Left', 'Right')
GROUP BY Pair_Name
HAVING MIN(Individual_Direction) = MAX(Individual_Direction)
)
SELECT *
FROM yourTable
WHERE Pair_Name IN (SELECT Pair_Name FROM cte);
The HAVING clause used above asserts that a matching pair has both a minimum and maximum direction which are the same. This implies that such a pair only has one direction.
As is the case with Gordon's answer, an index on (Pair_Name, Individual_Direction) might help performance:
CREATE INDEX idx ON yourTable (Pair_Name, Individual_Direction);
There should be an elegant way of using window function than what I wrote:
WITH ranked AS
(
SELECT *, RANK() OVER(ORDER BY Pair_Id, Pair_Name, Individual_Direction) AS r
FROM pairs
),
counted AS
(
SELECT Pair_Id, Pair_Name, Individual_Direction,r, COUNT(r) as times FROM ranked
GROUP BY Pair_Id, Pair_Name, Individual_Direction, r
HAVING COUNT(r) > 1
)
SELECT ranked.Pair_Id, ranked.Pair_Name, ranked.Individual_Id, ranked.Individual_Direction FROM ranked
RIGHT JOIN counted
ON ranked.Pair_Id=counted.Pair_Id
AND ranked.Pair_Name=counted.Pair_Name
AND ranked.Individual_Direction=counted.Individual_Direction

T-SQL: Best way to replace NULL with most recent non-null value?

Assume I have this table:
+----+-------+
| id | value |
+----+-------+
| 1 | 5 |
| 2 | 4 |
| 3 | 1 |
| 4 | NULL |
| 5 | NULL |
| 6 | 14 |
| 7 | NULL |
| 8 | 0 |
| 9 | 3 |
| 10 | NULL |
+----+-------+
I want to write a query that will replace any NULL value with the last value in the table that was not null in that column.
I want this result:
+----+-------+
| id | value |
+----+-------+
| 1 | 5 |
| 2 | 4 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 14 |
| 7 | 14 |
| 8 | 0 |
| 9 | 3 |
| 10 | 3 |
+----+-------+
If no previous value existed, then NULL is OK. Ideally, this should be able to work even with an ORDER BY. So for example, if I ORDER BY [id] DESC:
+----+-------+
| id | value |
+----+-------+
| 10 | NULL |
| 9 | 3 |
| 8 | 0 |
| 7 | 0 |
| 6 | 14 |
| 5 | 14 |
| 4 | 14 |
| 3 | 1 |
| 2 | 4 |
| 1 | 5 |
+----+-------+
Or even better if I ORDER BY [value] DESC:
+----+-------+
| id | value |
+----+-------+
| 6 | 14 |
| 1 | 5 |
| 2 | 4 |
| 9 | 3 |
| 3 | 1 |
| 8 | 0 |
| 4 | 0 |
| 5 | 0 |
| 7 | 0 |
| 10 | 0 |
+----+-------+
I think this might involve some kind of analytic function - somehow partitioning over the value column - but I'm not sure where to look.
You can use a running sum to set groups and use max to fill in the null values.
select id,max(value) over(partition by grp) as value
from (select id,value,sum(case when value is not null then 1 else 0 end) over(order by id) as grp
from tbl
) t
Change the over() clause to order by value desc to get the second result in the question.
The best way has been covered by Itzik Ben-Gan here:The Last non NULL Puzzle
Below is a solution which for 10 million rows and completes around in 20 seconds on my system
SELECT
id,
value1,
CAST(
SUBSTRING(
MAX(CAST(id AS binary(4)) + CAST(value1 AS binary(4)))
OVER (ORDER BY id
ROWS UNBOUNDED PRECEDING),
5, 4)
AS int) AS lastval
FROM dbo.T1;
This solution assumes your id column is indexed
You can also try using correlated subquery
select id,
case when value is not null then value else
(select top 1 value from table
where id < t.id and value is not null order by id desc) end value
from table t
Result :
id value
1 5
2 4
3 1
4 1
5 1
6 14
7 14
8 0
9 3
10 3
If the NULLs are scattered I use a WHILE loop to fill them in
However if the NULLs are in longer consecutive strings there are faster ways to do it.
So here's one approach:
First find a record that we want to update. It has NULL in this record and no NULL in the prior record
SELECT C.VALUE, N.ID
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
Use that to update: (bit hazy on this syntax but you get the idea)
UPDATE N
SET VALUE = C.Value
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
.. now just keep doing it till you run out of rows
-- This is needed to set ##ROWCOUNT to non zero
SELECT 1;
WHILE ##ROWCOUNT <> 0
BEGIN
UPDATE N
SET VALUE = C.Value
FROM TABLE C
INNER JOIN TABLE N
ON C.ID + 1 = N.ID
WHERE C.VALUE IS NOT NULL
AND N.VALUE IS NULL;
END
The other way is to use a similiar query to get a range of id's to update. This works much faster if your NULLS are usually against consecutive id's
Here is the one simple approach using OUTER APPLY
CREATE TABLE #table(id INT, value INT)
INSERT INTO #table VALUES
(1,5),
(2,4),
(3,1),
(4,NULL),
(5,NULL),
(6,14),
(7,NULL),
(8,0),
(9,3),
(10,NULL)
SELECT t.id, ISNULL(t.value, t3.value) value
FROM #table t
OUTER APPLY(SELECT id FROM #table WHERE id = t.id AND VALUE IS NULL) t2
OUTER APPLY(SELECT TOP 1 value
FROM #table WHERE id <= t2.id AND VALUE IS NOT NULL ORDER BY id DESC) t3
OUTPUT:
id VALUE
---------
1 5
2 4
3 1
4 1
5 1
6 14
7 14
8 0
9 3
10 3
Using this sample data:
if object_id('tempdb..#t1') is not null drop table #t1;
create table #t1 (id int primary key, [value] int null);
insert #t1 values(1,5),(2,4),(3,1),(4,NULL),(5,NULL),(6,14),(7,NULL),(8,0),(9,3),(10,NULL);
I came up with:
with x(id, [value], grouper) as (
select *, row_number() over (order by id)-sum(iif([value] is null,1,0)) over (order by id)
from #t1)
select id, min([value]) over (partition by grouper)
from x;
I noticed, however, that Vamsi Prabhala beat me to it... My solution is identical to what he posted. (arghhhh!). So I thought I'd try a recursive solution. Here's a pretty efficient use of a recursive cte (provided that ID is indexed):
with sorted as (select *, seqid = row_number() over (order by id) from #t1),
firstRecord as (select top(1) * from #t1 order by id),
prev as
(
select t.id, t.[value], lastid = 1, lastvalue = null
from sorted t
where t.id = 1
union all
select t2.id, t2.[value], lastid+1, isnull(prev.[value],lastvalue)
from sorted t2
join prev on t2.id = prev.lastid+1
)
select id, [value]=isnull([value],lastvalue)--, *
from prev;
Normally I don't like recursive cte's (rCte for short) but in this case it offered an elegant solution and was faster than using the window aggregate function (sum over, min over...). Note the execution plans, the rcte on the bottom. The rCTE get's it done with two index seeks, one of which is for just one row. Unlike the window aggregate solution, the rcte does not require a sort. Running this with statistics io on; the rcte produces much less IO.
All this said, don't use either of these solutions, What the TheGameiswar posted will perform the best by far. His solution on a properly indexed id column would be lightening fast.
Following UPDATE statement can be used, please test it before use
update #table
set value = newvalue
from (
select
s.id, s.value,
(select top 1 t.value from #table t where t.id <= s.id and t.value is not null order by t.id desc) as newvalue
from #table S
) u
where #table.id = u.id and #table.value is null
stop worrying..here's the answer for you :)
SELECT *
INTO #TempIsNOtNull
FROM YourTable
WHERE value IS NOT NULL
SELECT *
INTO #TempIsNull
FROM YourTable
WHERE value IS NULL
UPDATE YourTable
SEt YourTable.value = UpdateDtls.value
FROM YourTable
JOIN (
SELECT OuterTab1.id,
#TempIsNOtNull.value
FROM #TempIsNull OuterTab1
CROSS JOIN #TempIsNOtNull
WHERE OuterTab1.id - #TempIsNOtNull.id > 0
AND (OuterTab1.id - #TempIsNOtNull.id) = ( SELECT TOP 1
OuterTab1.id - #TempIsNOtNull.id
FROM #TempIsNull InnerTab
CROSS JOIN #TempIsNOtNull
WHERE OuterTab1.id - #TempIsNOtNull.id > 0
AND OuterTab1.id = InnerTab.id
ORDER BY (OuterTab1.id - #TempIsNOtNull.id) ASC) ) AS UpdateDtls
ON (YourTable.id = UpdateDtls.id)

Impala query - optimize a query to get the uniques for given key

I'm looking for ways to count unique users that have a specific pkey and also the count of unique users who didn't have that pkey.
Here is a sample table:
userid | pkey | pvalue
------------------------------
U1 | x | vx
U1 | y | vy
U1 | z | vz
U2 | y | vy
U3 | z | vz
U4 | null | null
I get the expected results to get the unique users who has the pkey='y' and those who didn't using this query but turns out to be expensive:
WITH all_rows AS
( SELECT userid,
IF( pkey='y', pval, 'none' ) AS val,
SUM( IF(pkey='y',1,0) ) AS has_key
FROM some_table
GROUP BY userid, val)
SELECT val,
count(distinct(userid)) uniqs
FROM all_rows
WHERE has_key=1
GROUP BY val
UNION ALL
SELECT 'no_key_set' val,
count(distinct(userid)) uniqs
FROM all_rows a1 LEFT ANTI JOIN
all_rows a2 on (a1.userid = a2.userid and a2.has_key=1)
GROUP BY val;
Results:
val | uniqs
--------------------
vy | 2
no_key_set | 2
I'm looking to avoid using any temp tables, so any better ways this can be achieved?
Thanks!
By using EXPLAIN, you can observe that most of the cost is spent on doing excessive GROUP BY aggregations rather than on using subqueries in your original query.
Here is a straightforward implementation
WITH t1 AS (
SELECT pkey, COUNT(*) AS cnt
FROM table
WHERE pkey IS NOT NULL
GROUP BY pkey
), t2 AS (
SELECT COUNT(DISTINCT userid) AS total_cnt
FROM table
)
SELECT
CONCAT('no_', pkey) AS pkey,
(total_cnt - cnt) AS cnt
FROM t1, t2
UNION ALL
SELECT * FROM t1
t1 gets a table of unique user count per pkey
+------+-----+
| pkey | cnt |
+------+-----+
| x | 1 |
| z | 2 |
| y | 2 |
+------+-----+
t2 gets the number of total unique users
+-----------+
| total_cnt |
+-----------+
| 4 |
+-----------+
we can use the result from t2 to get the complement table of t1
+------+-----+
| pkey | cnt |
+------+-----+
| no_x | 3 |
| no_z | 2 |
| no_y | 2 |
+------+-----+
a final union of the two tables gives a result of
+------+-----+
| pkey | cnt |
+------+-----+
| no_x | 3 |
| no_z | 2 |
| no_y | 2 |
| x | 1 |
| z | 2 |
| y | 2 |
+------+-----+

order by after full outer join

I create the following table on http://sqlfiddle.com in PostgreSQL 9.3.1 mode:
CREATE TABLE t
(
id serial primary key,
m varchar(1),
d varchar(1),
c int
);
INSERT INTO t
(m, d, c)
VALUES
('A', '1', 101),
('A', '2', 102),
('A', '3', 103),
('B', '1', 104),
('B', '3', 105);
table:
| ID | M | D | C |
|----|---|---|-----|
| 1 | A | 1 | 101 |
| 2 | A | 2 | 102 |
| 3 | A | 3 | 103 |
| 4 | B | 1 | 104 |
| 5 | B | 3 | 105 |
From this I want to generate such a table:
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
but with my current statement
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
I only get the following
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| B | 1 | 4 | 104 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 3 | 5 | 105 |
| B | 2 | (null) | (null) |
Attempts to order it by m,d fail so far:
select * from
(select * from
(select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join
t
on kombi.d = t.d and kombi.m = t.m) as result)
order by result.m
Error message:
ERROR: subquery in FROM must have an alias: select * from (select * from (select * from (select * from (select distinct m from t) as dummy1, (select distinct d from t) as dummy2) as kombi full outer join t on kombi.d = t.d and kombi.m = t.m) as result) order by result.m
It would be cool if somebody could point out to me what I am doing wrong and perhaps show the correct statement.
select * from
(select kombi.m, kombi.d, t.id, t.c from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as kombi
full outer join t
on kombi.d = t.d and kombi.m = t.m) as result
order by result.m, result.d
I think your problem is the order. You can solve this problem with the order by clause:
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by combi.m, combi.d
You need to specify which data you would like to order. In this case you get back the row from the combi table, so you need to say that.
http://sqlfiddle.com/#!15/ddc0e/17
You could also use column numbers instead of names to do the ordering.
select * from
(select * from
(select distinct m from t) as dummy1,
(select distinct d from t) as dummy2) as combi
full outer join
t
on combi.d = t.d and combi.m = t.m
order by 1,2;
| M | D | ID | C |
|---|---|--------|--------|
| A | 1 | 1 | 101 |
| A | 2 | 2 | 102 |
| A | 3 | 3 | 103 |
| B | 1 | 4 | 104 |
| B | 2 | (null) | (null) |
| B | 3 | 5 | 105 |
you just need a pivot table
the query is very simple
select classes.M, p.i as D, t.ID, t.C
from (select M, max(D) MaxValue from t group by m) classes
inner join pivot p
on p.i =< classes.MaxValue
left join t
on t.M = classes.M
and t.D = p.i
pivot table is a dummy table some how
CREATE TABLE Pivot (
i INT,
PRIMARY KEY(i)
)
populate is some how
CREATE TABLE Foo(
i CHAR(1)
)
INSERT INTO Foo VALUES('0')
INSERT INTO Foo VALUES('1')
INSERT INTO Foo VALUES('2')
INSERT INTO Foo VALUES('3')
INSERT INTO Foo VALUES('4')
INSERT INTO Foo VALUES('5')
INSERT INTO Foo VALUES('6')
INSERT INTO Foo VALUES('7')
INSERT INTO Foo VALUES('8')
INSERT INTO Foo VALUES('9')
Using the 10 rows in the Foo table, you can easily populate the Pivot table with 1,000 rows. To get 1,000 rows from 10 rows, join Foo to itself three times to create a Cartesian product:
INSERT INTO Pivot
SELECT f1.i+f2.i+f3.i
FROM Foo f1, Foo F2, Foo f3
you can read about that in Transac-SQL Cookbook by Jonathan Gennick, Ales Spetic
You just need to order by the final column definitions. t.m and t.d. SO your final SQL would be...
SELECT *
FROM (SELECT *
FROM (SELECT DISTINCT m FROM t) AS dummy1,
(SELECT DISTINCT d FROM t) AS dummy2) AS combi
FULL OUTER JOIN t
ON combi.d = t.d
AND combi.m = t.m
ORDER BY t.m,
t.d;
Also for query optimization perspective, it is better to now have many layers of sub queries.
I think you need another correlation name - dummy3? - after 'as result )' before the order by.