Performance difference between Select count(ID) and Select count(*) - sql

There are two queries below which return count of ID column excluding NULL values
and second query will return the count of all the rows from the table including NULL rows.
select COUNT(ID) from TableName
select COUNT(*) from TableName
My Confusion :
Is there any performance difference ?

TL/DR: Plans might not be the same, you should test on appropriate
data and make sure you have the correct indexes and then choose the best solution based on your investigations.
The query plans might not be the same depending on the indexing and the nullability of the column which is used in the COUNT function.
In the following example I create a table and fill it with one million rows.
All the columns have been indexed except column 'b'.
The conclusion is that some of these queries do result in the same execution plan but most of them are different.
This was tested on SQL Server 2014, I do not have access to an instance of 2012 at this moment. You should test this yourself to figure out the best solution.
create table t1(id bigint identity,
dt datetime2(7) not null default(sysdatetime()),
a char(800) null,
b char(800) null,
c char(800) null);
-- We will use these 4 indexes. Only column 'b' does not have any supporting index on it.
alter table t1 add constraint [pk_t1] primary key NONCLUSTERED (id);
create clustered index cix_dt on t1(dt);
create nonclustered index ix_a on t1(a);
create nonclustered index ix_c on t1(c);
insert into T1 (a, b, c)
select top 1000000
a = case when low = 1 then null else left(REPLICATE(newid(), low), 800) end,
b = case when low between 1 and 10 then null else left(REPLICATE(newid(), 800-low), 800) end,
c = case when low between 1 and 192 then null else left(REPLICATE(newid(), 800-low), 800) end
from master..spt_values
cross join (select 1 from master..spt_values) m(ock)
where type = 'p';
checkpoint;
-- All rows, no matter if any columns are null or not
-- Uses primary key index
select count(*) from t1;
-- All not null,
-- Uses primary key index
select count(id) from t1;
-- Some values of 'a' are null
-- Uses the index on 'a'
select count(a) from t1;
-- Some values of b are null
-- Uses the clustered index
select count(b) from t1;
-- No values of dt are null and the table have a clustered index on 'dt'
-- Uses primary key index and not the clustered index as one could expect.
select count(dt) from t1;
-- Most values of c are null
-- Uses the index on c
select count(c) from t1;
Now what would happen if we were more explicit in what we wanted our count to do? If we tell the query planner that we want to get only rows which have not null, will that change anything?
-- Homework!
-- What happens if we explicitly count only rows where the column is not null? What if we add a filtered index to support this query?
-- Hint: It will once again be different than the other queries.
create index ix_c2 on t1(c) where c is not null;
select count(*) from t1 where c is not null;

Related

Eliminate subquery to improve query performance

I'd like to rewrite the following subquery as it's used over and over again in a larger query. The DBMS used is Postgres and the table has the following structure table (id uuid, seq int, value int).
Given a value for id (id_value), the query finds all records in "table" where seq < seq of id_value
My naive (slow) solution so far is the following:
select * from table
where seq < (select seq from table where id = id_value)
table
id, seq, value
a, 1, 12
b, 2, 22
c, 3, 32
x, 4, 43
d, 5, 54
s, 6, 32
a, 7, 54
e.g. a query
select * from table where seq < (select seq from table where id = 'x')
returns
a, 1, 12
b, 2, 22
c, 3, 32
For testing purposes, I've tried to hardcode the relevant seq field and it improves the whole query significantly, but I really don't like to query for seq as a two-stage process. Ideally this could happen as part of the query. Any ideas or inspiration would be appreciated.
CREATE TABLE foo
(
seq integer NOT NULL,
id uuid NOT NULL,
CONSTRAINT foo_pkey PRIMARY KEY (id),
CONSTRAINT foo_id_key UNIQUE (id),
CONSTRAINT foo_seq_key UNIQUE (seq)
);
CREATE UNIQUE INDEX idx_foo_id
ON public.foo USING btree
(id)
TABLESPACE pg_default;
CREATE UNIQUE INDEX idx_foo_seq
ON public.foo USING btree
(seq)
TABLESPACE pg_default;
You may have so many redundant indexes that you are confusing Postgres. Simply defining a column as primary key or unique is sufficient. You don't need multiple index declarations.
For what you want to do, this should be optimal:
select f.*
from foo f
where f.seq < (select f2.seq from foo f2 where f2.id = :id_value)
This should use the index to fetch the seq value in the subquery. Then it should return the appropriate rows.
You could also try:
select f.*
from (select f.*, min(seq) filter (where id = :id_value) over () as min_seq
from foo f
) f
where seq < min_seq;
However, my suspicion is simply that the query is returning a large number of rows and that is affecting performance.

How to optimize this long running sqlite3 query for finding duplicates?

I've got this rather insane query for finding all but the FIRST record with a duplicate value. It takes a substantially long time to run on 38000 records; about 50 seconds.
UPDATE exr_exrresv
SET mh_duplicate = 1
WHERE exr_exrresv._id IN
(
SELECT F._id
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1 AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1)
)
AND exr_exrresv._id NOT IN
(
SELECT Min(_id)
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1
AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1
)
GROUP BY PHONE_NUMBER
);
Any tips on how to optimize it or how I should begin to go about it? I've checked out the query plan but I'm really not sure how to begin improving it. Temp tables? Better query?
Here is the explain query plan output:
0|0|0|SEARCH TABLE exr_exrresv USING INTEGER PRIMARY KEY (rowid=?) (~12 rows)
0|0|0|EXECUTE LIST SUBQUERY 0
0|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
1|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
2|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
3|0|0|USE TEMP B-TREE FOR GROUP BY
2|0|0|USE TEMP B-TREE FOR GROUP BY
Any tips would be much appreciated. :)
Also, I am using Ruby to make the sql query so if it makes more sense for the logic to leave SQL and be written in Ruby, that's possible.
The schema is as follows, and you can use sqlfiddle here: http://sqlfiddle.com/#!2/2c07e
_id INTEGER PRIMARY KEY
OPPORTUNITY_ID varchar(50)
CREATEDDATE varchar(50)
FIRSTNAME varchar(50)
LASTNAME varchar(50)
MAILINGSTREET varchar(50)
MAILINGCITY varchar(50)
MAILINGSTATE varchar(50)
MAILINGZIPPOSTALCODE varchar(50)
EMAIL varchar(50)
CONTACT_PHONE varchar(50)
PHONE_NUMBER varchar(50)
CallFromWeb varchar(50)
OPPORTUNITY_ORIGIN varchar(50)
PROJECTED_LTV varchar(50)
MOVE_IN_DATE varchar(50)
mh_processed_date varchar(50)
mh_control INTEGER
mh_active INTEGER
mh_duplicate INTEGER
Guessing from your post, it looks like you are trying to update the mh_duplicate column for any record that has the same phone number if it's not the first record with that phone number?
If that's correct, I think this should get you the id's to update (you may need to add back your appropriate where criteria) -- from there, the Update is straight-forward:
SELECT e._Id
FROM exr_exrresv e
JOIN
( SELECT t.Phone_Number
FROM exr_exrresv t
GROUP BY t.Phone_Number
HAVING COUNT (t.Phone_Number) > 1
) e2 ON e.Phone_Number = e2.Phone_Number
LEFT JOIN
( SELECT MIN(t2._Id) as KeepId
FROM exr_exrresv t2
GROUP BY t2.Phone_Number
) e3 ON e._Id = e3.KeepId
WHERE e3.KeepId is null
And the SQL Fiddle.
Good luck.
This considers a record duplicate if there exists an active record with a matching phone_number and with a lesser _id. (No grouping or counting needed.)
update exr_exrresv
set mh_duplicate = 1
where exr_exrresv._id in (
select target._id
from exr_exrresv as target
where target.phone_number != ''
and target.mh_active = 1
and exists (
select null from exr_exrresv as probe
where probe.phone_number = target.phone_number
and probe.mh_active = 1
and probe._id < target._id
)
)
This query will be greatly aided if there exists an index on phone_number, ideally on exr_exrresv (phone_number, _id)
SQLFiddle

Can this ORDER BY on a CASE clause be made faster?

I'm selecting results from a table of ~350 million records, and it's running extremely slowly - around 10 minutes. The culprit seems to be the ORDER BY, as if I remove it the query only takes a moment. Here's the gist:
SELECT TOP 100
(columns snipped)
FROM (
SELECT
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
Basically, I'm ordering the results by whether Files have a corresponding Record, as those without need to be handled first. I could run two queries with WHERE clauses, but I'd rather do it in a single query if possible.
Is there any way to speed this up?
The ultimate issue is that the use of CASE in the sub-query introduces an ORDER BY over something that is not being used in a sargable manner. Thus the entire intermediate result-set must first be ordered to find the TOP 100 - this is all 350+ million records!2
In this particular case, moving the CASE to the outside SELECT and use a DESC ordering (to put NULL values, which means "0" in the current RecordExists, first) should do the trick1. It's not a generic approach, though .. but the ordering should be much, much faster iff Files.ID is indexed. (If the query is still slow, consult the query plan to find out why ORDER BY is not using an index.)
Another alternative might be to include a persisted computed column for RecordExists (that is also indexed) that can be used as an index in the ORDER BY.
Once again, the idea is that the ORDER BY works over something sargable, which only requires reading sequentially inside the index (up to the desired number of records to match the outside limit) and not ordering 350+ million records on-the-fly :)
SQL Server is then able to push this ordering (and limit) down into the sub-query, instead of waiting for the intermediate result-set of the sub-query to come up. Look at the query plan differences based on what is being ordered.
1 Example:
SELECT TOP 100
-- If needed
CASE WHEN (p1.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM (
SELECT
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
-- Hopefully ID is indexed, DESC makes NULLs (!RecordExists) go first
ORDER BY p1.ID DESC
2 Actually, it seems like it could hypothetically just stop after the first 100 0's without a full-sort .. at least under some extreme query planner optimization under a closed function range, but that depends on when the 0's are encountered in the intermediate result set (in the first few thousand or not until the hundreds of millions or never?). I highly doubt SQL Server accounts for this extreme case anyway; that is, don't count on this still non-sargable behavior.
Give this form a try
SELECT TOP(100) *
FROM (
SELECT TOP(100)
0 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
WHERE NOT EXISTS (SELECT * FROM dbo.Records e2 WHERE e1.FID = e2.FID)
ORDER BY SecondaryOrderColumn
) X
UNION ALL
SELECT * FROM (
SELECT TOP(100)
1 AS RecordExists
--,(columns snipped)
FROM dbo.Files AS e1
INNER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY SecondaryOrderColumn
) X
ORDER BY SecondaryOrderColumn
Key indexes:
Records (FID)
Files (FID, SecondaryOrdercolumn)
Well the reason it is much slower is because it is really a very different query without the order by clause.
With the order by clause:
Find all matching records out of the entire 350 million rows. Then sort them.
Without the order by clause:
Find the first 100 matching records. Stop.
Q: If you say the only difference is "with/outout" the "order by", then could you somehow move the "top 100" into the inner select?
EXAMPLE:
SELECT
(columns snipped)
FROM (
SELECT TOP 100
CASE WHEN (e2.ID IS NULL) THEN
CAST(0 AS BIT) ELSE CAST(1 AS BIT) END AS RecordExists,
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
) AS p1
ORDER BY p1.RecordExists
In SQL Server, null values collate lower than any value in the domain. Given these two tables:
create table dbo.foo
(
id int not null identity(1,1) primary key clustered ,
name varchar(32) not null unique nonclustered ,
)
insert dbo.foo ( name ) values ( 'alpha' )
insert dbo.foo ( name ) values ( 'bravo' )
insert dbo.foo ( name ) values ( 'charlie' )
insert dbo.foo ( name ) values ( 'delta' )
insert dbo.foo ( name ) values ( 'echo' )
insert dbo.foo ( name ) values ( 'foxtrot' )
go
create table dbo.bar
(
id int not null identity(1,1) primary key clustered ,
foo_id int null foreign key references dbo.foo(id) ,
name varchar(32) not null unique nonclustered ,
)
go
insert dbo.bar( foo_id , name ) values( 1 , 'golf' )
insert dbo.bar( foo_id , name ) values( 5 , 'hotel' )
insert dbo.bar( foo_id , name ) values( 3 , 'india' )
insert dbo.bar( foo_id , name ) values( 5 , 'juliet' )
insert dbo.bar( foo_id , name ) values( 6 , 'kilo' )
go
The query
select *
from dbo.foo foo
left join dbo.bar bar on bar.foo_id = foo.id
order by bar.foo_id, foo.id
yields the following result set:
id name id foo_id name
-- ------- ---- ------ -------
2 bravo NULL NULL NULL
4 delta NULL NULL NULL
1 alpha 1 1 golf
3 charlie 3 3 india
5 echo 2 5 hotel
5 echo 4 5 juliet
6 foxtrot 5 6 kilo
(7 row(s) affected)
This should allow the query optimizer to use a suitable index (if such exists); however, it does not guarantee than any such index would be used.
Can you try this?
SELECT TOP 100
(columns snipped)
FROM dbo.Files AS e1
LEFT OUTER JOIN dbo.Records AS e2 ON e1.FID = e2.FID
ORDER BY e2.ID ASC
This should give you where e2.ID is null first. Also, make sure Records.ID is indexed. This should give you the ordering you were wanting.

how to fix this query performance on SQL Server Compact Edition 4

i have the following SQL Query which runs on SQL Server CE 4
SELECT [Join_ReleaseMinDatePost].[FK_MovieID]
FROM (
SELECT [FK_MovieID], MIN([DatePost]) AS [ReleaseMinDatePost]
FROM [Release]
GROUP BY [FK_MovieID]
) [Join_ReleaseMinDatePost]
INNER JOIN
(
SELECT COUNT([ID]) AS [FolderCount], [FK_MovieID]
FROM [MovieFolder]
GROUP BY [FK_MovieID]
) [Join_MovieFolder]
ON [Join_MovieFolder].[FK_MovieID] = [Join_ReleaseMinDatePost].[FK_MovieID]
this query takes a long time to execute but if i change the Part
SELECT COUNT([ID]) AS [FolderCount], [FK_MovieID] FROM [MovieFolder] GROUP BY [FK_MovieID]
To
SELECT 1 AS [FolderCount], [FK_MovieID] FROM [MovieFolder]
So the full query becomes
SELECT [Join_ReleaseMinDatePost].[FK_MovieID]
FROM ( SELECT [FK_MovieID], MIN([DatePost]) AS [ReleaseMinDatePost] FROM [Release] GROUP BY [FK_MovieID] ) [Join_ReleaseMinDatePost]
INNER JOIN (SELECT 1 AS [FolderCount], [FK_MovieID] FROM [MovieFolder] ) [Join_MovieFolder]
ON [Join_MovieFolder].[FK_MovieID] = [Join_ReleaseMinDatePost].[FK_MovieID]
then the performance becomes very fast.
the problem is that the part that was changed if taken by itself is pretty fast. but for some reason the execution plan of the first query shows that the "actual number of rows" in the index scan is 160,016 while the total number of rows in the table MovieFolder is 2,192.
and the "Estimated number of rows" is 2,192.
so i think the problem is in the number of rows but i cant figure out why its all messed up.
any help will be appreciated :)
thanks
the schema of the tables is below
CREATE TABLE [Release] (
[ID] int NOT NULL
, [FD_ForumID] int NOT NULL
, [FK_MovieID] int NULL
, [DatePost] datetime NULL
);
GO
ALTER TABLE [Release] ADD CONSTRAINT [PK__Release__0000000000000052] PRIMARY KEY ([ID]);
GO
CREATE INDEX [IX_Release_DatePost] ON [Release] ([DatePost] ASC);
GO
CREATE INDEX [IX_Release_FD_ForumID] ON [Release] ([FD_ForumID] ASC);
GO
CREATE INDEX [IX_Release_FK_MovieID] ON [Release] ([FK_MovieID] ASC);
GO
CREATE UNIQUE INDEX [UQ__Release__0000000000000057] ON [Release] ([ID] ASC);
GO
CREATE TABLE [MovieFolder] (
[ID] int NOT NULL IDENTITY (1,1)
, [Path] nvarchar(500) NOT NULL
, [FK_MovieID] int NULL
, [Seen] bit NULL
);
GO
ALTER TABLE [MovieFolder] ADD CONSTRAINT [PK_MovieFolder] PRIMARY KEY ([ID]);
GO
CREATE INDEX [IX_MovieFolder_FK_MovieID] ON [MovieFolder] ([FK_MovieID] ASC);
GO
CREATE INDEX [IX_MovieFolder_Seen] ON [MovieFolder] ([Seen] ASC);
GO
CREATE UNIQUE INDEX [UQ__MovieFolder__0000000000000019] ON [MovieFolder] ([ID] ASC);
GO
CREATE UNIQUE INDEX [UQ__MovieFolder__0000000000000020] ON [MovieFolder] ([Path] ASC);
GO
I think you're running into a correlated subquery problem. The query part you're experimenting with is part of a JOIN condition, so it is fully evaluated for every potentially matching row. You're making your SQL engine do the second 'GROUP BY' for every row produced by the FROM clause. So it's reading 2192 rows to do the group by for each and every row produced by the FROM clause.
This suggest you're getting 73 rows in the FROM clause grouping (2192 * 73 = 160 016)
When you change it to do SELECT 1, you eliminate the table-scan read for grouping.
DaveE is right about the issue with your correlated subquery. When these issues arise you often need to rethink your entire query. If anything else fails you can probably save time extracting your sub-query to a temporary table like this:
/* Declare in-memory temp table */
DECLARE #Join_MovieFolder TABLE (
count INT,
movieId INT )
/* Insert data into temp table */
INSERT INTO #Join_MovieFolder ( count, movieId )
SELECT COUNT([ID]) AS [FolderCount], [FK_MovieID]
FROM [MovieFolder]
GROUP BY [FK_MovieID]
/* Inner join the temp table to avoid excessive sub-quering */
SELECT [Join_ReleaseMinDatePost].[FK_MovieID]
FROM (
SELECT [FK_MovieID], MIN([DatePost]) AS [ReleaseMinDatePost]
FROM [Release]
GROUP BY [FK_MovieID]
) [Join_ReleaseMinDatePost]
INNER JOIN #Join_MovieFolder
ON #Join_MovieFolder.movieId = [Join_ReleaseMinDatePost].[FK_MovieID]
i think if found the problem.
but i would like people to tell me if this indeed the problem.
the problem is that the 2 sub queries create some kind of like temp table (dont know how to call it).
but these 2 temp table dont contain a clustered index on [FK_MovieID].
so when the external join tries to join them it need to scan them several times and and this is mainly the problem.
now if i can only fix this ?

Recursive MySQL query

My database schema looks like this:
Table t1:
id
valA
valB
Table t2:
id
valA
valB
What I want to do, is, for a given set of rows in one of these tables, find rows in both tables that have the same valA or valB (comparing valA with valA and valB with valB, not valA with valB). Then, I want to look for rows with the same valA or valB as the rows in the result of the previous query, and so on.
Example data:
t1 (id, valA, valB):
1, a, B
2, b, J
3, d, E
4, d, B
5, c, G
6, h, J
t2 (id, valA, valB):
1, b, E
2, d, H
3, g, B
Example 1:
Input: Row 1 in t1
Output:
t1/4, t2/3
t1/3, t2/2
t2/1
...
Example 2:
Input: Row 6 in t1
Output:
t1/2
t2/1
I would like to have the level of the search at that the row was found in the result (e.g. in Example 1: Level 1 for t1/2 and t2/1, level 2 for t1/5, ...) A limited depth of recursion is okay. Over time, I maybe want to include more tables following the same schema into the query. It would be nice if it was easy to extend the query for that purpose.
But what matters most, is the performance. Can you tell me the fastest possible way to accomplish this?
Thanks in advance!
try this although it's not fully tested but looked like it was working :P (http://pastie.org/1140339)
drop table if exists t1;
create table t1
(
id int unsigned not null auto_increment primary key,
valA char(1) not null,
valB char(1) not null
)
engine=innodb;
drop table if exists t2;
create table t2
(
id int unsigned not null auto_increment primary key,
valA char(1) not null,
valB char(1) not null
)
engine=innodb;
drop view if exists t12;
create view t12 as
select 1 as tid, id, valA, valB from t1
union
select 2 as tid, id, valA, valB from t2;
insert into t1 (valA, valB) values
('a','B'),
('b','J'),
('d','E'),
('d','B'),
('c','G'),
('h','J');
insert into t2 (valA, valB) values
('b','E'),
('d','H'),
('g','B');
drop procedure if exists find_children;
delimiter #
create procedure find_children
(
in p_tid tinyint unsigned,
in p_id int unsigned
)
proc_main:begin
declare done tinyint unsigned default 0;
declare dpth smallint unsigned default 0;
create temporary table children(
tid tinyint unsigned not null,
id int unsigned not null,
valA char(1) not null,
valB char(1) not null,
depth smallint unsigned default 0,
primary key (tid, id, valA, valB)
)engine = memory;
insert into children select p_tid, t.id, t.valA, t.valB, dpth from t12 t where t.tid = p_tid and t.id = p_id;
create temporary table tmp engine=memory select * from children;
/* http://dec.mysql.com/doc/refman/5.0/en/temporary-table-problems.html */
while done <> 1 do
if exists(
select 1 from t12 t
inner join tmp on tmp.valA = t.valA or tmp.valB = t.valB and tmp.depth = dpth) then
insert ignore into children
select
t.tid, t.id, t.valA, t.valB, dpth+1
from t12 t
inner join tmp on tmp.valA = t.valA or tmp.valB = t.valB and tmp.depth = dpth;
set dpth = dpth + 1;
truncate table tmp;
insert into tmp select * from children where depth = dpth;
else
set done = 1;
end if;
end while;
select * from children order by depth;
drop temporary table if exists children;
drop temporary table if exists tmp;
end proc_main #
delimiter ;
call find_children(1,1);
call find_children(1,6);
You can do it with stored procedures (see listings 7 and 7a):
http://www.artfulsoftware.com/mysqlbook/sampler/mysqled1ch20.html
You just need to figure out a query for the step of the recursion - taking the already-found rows and finding some more rows.
If you had a database which supported SQL-99 recursive common table expressions (like PostgreSQL or Firebird, hint hint), you could take the same approach as in the above link, but using a rCTE as the framework, so avoiding the need to write a stored procedure.
EDIT: I had a go at doing this with an rCTE in PostgreSQL 8.4, and although i can find the rows, i can't find a way to label them with the depth at which they were found. First, i create a a view to unify the tables:
create view t12 (tbl, id, vala, valb) as (
(select 't1', id, vala, valb from t1)
union
(select 't2', id, vala, valb from t2)
)
Then do this query:
with recursive descendants (tbl, id, vala, valb) as (
(select *
from t12
where tbl = 't1' and id = 1) -- the query that identifies the seed rows, here just t1/1
union
(select c.*
from descendants p, t12 c
where (p.vala = c.vala or p.valb = c.valb)) -- the recursive term
)
select * from descendants;
You would imagine that capturing depth would be as simple as adding a depth column to the rCTE, set to zero in the seed query, then somehow incremented in the recursive step. However, i couldn't find any way to do that, given that you can't write subqueries against the rCTE in the recursive step (so nothing like select max(depth) + 1 from descendants in the column list), and you can't use an aggregate function in the column list (so no max(p.depth) + 1 in the column list coupled with a group by c.* on the select).
You would also need to add a restriction to the query to exclude already-selected rows; you don't need to do that in the basic version, because of the distincting effect of the union, but if you add a count column, then a row can be included in the results more than once with different counts, and you'll get a Cartesian explosion. But you can't easily prevent it, because you can't have subqueries against the rCTE, which means you can't say anything like and not exists (select * from descendants d where d.tbl = c.tbl and d.id = c.id)!
I know all this stuff about recursive queries is of no use to you, but i find it riveting, so please do excuse me.