Avoiding repetitive conditions in the select case and where clause - sql

I have a table say TAB1 with the following columns -
USER_ID NUMBER(5),
PHN_NO1 CHAR(20),
PHN_NO2 CHAR(20)
I have to fetch records from TAB1 into another table TAB2 such that all records with either one of the two or both PHN_NO1 and PHN_NO2 are of length 10 and begin with 5.
If in a record,say only PHN_NO1 satisfies the condition and PHN_NO2 does not then, TAB2.P1 should be same as TAB1.PHN_NO1 but TAB2.P2 should be NULL.
If neither of the two satisfy the condition, then the record should not be inserted into TAB2
Structure of TAB2 would be as
USER_ID number(5)- holding the ROWID of the record selected from TAB1
P1 char(10)- holding TAB1.PHN_NO1 if it is of length 10 and begins with 5, otherwise NULL
P2 char(10)- holding TAB1.PHN_NO2 if it is of length 10 and beigns with 5, otherwise NULL
I could write the below query to achieve the above, but the conditions in the CASE and WHERE are repetitive. Please suggest a way to achieve the above in a better way.
CREATE TABLE TAB2
AS
SELECT
USER_ID,
CASE WHEN
(LENGTH(TRIM(PHN_NO1)) = 10 AND TRIM(PHN_NO1) like '5%')
THEN
CAST(TRIM(PHN_NO1) as CHAR(10))
ELSE
CAST(NULL as CHAR(10))
END AS P1,
CASE (LENGTH(TRIM(PHN_NO2)) = 10 AND TRIM(PHN_NO2) like '5%')
THEN
CAST(TRIM(PHN_NO2) as CHAR(10))
ELSE
CAST(NULL as CHAR(10))
END AS P2
WHERE
(LENGTH(TRIM(PHN_NO1) = 10 AND TRIM(PHN_NO1) like '5%')
OR
(LENGTH(TRIM(PHN_NO2) = 10 AND TRIM(PHN_NO2) like '5%')

Sure you can! You do have to use some conditions though:
INSERT INTO New_Phone
SELECT user_id, phn_no1, phn_no2
FROM (SELECT user_id,
CASE WHEN LENGTH(TRIM(phn_no1)) = 10 AND TRIM(phn_no1) like '5%'
THEN SUBSTR(phn_no1, 1, 10) ELSE NULL END phn_no1,
CASE WHEN LENGTH(TRIM(phn_no2)) = 10 AND TRIM(phn_no2) like '5%'
THEN SUBSTR(phn_no2, 1, 10) ELSE NULL END phn_no2
FROM Old_Phone) Old
WHERE phn_no1 IS NOT NULL
OR phn_no2 IS NOT NULL;
(I have a working SQL Fiddle example.)
This should work on any RDBMS. Note that, because of your data, this isn't likely to be less performant than your original (which would not have used an index, given the TRIM()). It's also not likely to be better, given that most major RDBMSs are able to re-use the results of deterministic functions per-row.
Oh, it should be noted that, internationally, phone numbers can be up to 15 digits in length (with a minimum in-country of 6 or less). Maybe use VARCHAR (and save yourself some TRIM()s, too)? And INTEGER (or BIGINT, maybe TINYINT) is more often used for surrogate ids, NUMBER is a little strange.

Related

SQL Query : should return Single Record if Search Condition met, otherwise return Multiple Records

I have table with Billions of Records, Table structure is like :
ID NUMBER PRIMARY KEY,
MY_SEARCH_COLUMN NUMBER,
MY_SEARCH_COLUMN will have Numeric value upto 15 Digit in length.
What I want is, if any specific record is matched, I will have to get that matched value only,
i.e. : If I enter WHERE MY_SEARCH_COLUMN = 123454321 and table has value 123454321 then this only should be returned.
But if exact value is not matched, I will have to get next 10 values from the table.
i.e. : if I enter WHERE MY_SEARCH_COLUMN = 123454321 and column does not have the value 123454321 then it should return 10 values from the table which is greater than 123454321
Both the case should be covered in single SQL Query, and I have have to keep in mind the Performance of the Query. I have already created Index on the MY_SEARCH_COLUMN columns, so other suggestions are welcome to improve the Performance.
This could be tricky to do without using a proc or maybe some dynamic SQL, but we can try using ROW_NUMBER here:
WITH cte AS (
SELECT ID, MY_SEARCH_COLUMN,
ROW_NUMBER() OVER (ORDER BY MY_SEARCH_COLUMN) rn
FROM yourTable
WHERE MY_SEARCH_COLUMN >= 123454321
)
SELECT *
FROM cte
WHERE rn <= CASE WHEN EXISTS (SELECT 1 FROM yourTable WHERE MY_SEARCH_COLUMN = 123454321)
THEN 1
ELSE 10 END;
The basic idea of the above query is that we assign a row number to all records matching the target or greater. Then, we query using either a row number of 1, in case of an exact match, or all row numbers up to 10 in case of no match.
SELECT *
FROM your_table AS src
WHERE src.MY_SEARCH_COLUMN = CASE WHEN EXISTS (SELECT 1 FROM your_table AS src2 WITH(NOLOCK) WHERE src2.MY_SEARCH_COLUMN = 123456321)
THEN 123456321
ELSE src.MY_SEARCH_COLUMN
END

Update statement using a WHERE clause that contains columns with null Values

I am updating a column on one table using data from another table. The WHERE clause is based on multiple columns and some of the columns are null. From my thinking, this nulls are what are throwing off your standard UPDATE TABLE SET X=Y WHERE A=B statement.
See this SQL Fiddle of the two tables where am trying to update table_one based on data from table_two.
My query currently looks like this:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.invoice_number = table_two.invoice_number AND
table_one.submitted_by = table_two.submitted_by AND
table_one.passport_number = table_two.passport_number AND
table_one.driving_license_number = table_two.driving_license_number AND
table_one.national_id_number = table_two.national_id_number AND
table_one.tax_pin_identification_number = table_two.tax_pin_identification_number AND
table_one.vat_number = table_two.vat_number AND
table_one.ggcg_number = table_two.ggcg_number AND
table_one.national_association_number = table_two.national_association_number
The query fails for some rows in that table_one.x isn't getting updated when any of the columns in either table are null. i.e. it only gets updated when all columns have some data.
This question is related to my earlier one here on SO where I was getting distinct values from a large data set using Distinct On. What I now I want is to populate the large data set with a value from the table which has unique fields.
UPDATE
I used the first update statement provided by #binotenary. For small tables, it runs in a flash. Example is had one table with 20,000 records and the update was completed in like 20 seconds. But another table with 9 million plus records has been running for 20 hrs so far!. See below the output for EXPLAIN function
Update on table_one (cost=0.00..210634237338.87 rows=13615011125 width=1996)
-> Nested Loop (cost=0.00..210634237338.87 rows=13615011125 width=1996)
Join Filter: ((((my_update_statement_here))))
-> Seq Scan on table_one (cost=0.00..610872.62 rows=9661262 width=1986)
-> Seq Scan on table_two (cost=0.00..6051.98 rows=299998 width=148)
The EXPLAIN ANALYZE option took also forever so I canceled it.
Any ideas on how to make this type of update faster? Even if it means using a different update statement or even using a custom function to loop through and do the update.
Since null = null evaluates to false you need to check if two fields are both null in addition to equality check:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
(table_one.invoice_number = table_two.invoice_number
OR (table_one.invoice_number is null AND table_two.invoice_number is null))
AND
(table_one.submitted_by = table_two.submitted_by
OR (table_one.submitted_by is null AND table_two.submitted_by is null))
AND
-- etc
You could also use the coalesce function which is more readable:
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
AND coalesce(table_one.submitted_by, '') = coalesce(table_two.submitted_by, '')
AND -- etc
But you need to be careful about the default values (last argument to coalesce).
It's data type should match the column type (so that you don't end up comparing dates with numbers for example) and the default should be such that it doesn't appear in the data
E.g coalesce(null, 1) = coalesce(1, 1) is a situation you'd want to avoid.
Update (regarding performance):
Seq Scan on table_two - this suggests that you don't have any indexes on table_two.
So if you update a row in table_one then to find a matching row in table_two the database basically has to scan through all the rows one by one until it finds a match.
The matching rows could be found much faster if the relevant columns were indexed.
On the flipside if table_one has any indexes then that slows down the update.
According to this performance guide:
Table constraints and indexes heavily delay every write. If possible, you should drop all the indexes, triggers and foreign keys while the update runs and recreate them at the end.
Another suggestion from the same guide that might be helpful is:
If you can segment your data using, for example, sequential IDs, you can update rows incrementally in batches.
So for example if table_one an id column you could add something like
and table_one.id between x and y
to the where condition and run the query several times changing the values of x and y so that all rows are covered.
The EXPLAIN ANALYZE option took also forever
You might want to be careful when using the ANALYZE option with EXPLAIN when dealing with statements with sideffects.
According to documentation:
Keep in mind that the statement is actually executed when the ANALYZE option is used. Although EXPLAIN will discard any output that a SELECT would return, other side effects of the statement will happen as usual.
Try below, similar to the above #binoternary. Just beat me to the answer.
update table_one
set column_x = (select column_y from table_two
where
(( table_two.invoice_number = table_one.invoice_number)OR (table_two.invoice_number IS NULL AND table_one.invoice_number IS NULL))
and ((table_two.submitted_by=table_one.submitted_by)OR (table_two.submitted_by IS NULL AND table_one.submitted_by IS NULL))
and ((table_two.passport_number=table_one.passport_number)OR (table_two.passport_number IS NULL AND table_one.passport_number IS NULL))
and ((table_two.driving_license_number=table_one.driving_license_number)OR (table_two.driving_license_number IS NULL AND table_one.driving_license_number IS NULL))
and ((table_two.national_id_number=table_one.national_id_number)OR (table_two.national_id_number IS NULL AND table_one.national_id_number IS NULL))
and ((table_two.tax_pin_identification_number=table_one.tax_pin_identification_number)OR (table_two.tax_pin_identification_number IS NULL AND table_one.tax_pin_identification_number IS NULL))
and ((table_two.vat_number=table_one.vat_number)OR (table_two.vat_number IS NULL AND table_one.vat_number IS NULL))
and ((table_two.ggcg_number=table_one.ggcg_number)OR (table_two.ggcg_number IS NULL AND table_one.ggcg_number IS NULL))
and ((table_two.national_association_number=table_one.national_association_number)OR (table_two.national_association_number IS NULL AND table_one.national_association_number IS NULL))
);
You can use a null check function like Oracle's NVL.
For Postgres, you will have to use coalesce.
i.e. your query can look like :
UPDATE table_one SET table_one.x =(select table_two.y from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1))
where table_one.table_one_pk in (select table_one.table_one_pk from table_one,table_two
WHERE
coalesce(table_one.invoice_number,table_two.invoice_number,1) = coalesce(table_two.invoice_number,table_one.invoice_number,1)
AND
coalesce(table_one.submitted_by,table_two.submitted_by,1) = coalesce(table_two.submitted_by,table_one.submitted_by,1));
Your current query joins two tables using Nested Loop, which means that the server processes
9,661,262 * 299,998 = 2,898,359,277,476
rows. No wonder it takes forever.
To make the join efficient you need an index on all joined columns. The problem is NULL values.
If you use a function on the joined columns, generally the index can't be used.
If you use an expression like this in the JOIN:
coalesce(table_one.invoice_number, '') = coalesce(table_two.invoice_number, '')
an index can't be used.
So, we need an index and we need to do something with NULL values to make index usable.
We don't need to make any changes in table_one, because it has to be scanned in full in any case.
But, table_two definitely can be improved. Either change the table itself, or create a separate (temporary) table. It has only 300K rows, so it should not be a problem.
Make all columns that are used in the JOIN to be NOT NULL.
CREATE TABLE table_two (
id int4 NOT NULL,
invoice_number varchar(30) NOT NULL,
submitted_by varchar(20) NOT NULL,
passport_number varchar(30) NOT NULL,
driving_license_number varchar(30) NOT NULL,
national_id_number varchar(30) NOT NULL,
tax_pin_identification_number varchar(30) NOT NULL,
vat_number varchar(30) NOT NULL,
ggcg_number varchar(30) NOT NULL,
national_association_number varchar(30) NOT NULL,
column_y int,
CONSTRAINT table_two_pkey PRIMARY KEY (id)
);
Update the table and replace NULL values with '', or some other appropriate value.
Create an index on all columns that are used in JOIN plus column_y. column_y has to be included last in the index. I assume that your UPDATE is well-formed, so index should be unique.
CREATE UNIQUE INDEX IX ON table_two
(
invoice_number,
submitted_by,
passport_number,
driving_license_number,
national_id_number,
tax_pin_identification_number,
vat_number,
ggcg_number,
national_association_number,
column_y
);
The query will become
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
Note, that COALESCE is used only on table_one columns.
It is also a good idea to do UPDATE in batches, rather than the whole table at once. For example, pick a range of ids to update in a batch.
UPDATE table_one SET table_one.x = table_two.y
FROM table_two
WHERE
table_one.id >= <some_starting_value> AND
table_one.id < <some_ending_value> AND
COALESCE(table_one.invoice_number, '') = table_two.invoice_number AND
COALESCE(table_one.submitted_by, '') = table_two.submitted_by AND
COALESCE(table_one.passport_number, '') = table_two.passport_number AND
COALESCE(table_one.driving_license_number, '') = table_two.driving_license_number AND
COALESCE(table_one.national_id_number, '') = table_two.national_id_number AND
COALESCE(table_one.tax_pin_identification_number, '') = table_two.tax_pin_identification_number AND
COALESCE(table_one.vat_number, '') = table_two.vat_number AND
COALESCE(table_one.ggcg_number, '') = table_two.ggcg_number AND
COALESCE(table_one.national_association_number, '') = table_two.national_association_number
You can use coalesce function which will return true every time when any variable passed is null. Null check function will help you.
Null-related functions here.

Check a lot of colums for at least one 'true'

I have a table with a lot of columns (say 200) they are all boolean. I want to know which of those has at least one record set to true. I have come up with the following query which works fine:
SELECT sum(Case When [column1] = 1 Then 1 Else 0 End) as column1,
sum(Case When [column2] = 1 Then 1 Else 0 End) as column2, sum(Case
When [column3] = 1 Then 1 Else 0 End) as column3, FROM [tablename];
It will return the number of rows that are 'true' for a column. However, this is more information than I need and thereby maybe a more expensive query then needed. The query keeps scanning all fields for all records even though that would not be necessary.
I just learned something about CHECKSUM(*) that might be useful. Try the following code:
DECLARE #T TABLE (
b1 bit
,b2 bit
,b3 bit
);
DECLARE #T2 TABLE (
b1 bit
,b2 bit
,b3 bit
,b4 bit
,b5 bit
);
INSERT INTO #T VALUES (0,0,0),(1,1,1);
INSERT INTO #T2 VALUES (0,0,0,0,0),(1,1,1,1,1);
SELECT CHECKSUM(*) FROM #T;
SELECT CHECKSUM(*) FROM #T2;
You will see from the results that no matter how many columns are in a row, if they are all bit columns with a value of 0, the result of CHECKSUM(*) is always 0.
This means that you could use WHERE CHECKSUM(*)<>0 in your query to save the engine the trouble of summing rows where all the values are 0. Might improve performance.
And even if it doesn't, it's a neat thing to know.
EDIT:
You could do an EXISTS() function on each column. I understand that the EXISTS() function stops scanning when it finds a value that exists. If you have more rows than columns, it might be more performant. If you have more columns than rows, then your current query using SUM() on every column is probably the fastest thing you can do.
If you just want to know the rows that have at last one boolean field, you will need to test every of them.
Something like this (maybe):
SELECT ROW.*
FROM TABLE ROW
WHERE ROW.COLUMN_1 = 1
OR ROW.COLUMN_2 = 1
OR ROW.COLUMN_3 = 1
OR ...
OR ROW.COLUMN_N = 1;
If you actually have 200 columns/fields on one table with boolean then something like the following should work.
SELECT CASE WHEN column1 + column2 + column3 + ... + column200 >= 1 THEN 'Something was true for this record' ELSE NULL END AS My_Big_Field_Test
FROM [TableName];
I'm not in front of my machine, but you could also try the bitwise or operator:
SELECT * FROM [table name] WHERE column1 | column2 | column3 = 1
The OR answer from Arthur is the other suggestion I would offer. Try a few different suggestions and look at the query plans. Also take a look at disk reads and CPU usage. (SET STATISTICS IO ON and SET STATISTICS TIME ON).
See whatever method gives the desires results and the best performance...and then let us know :-)
You can use a query of the form
SELECT
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column1] = 1) THEN 0 ELSE 1 END AS 'Column1',
CASE WHEN EXISTS (SELECT * FROM [Table] WHERE [Column2] = 1) THEN 0 ELSE 1 END AS 'Column2',
...
The efficiency of this critically depends on how sparse your table is. If there are columns where every single row has a 0 value, then any query that searches for a 1 value will require a full table scan, unless an index is in place. A really good choice for this scenario (millions of rows and hundreds of columns) is a columnstore index. These are supported from SQL Server 2012 onwards; from SQL Server 2014 onwards they don't cause the table to be read-only (which is a major barrier to their adoption).
With a columnstore index in place, each subquery should require constant time, and so should the query as a whole (in fact, with hundreds of columns, this query gets so big that you might run into trouble with the input buffer and need to split it up into smaller queries). Without indexes, this query can still be effective as long as the table isn't sparse -- if it "quickly" runs into a row with a 1 value, it stops.

Random select is not always returning a single row

The intention of following (simplified) code fragment is to return one random row.
Unfortunatly, when we run this fragment in the query analyzer, it returns between zero and three results.
As our input table consists of exactly 5 rows with unique ID's and as we perform a select on this table where ID equals a random number, we are stumped that there would ever be more than one row returned.
Note: among other things, we already tried casting the checksum result to an integer with no avail.
DECLARE #Table TABLE (
ID INTEGER IDENTITY (1, 1)
, FK1 INTEGER
)
INSERT INTO #Table
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
SELECT *
FROM #Table
WHERE ID = ABS(CHECKSUM(NEWID())) % 5 + 1
Edit
Our usage scenario is as follows (please don't comment on wether it is the right thing to do or not. It's the powers that be that have decided)
Ultimately, we must create a result with realistic values where the combination of producer and weights are obfuscated by selecting at random existing weights from the table itself.
The query then would become something like this (also a reason why RAND can not be used)
SELECT t.ID
, FK1 = (SELECT FK1 FROM #Table WHERE ID=ABS(CHECKSUM(NEWID())) % 5 + 1)
FROM #Table t
Because the inner select could be returning zero results, it would return a NULL value wich again is not acceptable. It is the investigation of why the inner select returns between zero and x results, that this question sproused (is this even English?).
Answer
What turned the light on for me was the simple observation that ABS(CHECKSUM(NEWID())) % 5 + 1) was re-evaluated for each row. I was under the impression that ABS(CHECKSUM(NEWID())) % 5 + 1) would get evaluated once, then matched.
Thank you all for answering and slowly but surely leading me to a better understanding.
The reason this happens is because NEWID() gies a different value for each row in the table. For each row, independently of the others, there is a one in five chance of it being returned. Consequently, as it stands, you actually have a 1 in 3125 chance of all 5 rows being returned!
To see this, run the following query. You'll see that each row gets a different ID.
SELECT * , NEWID()
FROM #Table
This will fix your code:
DECLARE #Id int
SET #Id = ABS(CHECKSUM(NEWID())) % 5 + 1
SELECT *
FROM #Table
WHERE ID = #Id
However, I'm not sure this is the most efficient method of selecting a single random row from the table.
You might find this MSDN article useful: http://msdn.microsoft.com/en-us/library/Aa175776 (Random Sampling in T-SQL)
EDIT 1: now I think about it, this probably is the most efficient way to do it, assuming the number of rows remains fixed and the IDs are guaranteed to be contiguous.
EDIT 2: to achieve the desired result when used as a sub-query, use TOP 1 like this:
SELECT t.ID
, FK1 = (SELECT TOP 1 FK1 FROM #Table ORDER BY NEWID())
FROM #Table t
A bit of a guess, and not sure that SQL works this way, but wouldn't SQL evaluate "ABS(CHECKSUM(NEWID())) % 5 + 1" for each row in the table? If so, then each evaluation may or may not return the value of ID of the current row.
Try this instead - generating the random number explicitly first, and matching on that single value:
declare #targetRandom int
set #targetRandom = ABS(CHECKSUM(NEWID())) % 5 + 1
select * from #table where ID = #targetRandom
Try the following, so you can see what happens:
SELECT ABS(CHECKSUM(NEWID())) % 5 + 1 AS Number, #Table.*
FROM #Table
WHERE ID = Number
Or you could use RAND() instead of NEWID(), which is only evaluated once per query in MS SQL
If you want to use CHECKSUM to obtain a random row, this is the way to do it.
SELECT TOP 1 *
FROM #Table
ORDER BY CHECKSUM(NEWID())
what about?
SELECT t.ID
, FK1 = (SELECT TOP 1 FK1 FROM #Table ORDER BY NEWID())
FROM #Table t
This may help you understand the reasons.
Run the query multiple times. How many times does MY_FILTER = ID ?
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
SELECT *, ABS(CHECKSUM(NEWID())) % 5 + 1 AS MY_FILTER
FROM #Table
I don't know how much this will be helpful to you, but try this.. All I understood is you want one random row each time you execute the query..
select top 1 newid() as row,ID from #Table order by row
Here is the logic. Each time you execute the query a newid is being assigned to each row and all are unique and the you just order them with the new uniquely generated rowid. Then all you need to do is select the top most or whatever you want..

Iterate through "linked list" in one SQL query?

I have a table that looks basically like this:
id | redirectid | data
where the redirectid is an id to another row. Basically if a row is selected, and it has a redirectid, then the redirectid data should be used in it's place. There may be multiple redirects until redirectid is NULL. Essentially these redirects form a linked list in the table. What I'd like to know is, given an id, is it possible to set up a sql query that will iterate through all the possible redirects and return the id at the end of the "list"?
This is using Postgresql 8.3 and I'd like to do everything in on sql query if possible (rather than iterate in my code).
Does postgresql support recursive queries that use WITH clauses? If so, something like this might work. (If you want a tested answer, provide some CREATE TABLE and INSERT statements in your question, along with the results you need for the sample data in the INSERTs.)
with Links(id,link,data) as (
select
id, redirectid, data
from T
where redirectid is null
union all
select
id, redirectid, null
from T
where redirectid is not null
union all
select
Links.id,
T.redirectid,
case when T.redirectid is null then T.data else null end
from T
join Links
on Links.link = T.id
)
select id, data
from Links
where data is not null;
Additional remarks:
:( You can implement the recursion yourself based on the WITH expression. I don't know postgresql syntax for sequential programming, so this is a bit pseudo:
Insert the result of this query into a new table called Links:
select
id, redirectid as link, data, 0 as depth
from T
where redirectid is null
union all
select
id, redirectid, null, 0
from T
where redirectid is not null
Also declare an integer ::depth and initialize it to zero. Then repeat the following until it no longer adds rows to Links. Links will then contain your result.
increment ::depth;
insert into Links
select
Links.id,
T.redirectid,
case when T.redirectid is null then T.data else null end,
depth + 1
from T join Links
on Links.link = T.id
where depth = ::depth-1;
end;
I think this will be better than any cursor solution. In fact, I can't really think of how cursors would be useful for this problem at all.
Note that this will not terminate if there are any cycles (redirects that are ultimately circular).
I'd say you should create a user-defined function in this vein:
create function FindLastId (ID as integer) returns integer as $$
declare newid integer;
declare primaryid integer;
declare continue boolean;
begin
set continue = true;
set primaryid = $1;
while (continue)
select into newid redirectid from table where id = :primaryid;
if newid is null then
set continue = false;
else
set primaryid = :newid;
end if;
end loop;
return primaryid;
end;
$$ language pgplsql;
I'm a bit shaky on the Postgres syntax, so you may have some cleanup to do. Anyway, you can then call your function like so:
select id, FindLastId(id) as EndId from table
On a table like so:
id redirectid data
1 3 ab
2 null cd
3 2 ef
4 1 gh
5 null ij
This will return:
id EndId
1 2
2 2
3 2
4 2
5 5
Note that this will be markedly slow, but it should get you the ID's pretty quickly for a small result set on a well indexed table.