Working with sequential numbers in SQL Server 2005 without cursors - sql-server-2005

I'm currently working on a project that needs to have a process that assigns "control numbers" to some records. This also needs to be able to be run at a later date and include records without a control number that changed, and assign an unused control number to these records. These control numbers are preassigned by an outside entity and are 9 digits long. You would usually get a range depending on how many estimated records your company will generate. For example one of the companies estimated they would need 50, so they assigned us the range 790123401 to 790123450.
The problem: right now I'm using cursors to assign these numbers. For each individual record, I go and check if the first number in the sequence is already taken in the table, if it is, I increment the number, and recheck. This check goes on and on for each record in the table. One of the companies has 17,000 records, which means that for each of the records, I could be potentially iterating at worst 17,000 times if all numbers have been taken.
I really don't mind all the repetition on the initial run since the first run will assign control numbers to a lot of records. My problem is that if later a record gets changed and now should have a control number associated with it, then re-running the process would mean it would go through each available number until I get an unused one.
I've seen numerous examples on how to use sequences without cursors, but most are specific to Oracle. I'm using SQL Server 2005 for this particular project.
Suggestions?

You are looking for all unassigned numbers in a range? If so you can outer join onto a numbers table. The example below uses a CTE to create one on the fly I would suggest a permanent one containing at least 17,000 numbers if that is the max size of your range.
DECLARE #StartRange int, #EndRange int
SET #StartRange = 790123401
SET #EndRange = 790123450;
WITH YourTable(ControlNumber) AS
(
SELECT 790123401 UNION ALL
SELECT 790123402 UNION ALL
SELECT 790123403 UNION ALL
SELECT 790123406
),
Nums(N) AS
(
SELECT #StartRange
UNION ALL
SELECT N+1
FROM Nums
WHERE N < #EndRange
)
SELECT N
FROM Nums
WHERE NOT EXISTS(SELECT *
FROM YourTable
WHERE ControlNumber = N )
OPTION (MAXRECURSION 0)

Related

How to select random entry from table?

I need to display a random last name of a person who entered into an employment contract in a specified month using the rand function
go
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont and dbo.Tenants.tenant_code = (select * from randNumber))
return #surname
end
go
select dbo.User_Surname (1)
create or alter view randNumber as
Select FLOOR((RAND() * (MAX(tenant_code + 1) - 1)) + 1) as value from Tenants
So what if tenant #42 has been removed? If the random number function returns 42, then your query will yield nothing.
To fix this problem, an approach which would be quite difficult to correctly implement would involve a row-sequence-number column which is an integer which sequentially increments and does not contain gaps. In order to avoid a gap when a row is deleted, you must pick the last row from the table and give it the row-sequence-number of the deleted column. Consistently doing so without ever forgetting to do it seems like a tough proposition. Achieving this without concurrency problems when rows are being concurrently deleted also seems like a tough proposition. Furthermore, the possibility that the last row may be re-sequenced means that you cannot use an SQL SEQUENCE for issuing row sequence numbers, or that your RDBMS must support the ability to count-down on a sequence, which is a tough proposition.
A better approach would be to create a random number N between zero and the number of rows instead of the maximum row id number, and then to pick the Nth row from the table. That would be something like SELECT BOTTOM 1 FROM (SELECT TOP N FROM...
An SQL-only solution (involving no stored procedures) would be very inefficient. It would involve joining the table of interest with the random-number function, (just real random numbers between 0.0 and 1.0,) essentially creating a new table which also contains a random number field, then using ORDER BY the random field, and then using TOP 1 to get the first row. To achieve this, your RDBMS would be performing a full table scan, and creating an entire new sorted temporary table, and it would be doing that each time you ask for a row at random, so it would be preposterously inefficient.
A performance improvement on the above idea would be to permanently add the random number column to each row, (and to issue a new random number between 0.0 and 1.0 to each row later inserted,) and then use a SEQUENCE for issuing sequential row index numbers, so that each time you want a new random row you pick the next number N from the sequence, you compute its modulus by the number of rows in the table, and you get the Nth row from the table sorted by random-number-column. It will probably be a good idea to make that random number column indexed. The problem with this approach is that it does not truly yield records at random, it yields all records in random order. Truly yielding records at random means that the same row might be yielded twice in two successive queries. This approach will only yield a record again once all other records have first been yielded.
As you want only one tenant, use 'ORDEr BY RAND()`
As always with randomness, you could alos get 100 times the same Tennant, especially when you have only a small number of tennants that fit the bill.
This will never be fast as the table needs to be full scnanned
but at least you should have an index on (tenant_code ,rental_start_date) so that it will be faster to select the correct tennants
CREATE OR ALTER function [dbo].[User_Surname]
(#mont int)
returns nvarchar(50)
begin
Declare #surname nvarchar(50)
Set #surname = (Select top(1) surname from dbo.Tenants
inner join dbo.lease_agreements on Tenants.tenant_code = lease_agreements.tenant_code
where MONTH(lease_agreements.rental_start_date) = #mont
ORDER By RAND())
return #surname
end

Are there any database implementations that allow for tables that don't contain data but generate data upon query?

I have an application that works well with database query outputs but now need to run with each output over a range of numbers. Sure, I could refactor the application to iterate over the range for me, but it would arguably be cleaner if I could just have a "table" in the database that I could CROSS JOIN with my normal query outputs. Sure, I could just make a table that contains a range of values, but that seems like unnecessary waste.
For example a "table" in a database that represents a range of values, say 0 to 999,999 in a column called "number" WITHOUT having to actually store a million rows, but can be used in a query with a CROSS JOIN with another table as though there actually existed such a table.
I am mostly just curious if such a construct exists in any database implementation.
PostgreSQL has generate_series. SQLite has it as a loadable extension.
SELECT * FROM generate_series(0,9);
On databases which support recursive CTE (SQLite, PostgreSQL, MariaDB), you can do this and then join with it.
WITH RECURSIVE cnt(x) AS (
VALUES(0)
UNION ALL
SELECT x+1 FROM cnt WHERE x < 1000000
)
SELECT x FROM cnt;
The initial-select runs first and returns a single row with a single column "1". This one row is added to the queue. In step 2a, that one row is extracted from the queue and added to "cnt". Then the recursive-select is run in accordance with step 2c generating a single new row with value "2" to add to the queue. The queue still has one row, so step 2 repeats. The "2" row is extracted and added to the recursive table by steps 2a and 2b. Then the row containing 2 is used as if it were the complete content of the recursive table and the recursive-select is run again, resulting in a row with value "3" being added to the queue. This repeats 999999 times until finally at step 2a the only value on the queue is a row containing 1000000. That row is extracted and added to the recursive table. But this time, the WHERE clause causes the recursive-select to return no rows, so the queue remains empty and the recursion stops.
Generally speaking, this depends a lot on the database you're using. In SQLite, for example, you are going to generator a sequence from 1 to 100. You could code like this:
WITH basic(i) AS (
VALUES(1)
),
seq(i) AS (
SELECT i FROM basic
UNION ALL
SELECT i + 1 FROM seq WHERE i < 100
)
SELECT * FROM seq;
Hope ring your bell.
Looks like the answer to my question "Are there any database implementations that allow for tables that don't contain data but generate data upon query?" is yes. For example in sqlite there exists virtual tables: https://www.sqlite.org/vtab.html
In fact, it has the exact sort of thing I was looking for with generate_series: https://www.sqlite.org/series.html

What is the distribution of getting a single random row in Oracle using this SQL statement?

We are attempting to pull a semi-random row from Oracle. (We don't need perfectly random row that meets rigorous statistical scrutiny but we would like something that has a chance of getting any row in the table even though there may be some degree of skew.)
We are using this approach:
SELECT PERSON_ID FROM ENCOUNTER SAMPLE(0.0001) WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020 AND ROWNUM = 1
This approach appears to be giving us just one random result each time we run it.
However, according to answers to this question, this approach gives results from the beginning of the table far more commonly.
How commonly? If that statement is true then how much more commonly are values taken from the top of the table? Our typical table has tens of millions of rows (occasionally billions.) Is there a simple heuristic or a rough estimate to understand the skew in the distribution we can expect?
We are asking for skew because other methods aren't fast enough for our use case. We are avoiding using ORDER because the source tables can be so large (i.e. billions of rows) that the reporting server will run for hours or can time out before we get an answer. Thus, our constraint is we need to use approaches like SAMPLE that respond with little database overhead.
The issue than sample is basically going through the table in order and randomly selecting rows. The issue is the rownum, not the sample.
The solution is to use sample and then randomly sort:
SELECT p.*
FROM (SELECT PERSON_ID
FROM ENCOUNTER SAMPLE(0.0001)
WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020
ORDER BY dbms_random.value
) p
WHERE ROWNUM = 1
Just for fun, here is an alternative way to select a single, uniformly distributed row out of a (uniformly distributed) "small" sample of rows from the table.
Suppose the table has millions or billions of rows, and we use the sample clause to select only a small, random (and presumably uniformly distributed) sample of rows. Let's say the sample size is 200 rows. How can we select a single row out of those 200, in such a way that the selection is not biased?
As the OP explained, if we always select the first row generated in the sample, that has a very high likelihood to be biased. Gordon Linoff has shown a perfectly valid way to fix that. Here I describe a different approach - which is even more efficient, as it only generates a single random number, and it does not need to order the 200 rows. (Admittedly this is not a lot of overhead, but it may still matter if the query must be run many times.)
Namely: Given any 200 rows, generate a (hopefully uniformly distributed) single integer between 1 and 200. Also, as the 200 rows are generated, capture ROWNUM at the same time. Then it's as simple as selecting the row where ROWNUM = <the randomly generated integer>
Unfortunately, the sample clause doesn't generate a fixed number of rows, even if the table and the percentage sampled are fixed (and even if stats on the table are current). So the solution is just slightly more complicated - first I generate the sample, then I count how many rows it contains, and then I select the one row we want.
The output will include a column for the "random row number"; if that is an issue, just list the columns from the base table instead of * in the final query. I assume the name of the base table is t.
with
p as ( select t.*, rownum as rn
from t sample(0.0001)
)
, r as ( select trunc(dbms_random.value(1, (select count(*) from p) + 1)) as rn
from dual
)
select p.*
from p join r on p.rn = r.rn
;
It's not accurate to say "[SAMPLE] gives results from the beginning of the table far more commonly," unless you're using SAMPLE wrong. However, there are some unusual cases where earlier rows are favored if those early rows are much larger than subsequent rows.
SAMPLE Isn't That Bad
If you use a large sample size, the first rows returned do appear to come from the "first" rows of the table. (But tables are unordered, and while I observe
this behavior on my machine there is no guarantee you will always see it.)
The below query does seem to do a good job of picking random rows, but not if you only look at the first N rows returned:
select * from test1 sample(99);
SAMPLE Isn't Perfect Either
The below test case shows how the row size can skew the results. If you insert 10,000 large rows and then insert 10,000 small rows, a small SAMPLE will
almost always only return large rows.
--drop table test1 purge;
create table test1(a varchar2(5), b varchar2(4000));
--Insert 10K large records.
insert into test1 select 'large', lpad('A', 4000, 'A') from dual connect by level <= 10000;
--Insert 10K small records.
insert into test1 select 'small', null from dual connect by level <= 10000;
--Select about 10 rows. Notice that they are almost always a "LARGE" row.
select * from test1 sample (0.1);
However, the skew completely disappears if you insert the small rows before the large rows.
I think these results imply that SAMPLE is based on the distribution of data in blocks (8 KB of data), and not strictly random per rows. If small rows are "hidden" in a physically small part of the table they are much less likely to show up. However, Oracle always seems to check the first part of the table, and if the small rows exist there, then the sample is evenly distributed. The rows have to be hiding very well to be missed.
The real answer depends on Oracle's implementation, which I don't have access to. Hopefully this test case will at least give you some ideas to play around and determine if SAMPLE is random enough for your needs.

Limit Records with SQL in Access

We get an Access DB (.accdb) from an external source and have no control over the structure or data. We need to ingest the data into our DB using code. This means I have control over the SQL.
Our issue is that one table contains almost 13k records (currently 12,997) and takes a long time to process. I'd like to query the data from the source DB but only a predefined number of records at a time - let's say 1000 at a time.
I tried generating my query inside a loop where I update the number the records to return with each loop. So far, the only thing I've found that comes close to working is something like this:
SELECT *
FROM (
SELECT Top + pageSize + sub.*
FROM (
SELECT TOP + startPos + [Product Description Codes].*
FROM [Product Description Codes]
ORDER BY [Product Description Codes].PRODDESCRIPCODE
) sub
ORDER BY sub.PRODDESCRIPCODE DESC
) subOrdered
ORDER BY subOrdered.PRODDESCRIPCODE
Where I increment pageSize and startPos with each loop. The problem is that it always returns 1000 rows, even on what I think should be the last loop when it should return only 997 and then return zero after that.
Can anyone help me with this? I don't have another column to filter on. Is there a way to select a certain number of records in a loop and then increment that number until I've gotten all the records, and then stop?
If PRODDESCRIPCODE is primary key then you can simplify your select. ie:
SELECT TOP 1000 *
FROM [Product Description Codes]
where PRODDESCRIPCODE > #pcode;
and start with passing a #pcode parameter of 0 (if int, or '' if text etc). In next loop you would set the parameter to the max PRODDESCRIPCODE you have received.
(I am not sure if you meant MS SQL server saying SQL and how you are doing this).
Do you absolutely have to update records, or can you afford to insert the entire access table into your local table, slap on a timestamp field, and structure your local queries to grab the most recent entry? Based on some of your comments above, it doesn't sound like you have any cases where you are keeping a local record over an imported one.
SELECT PRODDESCRIPCODE, MAX(timestamp) FROM table GROUP BY PRODDESCRIPCODE
I ended up using a variation of the method from here:
http://www.jertix.org/en/blog/programming/implementation-of-sql-pagination-with-ms-access.html
Thank you all very much for your suggestions.

Numbering rows in a view

I am connecting to an SQL database using a PLC, and need to return a list of values. Unfortunately, the PLC has limited memory, and can only retrieve approximately 5,000 values at any one time, however the database may contain up to 10,000 values.
As such I need a way of retrieving these values in 2 operations. Unfortunately the PLC is limited in the query it can perform, and is limited to only SELECT and WHERE commands, so I cannot use LIMIT or TOP or anything like that.
Is there a way in which I can create a view, and auto number every field in that view? I could then query all records < 5,000, followed by a second query of < 10,000 etc?
Unfortunately it seems that views do not support the identity column, so this would need to be done manually.
Anyone any suggestions? My only realistic option at the moment seems to be to create 2 views, one with the first 5,000 and 1 with the next 5,000...
I am using SQL Server 2000 if that makes a difference...
There are 2 solutions. The easiest is to modify your SQL table and add an IDENTITY column. If that is not a possibility, the you'll have to do something like the below query. For 10000 rows, it shouldn't be too slow. But as the table grows, it will become worse and worse-performing.
SELECT Col1, Col2, (SELECT COUNT(i.Col1)
FROM yourtable i
WHERE i.Col1 <= o.Col1) AS RowID
FROM yourtable o
While the code provided by Derek does what I asked - i.e numbers each row in the view, the performance for this is really poor - approximately 20 seconds to number 100 rows. As such it is not a workable solution. An alternative is to number the first 5,000 records with a 1, and the next 5,000 with a 2. This can be done with 3 simple queries, and is far quicker to execute.
The code to do so is as follows:
SELECT TOP(5000) BCode, SAPCode, 1 as GroupNo FROM dbo.DB
UNION
SELECT TOP (10000) BCode, SAPCode, 2 as GroupNo FROM dbo.DB p
WHERE ID NOT IN (SELECT TOP(5000) ID FROM dbo.DB)
Although, as pointed out by Andriy M, you should also specify an explicit sort, to ensure the you dont miss any records.
One possibility might be to use a function with a temporary table such as
CREATE FUNCTION dbo.OrderedBCodeData()
RETURNS #Data TABLE (RowNumber int IDENTITY(1,1),BCode int,SAPCode int)
AS
BEGIN
INSERT INTO #Data (BCode,SAPCode)
SELECT BCode,SAPCode FROM dbo.DB ORDER BY BCode
RETURN
END
And select from this function such as
SELECT FROM dbo.OrderedBCodeData() WHERE RowNumber BETWEEN 5000 AND 10000
I haven't used this in production ever, in fact was just a quick idea this morning but worth exploring as a neater alternative?