Are there any database implementations that allow for tables that don't contain data but generate data upon query? - sql

I have an application that works well with database query outputs but now need to run with each output over a range of numbers. Sure, I could refactor the application to iterate over the range for me, but it would arguably be cleaner if I could just have a "table" in the database that I could CROSS JOIN with my normal query outputs. Sure, I could just make a table that contains a range of values, but that seems like unnecessary waste.
For example a "table" in a database that represents a range of values, say 0 to 999,999 in a column called "number" WITHOUT having to actually store a million rows, but can be used in a query with a CROSS JOIN with another table as though there actually existed such a table.
I am mostly just curious if such a construct exists in any database implementation.

PostgreSQL has generate_series. SQLite has it as a loadable extension.
SELECT * FROM generate_series(0,9);
On databases which support recursive CTE (SQLite, PostgreSQL, MariaDB), you can do this and then join with it.
WITH RECURSIVE cnt(x) AS (
VALUES(0)
UNION ALL
SELECT x+1 FROM cnt WHERE x < 1000000
)
SELECT x FROM cnt;
The initial-select runs first and returns a single row with a single column "1". This one row is added to the queue. In step 2a, that one row is extracted from the queue and added to "cnt". Then the recursive-select is run in accordance with step 2c generating a single new row with value "2" to add to the queue. The queue still has one row, so step 2 repeats. The "2" row is extracted and added to the recursive table by steps 2a and 2b. Then the row containing 2 is used as if it were the complete content of the recursive table and the recursive-select is run again, resulting in a row with value "3" being added to the queue. This repeats 999999 times until finally at step 2a the only value on the queue is a row containing 1000000. That row is extracted and added to the recursive table. But this time, the WHERE clause causes the recursive-select to return no rows, so the queue remains empty and the recursion stops.

Generally speaking, this depends a lot on the database you're using. In SQLite, for example, you are going to generator a sequence from 1 to 100. You could code like this:
WITH basic(i) AS (
VALUES(1)
),
seq(i) AS (
SELECT i FROM basic
UNION ALL
SELECT i + 1 FROM seq WHERE i < 100
)
SELECT * FROM seq;
Hope ring your bell.

Looks like the answer to my question "Are there any database implementations that allow for tables that don't contain data but generate data upon query?" is yes. For example in sqlite there exists virtual tables: https://www.sqlite.org/vtab.html
In fact, it has the exact sort of thing I was looking for with generate_series: https://www.sqlite.org/series.html

Related

What is the distribution of getting a single random row in Oracle using this SQL statement?

We are attempting to pull a semi-random row from Oracle. (We don't need perfectly random row that meets rigorous statistical scrutiny but we would like something that has a chance of getting any row in the table even though there may be some degree of skew.)
We are using this approach:
SELECT PERSON_ID FROM ENCOUNTER SAMPLE(0.0001) WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020 AND ROWNUM = 1
This approach appears to be giving us just one random result each time we run it.
However, according to answers to this question, this approach gives results from the beginning of the table far more commonly.
How commonly? If that statement is true then how much more commonly are values taken from the top of the table? Our typical table has tens of millions of rows (occasionally billions.) Is there a simple heuristic or a rough estimate to understand the skew in the distribution we can expect?
We are asking for skew because other methods aren't fast enough for our use case. We are avoiding using ORDER because the source tables can be so large (i.e. billions of rows) that the reporting server will run for hours or can time out before we get an answer. Thus, our constraint is we need to use approaches like SAMPLE that respond with little database overhead.
The issue than sample is basically going through the table in order and randomly selecting rows. The issue is the rownum, not the sample.
The solution is to use sample and then randomly sort:
SELECT p.*
FROM (SELECT PERSON_ID
FROM ENCOUNTER SAMPLE(0.0001)
WHERE EXTRACT(YEAR FROM REG_DT_TM) = 2020
ORDER BY dbms_random.value
) p
WHERE ROWNUM = 1
Just for fun, here is an alternative way to select a single, uniformly distributed row out of a (uniformly distributed) "small" sample of rows from the table.
Suppose the table has millions or billions of rows, and we use the sample clause to select only a small, random (and presumably uniformly distributed) sample of rows. Let's say the sample size is 200 rows. How can we select a single row out of those 200, in such a way that the selection is not biased?
As the OP explained, if we always select the first row generated in the sample, that has a very high likelihood to be biased. Gordon Linoff has shown a perfectly valid way to fix that. Here I describe a different approach - which is even more efficient, as it only generates a single random number, and it does not need to order the 200 rows. (Admittedly this is not a lot of overhead, but it may still matter if the query must be run many times.)
Namely: Given any 200 rows, generate a (hopefully uniformly distributed) single integer between 1 and 200. Also, as the 200 rows are generated, capture ROWNUM at the same time. Then it's as simple as selecting the row where ROWNUM = <the randomly generated integer>
Unfortunately, the sample clause doesn't generate a fixed number of rows, even if the table and the percentage sampled are fixed (and even if stats on the table are current). So the solution is just slightly more complicated - first I generate the sample, then I count how many rows it contains, and then I select the one row we want.
The output will include a column for the "random row number"; if that is an issue, just list the columns from the base table instead of * in the final query. I assume the name of the base table is t.
with
p as ( select t.*, rownum as rn
from t sample(0.0001)
)
, r as ( select trunc(dbms_random.value(1, (select count(*) from p) + 1)) as rn
from dual
)
select p.*
from p join r on p.rn = r.rn
;
It's not accurate to say "[SAMPLE] gives results from the beginning of the table far more commonly," unless you're using SAMPLE wrong. However, there are some unusual cases where earlier rows are favored if those early rows are much larger than subsequent rows.
SAMPLE Isn't That Bad
If you use a large sample size, the first rows returned do appear to come from the "first" rows of the table. (But tables are unordered, and while I observe
this behavior on my machine there is no guarantee you will always see it.)
The below query does seem to do a good job of picking random rows, but not if you only look at the first N rows returned:
select * from test1 sample(99);
SAMPLE Isn't Perfect Either
The below test case shows how the row size can skew the results. If you insert 10,000 large rows and then insert 10,000 small rows, a small SAMPLE will
almost always only return large rows.
--drop table test1 purge;
create table test1(a varchar2(5), b varchar2(4000));
--Insert 10K large records.
insert into test1 select 'large', lpad('A', 4000, 'A') from dual connect by level <= 10000;
--Insert 10K small records.
insert into test1 select 'small', null from dual connect by level <= 10000;
--Select about 10 rows. Notice that they are almost always a "LARGE" row.
select * from test1 sample (0.1);
However, the skew completely disappears if you insert the small rows before the large rows.
I think these results imply that SAMPLE is based on the distribution of data in blocks (8 KB of data), and not strictly random per rows. If small rows are "hidden" in a physically small part of the table they are much less likely to show up. However, Oracle always seems to check the first part of the table, and if the small rows exist there, then the sample is evenly distributed. The rows have to be hiding very well to be missed.
The real answer depends on Oracle's implementation, which I don't have access to. Hopefully this test case will at least give you some ideas to play around and determine if SAMPLE is random enough for your needs.

SQL Server 2005 - exclude rows with consecutive duplicate values in 1 field

I have a source table with 2 fields, a date, and a status code. I need a query to remove duplicate consecutive status codes, keeping only the row with the first date of a different status. For example:
Date Status
10/02/2004 A
10/12/2004 B
10/14/2004 B
11/22/2004 C
11/23/2004 C
12/03/2004 C
03/05/2006 B
The desired result set would be:
10/02/2004 A
10/12/2004 B
11/22/2004 C
03/05/2006 B
The main problem is that all the grouping functions (GROUP BY and ROW_NUMBER() OVER) don't seem to care about order, so in the example, all the "B" status records would be grouped together, which is incorrect, since the status changes from non-"B" to "B" two different times.
This problem is easy to solve using a cursor based loop to produce the result. Just remember the current value in a variable, and test each record as you loop. That works perfectly, but is dreadfully slow (over 20 minutes on real data).
This needs to run on SQL Server 2005 and later, so some newer windowing functions are not available. Is there a way to do this using a set-based query, that would presumably run much faster? It seems like it should be a simple thing to do, but maybe not. Other similar questions on SO seem to rely on additional ID or Sequence fields that we do not have available.
The reason regular grouping doesn't help in this situation is because the grouping criteria needs to reference fields in 2 different records to determine if a group break should occur. Since SQL 2005 lags behind the newer versions, we don't have a lag function to look at the prior record's value. Instead, we need to do a self join to get access to the prior record. To do that, we need to create a temporary sequence field in a CTE using ROW_NUMBER(). Then use that generated sequence in the self join to look at the prior record. We end up with something like:
;WITH tmp AS (
SELECT myDate,myStatus,ROW_NUMBER() OVER (ORDER BY myDate) as seq
FROM myTable )
SELECT tmp.* FROM tmp LEFT JOIN tmp t2 ON t2.seq = tmp.seq-1
WHERE t2.seq is null OR t2.myStatus!=tmp.myStatus
So, even though the original data doesn't have a sequence column, we can generate it on the fly in order to be able to find the prior record (if any) for any given other record using the self join. Then we get the desired result of selecting only the records where the status has changed from the prior record.

iSeries query changes selected RRN of subquery result rows

I'm trying to make an optimal SQL query for an iSeries database table that can contain millions of rows (perhaps up to 3 million per month). The only key I have for each row is its RRN (relative record number, which is the physical record number for the row).
My goal is to join the table with another small table to give me a textual description of one of the numeric columns. However, the number of rows involved can exceed 2 million, which typically causes the query to fail due to an out-of-memory condition. So I want to rewrite the query to avoid joining a large subset with any other table. So the idea is to select a single page (up to 30 rows) within a given month, and then join that subset to the second table.
However, I ran into a weird problem. I use the following query to retrieve the RRNs of the rows I want for the page:
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
This query works just fine, returning the correct RRNs for the rows I need. However, when I attempted to join the result of the subquery with another table, the RRNs changed. So I simplified the query to a subquery within a simple outer query, without any join:
select rrn(e) as RRN, e.*
from TABLE1 as e
where rrn(e) in (
select t.RRN2 -- Gives correct RRNs
from (
select row_number() over() as SEQ,
rrn(e2) as RRN2, e2.*
from TABLE1 as e2
where e2.UPDATED between '2013-05-01' and '2013-05-31'
order by e2.UPDATED, e2.ACCOUNT
) as t
where t.SEQ > 270 and t.SEQ <= 300 -- Paging
order by t.UPDATED, t.ACCOUNT
)
order by e.UPDATED, e.ACCOUNT
The outer query simply grabs all of the columns of each row selected by the subquery, using the RRN as the row key. But this query does not work - it returns rows with completely different RRNs.
I need the actual RRN, because it will be used to retrieve more detailed information from the table in a subsequent query.
Any ideas about why the RRNs end up different?
Resolution
I decided to break the query into two calls, one to issue the simple subquery and return just the RRNs (rows-IDs), and the second to do the rest of the JOINs and so forth to retrieve the complete info for each row. (Since the table gets updated only once a day, and rows never get deleted, there are no potential timing problems to worry about.)
This approach appears to work quite well.
Addendum
As to the question of why an out-of-memory error occurs, this appears to be a limitation on only some of our test servers. Some can only handle up to around 2m rows, while others can handle much more than that. So I'm guessing that this is some sort of limit imposed by the admins on a server-by-server basis.
Trying to use RRN as a primary key is asking for trouble.
I find it hard to believe there isn't a key available.
Granted, there may be no explicit primary key defined in the table itself. But is there a unique key defined in the table?
It's possible there's no keys defined in the table itself ( a practice that is 20yrs out of date) but in that case there's usually a logical file with a unique key defined that is by the application as the de-facto primary key to the table.
Try looking for related objects via green screen (DSPDBR) or GUI (via "Show related"). Keyed logical files show in the GUI as views. So you'd need to look at the properties to determine if they are uniquely keyed DDS logicals instead of non-keyed SQL views.
A few times I've run into tables with no existing de-facto primary key. Usually, it was possible to figure out what could be defined as one from the existing columns.
When there truly is no PK, I simply add one. Usually a generated identity column. There's a technique you can use to easily add columns without having to recompile or test any heritage RPG/COBOL programs. (and note LVLCHK(*NO) is NOT it!)
The technique is laid out in Chapter 4 of the modernizing Redbook
http://www.redbooks.ibm.com/abstracts/sg246393.html
1) Move the data to a new PF (or SQL table)
2) create new LF using the name of the existing PF
3) repoint existing LF to new PF (or SQL table)
Done properly, the record format identifiers of the existing objects don't change and thus you don't have to recompile any RPG/COBOL programs.
I find it hard to believe that querying a table of mere 3 million rows, even when joined with something else, should cause an out-of-memory condition, so in my view you should address this issue first (or cause it to be addressed).
As for your question of why the RRNs end up different I'll take the liberty of quoting the manual:
If the argument identifies a view, common table expression, or nested table expression derived from more than one base table, the function returns the relative record number of the first table in the outer subselect of the view, common table expression, or nested table expression.
A construct of the type ...where something in (select somethingelse...) typically translates into a join, so there.
Unless you can specifically control it, e.g., via ALWCPYDTA(*NO) for STRSQL, SQL may make copies of result rows for any intermediate set of rows. The RRN() function always accesses physical record number, as contrasted with the ROW_NUMBER() function that returns a logical row number indicating the relative position in an ordered (or unordered) set of rows. If a copy is generated, there is no way to guarantee that RRN() will remain consistent.
Other considerations apply over time; but in this case it's as likely to be simple copying of intermediate result rows as anything.

oracle sql query requirements

I have some data in oracle table abot 10,000 rows i want to genrate column which return 1 for ist row and 2 for second and so on 1 for 3rd and 2 for 4th and 1 for 5th and 2 for 6th and so on..Is there any way that i can do it using sql query or any script which can update my column like this.that it will generate 1,2 as i mentioned above i have thought much but i didn't got to do this using sql or any other scencrio for my requirements.plz help if any possibility for doing this with my table data
You can use the combination of the ROWNUM and MOD functions.
Your query would look something like this:
SELECT ROWNUM, 2 - MOD(ROWNUM, 2) FROM ...
The MOD function will return 0 for even rows and 1 for odd rows.
select mod(rownum,5)+1,fld1, fld2, fld3 from mytable;
Edit:
I did not misunderstand requirements, I worked around them. Adding a column and then updating a table that way is a bad design idea. Tables are seldom completely static, even rule and validation tables. The only time this might make any sense is if the table is locked against delete, insert, and update. Any change to any existing row can alter the logical order. Which was never specified. Delete means the entire sequence has to be rewritten. Update and insert can have the same effect.
And if you wanted to do this you can use a sequence to insert a bogus counter. A sequence that cycles over and over, assuming you know the order and can control inserts and updates in terms of that order.

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post