Oracle SQL: Slicing big results into chunks - sql

I have a large table (too large, to query at one time). I need a efficient strategy to "slice" the results into chunks - allowing incremental updates and avoiding timeouts.
I guess there is a smarter solution than
SELECT
tbl1.ID,
tbl2.*
FROM
(SELECT * FROM FOOUSER.TABLE1 ORDER BY ID) tbl1
JOIN
FOOUSER.TABLE2 ON tbl1.ID = tbl2.ID2
WHERE
ID > :LASTMAXVALUE
AND ROWNUM <= 1000
ORDER BY
tbl1.ID;
.. with :LASTMAXVALUE being the maximum value of ID from the last query, and ROWNUM <= 1000 giving chunks of 1000 rows.
Thanks in advance.

Related

What is the most efficient way to randomly sample with replacement in BigQuery?

The answers to this question explain how to randomly sample from a BigQuery table. Is there an efficient way to do this with replacement?
As an example, suppose I have a table with 1M rows and I wish to select 100K independently random sampled rows.
Found a neat solution:
Index the rows of the table
Generate a dummy table with 100K random integers between 1 and 1M
Inner join the tables on index = random value
Code:
# randomly sample 100K rows from `table` with replacement
with large_table as (select *, row_number() over() as rk from `table`),
num_elements as (select count(1) as n from large_table),
dummy_table as (select 1 + cast(rand() * (select n - 1 from num_elements) as int64) as i from unnest(generate_array(1, 100000)))
select * from dummy_table
inner join large_table on dummy_table.i = large_table.rk

How to query samples in relativity?

I have a large data set with about 100 million rows that I want to 'compress' the data set and get a 1% sample of the entire dataset while ensuring relativity.
How can such query be implemented?
Step 1: create the helper table
You can use aggregation to group records by visit_id, and CROSS JOIN with a query that computes the total number of records in the table to compute the distribution percent:
CREATE TABLE my_helper AS
SELECT
t.visit_number,
COUNT(*) visit_count,
SUM(t.purchase_id) sum_purchase,
COUNT(*)/total.cnt distribution
FROM
mytable t
CROSS JOIN (SELECT COUNT(*) cnt FROM mytable) total
GROUP BY t.visit_number
Step 2: sample the main table using the helper table
Within a subquery, you can use ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) to assign a random rank to each record within groups of records sharing the same visit_id. Then, in the outer query, you can join on the helper table to select the corect amount of records for each visit_id:
SELECT x.*
FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY visit_number ORDER BY RANDOM()) rn
FROM mytable t
) x
INNER JOIN my_helper h ON h.visit_number = x.visit_number
WHERE x.rn <= 1000000 * h.distribution
Side notes:
this only works if there are indeed more than 1 million record in the source table
the exact number of records in the output might be slightly below or above 1 million (depending on the distribution in the original table)
it should be possible to combine both queries into a single one, which would avoid the need to use a helper table
This is doable. A quick way is to take every nth record only.
1) order by a random column (probably ID)
2) apply a nownum() attribute
3) apply a mod(rownum) = 0 on whatever percent makes sense (e.g. 1% would be rownum mod 100)
You may need steps 1/2 in a sub query and step 3 on the outside.
Enjoy and good luck!

Querying query results

If I have a query for example
SELECT * FROM MY_TABLE WHERE FIRSTNAME = 'HENRY';
thats returns say twenty results for HENRY that are identical.
Is there a way to then query the results of the original query to only return non duplicates.
This is a trivial example but basically I have a query where I am trying to perform a SELECT DISTINCT on a large data set. If I don't specify DISTINCT I get a relatively small and fast return of some duplicate data. Is there any logic in SQL I can apply to then perform a SELECT DISTINCT on those results. Essentially breaking up the query to reduce response times? Assume everything of value is indexed.
Thanks
To return the first of a group of records you can do something like this:
select *
from
(
SELECT *, row_number() over (partition by firstname order by id) r
FROM MY_TABLE
--WHERE FIRSTNAME = 'HENRY'
) x
where x.r = 1
If the records are exact duplicates, you're not worried about the first since they're all the same, so you just want distinct records:
SELECT distinct *
FROM MY_TABLE
WHERE FIRSTNAME = 'HENRY'
or to see how many duplicates:
SELECT *, count(*)-1 NoOfDuplicates
FROM MY_TABLE
WHERE FIRSTNAME = 'HENRY'
group by firstname, lastname --, ...
Be warned that for the database to divide the data set up into those records which have a duplicate and those which do not is generally no more efficient than performing the actual distinct, unless the number of columns on which duplication occurs is very much less than the total number of columns.
In some cases of very wide tables where duplication exist only on a subset of columns and on a small proportion of the rows it might be more efficient to do something like:
select *
from my_table t1
where not exists (
select null
from my_table t2
where t2.duplication_column = t1.duplication_column and
t2.rowid != t1.rowid)
union all
select distinct *
from my_table t1
where exists (
select null
from my_table t2
where t2.duplication_column = t1.duplication_column and
t2.rowid != t1.rowid)
This would generally not be worth doing unless it avoided something very inefficient, like a very large sort spilling to disk.
Edit: modified the query

How will Oracle optimise a record set if we specify a rownum clause

If I say:
select * from table order by col1 where rownum < 100
If the table has 10 million records, will Oracle bring all 10 million, sort it and then show me the first 10? Or is there a way it will optimise it?
If you do this
select * from table order by col1 where rownum < 100
then Oracle will throw an error as the WHERE clause comes before the ORDER BY.
If you do this
select * from table where rownum < 100 order by col1
then Oracle will return a random 99 records as the WHERE clause comes before the ORDER BY.
If you want to return a the first 100 records, ordered by a column, you must put the order by in a sub-select.
select *
from ( select * from table order by col1 )
where rownum <= 100
Oracle will do the sort, how else will it know the records you want? However, it will be a sort with a stopkey because of the ROWNUM. Oracle doesn't actually sort the entire result set, as some optimisation goes on under the hood, but this is what you can assume takes place.
Please see this article by Tom Kyte.

SQL Server SELECT LAST N Rows

This is a known question but the best solution I've found is something like:
SELECT TOP N *
FROM MyTable
ORDER BY Id DESC
I've a table with lots of rows. It is not a posibility to use that query because it takes lot of time. So how can I do to select last N rows without using ORDER BY?
EDIT
Sorry duplicated question of this one
You can get SQL server to select the last N rows with the following query:
select * from tbl_name order by id desc limit N;
I tested JonVD's code, but found it was very slow, 6s.
This code took 0s.
SELECT TOP(5) ORDERID, CUSTOMERID, OrderDate
FROM Orders where EmployeeID=5
Order By OrderDate DESC
You can do it by using the ROW NUMBER BY PARTITION Feature also. A great example can be found here:
I am using the Orders table of the Northwind database... Now let us retrieve the Last 5 orders placed by Employee 5:
SELECT ORDERID, CUSTOMERID, OrderDate
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY EmployeeID ORDER BY OrderDate DESC) AS OrderedDate,*
FROM Orders
) as ordlist
WHERE ordlist.EmployeeID = 5
AND ordlist.OrderedDate <= 5
If you want to select last numbers of rows from a table.
Syntax will be like
select * from table_name except select top
(numbers of rows - how many rows you want)* from table_name
These statements work but differrent ways. thank you guys.
select * from Products except select top (77-10) * from Products
in this way you can get last 10 rows but order will show descnding way
select top 10 * from products
order by productId desc
select * from products
where productid in (select top 10 productID from products)
order by productID desc
select * from products where productID not in
(select top((select COUNT(*) from products ) -10 )productID from products)
First you most get record count from
Declare #TableRowsCount Int
select #TableRowsCount= COUNT(*) from <Your_Table>
And then :
In SQL Server 2012
SELECT *
FROM <Your_Table> As L
ORDER BY L.<your Field>
OFFSET <#TableRowsCount-#N> ROWS
FETCH NEXT #N ROWS ONLY;
In SQL Server 2008
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS sequencenumber, *
FROM <Your_Table>
Order By <your Field>
) AS TempTable
WHERE sequencenumber > #TableRowsCount-#N
In a very general way and to support SQL server here is
SELECT TOP(N) *
FROM tbl_name
ORDER BY tbl_id DESC
and for the performance, it is not bad (less than one second for more than 10,000 records On Server machine)
Is "Id" indexed? If not, that's an important thing to do (I suspect it is already indexed).
Also, do you need to return ALL columns? You may be able to get a substantial improvement in speed if you only actually need a smaller subset of columns which can be FULLY catered for by the index on the ID column - e.g. if you have a NONCLUSTERED index on the Id column, with no other fields included in the index, then it would have to do a lookup on the clustered index to actually get the rest of the columns to return and that could be making up a lot of the cost of the query. If it's a CLUSTERED index, or a NONCLUSTERED index that includes all the other fields you want to return in the query, then you should be fine.
select * from (select top 6 * from vwTable order by Hours desc) T order by Hours
Here's something you can try without an order by but I think it requires that each row is unique. N is the number of rows you want, L is the number of rows in the table.
select * from tbl_name except select top L-N * from tbl_name
As noted before, which rows are returned is undefined.
EDIT: this is actually dog slow. Of no value really.
A technique I use to query the MOST RECENT rows in very large tables (100+ million or 1+ billion rows) is limiting the query to "reading" only the most recent "N" percentage of RECENT ROWS. This is real world applications, for example I do this for non-historic Recent Weather Data, or recent News feed searches or Recent GPS location data point data.
This is a huge performance improvement if you know for certain that your rows are in the most recent TOP 5% of the table for example. Such that even if there are indexes on the Tables, it further limits the possibilites to only 5% of rows in tables which have 100+ million or 1+ billion rows. This is especially the case when Older Data will require Physical Disk reads and not only Logical In Memory reads.
This is well more efficient than SELECT TOP | PERCENT | LIMIT as it does not select the rows, but merely limit the portion of the data to be searched.
DECLARE #RowIdTableA BIGINT
DECLARE #RowIdTableB BIGINT
DECLARE #TopPercent FLOAT
-- Given that there is an Sequential Identity Column
-- Limit query to only rows in the most recent TOP 5% of rows
SET #TopPercent = .05
SELECT #RowIdTableA = (MAX(TableAId) - (MAX(TableAId) * #TopPercent)) FROM TableA
SELECT #RowIdTableB = (MAX(TableBId) - (MAX(TableBId) * #TopPercent)) FROM TableB
SELECT *
FROM TableA a
INNER JOIN TableB b ON a.KeyId = b.KeyId
WHERE a.Id > #RowIdTableA AND b.Id > #RowIdTableB AND
a.SomeOtherCriteria = 'Whatever'
MS doesn't support LIMIT in t-sql. Most of the times i just get MAX(ID) and then subtract.
select * from ORDERS where ID >(select MAX(ID)-10 from ORDERS)
This will return less than 10 records when ID is not sequential.
This query returns last N rows in correct order, but it's performance is poor
select *
from (
select top N *
from TableName t
order by t.[Id] desc
) as temp
order by temp.[Id]
use desc with orderby at the end of the query to get the last values.
This may not be quite the right fit to the question, but…
OFFSET clause
The OFFSET number clause enables you to skip over a number of rows and then return rows after that.
That doc link is to Postgres; I don't know if this applies to Sybase/MS SQL Server.
DECLARE #MYVAR NVARCHAR(100)
DECLARE #step int
SET #step = 0;
DECLARE MYTESTCURSOR CURSOR
DYNAMIC
FOR
SELECT col FROM [dbo].[table]
OPEN MYTESTCURSOR
FETCH LAST FROM MYTESTCURSOR INTO #MYVAR
print #MYVAR;
WHILE #step < 10
BEGIN
FETCH PRIOR FROM MYTESTCURSOR INTO #MYVAR
print #MYVAR;
SET #step = #step + 1;
END
CLOSE MYTESTCURSOR
DEALLOCATE MYTESTCURSOR
In order to get the result in ascending order
SELECT n.*
FROM
(
SELECT *
FROM MyTable
ORDER BY id DESC
LIMIT N
) n
ORDER BY n.id ASC
I stumpled acros this issue while using SQL server
What i did to resolve it is order the results descending and giving row number to the results of that, After i filtered the results and turned them around again.
SELECT *
FROM (
SELECT *
,[rn] = ROW_NUMBER() OVER (ORDER BY [column] DESC)
FROM [table]
) A
WHERE A.[rn] < 3
ORDER BY [column] ASC
Easy copy paste answer
To display last 3 rows without using order by:
select * from Lms_Books_Details where Book_Code not in
(select top((select COUNT(*) from Lms_Books_Details ) -3 ) book_code from Lms_Books_Details)
Try using the EXCEPT syntax.
Something like this:
SELECT *
FROM clientDetails
EXCEPT
(SELECT TOP (numbers of rows - how many rows you want) *
FROM clientDetails)