Selecting data effectively sql - sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;

rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.

1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;

An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause

In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.

it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.

Related

Does Snowflake preserve retrieval order?

Posting two questions:
1.
Let's say there is a query:
SELECT C1, C2, C3 from TABLE;
When this query is fired for the first time,it retrieves all the values in a certain order.
Next time, when the same query is fired, will the previous order be retained?
There are 2 tables, TABLE1 and TABLE2, both of them have identical data.
Will (SELECT * from TABLE1) and (SELECT * from TABLE1) retrieve the same order of rows?
SQL tables represent unordered sets. Period. There is no ordering in a result set unless you explicitly include an ORDER BY.
It is that simple. If you want data in a particular order, then you need to use ORDER BY. That is how relational databases work.
The same query can return results in different orders each time the query is executed. There are no guarantees about the order -- unless the query has an ORDER BY for the outermost SELECT.
No, unless you are fetching data from result cache!
No, unless they are very small tables and your query runs with low parallelism.
Sorry for extra answer, but I see Tim claims that the query will return same result as long as the underlying table(s) is not modified, and the query has same execution plan.
Snowflake executes the queries in parallel, therefore the order of data is not predictable unless ORDER BY is used.
Let's create a table (big enough to be processed in parallel), and run a simple test case:
-- running on medium warehouse
create or replace table my_test_table ( id number, name varchar ) as
select seq4(), 'gokhan' || seq4() from table(generator(rowcount=>1000000000));
alter session set USE_CACHED_RESULT = false;
select * from my_test_table limit 10;
You will see that it will return different rows every time you run.
To answer both questions short: No.
If your query has no ORDER BY-clause, the SELECT statement always returns an unordered set. This means: Even if you query the same table twice and the data didnt change, SELECT without ORDER BY can retrieve different row-orders.
https://docs.snowflake.com/en/sql-reference/sql/select.html

Is there a way to select different rows each time avoiding ORDER BY clause?

I have a table with approximately 100 million rows (TABLE_A), I need to select 6 millons different rows each query, once the entire table is selected, the process ends. TABLE_A does not have index or primary key, and ORDER BY is very expensive in terms of time, also I don't need any order here, just different rows. I have tried to order using ROWID, according to this,
They are the fastest way to access a single row.
This query works but takes about 5 minutes (I would like to avoid this order by)
SELECT * FROM TABLE_A ORDER BY ROWID
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
This query works faster but has no sense since ROWNUM, according to this
returns a number indicating the order in which Oracle selects the row
from a table
SELECT * FROM TABLE_A ORDER BY ROWNUM asc
OFFSET 6000000 ROWS FETCH NEXT 6000000 ROWS ONLY;
As expected, same query returns different results each time.
This query seems to be conceptually better.
SELECT * FROM TABLE_A WHERE ROWID >= 6000000 AND ROWID <12000000;
But it can't be done in this way, ROWID (UROWID Datatype) has values like AAAZIUAHWAAC6XcAAI
So, Is there a way to select different rows avoiding order? and just call the rows using some kind of internal ID, maybe a direction in the storage or maybe a default order. The whole approach was likely wrong, so I'm open to radical changes.
I've also tried somethig like this
SELECT * FROM TABLE_A
WHERE dbms_rowid.rowid_block_number(rowid)
BETWEEN 2049 AND 261281;
it's surprisingly fast but unfortunately a row could have more than one block number.
Based on your last comment, some things to look at:
DBMS_PARALLEL_EXECUTE
If you are going through 100 million rows, the best place to process them is on the database itself. If your processing is done with PL/SQL, then dbms_parallel_execute can manage most of the parallelisation for you, and carve up the rows.
ROWID ranges
Even if you don't process the rows on the database, you can use DBMS_PARALLEL_EXECUTE to produce the rowid ranges for you. Then use those start-end pairs as inputs to whatever app you are using to do the processing
simple MOD
Each instance of your app gets an ID from 0 to 'n-1' and each issues a query
select *
from (
select rownum r, m.* from my_table
)
where mod(r,"n") = :x
where x is that app's ID. If you already have a numeric sequence column of some sort that is reasonably distributed, you can substitute that in for the rownum

Conditional offset on union select

Consider we have complex union select from dozens of tables with different structure but similar fields meaning:
SELECT a1.abc as field1,
a1.bcd as field2,
a1.date as order_date,
FROM a1_table a1
UNION ALL
SELECT a2.def as field1,
a2.fff as field2,
a2.ts as order_date,
FROM a2_table a2
UNION ALL ...
ORDER BY order_date
Notice also that results in general are sorted by "synthetic" field order_date.
This query gives huge number of rows, and we want to work with pages from this set of rows. Each page is defined by two parameters:
page size
field2 value of last item from previous page
Most important thing that we can not change the way can page is defined. I.e. it is not possible to use row number of date of last item from previous page: only field2 value is acceptable.
Current algorithm of paging is implemented in quite ugly way:
1) query above is wrapped in additional select with row_number() additional column and then wrapped in stored procedure union_wrapper which returns appropriate
table ( field1 ..., field2 character varying),
2) then complex select performed:
RETURN QUERY
with tmp as (
select
rownum, field1, field2 from union_wrapper()
)
SELECT field1, field2
FROM tmp
WHERE rownum > (SELECT rownum
FROM tmp
WHERE field2 = last_field_id
LIMIT 1)
LIMIT page_size
The problem is that we have to build in memory full union-select results in order to later detect row number from which we want to cut new page. This is quite slow and takes unacceptable much time to perform.
Is any way to reconfigure this operations in order to significantly reduce query complexity and increase its speed?
And again: we can not change condition of paging, we can not change structure of the tables. Only way of rows retrieving.
UPD: I also can not use temp tables, because I'm working in read-replica of the database.
You have successfully maneuvered yourself into a tight spot. The query and its ORDER BY expression contradict your paging requirements.
ORDER BY order_date is not a deterministic sort order (there could be multiple rows with the same order_date) - which you need before you do anything else here. And field2 does not seem to be unique either. You need both: Define a deterministic sort order and a unique indicator for page end / start. Ideally, the indicator matches the sort order. Could be (order_date, field2), which both columns defined NOT NULL, and the combination UNIQUE. Your restriction "only field2 value is acceptable" contradicts your query.
That's all before thinking about how to get best performance ...
There are proven solutions with row values and multi-column indexes for paging:
Optimize query with OFFSET on large table
But drawing from a combination of multiple source tables complicates matters. Optimization depends on the details of your setup.
If you can't get the performance you need, your only remaining alternative is to materialize the query results somehow. Temp table, cursor, materialized view - the best tool depends on details of your setup.
Of course, general performance tuning might help, too.

processing large table - how do i select the records page by page?

I need to do a process on all the records in a table. The table could be very big so I rather process the records page by page. I need to remember the records that have already been processed so there are not included in my second SELECT result.
Like this:
For first run,
[SELECT 100 records FROM MyTable]
For second run,
[SELECT another 100 records FROM MyTable]
and so on..
I hope you get the picture. My question is how do I write such select statement?
I'm using oracle btw, but would be nice if I can run on any other db too.
I also don't want to use store procedure.
Thank you very much!
Any solution you come up with to break the table into smaller chunks, will end up taking more time than just processing everything in one go. Unless the table is partitioned and you can process exactly one partition at a time.
If a full table scan takes 1 minute, it will take you 10 minutes to break up the table into 10 pieces. If the table rows are physically ordered by the values of an indexed column that you can use, this will change a bit due to clustering factor. But it will anyway take longer than just processing it in one go.
This all depends on how long it takes to process one row from the table of course. You could chose to reduce the load on the server by processing chunks of data, but from a performance perspective, you cannot beat a full table scan.
You are most likely going to want to take advantage of Oracle's stopkey optimization, so you don't end up with a full tablescan when you don't want one. There are a couple ways to do this. The first way is a little longer to write, but let's Oracle automatically figure out the number of rows involved:
select *
from
(
select rownum rn, v1.*
from (
select *
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rownum <= 200
)
where rn >= 101;
You could also achieve the same thing with the FIRST_ROWS hint:
select /*+ FIRST_ROWS(200) */ *
from (
select rownum rn, t.*
from table t
where filter_columns = 'where clause'
order by columns_to_order_by
) v1
where rn between 101 and 200;
I much prefer the rownum method, so you don't have to keep changing the value in the hint (which would need to represent the end value and not the number of rows actually returned to the page to be accurate). You can set up the start and end values as bind variables that way, so you avoid hard parsing.
For more details, you can check out this post

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)
The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.
To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.
If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.
The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.
I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.
sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.