Fast query in PostgreSQL - sql

I have a very large database (~1TB), so running even a very simple query can take a very long time. Eg. for:
EXPLAIN select count(*) from users;
the cost is 44661683.87 disk page fetch. Hence making it very expensive to execute.
When I try to put a limit on the query like:
EXPLAIN select count(*) from users limit 10;
the cost of executing the query remains the same i.e 44661683.87 disk page fetch.
So (1) is it possible to execute a query on subset of data and then extrapolate to the rest of the table? The row size can be quickly found using something like:
SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = 'users';
Moreover, (2) is it possible to select a randomly distributed subset of rows?

is it possible to execute a query on subset of data and then extrapolate to the rest of the table
You could use the tablesample option:
select count(*) * 10
from the_table tablesample system (10);
tablesample system (10) will only scan 10 percent of the blocks of the table which should be quite fast. If you multiply the resulting row count with 10 you'll have an approximation(!) of the total number of rows. The smaller the sample size is the faster, this will be - but also less accurate.
I accuracy of the number depends on how much free space your table has because the 10% (or whatever sample size you choose) is based on the total number of blocks in the table. If there are many free (or half free) blocks, then the number will be less reliable.

select count(*) . . . is an aggregation query with no group by. It returns 1 row, so the limit has no impact.
You seem to want:
select count(*)
from (select u.*
from users u
limit 10
);
As for your second question, Postgres introduced tablesample in version 9.5. You can investigate that.

If you have a primary key index on the users table (or an index on another column), you can get it to use that index for an index-only scan which should result in a much better execution plan. But, strangely, it won't work with COUNT so you can do a SELECT DISTINCT in a subquery and then COUNT on an outer query to force it to use the index:
EXPLAIN SELECT COUNT(*) FROM (SELECT DISTINCT id FROM users) u;

Related

SQL real limit/top function

I have a question about LIMIT/TOP. As I understand before we get only rows from the limit, the whole table is processed.
so if I write Select * from TABLE limit 2, first the whole table is processed and then it is cut.
Is there a way to cut it before it gets processed? So for example "take 2 random rows". So then I don't query the whole table, but only a part of it.
I hope this question makes sense to you. I will appreciate your help!
In the execution plan tree a LIMIT node will stop processing the child nodes as soon as it's complete; i.e., when it receives the maximum number of rows from the child nodes (in your case 2 rows).
This will be very effective in terms of performance and response time if the child nodes are pipelined, reducing the cost drastically. For example:
select * from t limit 2;
If the child nodes are materialized then the subbranch will be entirely processed before limiting, and the cost and response time won't be significantly affected. For example:
select * from t order by rand() limit 2;
MySQL Limit clause used select statement is used to restrict the number of rows returns from the result set, rather than fetching the whole set from table.
If you use Select * from TABLE limit 2 it will give result set in random order. It better to used Limit clause with criteria so you can increase the performance on table.
For example:
SELECT * FROM TABLE
WHERE column_name >30
ORDER BY column_name DESC
LIMIT 5;

Optimizing a very slow select max group by query on Sybase ASE 15.5

I have a very simple query on a table with 60 million rows :
select id, max(version) from mytable group by id
It returns 6 million records and takes more than one hour to run. I just need to run it once because I am transferring the records to another new table that I keep updated.
I tried a few things that didn't work for me but that are often suggested here on stackoverflow:
inner query with select top 1 / order by desc: it is not supported in Sybase ASE
left outer join where a.version < b.version and b.version is null: I interrupted the query after more than one hour and barely a hundred thousand records were found
I understand that Sybase has to do a full scan.
Why could the full scan be so slow?
Is the slowness due to the Sybase ASE instance itself or specific to the query?
What are my options to reduce the running time of the query?
I am not intimately familiar with Sybase optimization. However, your query is really slow. Here are two ideas.
First, add an index on mytable(id, version desc). At a minimum, this is a covering index for the query, meaning that all columns used are in the index. Sybase is probably smart enough to eliminate the group by.
Another option uses the same index, but with a correlated subquery:
select t.id
from mytable t
where t.version = (select max(t2.version)
from mytable t2
where t2.id = t.id
);
This would be a full table scan (a little expensive but not an hour's worth) and an index lookup on each row (pretty cheap). The advantage of this approach is that you can select all the columns you want. The disadvantage is that if two rows have the same maximum version for an id, you will get both in the result set.
Edit : Here Nicolas a more precise answer. I have no particular experience with Sybase but I earned experience working with tones of data with a quite small server on Sql Server. From this experience, I learn that when you work with a large amount of data and your server doesn't have enough memory to deal with that amount of data, you will encounter bottlenecks (I guess it takes times to write the temporary results on the disk). I think it's your case (60 millions rows) but once again, I don't know Sybase and it depends of many factors as the numbers of columns mytable have and the amount of RAM your server have, etc ...
Here the results of a small experience I just did :
I run on Sql-Server and PostgreSQL those two queries.
Query 1 :
SELECT id, max(version)
FROM mytable
GROUP BY id
Query 2 :
SELECT id, version
FROM
(
SELECT id, version, ROW_NUMBER() OVER (PARTITION BY id ORDER BY version DESC) as RN
FROM mytable
) q
WHERE q.rn = 1
On PostgreSQL, mytable has 2.878.441 rows.
Query#1 takes 31.458 sec and returns 1.200.146 rows.
Query#2 takes 41.787 sec and returns 1.200.146 rows.
On Sql Server, mytable has 1.600.010 rows.
Query#1 takes 6 sec and returns 537.232 rows.
Query#2 takes 10 sec and returns 537.232 rows.
So far, your query is always faster. So I tried on a bigger tables.
On PostgreSQL, mytable has now 5.875.134 rows.
Query#1 takes 100.915 sec and returns 2.796.800 rows.
Query#2 takes 98.805 sec and returns 2.796.800 rows.
On Sql Server, mytable has now 11.712.606 rows.
Query#1 takes 28 min 28 sec and returns 6.262.778 rows.
Query#2 takes 2 min 39 sec and returns 6.262.778 rows.
Now we can make an assumption. In the first part on this experience. The two servers have enough memory to deal with the data, thus Group By is faster. The second part on this experiment might prove that too much data kill the performance of group by. To prevent the bottleneck ROW_NUMBER() seems to do the trick.
Criticisms : I don't have a bigger table on PostgreSQL nor I have a Sybase server at hand.
For this experiment I was using PostgreSQL 9.3.5 on x86_64 and SQL Server 2012 - 11.0-2100.60 (X64)
Maybe Nicolas this experiment will help you.
So finally the nonclustered index on (id, version desc) did the trick without having to change anything to the query. Index creation also takes one hour and the query responds in few seconds. But I guess it's still better than having another table that could cause data integrity issues.
the function max() does not help the optimizer to use the index.
Perhaps you should create a function-based index on max(version):
http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1550/html/sqlug/CHDDHJIB.htm

select first rows in oracle without full table scan

I'm trying to perform the following query in Oracle.
select * from (select rownum r, account from fooTable) where r<5001;
It selects the 1st 5000 rows. I'm running into a problem that fooTable has a lot of data inside of it and this is really slowing down the query (35 million+ rows). According to the query analyzer it's performing a full table scan.
My question is, is there a way to speed up this statement? Since I'm only fetching the 1st N rows, is the full table scan necessary?
mj
I have found the /*+ FIRST_ROWS(n) */ hint to be very helpful in cases like this (such as for limiting pagination results). You replace n with whatever value you want.
select /*+ FIRST_ROWS(5000) */
account
from fooTable
where rownum <5000;
You still need the rownum predicate to limit rows, but the hint lets the optimizer know you only need a lazy fetch of n rows.

Fast way to discover the row count of a table in PostgreSQL

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.

Selecting data effectively sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;
rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.
1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;
An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause
In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.
it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.