I use SQL COUNT function to get the total number or rows from a table. Is there any difference between the two following statements?
SELECT COUNT(*) FROM Table
and
SELECT COUNT(TableId) FROM Table
Also, is there any difference in terms of performance and execution time?
Thilo nailed the difference precisely... COUNT( column_name ) can return a lower number than COUNT( * ) if column_name can be NULL.
However, if I can take a slightly different angle at answering your question, since you seem to be focusing on performance.
First, note that issuing SELECT COUNT(*) FROM table; will potentially block writers, and it will also be blocked by other readers/writers unless you have altered the isolation level (knee-jerk tends to be WITH (NOLOCK) but I'm seeing a promising number of people finally starting to believe in RCSI). Which means that while you're reading the data to get your "accurate" count, all these DML requests are piling up, and when you've finally released all of your locks, the floodgates open, a bunch of insert/update/delete activity happens, and there goes your "accurate" count.
If you need an absolutely transactionally consistent and accurate row count (even if it is only valid for the number of milliseconds it takes to return the number to you), then SELECT COUNT( * ) is your only choice.
On the other hand, if you are trying to get a 99.9% accurate ballpark, you are much better off with a query like this:
SELECT row_count = SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE [object_id] = OBJECT_ID('dbo.Table')
AND index_id IN (0,1);
(The SUM is there to account for partitioned tables - if you are not using table partitioning, you can leave it out.)
This DMV maintains accurate row counts for tables with the exception of rows that are currently participating in transactions - and those very transactions are the ones that will make your SELECT COUNT query wait (and ultimately make it inaccurate before you have time to read it). But otherwise this will lead to a much quicker answer than the query you propose, and no less accurate than using WITH (NOLOCK).
count(id) needs to null-check the column (which may be optimized away for a primary key or otherwise not-null column), so count(*) or count(1) should be prefered (unless you really want to know the number of rows with a non-null value for id).
Related
I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?
Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...
Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.
NOTE: This is a re-posting of a question from a Stack Overflow Teams site to attract a wider audience
I have a transaction log table that has many millions of records. Many of the data items that are linked to these logs might have more than 100K rows for each item.
I have a requirement to display a warning if a user tries to delete an item when more than 1000 items in the log table exist.
We have determined that 1000 logs means this item is in use
If I try to simply query the table to lookup the total number of log rows the query takes too long to execute:
SELECT COUNT(1)
FROM History
WHERE SensorID IN (SELECT Id FROM Sensor WHERE DeviceId = 96)
Is there a faster way to determine if the entity has more than 1000 log records?
NOTE: history table has an index on the SensorId column.
You are right to use Count instead of returning all the rows and checking the record count, but we are still asking the database engine to seek across all rows.
If the requirement is not to return the maximum number of rows, but just to determine if there are more than X number of rows, then the first improvement I would do is to return the count of just the first X rows from the table.
So if X is 1000, your application logic does not need to change, you will still be able to determine the difference between an item with 999 logs and 1000+ logs
We simply change the existing query an select the TOP(X) rows instead of the count, and then return the count of that resultset, only select the primary key or a unique indexed column so that we are only inspecting the index and not the underlying table store.
select count(Id) FROM (
SELECT TOP(1000) // limit the seek that the DB engine does to the limit
Id // Further constrain the seek to just the indexed column
FROM History
where SensorId IN ( // this is the same filter condition as before, just re-formatted
SELECT Id
FROM Sensor
WHERE DeviceId = 96)
) as trunk
Changing this query to top 10,000 still provides sub-second response, however with X = 100,000 the query took almost as long as the original query
There is another seemingly 'silver bullet' approach to this type of issue if table in question has a high transaction rate and the main reason for the execution time is due to waiting cause by lock contention.
If you suspect that locks are the issue, and you can accept a count response that includes uncommitted rows then you can use the WITH(NOLOCK) table hint to allow the query to run effectively in the READ UNCOMMITED transaction isolation level.
There is a good discussion about the effect of the NOLOCK table hint on select queries here
SELECT COUNT(1) FROM History WITH (NOLOCK)
WHERE SensorId IN (SELECT Id FROM Sensor WHERE DeviceId = 96)
Although strongly discouraged, this is a good example of a scenario when NOLOCK can easily be permitted, it even makes sense, as your count before delete will take into account another user or operation that is actively adding to the log count.
After many trials, when querying for 1000 or 10K rows the select with count solution is still faster than using the NOLOCK table hint. NOLOCK however presents an opportunity to execute the same query with minimal change, while still returning within a timely manner.
The performance of a select with NOLOCK will still increase as the number of rows in the underlying result set increases, where as the performance of the select that has a top with no order by clause should remain constant once the top limit has been exceeded.
I know two ways to display no. of rows one using count() - slower, other using user_tables -quickie.
select table_name, num_rows from user_tables;
displays, null for 4 tables
TABLE_NAME NUM_ROWS
TABLEP
TABLEU
TABLEN
TABLE1
TRANSLATE 26
but
select count(*) from tableu
gives,
COUNT(*)
6
What is the problem here, what should I do so that user_tables will be updated/ or whatever to show exact no. of rows. I have already tried issuing commit statement.
num_rows is not accurate since it depends on when is the last time DBMS_STATS package was ran:
exec dbms_stats.gather_schema_stats('ONWER NAME');
Run stats like above and then re-run your query.
You should not assume or expect that num_rows in user_tables is an accurate row count. The only way to get an accurate row count would be to do a count(*) against the table.
num_rows is used by the cost-based optimizer (CBO) to provide estimates that drive query plans. The actual value does not need to be particularly accurate for those estimates to generate query plans-- if the optimizer guesses incorrectly by a factor of 3 or 4 the number of rows that an operation will produce, that's still likely to be more than accurate enough. This estimate is generated when statistics are gathered on the tables. Generally, that happens late at night and only on tables that are either missing (num_rows is NULL) or are stale (generally meaning that roughly 20% of the rows are new or updated since the last time statistics were gathered). And even then, the values that are generated are normally only estimates, they're not intended to be 100% accurate.
It is possible to call dbms_stats.gather_table_stats to force num_rows to be populated immediately before querying num_rows and to pass parameters to generate a completely accurate value. Of course, that means that gather_table_stats is doing a count(*) under the covers (plus doing additional work to gather additional statistics) so it would be easier and more efficient to have done a count(*) directly in the first place.
I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.
I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.
I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?
Here's the query:
SELECT DISTINCT
c.username, o.orderno, o.totalcredits, o.totalrefunds,
o.recstatus, o.reason
FROM management.contacts c
JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')
Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.
What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)
Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)
Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).
Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.
If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.
Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.
Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.
Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.
You can try to make a group by like this:
SELECT c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
FROM management.contacts c,
management.orders o
WHERE c.custID = o.custID
AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
Also verify if you have index on o.recDate