I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.
Related
I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?
Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...
Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.
I have a script that runs several times a day, which records the row counts of several PostgreSQL tables.
Some of the tables though are read-only and never change. (No rows are added or removed, nor are any values changed.)
Is there a way I could quickly get the row count from PostgreSQL? Eg. Could I create an index on select count(*) from some_table;?
I'd prefer not to cache this in the script. If I were to cache in the script, I haven't found a reliably way to determine if a table has been changed since the last time the script has run.
Unfortunately, in postgresql SELECT COUNT(*) is often slower than mysql to which it often get's compared to.
You can use the following query as an alternative to SELECT COUNT(*).
SELECT reltuples FROM pg_class WHERE relname = 'mytable';
This is not always 100% upto date but for immutable tables it will be accurate every time. And instant. For very large tables the percentage error will be very small and thus well worth the massive saving in time.
If it does matter and the table does not contain nulls, you can use
SELECT COUNT(primary_key_column) FROM table
and this will be significantly faster than SELECT COUNT(*)
I know two ways to display no. of rows one using count() - slower, other using user_tables -quickie.
select table_name, num_rows from user_tables;
displays, null for 4 tables
TABLE_NAME NUM_ROWS
TABLEP
TABLEU
TABLEN
TABLE1
TRANSLATE 26
but
select count(*) from tableu
gives,
COUNT(*)
6
What is the problem here, what should I do so that user_tables will be updated/ or whatever to show exact no. of rows. I have already tried issuing commit statement.
num_rows is not accurate since it depends on when is the last time DBMS_STATS package was ran:
exec dbms_stats.gather_schema_stats('ONWER NAME');
Run stats like above and then re-run your query.
You should not assume or expect that num_rows in user_tables is an accurate row count. The only way to get an accurate row count would be to do a count(*) against the table.
num_rows is used by the cost-based optimizer (CBO) to provide estimates that drive query plans. The actual value does not need to be particularly accurate for those estimates to generate query plans-- if the optimizer guesses incorrectly by a factor of 3 or 4 the number of rows that an operation will produce, that's still likely to be more than accurate enough. This estimate is generated when statistics are gathered on the tables. Generally, that happens late at night and only on tables that are either missing (num_rows is NULL) or are stale (generally meaning that roughly 20% of the rows are new or updated since the last time statistics were gathered). And even then, the values that are generated are normally only estimates, they're not intended to be 100% accurate.
It is possible to call dbms_stats.gather_table_stats to force num_rows to be populated immediately before querying num_rows and to pass parameters to generate a completely accurate value. Of course, that means that gather_table_stats is doing a count(*) under the covers (plus doing additional work to gather additional statistics) so it would be easier and more efficient to have done a count(*) directly in the first place.
I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.
I use SQL COUNT function to get the total number or rows from a table. Is there any difference between the two following statements?
SELECT COUNT(*) FROM Table
and
SELECT COUNT(TableId) FROM Table
Also, is there any difference in terms of performance and execution time?
Thilo nailed the difference precisely... COUNT( column_name ) can return a lower number than COUNT( * ) if column_name can be NULL.
However, if I can take a slightly different angle at answering your question, since you seem to be focusing on performance.
First, note that issuing SELECT COUNT(*) FROM table; will potentially block writers, and it will also be blocked by other readers/writers unless you have altered the isolation level (knee-jerk tends to be WITH (NOLOCK) but I'm seeing a promising number of people finally starting to believe in RCSI). Which means that while you're reading the data to get your "accurate" count, all these DML requests are piling up, and when you've finally released all of your locks, the floodgates open, a bunch of insert/update/delete activity happens, and there goes your "accurate" count.
If you need an absolutely transactionally consistent and accurate row count (even if it is only valid for the number of milliseconds it takes to return the number to you), then SELECT COUNT( * ) is your only choice.
On the other hand, if you are trying to get a 99.9% accurate ballpark, you are much better off with a query like this:
SELECT row_count = SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE [object_id] = OBJECT_ID('dbo.Table')
AND index_id IN (0,1);
(The SUM is there to account for partitioned tables - if you are not using table partitioning, you can leave it out.)
This DMV maintains accurate row counts for tables with the exception of rows that are currently participating in transactions - and those very transactions are the ones that will make your SELECT COUNT query wait (and ultimately make it inaccurate before you have time to read it). But otherwise this will lead to a much quicker answer than the query you propose, and no less accurate than using WITH (NOLOCK).
count(id) needs to null-check the column (which may be optimized away for a primary key or otherwise not-null column), so count(*) or count(1) should be prefered (unless you really want to know the number of rows with a non-null value for id).