Trouble in displaying number of rows in table, Oracle express 11g DB - sql

I know two ways to display no. of rows one using count() - slower, other using user_tables -quickie.
select table_name, num_rows from user_tables;
displays, null for 4 tables
TABLE_NAME NUM_ROWS
TABLEP
TABLEU
TABLEN
TABLE1
TRANSLATE 26
but
select count(*) from tableu
gives,
COUNT(*)
6
What is the problem here, what should I do so that user_tables will be updated/ or whatever to show exact no. of rows. I have already tried issuing commit statement.

num_rows is not accurate since it depends on when is the last time DBMS_STATS package was ran:
exec dbms_stats.gather_schema_stats('ONWER NAME');
Run stats like above and then re-run your query.

You should not assume or expect that num_rows in user_tables is an accurate row count. The only way to get an accurate row count would be to do a count(*) against the table.
num_rows is used by the cost-based optimizer (CBO) to provide estimates that drive query plans. The actual value does not need to be particularly accurate for those estimates to generate query plans-- if the optimizer guesses incorrectly by a factor of 3 or 4 the number of rows that an operation will produce, that's still likely to be more than accurate enough. This estimate is generated when statistics are gathered on the tables. Generally, that happens late at night and only on tables that are either missing (num_rows is NULL) or are stale (generally meaning that roughly 20% of the rows are new or updated since the last time statistics were gathered). And even then, the values that are generated are normally only estimates, they're not intended to be 100% accurate.
It is possible to call dbms_stats.gather_table_stats to force num_rows to be populated immediately before querying num_rows and to pass parameters to generate a completely accurate value. Of course, that means that gather_table_stats is doing a count(*) under the covers (plus doing additional work to gather additional statistics) so it would be easier and more efficient to have done a count(*) directly in the first place.

Related

Index to get row count of read-only (immutable) PostgreSQL table?

I have a script that runs several times a day, which records the row counts of several PostgreSQL tables.
Some of the tables though are read-only and never change. (No rows are added or removed, nor are any values changed.)
Is there a way I could quickly get the row count from PostgreSQL? Eg. Could I create an index on select count(*) from some_table;?
I'd prefer not to cache this in the script. If I were to cache in the script, I haven't found a reliably way to determine if a table has been changed since the last time the script has run.
Unfortunately, in postgresql SELECT COUNT(*) is often slower than mysql to which it often get's compared to.
You can use the following query as an alternative to SELECT COUNT(*).
SELECT reltuples FROM pg_class WHERE relname = 'mytable';
This is not always 100% upto date but for immutable tables it will be accurate every time. And instant. For very large tables the percentage error will be very small and thus well worth the massive saving in time.
If it does matter and the table does not contain nulls, you can use
SELECT COUNT(primary_key_column) FROM table
and this will be significantly faster than SELECT COUNT(*)

Tracking row numbers in Oracle

Is there a way I could keep track of modified tables in Oracle?
Is there a master table that keeps track of all other table's row? For example if I add a row to table1 it would update the row count stating that table1 now has 5 rows.
I was thinking of tracking either dba_tables or all_tables or user_tables but I'm not sure which one actually counts the number of rows each table has.
You can get an improvement on the just querying user/all/dba_statistics by combining them with information gathered by table monitoring.
The views DBA/ALL/USER_TAB_MODIFICATIONS are populated with the number on insets, updates, deletes and truncates on the table since statistics were last gathered. The view is populated asynchronously so call DBMS_STATS.FLUSH_DATABASE_MONITORING_INFO to flush the latest in-memory data to the tables.
Bear in mind that statistics themselves may be estimated, and although the accuracy is pretty good on most tables even to surprisingly low levels of estimation percent (even down to 5% or below), if you need accurate numbers you'll have to query the tables themselves with count(*). You can put together a pipelined function to do this for multiple tables with a single query.
SELECT TABLE_NAME, NUM_ROWS
FROM USER_TABLES
I highly doubt you're actually using Oracle 3.1. This query works at least in 11g (I don't have other instances to test at the moment).
Keep in mind that this is a data dictionary table and it won't update automatically after you insert a row in any schema table. The Gather Statistics procedure must be run to update these records.
The only difference among dba|user|all_tables is scope. user_tables limit the output to the tables you own, all_tables is basically user_tables + tables from other schemas you've been granted access to, and dba_tables is everything that exists in the database.
num_rows is a valid option to track amount of number of rows in a table. unfortunately, it is not calculated real-time, but as a part of table statistics collection operation. There is no out of the box option of tracking amount of rows in real-time I am aware of.

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

Fast way to discover the row count of a table in PostgreSQL

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.

Count(*) vs Count(Id) in sql server 2005

I use SQL COUNT function to get the total number or rows from a table. Is there any difference between the two following statements?
SELECT COUNT(*) FROM Table
and
SELECT COUNT(TableId) FROM Table
Also, is there any difference in terms of performance and execution time?
Thilo nailed the difference precisely... COUNT( column_name ) can return a lower number than COUNT( * ) if column_name can be NULL.
However, if I can take a slightly different angle at answering your question, since you seem to be focusing on performance.
First, note that issuing SELECT COUNT(*) FROM table; will potentially block writers, and it will also be blocked by other readers/writers unless you have altered the isolation level (knee-jerk tends to be WITH (NOLOCK) but I'm seeing a promising number of people finally starting to believe in RCSI). Which means that while you're reading the data to get your "accurate" count, all these DML requests are piling up, and when you've finally released all of your locks, the floodgates open, a bunch of insert/update/delete activity happens, and there goes your "accurate" count.
If you need an absolutely transactionally consistent and accurate row count (even if it is only valid for the number of milliseconds it takes to return the number to you), then SELECT COUNT( * ) is your only choice.
On the other hand, if you are trying to get a 99.9% accurate ballpark, you are much better off with a query like this:
SELECT row_count = SUM(row_count)
FROM sys.dm_db_partition_stats
WHERE [object_id] = OBJECT_ID('dbo.Table')
AND index_id IN (0,1);
(The SUM is there to account for partitioned tables - if you are not using table partitioning, you can leave it out.)
This DMV maintains accurate row counts for tables with the exception of rows that are currently participating in transactions - and those very transactions are the ones that will make your SELECT COUNT query wait (and ultimately make it inaccurate before you have time to read it). But otherwise this will lead to a much quicker answer than the query you propose, and no less accurate than using WITH (NOLOCK).
count(id) needs to null-check the column (which may be optimized away for a primary key or otherwise not-null column), so count(*) or count(1) should be prefered (unless you really want to know the number of rows with a non-null value for id).