How do I speed up counting rows in a PostgreSQL table?

How do I speed up counting rows in a PostgreSQL table? - sql

We need to count the number of rows in a PostgreSQL table. In our case, no conditions need to be met, and it would be perfectly acceptable to get a row estimate if that significantly improved query speed.
Basically, we want select count(id) from <table> to run as fast as possible, even if that implies not getting exact results.

For a very quick estimate:
SELECT reltuples FROM pg_class WHERE relname = 'my_table';
There are several caveats, though. For one, relname is not necessarily unique in pg_class. There can be multiple tables with the same relname in multiple schemas of the database. To be unambiguous:
SELECT reltuples::bigint FROM pg_class WHERE oid = 'my_schema.my_table'::regclass;
If you do not schema-qualify the table name, a cast to regclass observes the current search_path to pick the best match. And if the table does not exist (or cannot be seen) in any of the schemas in the search_path you get an error message. See Object Identifier Types in the manual.
The cast to bigint formats the real number nicely, especially for big counts.
Also, reltuples can be more or less out of date. There are ways to make up for this to some extent. See this later answer with new and improved options:
Fast way to discover the row count of a table in PostgreSQL
And a query on pg_stat_user_tables is many times slower (though still much faster than full count), as that's a view on a couple of tables.

Count is slow for big tables, so you can get a close estimate this way:
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE relname='tableName';
and its extremely fast, results are not float, but still a close estimate.
reltuples is a column from pg_class table, it holds data about "number of rows in the table. This is only an estimate used by the planner. It is updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX" (manual)
The catalog pg_class catalogs tables and most everything else that has columns or is otherwise similar to a table. This includes indexes (but see also pg_index), sequences, views, composite types, and some kinds of special relation (manual)
"Why is "SELECT count(*) FROM bigtable;" slow?" : http://wiki.postgresql.org/wiki/FAQ#Why_is_.22SELECT_count.28.2A.29_FROM_bigtable.3B.22_slow.3F

Aside from running COUNT() on an indexed field (which hopefully 'id' is) - the next best thing would be to actually cache the row count in some table using a trigger on INSERT. Naturally, you'll be checking the cache instead.
For an approximation you can try this (from https://wiki.postgresql.org/wiki/Count_estimate):
select reltuples from pg_class where relname='tablename';

You can get an estimate from the system table "pg_stat_user_tables".
select schemaname, relname, n_live_tup
from pg_stat_user_tables
where schemaname = 'your_schema_name'
and relname = 'your_table_name';

You can ask for the exact value of the count in the table by simply using trigger AFTER INSERT OR DELETE
Something like this
CREATE TABLE tcounter(id serial primary key,table_schema text, table_name text, count serial);
insert into tcounter(table_schema, table_name,count) select 'my_schema', 'my_table', count(*) from my_schema.my_table;
and use trigger
CREATE OR REPLACE FUNCTION ex_count()
RETURNS trigger AS
$BODY$
BEGIN
IF (TG_OP='INSERT') THEN
UPDATE tcounter set count = count + 1 where table_schema = TG_TABLE_SCHEMA::TEXT and table_name = TG_TABLE_NAME::TEXT;
ELSIF (TG_OP='DELETE') THEN
UPDATE tcounter set count = count - 1 where table_schema = TG_TABLE_SCHEMA::TEXT and table_name = TG_TABLE_NAME::TEXT;
END IF;
RETURN NEW;
END$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
CREATE TRIGGER tg_counter AFTER INSERT OR DELETE
ON my_schema.my_table FOR EACH ROW EXECUTE PROCEDURE ex_count();
And ask for count
select * from tcounter where table_schema = 'my_schema' and table_name = 'my_table'
it means you select count(*) once for initialize first record

If your database is small, you can get an estimate of all your tables like #mike-sherrill-cat-recall suggested. This command will list all the tables though.
SELECT schemaname,relname,n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
Output would be something like this:
schemaname | relname | n_live_tup
------------+--------------------+------------
public | items | 21806
public | tags | 11213
public | sessions | 3269
public | users | 266
public | shops | 259
public | quantities | 34
public | schema_migrations | 30
public | locations | 8
(8 rows)

Related

Faster Sqlite insert from another table

I have an Sqlite DB which I am doing updates on and its very slow. I am wondering if I am doing it the best way or is there a faster way. My tables are:
create table files(
fileid integer PRIMARY KEY,
name TEXT not null,
sha256 TEXT,
created INT,
mtime INT,
inode INT,
nlink INT,
fsno INT,
sha_id INT,
size INT not null
);
create table fls2 (
fileid integer PRIMARY KEY,
name TEXT not null UNIQUE,
size INT not null,
sha256 TEXT not null,
fs2,
fs3,
fs4,
fs7
);
Table 'files' is actually in an attached DB named ttb. I am then doing this:
UPDATE fls2
SET fs3 = (
SELECT inode || 'X' || mtime || 'X' || nlink
FROM
ttb.files
WHERE
ttb.files.fsno = 3
AND
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
);
So the idea is, fls2 has values in 'name' which are also present in ttb.files.name. In ttb.files there are other parameters which I want to insert into the corresponding rows in fls2. The query works but I assume the matching up of the two tables is taking the time, and I wonder if theres a more efficient way to do it. There are indexes on each column in fls2 but none on files. I am doing it as a transaction, and pragma journal = memory (although sqlite seems to be ignoring that because a journal file is being created).
It seems slow, so far about 90 minutes for around a million rows in each table.
One CPU is pegged so I assume its not disk bound.
Can anyone suggest a better way to structure the query?
EDIT: EXPLAIN QUERY PLAN
|--SCAN TABLE fls2
`--CORRELATED SCALAR SUBQUERY 1
`--SCAN TABLE files
Not sure what that means though. It carries out the SCAN TABLE files for each SCAN TABLE fls2 hit?
EDIT2:
Well blimey, Crtl-C the query which had been running 2.5 hours at that point, exit Sqlite, run sqlite with the files DB, create index (sha256, name) - 1 minute or so. Exit that, run Sqlite with the main DB. Explain shows that now the latter scan is done with the index. Run the update - takes 150 seconds. Compared to >150 minutes, thats a heck of a speed up. Thanks for the assistance.
TIA, Pete

There are indexes on each column in fls2
Indexes are used for faster selection. They slow down inserts and updates. Maybe removing the one for fls2.fs3 helps?

Not an expert on sqlite, but on some databases it is more performant to insert the data into temporary table, delete them, then insert them from the temp table.
Insert into tmptab
Select fileid,
name,
size,
sha256,
fs2,
inode || 'X' || mtime || 'X' || nlink,
fs4,
fs7
From fls2
Inner join files on
fls2.name = ttb.files.name
AND
fls2.sha256 = ttb.files.sha256
delete from
Fls2 where exists (select 1 from tmptab where tmptab.<primary key> = fls2.<primary key>)
Insert into fls2 select * from tmptab

How to find null and empty columns in a table with SQL

I am using Oracle SQL developer, We are loading tables with data and I need to validate if all the tables are populated and if there are any columns that are completely null(all the rows are null for that column).
For tables I am clicking each table and looking at the data tab and finding if the tables are populated and then have looking through each of the columns using filters to figure out if there are any completely null columns. I am wondering if there is faster way to do this.
Thanks,
Suresh

You're in luck - there's a fast and easy way to get this information using optimizer statistics.
After a large data load the statistics should be gathered anyway. Counting NULLs is something the statistics gathering already does. With the default settings since 11g, Oracle will count the number of NULLs 100% accurately. (But remember that the number will only reflect that one point in time. If you add data later, the statistics must be re-gathered to get newer results.)
Sample schema
create table test1(a number); --Has non-null values.
create table test2(b number); --Has NULL only.
create table test3(c number); --Has no rows.
insert into test1 values(1);
insert into test1 values(2);
insert into test2 values(null);
commit;
Gather stats and run a query
begin
dbms_stats.gather_schema_stats(user);
end;
/
select table_name, column_name, num_distinct, num_nulls
from user_tab_columns
where table_name in ('TEST1', 'TEST2', 'TEST3');
Using the NUM_DISTINCT and NUM_NULLS you can tell if the column has non-NULLs (num_distinct > 0), NULL only (num_distinct = 0 and num_nulls > 0), or no rows (num_distinct = 0 and num_nulls = 0).
TABLE_NAME COLUMN_NAME NUM_DISTINCT NUM_NULLS
---------- ----------- ------------ ---------
TEST1 A 2 0
TEST2 B 0 1
TEST3 C 0 0

Certainly. Write a SQL script that:
Enumerates all of the tables
Enumerates the columns within the tables
Determine a count of rows in the table
Iterate over each column and count how many rows are NULL in that column.
If the number of rows for the column that are null is equal to the number of rows in the table, you've found what you're looking for.

Here's how to do just one column in one table, if the COUNT comes back as anything higher than 0 - it means there is data in it.
SELECT COUNT(<column_name>)
FROM <table_name>
WHERE <column_name> IS NOT NULL;

This query return that what you want
select table_name,column_name,nullable,num_distinct,num_nulls from all_tab_columns
where owner='SCHEMA_NAME'
and num_distinct is null
order by column_id;

Below script you can use to get empty columns in a table
SELECT column_name
FROM all_tab_cols
where table_name in (<table>)
and avg_col_len = 0;

Reuse a complex query result in other queries without redoing the complex query

I have a complex query in PostgreSQL and I want to use the result of it in other operations like UPDATEs and DELETEs, something like:
<COMPLEX QUERY>;
UPDATE WHERE <COMPLEX QUERY RESULT> = ?;
DELETE WHERE <COMPLEX QUERY RESULT> = ?;
UPDATE WHERE <COMPLEX QUERY RESULT> = ?;
I don't want to have to do the complex query one time for each operations. One way to avoid this is store the result in a table and use it for the WHERE and JOINS and after finishing, drop the temporary table.
I want to know if there is another way without storing the results to database, but already using the results in memory.
I already use loops for this, but I think doing only one operation for each thing will be faster than doing the operations per row.

You can loop through the query results like #phatfingers demonstrates (probably with a generic record variable or scalar variables instead of a rowtype, if the result type of the query doesn't match any existing rowtype). This is a good idea for few resulting rows or when sequential processing is necessary.
For big result sets your original approach will perform faster by an order of magnitude. It is much cheaper to do a mass INSERT / UPDATE / DELETE with one SQL command
than to write / delete incrementally, one row at a time.
A temporary table is the right thing for reusing such results. It gets dropped automatically at the end of the session. You only have to delete explicitly if you want to get rid of it right away or at the end of a transaction. I quote the manual here:
Temporary tables are automatically dropped at the end of a session, or
optionally at the end of the current transaction.
For big temporary tables it might be a good idea to run ANALYZE after they are populated.
Writeable CTE
Here is a demo for what Pavel added in his comment:
CREATE TEMP TABLE t1(id serial, txt text);
INSERT INTO t1(txt)
VALUES ('foo'), ('bar'), ('baz'), ('bax');
CREATE TEMP TABLE t2(id serial, txt text);
INSERT INTO t2(txt)
VALUES ('foo2'),('bar2'),('baz2');
CREATE TEMP TABLE t3 (id serial, txt text);
WITH x AS (
UPDATE t1
SET txt = txt || '2'
WHERE txt ~~ 'ba%'
RETURNING txt
)
, y AS (
DELETE FROM t2
USING x
WHERE t2.txt = x.txt
RETURNING *
)
INSERT INTO t3
SELECT *
FROM y
RETURNING *;
Read more in the chapter Data-Modifying Statements in WITH in the manual.

DECLARE
r foo%rowtype;
BEGIN
FOR r IN [COMPLEX QUERY]
LOOP
-- process r
END LOOP;
RETURN;
END

PostgreSQL dynamic table access

I have a products schema and some tables there.
Each table in products schema has an id, and by this id I can get this table name, e.g.
products
\ product1
\ product2
\ product3
I need to select info from dynamic access to appropriate product, e.g.
SELECT * FROM 'products.'(SELECT id from categories WHERE id = 7);
Of course, this doesn't work...
How I can do something like that in PostgreSQL?

OK, I found a solution:
CREATE OR REPLACE FUNCTION getProductById(cid int) RETURNS RECORD AS $$
DECLARE
result RECORD;
BEGIN
EXECUTE 'SELECT * FROM ' || (SELECT ('products.' || (select category_name from category where category_id = cid) || '_view')::regclass) INTO result;
RETURN result;
END;
$$ LANGUAGE plpgsql;
and to select:
SELECT * FROM getProductById(7) AS b (category_id int, ... );
works for PostgreSQL 9.x

If you can change your database layout to use partitioning instead, that would probably be the way to go. Then you can just access the "master" table as if it were one table rather than multiple subtables.
You could create a view that combines the tables with an extra column corresponding to the table it's from. If all your queries specify a value for this extra column, the planner should be smart enough to skip scanning all the rest of the tables.
Or you could write a function in PL/pgSQL, using the EXECUTE command to construct the appropriate query after fetching the table name. The function can even return a set so it can be used in the FROM clause just as you would a table reference. Or you could just do the same query construction in your application logic.

To me, it sounds like you've a major schema design problem: shouldn't you only have one products table with a category_id in it?
Might you be maintaining the website mentioned in this article?
http://thedailywtf.com/Articles/Confessions-The-Shopping-Cart.aspx

PL/SQL - caching two resultsets into collections and joining them together?

I have two very large tables and I need to process a small resultset from these tables. However, processing is done in several functions each and function must do some joining in order to format the data in proper way.
I would definitely need to cache the initial resultset somehow so it can be reused by the functions. What I would like to do is put the first result set in one collection, the second resultset in another collection, and then manipulate these collections through SQL queries as if they were real SQL tables.
Can you suggest how this can be done?

Sounds like a job for temp tables:
CREATE GLOBAL TEMPORARY TABLE table_name (...) ON ...
The ON has two options, with different impacts:
ON COMMIT DELETE ROWS specifies temporary table would be transaction specific. Data persist within table up to transaction ending time. If you end the transaction the database truncates the table (delete all rows). Suppose if you issue commit or run ddl then data inside the temporary table will be lost. It is by default option.
ON COMMIT PRESERVE ROWS specifies temporary table would be session specific. Data persist within table up to session ending time. If you end the session the database truncates the table (delete all rows). Suppose you type exit in SQL*Plus then data inside the temporary table will be lost.
Reference:
ASKTOM
Create Temporary Table in Oracle
...but it is possible that you don't need to use temporary tables. Derived tables/inline views/subqueries (maybe pipelining) can possibly do what you want, but the info is vague so I can't recommend a particular approach.

If your collections are declared with a SQL type you can use them in SQL statements with a TABLE() function. In Oracle 10g we can merge collections using the MULTISET UNION operator. The following code shows examples of both techniques...
SQL> declare
2 v1 sys.dbms_debug_vc2coll;
3 v2 sys.dbms_debug_vc2coll;
4 v3 sys.dbms_debug_vc2coll := sys.dbms_debug_vc2coll();
5 begin
6 select ename
7 bulk collect into v1
8 from emp;
9 select dname
10 bulk collect into v2
11 from dept;
12
13 -- manipulate connects using SQL
14
15 for r in ( select * from table(v1)
16 intersect
17 select * from table(v2)
18 )
19 loop
20 dbms_output.put_line('Employee '|| r.column_value ||' has same name as a department');
21 end loop;
22
23 -- combine two collections into one
24
25 dbms_output.put_line('V3 has '|| v3.count() ||' elements');
26 v3 := v1 multiset union v2;
27 dbms_output.put_line('V3 now has '|| v3.count() ||' elements');
28 end;
29 /
Employee SALES has same name as a department
V3 has 0 elements
V3 now has 23 elements
PL/SQL procedure successfully completed.
SQL>
There are a number of other approaches which you can employ. As a rule it is better to use SQL rather than PL/SQL, so OMG Ponies suggestion of temporary tables might be appropriate. It really depends on the precise details of your processing needs.

You need to create an schema-level type(not inside a package) as a nested table. You can populate them, and then you may use them in you queries as normal tables using the "table()" statement.
This link explains what you need. A quick example
create type foo as table of number;-- or a record type, data%rowtype, whatever
...
myfoo1 foo := foo (1,2,3);
myfoo2 foo := foo(3,4,5)
select column_value
into bar
from table(foo1) join table(foo2) using (column_value)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas