Using WITH + DELETE clause in a single query in postgresql - sql

I have the following table structure, for a table named listens with PRIMARYKEY on (uid,timestamp)
Column | Type | Modifiers
----------------+-----------------------------+------------------------------------------------------
id | integer | not null default nextval('listens_id_seq'::regclass)
uid | character varying | not null
date | timestamp without time zone |
timestamp | integer | not null
artist_msid | uuid |
album_msid | uuid |
recording_msid | uuid |
json | character varying |
I need to remove all the entries for a particular user (uid) which are older than the max timestamp, say max is 123456789 (in seconds) and delta is 100000, then, all records older than max-100000.
I have managed to create a query when the table contains a single user but i am unable to formulate it to work for every user in the database. This operation needs to be done for every user in the database.
WITH max_table as (
SELECT max(timestamp) - 10000 as max
FROM listens
GROUP BY uid)
DELETE FROM listens
WHERE timestamp < (SELECT max FROM max_table);
Any solutions?

I think all you need, is to make this a co-related subquery:
WITH max_table as (
SELECT uid, max(timestamp) - 10000 as mx
FROM listens
GROUP BY uid
)
DELETE FROM listens
WHERE timestamp < (SELECT mx
FROM max_table
where max_table.uid = listens.uid);
Btw: timestamp is a horrible name for a column, especially one that doesn't contain a timestamp value. One reason is because it's also a keyword but more importantly it doesn't document what that column contains. A registration timestamp? An expiration timestamp? A last active timestamp?

Alternatively, you could avoid the MAX() by using an EXISTS()
DELETE FROM listens d
WHERE EXISTS (
SELECT * FROM listens x
WHERE x.uid = d.uid
AND x.timestamp >= d.timestamp + 10000
);
BTW: timestamp is an ugly name for a column, since it is also a typename.

Related

Select multiple 'latest' rows from single table of snapshots, one per story snapshotted

I have a table with multiple snapshots of analytics data for multiple stories. The data is stored along with the timestamp it was taken, and the story_id the data is referring to.
id: integer auto_increment
story_id: string
timestamp: datetime
value: number
and I need to pull out the latest value for each story (i.e. each unique storyId) in a list of ids.
I've written a query, but it scales catastrophically.
SELECT story_id, value
FROM table
WHERE story_id IN ('1','2','3')
AND id = (SELECT id
FROM table inner
WHERE inner.story_id = table.story_id
ORDER BY timestamp DESC
LIMIT 1)
What's a more efficient way to make this query?
Nice to know:
story_id has to be a string, it's from an external data source
story_id and timestamp already have indexes
there are 2.9M rows and counting...
This is a good case for order by - controlled distinct on.
select DISTINCT ON (story_id)
story_id, "value"
from the_table
where story_id in ('1','2','3')
ORDER BY story_id, "timestamp" desc;
An index on story_id, timestamp as #wildplasser suggests will make it scale well.
You can try it like this, does it improve your performance?
SELECT
story_id,
value
FROM (
SELECT
story_id,
timestamp,
MAX(timestamp) OVER (PARTITION BY story_id) AS max_timestamp_per_id,
value
FROM table)
WHERE timestamp = max_timestamp_per_id
A equivalent rewrite of your query using analytic function to get the last ID per story_id in the timestamp order is as below
with last_snap as (
select
story_id, value,
row_number() over (partition by story_id order by timestamp desc) as rn
from tab
)
select story_id, value
from last_snap
where rn = 1;
This may work bettwer than you subquery solution, but will also not scale for data larger that few 100K rows (better, it will scale the same way as sorting of your full data).
The correct setup for a table with multiple snapshots is a partitioned table containing one partition for each snapshot.
The query selects only the data in the last snapshot (snap_id = nnn) skipping all older data.
Use the proper table definition, including natural key on (story_id, ztimestamp) .
BTW timestamp is a data type, better not use it as a column name.
BTW2: you probably want story_id to be an integer field in stead of a text field, and since it is a key field you may also want it to be NOT NULL.
-- DDL
DROP TABLE story CASCADE;
CREATE TABLE story
( id serial not null primary key
, story_id text NOT NULL
, ztimestamp timestamp not null
, zvalue integer not null default 0
, UNIQUE (story_id, ztimestamp) -- the natural key
);
\d+ story
EXPLAIN
SELECT * FROM story st
WHERE story_id IN('1','2','3')
AND NOT EXISTS(
SELECT *
FROM story nx
WHERE nx.story_id = st.story_id
AND nx.ztimestamp > st.ztimestamp
);
DROP TABLE
CREATE TABLE
Table "tmp.story"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------+-----------------------------+-----------+----------+-----------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('story_id_seq'::regclass) | plain | |
story_id | text | | not null | | extended | |
ztimestamp | timestamp without time zone | | not null | | plain | |
zvalue | integer | | not null | 0 | plain | |
Indexes:
"story_pkey" PRIMARY KEY, btree (id)
"story_story_id_ztimestamp_key" UNIQUE CONSTRAINT, btree (story_id, ztimestamp)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=1.83..18.97 rows=13 width=48)
-> Bitmap Heap Scan on story st (cost=1.67..10.94 rows=16 width=48)
Recheck Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Bitmap Index Scan on story_story_id_ztimestamp_key (cost=0.00..1.67 rows=16 width=0)
Index Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Index Only Scan using story_story_id_ztimestamp_key on story nx (cost=0.15..0.95 rows=2 width=40)
Index Cond: ((story_id = st.story_id) AND (ztimestamp > st.ztimestamp))
(7 rows)

How do I select the latest rows for all users?

I have a table similar to the following:
=> \d table
Table "public.table"
Column | Type | Modifiers
-------------+-----------------------------+-------------------------------
id | integer | not null default nextval( ...
user | bigint | not null
timestamp | timestamp without time zone | not null
field1 | double precision |
As you can see, it contains many field1 values over time for all users. Is there a way to efficiently get the latest field1 value for all users in one query (i.e. one row per user)? I'm thinking I might have to use some combination of group by and select first.
Simplest with DISTINCT ON in Postgres:
SELECT DISTINCT ON (id)
id, timestamp, field1
FROM tbl
ORDER BY id, timestamp DESC;
More details:
https://dba.stackexchange.com/questions/49540/how-do-i-efficiently-get-the-most-recent-corresponding-row/49555#49555
Select first row in each GROUP BY group?
Aside: Don't use timestamp as column name. It's a reserved word in SQL and a basic type name in Postgres.

PostgreSQL delete all but the oldest records

I have a PostgreSQL database that has multiple entries for the objectid, on multiple devicenames, but there is a unique timestamp for each entry. The table looks something like this:
address | devicename | objectid | timestamp
--------+------------+---------------+------------------------------
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00
I want to delete all but the oldest entry for each odjectid and devicename. In this case I want to delete all but:
1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
Is there a way do this? Or is it possible to select the oldest entries for both "objectid and devicename" into a temp table?
This should do it:
delete from devices
using (
select ctid as cid,
row_number() over (partition by devicename, objectid order by timestamp asc) as rn
from devices
) newest
where newest.cid = devices.ctid
and newest.rn <> 1;
It creates a derived table that will assign unique numbers to each combination of (address, devicename, objectid) giving the earliest one (the one with the smallest timestamp value) the number 1. Then this result is used to delete all those that do not have the number 1. The virtual column ctid is used to uniquely identify those rows (it's an internal identifier supplied by Postgres).
Note that for deleting a really large amount of rows, Erwin's approach will most definitely be faster.
SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2
To distill the described result, this would probably simplest and fastest:
SELECT DISTINCT ON (devicename, objectid) *
FROM tbl
ORDER BY devicename, objectid, ts DESC;
Details and explanation in this related answer.
From your sample data, I conclude that you are going to delete large portions of the original table. It is probably faster to just TRUNCATE the table (or DROP & recreate, since you should add a surrogate pk column anyway) and write the remaining rows to it. This also provides you with a pristine table, implicitly clustered (ordered) the way it's best for your queries and save the work that VACUUM would have to do otherwise. And it's probably still faster overall:
I would also strongly advise to add a surrogate primary key to your table, preferably a serial column.
BEGIN;
CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS
SELECT DISTINCT ON (devicename, objectid) *
FROM tbl
ORDER BY devicename, objectid, ts DESC;
TRUNCATE tbl;
ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY;
-- or, if you can afford to drop & recreate:
-- DROP TABLE tbl;
-- CREATE TABLE tbl (
-- tbl_id serial PRIMARY KEY
-- , address text
-- , devicename text
-- , objectid text
-- , ts timestamp);
INSERT INTO tbl (address, devicename, objectid, ts)
SELECT address, devicename, objectid, ts
FROM tmp_tbl;
COMMIT;
Do it all within a transaction to make sure you are not going to fail half way through.
This is fast as long as your setting for temp_buffers is big enough to hold the temporary table. Else the system will start swapping data to disk and performance takes a dive. You can set temp_buffers just for the current session like this:
SET temp_buffers = 1000MB;
So you don't waste RAM that you don't normally need for temp_buffers. Has to be set before the first use of any temporary objects in the session. More information in this related answer.
Also, as the INSERT follows a TRUNCATE inside a transaction, it will be easy on the Write Ahead Log - improving performance.
Consider CREATE TABLE AS for the alternative route:
What causes large INSERT to slow down and disk usage to explode?
The only downside: You need an exclusive lock on the table. This may be a problem in databases with heavy concurrent load.
Finally, never use timestamp as column name. It's a reserved word in every SQL standard and a type name in PostgreSQL. I used ts instead.
DELETE FROM DEVICES D WHERE d.timestamp = (SELECT min(timestamp) FROM DEVICES WHERE objectid = d.objectid and device = d.device)
My suggestion is to use a subquery, that checks existance of record with older timestamp:
DELETE FROM tablename
WHERE EXISTS(
SELECT * FROM tablename a
WHERE tablenmae.address = a.address
AND tablename.devicename = a.devicename
AND tablename.objectid = a.objectid
AND a.timestamp < tablename.timestamp
)
Query for selecting oldest records will be look like this:
SELECT address, devicename, objectid, MIN(timestamp)
FROM tablename
GROUP BY address, devicename, objectid
This should work assuming that address, devicename and objectid make up a unique identifier
DELETE FROM tablename
WHERE
address || devicename || objectid || timestamp NOT IN
(SELECT
address || devicename || objectid || min(timestamp)
FROM tablename
GROUP BY address, devicename, objectid)
This uses a concatenated string that consists of the unique columns to tie the selects together. One finds the min date for that unique combination, the next deletes those records from the table. Probably not the most efficient, but it should work.

Get last record of a table in Postgres

I'm using Postgres and cannot manage to get the last record of my table:
my_query = client.query("SELECT timestamp,value,card from my_table");
How can I do that knowning that timestamp is a unique identifier of the record ?
If under "last record" you mean the record which has the latest timestamp value, then try this:
my_query = client.query("
SELECT TIMESTAMP,
value,
card
FROM my_table
ORDER BY TIMESTAMP DESC
LIMIT 1
");
you can use
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
assuming you want also to sort by timestamp?
Easy way: ORDER BY in conjunction with LIMIT
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1;
However, LIMIT is not standard and as stated by Wikipedia, The SQL standard's core functionality does not explicitly define a default sort order for Nulls.. Finally, only one row is returned when several records share the maximum timestamp.
Relational way:
The typical way of doing this is to check that no row has a higher timestamp than any row we retrieve.
SELECT timestamp, value, card
FROM my_table t1
WHERE NOT EXISTS (
SELECT *
FROM my_table t2
WHERE t2.timestamp > t1.timestamp
);
It is my favorite solution, and the one I tend to use. The drawback is that our intent is not immediately clear when having a glimpse on this query.
Instructive way: MAX
To circumvent this, one can use MAX in the subquery instead of the correlation.
SELECT timestamp, value, card
FROM my_table
WHERE timestamp = (
SELECT MAX(timestamp)
FROM my_table
);
But without an index, two passes on the data will be necessary whereas the previous query can find the solution with only one scan. That said, we should not take performances into consideration when designing queries unless necessary, as we can expect optimizers to improve over time. However this particular kind of query is quite used.
Show off way: Windowing functions
I don't recommend doing this, but maybe you can make a good impression on your boss or something ;-)
SELECT DISTINCT
first_value(timestamp) OVER w,
first_value(value) OVER w,
first_value(card) OVER w
FROM my_table
WINDOW w AS (ORDER BY timestamp DESC);
Actually this has the virtue of showing that a simple query can be expressed in a wide variety of ways (there are several others I can think of), and that picking one or the other form should be done according to several criteria such as:
portability (Relational/Instructive ways)
efficiency (Relational way)
expressiveness (Easy/Instructive way)
If your table has no Id such as integer auto-increment, and no timestamp, you can still get the last row of a table with the following query.
select * from <tablename> offset ((select count(*) from <tablename>)-1)
For example, that could allow you to search through an updated flat file, find/confirm where the previous version ended, and copy the remaining lines to your table.
The last inserted record can be queried using this assuming you have the "id" as the primary key:
SELECT timestamp,value,card FROM my_table WHERE id=(select max(id) from my_table)
Assuming every new row inserted will use the highest integer value for the table's id.
If you accept a tip, create an id in this table like serial. The default of this field will be:
nextval('table_name_field_seq'::regclass).
So, you use a query to call the last register. Using your example:
pg_query($connection, "SELECT currval('table_name_field_seq') AS id;
I hope this tip helps you.
To get the last row,
Get Last row in the sorted order: In case the table has a column specifying time/primary key,
Using LIMIT clause
SELECT * FROM USERS ORDER BY CREATED_TIME DESC LIMIT 1;
Using FETCH clause - Reference
SELECT * FROM USERS ORDER BY CREATED_TIME FETCH FIRST ROW ONLY;
Get Last row in the rows insertion order: In case the table has no columns specifying time/any unique identifiers
Using CTID system column, where ctid represents the physical location of the row in a table - Reference
SELECT * FROM USERS WHERE CTID = (SELECT MAX(CTID) FROM USERS);
Consider the following table,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 | //as per created time, this is the last row
3 | C | 1535012279443 |
4 | D | 1535012212311 |
5 | E | 1535012254634 | //as per insertion order, this is the last row
The query 1 and 2 returns,
userid |username | createdtime |
2 | B | 1535042279423 |
while 3 returns,
userid |username | createdtime |
5 | E | 1535012254634 |
Note : On updating an old row, it removes the old row and updates the data and inserts as a new row in the table. So using the following query returns the tuple on which the data modification is done at the latest.
Now updating a row, using
UPDATE USERS SET USERNAME = 'Z' WHERE USERID='3'
the table becomes as,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 |
4 | D | 1535012212311 |
5 | E | 1535012254634 |
3 | Z | 1535012279443 |
Now the query 3 returns,
userid |username | createdtime |
3 | Z | 1535012279443 |
Use the following
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
These are all good answers but if you want an aggregate function to do this to grab the last row in the result set generated by an arbitrary query, there's a standard way to do this (taken from the Postgres wiki, but should work in anything conforming reasonably to the SQL standard as of a decade or more ago):
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
It's usually preferable to do select ... limit 1 if you have a reasonable ordering, but this is useful if you need to do this within an aggregate and would prefer to avoid a subquery.
See also this question for a case where this is the natural answer.
The column name plays an important role in the descending order:
select <COLUMN_NAME1, COLUMN_NAME2> from >TABLENAME> ORDER BY <COLUMN_NAME THAT MENTIONS TIME> DESC LIMIT 1;
For example: The below-mentioned table(user_details) consists of the column name 'created_at' that has timestamp for the table.
SELECT userid, username FROM user_details ORDER BY created_at DESC LIMIT 1;
In Oracle SQL,
select * from (select row_number() over (order by rowid desc) rn, emp.* from emp) where rn=1;
select * from table_name LIMIT 1;

optimize mysql count query

Is there a way to optimize this further or should I just be satisfied that it takes 9 seconds to count 11M rows ?
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "desc record_updates"
+--------------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------+------+-----+---------+-------+
| record_id | int(11) | YES | MUL | NULL | |
| date_updated | datetime | YES | MUL | NULL | |
+--------------+----------+------+-----+---------+-------+
devuser#xcmst > date; mysql --user=user --password=pass -D marctoxctransformation -e "select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "; date
Thu Dec 9 11:13:17 EST 2010
+----------+
| count(*) |
+----------+
| 11772117 |
+----------+
Thu Dec 9 11:13:26 EST 2010
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "explain select count(*) from record_updates where date_updated > '2009-10-11 15:33:22' "
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
| 1 | SIMPLE | record_updates | index | idx_marctoxctransformation_record_updates_date_updated | idx_marctoxctransformation_record_updates_date_updated | 9 | NULL | 11772117 | Using where; Using index |
+----+-------------+----------------+-------+--------------------------------------------------------+--------------------------------------------------------+---------+------+----------+--------------------------+
devuser#xcmst > mysql --user=user --password=pass -D marctoxctransformation -e "show keys from record_updates"
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
| record_updates | 1 | idx_marctoxctransformation_record_updates_date_updated | 1 | date_updated | A | 2416 | NULL | NULL | YES | BTREE | |
| record_updates | 1 | idx_marctoxctransformation_record_updates_record_id | 1 | record_id | A | 11772117 | NULL | NULL | YES | BTREE | |
+----------------+------------+--------------------------------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+
If mysql has to count 11M rows, there really isn't much of a way to speed up a simple count. At least not to get it to a sub 1 second speed. You should rethink how you do your count. A few ideas:
Add an auto increment field to the table. It looks you wouldn't delete from the table, so you can use simple math to find the record count. Select the min auto increment number for the initial earlier date and the max for the latter date and subtract one from the other to get the record count. For example:
SELECT min(incr_id) min_id FROM record_updates WHERE date_updated BETWEEN '2009-10-11 15:33:22' AND '2009-10-12 23:59:59';
SELECT max(incr_id) max_id FROM record_updates WHERE date_updated > DATE_SUB(NOW(), INTERVAL 2 DAY);`
Create another table summarizing the record count for each day. Then you can query that table for the total records. There would only be 365 records for each year. If you need to get down to more fine grained times, query the summary table for full days and the current table for just the record count for the start and end days. Then add them all together.
If the data isn't changing, which it doesn't seem like it is, then summary tables will be easy to maintain and update. They will significantly speed things up.
Since >'2009-10-11 15:33:22' contains most of the records,
I would suggest to do a reverse matching like <'2009-10-11 15:33:22' (mysql work less harder and less rows involved)
select
TABLE_ROWS -
(select count(*) from record_updates where add_date<"2009-10-11 15:33:22")
from information_schema.tables
where table_schema = "marctoxctransformation" and table_name="record_updates"
You can combine with programming language (like bash shell)
to make this calculation a bit smarter...
such as do execution plan first to calculate which comparison will use lesser row
From my testing (around 10M records), the normal comparison takes around 3s,
and now cut-down to around 0.25s
MySQL doesn't "optimize" count(*) queries in InnoDB because of versioning. Every item in the index has to be iterated over and checked to make sure that the version is correct for display (e.g., not an open commit). Since any of your data can be modified across the database, ranged selects and caching won't work. However, you possibly can get by using triggers. There are two methods to this madness.
This first method risks slowing down your transactions since none of them can truly run in parallel: use after insert and after delete triggers to increment / decrement a counter table. Second trick: use those insert / delete triggers to call a stored procedure which feeds into an external program which similarly adjusts values up and down, or acts upon a non-transactional table. Beware that in the event of a rollback, this will result in inaccurate numbers.
If you don't need an exact numbers, check out this query:
select table_rows from information_schema.tables
where table_name = 'foo';
Example difference: count(*): 1876668, table_rows: 1899004. The table_rows value is an estimation, and you'll get a different number every time even if you database doesn't change.
For my own curiosity: do you need exact numbers that are updated every second? IF so, why?
If the historical data is not volatile, create a summary table. There are various approaches, the one to choose will depend on how your table is updated, and how often.
For example, assuming old data is rarely/never changed, but recent data is, create a monthly summary table, populated for the previous month at the end of each month (eg insert January's count at the end of February). Once you have your summary table, you can add up the full months and the part months at the beginning and end of the range:
select count(*)
from record_updates
where date_updated >= '2009-10-11 15:33:22' and date_updated < '2009-11-01';
select count(*)
from record_updates
where date_updated >= '2010-12-00';
select sum(row_count)
from record_updates_summary
where date_updated >= '2009-11-01' and date_updated < '2010-12-00';
I've left it split out above for clarity but you can do this in one query:
select ( select count(*)
from record_updates
where date_updated >= '2010-12-00'
or ( date_updated>='2009-10-11 15:33:22'
and date_updated < '2009-11-01' ) ) +
( select count(*)
from record_updates
where date_updated >= '2010-12-00' );
You can adapt this approach for make the summary table based on whole weeks or whole days.
You should add an index on the 'date_updated' field.
Another thing you can do if you don't mind changing the structure of the table, is to use the timestamp of the date in 'int' instead of 'datetime' format, and it might be even faster.
If you decide to do so, the query will be
select count(date_updated) from record_updates where date_updated > 1291911807
There is no primary key in your table. It's possible that in this case it always scans the whole table. Having a primary key is never a bad idea.
If you need to return the total table's row count, then there is an alternative to the
SELECT COUNT(*) statement which you can use. SELECT COUNT(*) makes a full table scan to return the total table's row count, so it can take a long time. You can use the sysindexes system table instead in this case. There is a ROWS column in the sysindexes table. This column contains the total row count for each table in your database. So, you can use the following select statement instead of SELECT COUNT(*):
SELECT rows FROM sysindexes WHERE id = OBJECT_ID('table_name') AND indid < 2
This can improve the speed of your query.
EDIT: I have discovered that my answer would be correct if you were using a SQL Server database. MySQL databases do not have a sysindexes table.
It depends on a few things but something like this may work for you
im assuming this count never changes as it is in the past so the result can be cached somehow
count1 = "select count(*) from record_updates where date_updated <= '2009-10-11 15:33:22'"
gives you the total count of records in the table,
this is an approximate value in innodb table so BEWARE, depends on engine
count2 = "select table_rows from information_schema.`TABLES` where table_schema = 'marctoxctransformation' and TABLE_NAME = 'record_updates'"
your answer
result = count2 - count1
There are a few details I'd like you to clarify (would put into comments on the q, but it is actually easier to remove from here when you update your question).
What is the intended usage of data, insert once and get the counts many times, or your inserts and selects are approx on par?
Do you care about insert/update performance?
What is the engine used for the table? (heck you can do SHOW CREATE TABLE ...)
Do you need the counts to be exact or approximately exact (like 0.1% correct)
Can you use triggers, summary tables, change schema, change RDBMS, etc.. or just add/remove indexes?
Maybe you should explain also what is this table supposed to be? You have record_id with cardinality that matches the number of rows, so is it PK or FK or what is it? Also the cardinality of the date_updated suggests (though not necessarily correct) that it has same values for ~5,000 records on average), so what is that? - it is ok to ask a SQL tuning question with not context, but it is also nice to have some context - especially if redesigning is an option.
In the meantime, I'll suggest you to get this tuning script and check the recommendations it will give you (it's just a general tuning script - but it will inspect your data and stats).
Instead of doing count(*), try doing count(1), like this:-
select count(1) from record_updates where date_updated > '2009-10-11 15:33:22'
I took a DB2 class before, and I remember the instructor mentioned about doing a count(1) when we just want to count number of rows in the table regardless the data because it is technically faster than count(*). Let me know if it makes a difference.
NOTE: Here's a link you might be interested to read: http://www.mysqlperformanceblog.com/2007/04/10/count-vs-countcol/