Select multiple 'latest' rows from single table of snapshots, one per story snapshotted - sql

I have a table with multiple snapshots of analytics data for multiple stories. The data is stored along with the timestamp it was taken, and the story_id the data is referring to.
id: integer auto_increment
story_id: string
timestamp: datetime
value: number
and I need to pull out the latest value for each story (i.e. each unique storyId) in a list of ids.
I've written a query, but it scales catastrophically.
SELECT story_id, value
FROM table
WHERE story_id IN ('1','2','3')
AND id = (SELECT id
FROM table inner
WHERE inner.story_id = table.story_id
ORDER BY timestamp DESC
LIMIT 1)
What's a more efficient way to make this query?
Nice to know:
story_id has to be a string, it's from an external data source
story_id and timestamp already have indexes
there are 2.9M rows and counting...

This is a good case for order by - controlled distinct on.
select DISTINCT ON (story_id)
story_id, "value"
from the_table
where story_id in ('1','2','3')
ORDER BY story_id, "timestamp" desc;
An index on story_id, timestamp as #wildplasser suggests will make it scale well.

You can try it like this, does it improve your performance?
SELECT
story_id,
value
FROM (
SELECT
story_id,
timestamp,
MAX(timestamp) OVER (PARTITION BY story_id) AS max_timestamp_per_id,
value
FROM table)
WHERE timestamp = max_timestamp_per_id

A equivalent rewrite of your query using analytic function to get the last ID per story_id in the timestamp order is as below
with last_snap as (
select
story_id, value,
row_number() over (partition by story_id order by timestamp desc) as rn
from tab
)
select story_id, value
from last_snap
where rn = 1;
This may work bettwer than you subquery solution, but will also not scale for data larger that few 100K rows (better, it will scale the same way as sorting of your full data).
The correct setup for a table with multiple snapshots is a partitioned table containing one partition for each snapshot.
The query selects only the data in the last snapshot (snap_id = nnn) skipping all older data.

Use the proper table definition, including natural key on (story_id, ztimestamp) .
BTW timestamp is a data type, better not use it as a column name.
BTW2: you probably want story_id to be an integer field in stead of a text field, and since it is a key field you may also want it to be NOT NULL.
-- DDL
DROP TABLE story CASCADE;
CREATE TABLE story
( id serial not null primary key
, story_id text NOT NULL
, ztimestamp timestamp not null
, zvalue integer not null default 0
, UNIQUE (story_id, ztimestamp) -- the natural key
);
\d+ story
EXPLAIN
SELECT * FROM story st
WHERE story_id IN('1','2','3')
AND NOT EXISTS(
SELECT *
FROM story nx
WHERE nx.story_id = st.story_id
AND nx.ztimestamp > st.ztimestamp
);
DROP TABLE
CREATE TABLE
Table "tmp.story"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------+-----------------------------+-----------+----------+-----------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('story_id_seq'::regclass) | plain | |
story_id | text | | not null | | extended | |
ztimestamp | timestamp without time zone | | not null | | plain | |
zvalue | integer | | not null | 0 | plain | |
Indexes:
"story_pkey" PRIMARY KEY, btree (id)
"story_story_id_ztimestamp_key" UNIQUE CONSTRAINT, btree (story_id, ztimestamp)
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=1.83..18.97 rows=13 width=48)
-> Bitmap Heap Scan on story st (cost=1.67..10.94 rows=16 width=48)
Recheck Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Bitmap Index Scan on story_story_id_ztimestamp_key (cost=0.00..1.67 rows=16 width=0)
Index Cond: (story_id = ANY ('{1,2,3}'::text[]))
-> Index Only Scan using story_story_id_ztimestamp_key on story nx (cost=0.15..0.95 rows=2 width=40)
Index Cond: ((story_id = st.story_id) AND (ztimestamp > st.ztimestamp))
(7 rows)

Related

Using WITH + DELETE clause in a single query in postgresql

I have the following table structure, for a table named listens with PRIMARYKEY on (uid,timestamp)
Column | Type | Modifiers
----------------+-----------------------------+------------------------------------------------------
id | integer | not null default nextval('listens_id_seq'::regclass)
uid | character varying | not null
date | timestamp without time zone |
timestamp | integer | not null
artist_msid | uuid |
album_msid | uuid |
recording_msid | uuid |
json | character varying |
I need to remove all the entries for a particular user (uid) which are older than the max timestamp, say max is 123456789 (in seconds) and delta is 100000, then, all records older than max-100000.
I have managed to create a query when the table contains a single user but i am unable to formulate it to work for every user in the database. This operation needs to be done for every user in the database.
WITH max_table as (
SELECT max(timestamp) - 10000 as max
FROM listens
GROUP BY uid)
DELETE FROM listens
WHERE timestamp < (SELECT max FROM max_table);
Any solutions?
I think all you need, is to make this a co-related subquery:
WITH max_table as (
SELECT uid, max(timestamp) - 10000 as mx
FROM listens
GROUP BY uid
)
DELETE FROM listens
WHERE timestamp < (SELECT mx
FROM max_table
where max_table.uid = listens.uid);
Btw: timestamp is a horrible name for a column, especially one that doesn't contain a timestamp value. One reason is because it's also a keyword but more importantly it doesn't document what that column contains. A registration timestamp? An expiration timestamp? A last active timestamp?
Alternatively, you could avoid the MAX() by using an EXISTS()
DELETE FROM listens d
WHERE EXISTS (
SELECT * FROM listens x
WHERE x.uid = d.uid
AND x.timestamp >= d.timestamp + 10000
);
BTW: timestamp is an ugly name for a column, since it is also a typename.

How do I select the latest rows for all users?

I have a table similar to the following:
=> \d table
Table "public.table"
Column | Type | Modifiers
-------------+-----------------------------+-------------------------------
id | integer | not null default nextval( ...
user | bigint | not null
timestamp | timestamp without time zone | not null
field1 | double precision |
As you can see, it contains many field1 values over time for all users. Is there a way to efficiently get the latest field1 value for all users in one query (i.e. one row per user)? I'm thinking I might have to use some combination of group by and select first.
Simplest with DISTINCT ON in Postgres:
SELECT DISTINCT ON (id)
id, timestamp, field1
FROM tbl
ORDER BY id, timestamp DESC;
More details:
https://dba.stackexchange.com/questions/49540/how-do-i-efficiently-get-the-most-recent-corresponding-row/49555#49555
Select first row in each GROUP BY group?
Aside: Don't use timestamp as column name. It's a reserved word in SQL and a basic type name in Postgres.

Why is Oracle SQL Optimizer ignoring index predicate for this view?

I'm trying to optimize a set of stored procs which are going against many tables including this view. The view is as such:
We have TBL_A (id, hist_date, hist_type, other_columns) with two types of rows: hist_type 'O' vs. hist_type 'N'. The view self joins table A to itself and transposes the N rows against the corresponding O rows. If no N row exists for the O row, the O row values are repeated. Like so:
CREATE OR REPLACE FORCE VIEW V_A (id, hist_date, hist_type, other_columns_o, other_columns_n)
select
o.id, o.hist_date, o.hist_type,
o.other_columns as other_columns_o,
case when n.id is not null then n.other_columns else o.other_columns end as other_columns_n
from
TBL_A o left outer join TBL_A n
on o.id=n.id and o.hist_date=n.hist_date and n.hist_type = 'N'
where o.hist_type = 'O';
TBL_A has a unique index on: (id, hist_date, hist_type). It also has a unique index on: (hist_date, id, hist_type) and this is the primary key.
The following query is at issue (in a stored proc, with x declared as TYPE_TABLE_OF_NUMBER):
select b.id BULK COLLECT into x from TBL_B b where b.parent_id = input_id;
select v.id from v_a v
where v.id in (select column_value from table(x))
and v.hist_date = input_date
and v.status_new = 'CLOSED';
This query ignores the index on id column when accessing TBL_A and instead does a range scan using the date to pick up all the rows for the date. Then it filters that set using the values from the array. However if I simply give the list of ids as a list of numbers the optimizer uses the index just fine:
select v.id from v_a v
where v.id in (123, 234, 345, 456, 567, 678, 789)
and v.hist_date = input_date
and v.status_new = 'CLOSED';
The problem also doesn't exist when going against TBL_A directly (and I have a workaround that does that, but it's not ideal.).Is there a way to get the optimizer to first retrieve the array values and use them as predicates when accessing the table? Or a good way to restructure the view to achieve this?
Oracle does not use the index because it assumes select column_value from table(x) returns 8168 rows.
Indexes are faster for retrieving small amounts of data. At some point it's faster to scan the whole table than repeatedly walk the index tree.
Estimating the cardinality of a regular SQL statement is difficult enough. Creating an accurate estimate for procedural code is almost impossible. But I don't know where they came up with 8168. Table functions are normally used with pipelined functions in data warehouses, a sorta-large number makes sense.
Dynamic sampling can generate a more accurate estimate and likely generate a plan that will use the index.
Here's an example of a bad cardinality estimate:
create or replace type type_table_of_number as table of number;
explain plan for
select * from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8168 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 8168 | 00:00:01 |
-------------------------------------------------------------------------
Here's how to fix it:
explain plan for select /*+ dynamic_sampling(2) */ *
from table(type_table_of_number(1,2,3,4,5,6,7));
select * from table(dbms_xplan.display(format => '-cost -bytes'));
Plan hash value: 1748000095
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7 | 00:00:01 |
| 1 | COLLECTION ITERATOR CONSTRUCTOR FETCH| | 7 | 00:00:01 |
-------------------------------------------------------------------------
Note
-----
- dynamic statistics used: dynamic sampling (level=2)

Get last record of a table in Postgres

I'm using Postgres and cannot manage to get the last record of my table:
my_query = client.query("SELECT timestamp,value,card from my_table");
How can I do that knowning that timestamp is a unique identifier of the record ?
If under "last record" you mean the record which has the latest timestamp value, then try this:
my_query = client.query("
SELECT TIMESTAMP,
value,
card
FROM my_table
ORDER BY TIMESTAMP DESC
LIMIT 1
");
you can use
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
assuming you want also to sort by timestamp?
Easy way: ORDER BY in conjunction with LIMIT
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1;
However, LIMIT is not standard and as stated by Wikipedia, The SQL standard's core functionality does not explicitly define a default sort order for Nulls.. Finally, only one row is returned when several records share the maximum timestamp.
Relational way:
The typical way of doing this is to check that no row has a higher timestamp than any row we retrieve.
SELECT timestamp, value, card
FROM my_table t1
WHERE NOT EXISTS (
SELECT *
FROM my_table t2
WHERE t2.timestamp > t1.timestamp
);
It is my favorite solution, and the one I tend to use. The drawback is that our intent is not immediately clear when having a glimpse on this query.
Instructive way: MAX
To circumvent this, one can use MAX in the subquery instead of the correlation.
SELECT timestamp, value, card
FROM my_table
WHERE timestamp = (
SELECT MAX(timestamp)
FROM my_table
);
But without an index, two passes on the data will be necessary whereas the previous query can find the solution with only one scan. That said, we should not take performances into consideration when designing queries unless necessary, as we can expect optimizers to improve over time. However this particular kind of query is quite used.
Show off way: Windowing functions
I don't recommend doing this, but maybe you can make a good impression on your boss or something ;-)
SELECT DISTINCT
first_value(timestamp) OVER w,
first_value(value) OVER w,
first_value(card) OVER w
FROM my_table
WINDOW w AS (ORDER BY timestamp DESC);
Actually this has the virtue of showing that a simple query can be expressed in a wide variety of ways (there are several others I can think of), and that picking one or the other form should be done according to several criteria such as:
portability (Relational/Instructive ways)
efficiency (Relational way)
expressiveness (Easy/Instructive way)
If your table has no Id such as integer auto-increment, and no timestamp, you can still get the last row of a table with the following query.
select * from <tablename> offset ((select count(*) from <tablename>)-1)
For example, that could allow you to search through an updated flat file, find/confirm where the previous version ended, and copy the remaining lines to your table.
The last inserted record can be queried using this assuming you have the "id" as the primary key:
SELECT timestamp,value,card FROM my_table WHERE id=(select max(id) from my_table)
Assuming every new row inserted will use the highest integer value for the table's id.
If you accept a tip, create an id in this table like serial. The default of this field will be:
nextval('table_name_field_seq'::regclass).
So, you use a query to call the last register. Using your example:
pg_query($connection, "SELECT currval('table_name_field_seq') AS id;
I hope this tip helps you.
To get the last row,
Get Last row in the sorted order: In case the table has a column specifying time/primary key,
Using LIMIT clause
SELECT * FROM USERS ORDER BY CREATED_TIME DESC LIMIT 1;
Using FETCH clause - Reference
SELECT * FROM USERS ORDER BY CREATED_TIME FETCH FIRST ROW ONLY;
Get Last row in the rows insertion order: In case the table has no columns specifying time/any unique identifiers
Using CTID system column, where ctid represents the physical location of the row in a table - Reference
SELECT * FROM USERS WHERE CTID = (SELECT MAX(CTID) FROM USERS);
Consider the following table,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 | //as per created time, this is the last row
3 | C | 1535012279443 |
4 | D | 1535012212311 |
5 | E | 1535012254634 | //as per insertion order, this is the last row
The query 1 and 2 returns,
userid |username | createdtime |
2 | B | 1535042279423 |
while 3 returns,
userid |username | createdtime |
5 | E | 1535012254634 |
Note : On updating an old row, it removes the old row and updates the data and inserts as a new row in the table. So using the following query returns the tuple on which the data modification is done at the latest.
Now updating a row, using
UPDATE USERS SET USERNAME = 'Z' WHERE USERID='3'
the table becomes as,
userid |username | createdtime |
1 | A | 1535012279455 |
2 | B | 1535042279423 |
4 | D | 1535012212311 |
5 | E | 1535012254634 |
3 | Z | 1535012279443 |
Now the query 3 returns,
userid |username | createdtime |
3 | Z | 1535012279443 |
Use the following
SELECT timestamp, value, card
FROM my_table
ORDER BY timestamp DESC
LIMIT 1
These are all good answers but if you want an aggregate function to do this to grab the last row in the result set generated by an arbitrary query, there's a standard way to do this (taken from the Postgres wiki, but should work in anything conforming reasonably to the SQL standard as of a decade or more ago):
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
It's usually preferable to do select ... limit 1 if you have a reasonable ordering, but this is useful if you need to do this within an aggregate and would prefer to avoid a subquery.
See also this question for a case where this is the natural answer.
The column name plays an important role in the descending order:
select <COLUMN_NAME1, COLUMN_NAME2> from >TABLENAME> ORDER BY <COLUMN_NAME THAT MENTIONS TIME> DESC LIMIT 1;
For example: The below-mentioned table(user_details) consists of the column name 'created_at' that has timestamp for the table.
SELECT userid, username FROM user_details ORDER BY created_at DESC LIMIT 1;
In Oracle SQL,
select * from (select row_number() over (order by rowid desc) rn, emp.* from emp) where rn=1;
select * from table_name LIMIT 1;

Speeding up this big JOIN

EDIT: there was a mistake in the following question that explains the observations. I could delete the question but this might still be useful to someone. The mistake was that the actual query running on the server was SELECT * FROM t (which was silly) when I thought it was running SELECT t.* FROM t (which makes all the difference). See tobyobrian's answer and the comments to it.
I've a too slow query in a situation with a schema as follows. Table t has data rows indexed by t_id. t adjoins tables x and y via junction tables t_x and t_y each of which contains only the foreigns keys required for the JOINs:
CREATE TABLE t (
t_id INT NOT NULL PRIMARY KEY,
data columns...
);
CREATE TABLE t_x (
t_id INT NOT NULL,
x_id INT NOT NULL,
PRIMARY KEY (t_id, x_id),
KEY (x_id)
);
CREATE TABLE t_y (
t_id INT NOT NULL,
y_id INT NOT NULL,
PRIMARY KEY (t_id, y_id),
KEY (y_id)
);
I need to export the stray rows in t, i.e. those not referenced in either junction table.
SELECT t.* FROM t
LEFT JOIN t_x ON t_x.t_id=t.t_id
LEFT JOIN t_y ON t_y.t_id=t.t_id
WHERE t_x.t_id IS NULL OR t_y.t_id IS NULL
INTO OUTFILE ...;
t has 21 M rows while t_x and t_y both have about 25 M rows. So this is naturally going to be a slow query.
I'm using MyISAM so I thought I'd try to speed it up by preloading the t_x and t_y indexes. The combined size of t_x.MYI and t_y.MYI was about 1.2 M bytes so I created a dedicated key buffer for them, assigned their PRIMARY keys to the dedicated buffer and LOAD INDEX INTO CACHE'ed them.
But as I watch the query in operation, mysqld is using about 1% CPU, the average system IO pending queue length is around 5, and mysqld's average seek size is in the 250 k range. Moreover, nearly all the IO is mysqld reading from t_x.MYI and t_x.MYD.
I don't understand:
Why mysqld is reading the .MYD files at all?
Why mysqld isn't using the preloaded the t_x and t_y indexes?
Could it have something to do with the t_x and t_y PRIMARY keys being over two columns?
EDIT: The query explained:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 20980052 | |
| 1 | SIMPLE | t_x | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 235849 | Using index |
| 1 | SIMPLE | t_y | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 207947 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
Use not exists - this will be the fastest - much better than 'joins' or using 'not in' in this sitution.
SELECT t.* FROM t a
Where not exists (select 1 from t_x b
where b.t_id = a.t_id)
or not exists (select 1 from t_y c
where c.t_id = a.t_id);
I can answer part 1 of your question, and i may or may not be able to answer part two if you post the output of EXPLAIN:
In order to select t.* it needs to look in the MYD file - only the primary key is in the index, to fetch the data columns you requested it needs the rest of the columns.
That is, your query is quite probably filtering the results very quickly, its just struggling to copy all the data you wanted.
Also note that you will probably have duplicates in your output - if one row has no refs in t_x, but 3 in x_y you will have the same t.* repeated 3 times. Given we think the where clause is sufficiently efficient, and much time is spent on reading the actual data, this is quite possibly the source of your problems. try changing to select distinct and see if that helps your efficiency
This may be a bit more efficient:
SELECT *
FROM t
WHERE t.id NOT IN (
SELECT DISTINCT t_id
FROM t_x
UNION
SELECT DISTINCT t_id
FROM t_y
);