Timescale: Index scan - performance - sql

I have the following tables in timescale db
Fact table:
fact_program_event_audience (hypertable table):
create table v1.fact_program_event_audience
(
event_id uuid,
content_id text,
date_of_transmission date not null,
station_code integer,
start_time timestamp,
end_time timestamp,
duration_is_minutes integer,
program_type smallint,
program_name text,
...
audience_category_number integer,
live_viewing integer,
consolidated_viewing integer,
live_tv_viewing_excluding_playback integer,
consolidated_total_tv_viewing integer
);
create index fact_program_event_audience_v1_date_of_transmission_idx
on v1.fact_program_event_audience_v1 (date_of_transmission desc);
create index program_event_audience_v1_date_of_transmission_event_id_idx
on v1.fact_program_event_audience_v1 (date_of_transmission, event_id);
Dimension tables:
v1.dimension_program_content (PostgreSQL table):
create table v1.dimension_program_content (
content_id text not null,
content_name text not null,
episode_number numeric not null,
episode_name text not null
);
create index dimension_program_content_content_id_idx on dimension_program_content using btree (content_id);
v1.dimension_audience_category (PostgreSQL table):
create table v1.dimension_audience_category (
category_number numeric not null,
category_description text not null,
target_size_in_hundreds numeric not null
);
create index dimension_audience_category_category_number on dimension_audience_category using btree (category_number);
v1.dimension_db2_station (PostgreSQL table):
create table v1.dimension_db2_station (
stationcode numeric,
station15charname text,
stationname text
);
create index dimension_db2_station_station_code on dimension_db2_station using btree (db2stationcode);
Hypertable interval is '1 day'
Each day I have approximately new 2 000 000 - 5 000 000 rows in fact table.
Each dimension table has a maximum of 1000 rows, dimension_audience_category has only 60 rows (it will be important later)
I have the following query:
SELECT
event.start_time,
event.event_id,
event.db2_station_code,
event.end_time,
event.panel_code,
event.duration_is_minutes,
event.date_of_transmission,
event.area_flags,
event.barb_content_id,
station.db2stationname,
event.array,
content.episode_name,
content.episode_number,
content.content_name
FROM
(
SELECT
event.date_of_transmission,
event.event_id,
event.start_time,
event.end_time,
event.db2_station_code,
event.panel_code,
event.duration_is_minutes,
event.area_flags,
event.barb_content_id,
json_agg(json_build_object('d', category.category_description)) as "array"
FROM v1.fact_program_event_audience_v1 event
JOIN v1.dimension_audience_category category ON event.audience_category_number = category.category_number
WHERE
date_of_transmission = '2022-09-19T00:00:00Z'
GROUP BY
date_of_transmission,
event.event_id,
event.date_of_transmission,
event.start_time,
event.end_time,
event.db2_station_code,
event.panel_code,
event.duration_is_minutes,
event.area_flags,
event.barb_content_id
ORDER BY
date_of_transmission,
event.event_id
LIMIT 10000
) as event
JOIN v1.dimension_db2_station station ON station.stationcode = event.station_code
JOIN v1.dimension_program_content content ON content.content_id = event.content_id;
This query takes (cold cache / first ran) approximately 10 seconds, in the query plan I can see why: because I have a separate chunk per day and I don't have an index limit inside such chunk, but I have preordered data because of my index (date_of_transmission, event_id): Why it takes so much time to just select first 10k rows inside single chunk?
This is execution plan: https://pastebin.com/4VUZB0Fd
Here is a picture:
enter image description here
I can see buffers on chunk Index Scan stage:
Shared hit: 4.3 GB
Shared read: 491 MB
Seems like index is not ideal, my full chunk size is 550MB for the whole day (2022-09-19T00:00:00Z)
I also tried to add AND 'event_id > smallest event_id value in particular day' in WHERE, seems like it's not help at all.
I'm using Timescale DB cloud (16 GB RAM | 4 Cores)
Thanks.

Related

mariadb not using all fields of composite index

Mariadb not fully using composite index. Fast select and slow select both return same data, but explain shows that slow select uses only ix_test_relation.entity_id part and does not use ix_test_relation.stamp part.
I tried many cases (inner join, with, from) but couldn't make mariadb use both fields of index together with recursive query. I understand that I need to tell mariadb to materialize recursive query somehow.
Please help me optimize slow select which is using recursive query to be similar speed to fast select.
Some details about the task... I need to query user activity. One user activity record may relate to multiple entities. Entities are hierarchical. I need to query user activity for some parent entity and all children for specified stamp range. Stamp simplified from TIMESTAMP to BIGINT for demonstration simplicity. There can be a lot (1mil) of entities and each entity may relate to a lot (1mil) of user activity entries. Entity hierarchy depth expected to be like 10 levels deep. I assume that used stamp range reduces number of user activity records to 10-100. I denormalized schema, copied stamp from test_entry to test_relation to be able to include it in test_relation index.
I use 10.4.11-Mariadb-1:10:4.11+maria~bionic.
I can upgrade or patch or whatever mariadb if needed, I have full control over building docker image.
Schema:
CREATE TABLE test_entity(
id BIGINT NOT NULL,
parent_id BIGINT NULL,
CONSTRAINT pk_test_entity PRIMARY KEY (id),
CONSTRAINT fk_test_entity_pid FOREIGN KEY (parent_id) REFERENCES test_entity(id)
);
CREATE TABLE test_entry(
id BIGINT NOT NULL,
name VARCHAR(100) NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_entry PRIMARY KEY (id)
);
CREATE TABLE test_relation(
entry_id BIGINT NOT NULL,
entity_id BIGINT NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_relation PRIMARY KEY (entry_id, entity_id),
CONSTRAINT fk_test_relation_erid FOREIGN KEY (entry_id) REFERENCES test_entry(id),
CONSTRAINT fk_test_relation_enid FOREIGN KEY (entity_id) REFERENCES test_entity(id)
);
CREATE INDEX ix_test_relation ON test_relation(entity_id, stamp);
CREATE SEQUENCE sq_test_entry;
Test data:
CREATE OR REPLACE PROCEDURE test_insert()
BEGIN
DECLARE v_entry_id BIGINT;
DECLARE v_parent_entity_id BIGINT;
DECLARE v_child_entity_id BIGINT;
FOR i IN 1..1000 DO
SET v_parent_entity_id = i * 2;
SET v_child_entity_id = i * 2 + 1;
INSERT INTO test_entity(id, parent_id)
VALUES(v_parent_entity_id, NULL);
INSERT INTO test_entity(id, parent_id)
VALUES(v_child_entity_id, v_parent_entity_id);
FOR j IN 1..1000000 DO
SELECT NEXT VALUE FOR sq_test_entry
INTO v_entry_id;
INSERT INTO test_entry(id, name, stamp)
VALUES(v_entry_id, CONCAT('entry ', v_entry_id), j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_parent_entity_id, j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_child_entity_id, j);
END FOR;
END FOR;
END;
CALL test_insert;
Slow select (> 100ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
)
AND TR.stamp BETWEEN 6 AND 8
Fast select (1-2ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (2,3,4,5)
AND TR.stamp BETWEEN 6 AND 8
UPDATE 1
I can demonstrate the problem with even shorter example.
Explicitly store required entity_id records in temporary table
CREATE OR REPLACE TEMPORARY TABLE tbl
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
Try to run select using temporary table (below). Select is still slow but the only difference with fast query now is that IN statement queries table instead of inline constants.
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
For your queries (both of them) it looks to me like you should, as you mentioned, flip the column order on your compound index:
CREATE INDEX ix_test_relation ON test_relation(stamp, entity_id);
Why?
Your queries have a range filter TR.stamp BETWEEN 2 AND 3 on that column. For a range filter to use an index range scan (whether on a TIMESTAMP or a BIGINT column), the column being filtered must be first in a multicolumn index.
You also want a sargable filter, that is something lik this:
TR.stamp >= CURDATE() - INTERVAL 7 DAY
AND TR.stamp < CURDATE()
in place of
DATE(TR.stamp) BETWEEN DATE(NOW() - INTERVAL 7 DAY) AND DATE(NOW())
That is, don't put a function on the column you're scanning in your WHERE clause.
With a structured query like your first one, the query planner turns it into several queries. You can see this with ANALYZE FORMAT=JSON. The planner may choose different indexes and/or different chunks of indexes for each of those subqueries.
And, a word to the wise: don't get too wrapped around the axle trying to outguess the query planner built into the DBMS. It's an extraordinarily complex and highly wrought piece of software, created by decades of programming work by world-class experts in optimization. Our job as MariaDB / MySQL users is to find the right indexes.
The order of columns in a composite index matters. (O.Jones explains it nicely -- using SQL that has been removed from the Question?!)
I would rewrite
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
as
SELECT TR.entry_id
FROM tbl
JOIN test_relation TR ON tbl.id = TR.entity_id
WHERE TR.stamp BETWEEN 6 AND 8
or
SELECT entry_id
FROM test_relation TR
WHERE TR.stamp BETWEEN 6 AND 8
AND EXISTS ( SELECT 1 FROM tbl
WHERE tbl.id = TR.entity_id )
And have these in either case:
TR: INDEX(stamp, entity_id, entry_id) -- With `stamp` first
tbl: INDEX(id) -- maybe
Since tbl is a freshly built TEMPORARY TABLE, and it seems that only 3 rows need checking, it may not be worth adding INDEX(id).
Also needed:
test_entity: INDEX(parent_id, id)
Assuming that test_relation is a many:many mapping table, it is likely that you will also need (though not necessarily for the current query):
INDEX(entity_id, entry_id)

Postgres: How to find nearest tsrange from timestamp outside of ranges?

I am modeling (in Postgres 9.6.1 / postGIS 2.3.1) a booking system for local services provided by suppliers:
create table supplier (
id serial primary key,
name text not null check (char_length(title) < 280),
type service_type,
duration interval,
...
geo_position geography(POINT,4326)
...
);
Each supplier keeps a calendar with time slots when he/she is available to be booked:
create table timeslot (
id serial primary key,
supplier_id integer not null references supplier(id),
slot tstzrange not null,
constraint supplier_overlapping_timeslot_not_allowed
exclude using gist (supplier_id with =, slot with &&)
);
For when a client wants to know which nearby suppliers are available to book at a certain time, I create a view and a function:
create view supplier_slots as
select
supplier.name, supplier.type, supplier.geo_position, supplier.duration, ...
timeslot.slot
from
supplier, timeslot
where
supplier.id = timeslot.supplier_id;
create function find_suppliers(wantedType service_type, near_latitude text, near_longitude text, at_time timestamptz)
returns setof supplier_slots as $$
declare
nearpoint geography;
begin
nearpoint := ST_GeographyFromText('SRID=4326;POINT(' || near_latitude || ' ' || near_longitude || ')');
return query
select * from supplier_slots
where type = wantedType
and tstzrange(at_time, at_time + duration) <# slot
order by ST_Distance( nearpoint, geo_position )
limit 100;
end;
$$ language plpgsql;
All this works really well.
Now, for the suppliers that did NOT have a bookable time slot at the requested time, I would like to find their closest available timeslots, before and after the requested at_time, also sorted by distance.
This has my mind spinning a little bit and I can't find any suitable operators to give me the nearest tsrange.
Any ideas on the smartest way to do this?
The solution depends on the exact definition of what you want.
Schema
I suggest these slightly adapted table definitions to make the task simpler, enforce integrity and improve performance:
CREATE TABLE supplier (
supplier_id serial PRIMARY KEY,
supplier text NOT NULL CHECK (length(title) < 280),
type service_type,
duration interval,
geo_position geography(POINT,4326)
);
CREATE TABLE timeslot (
timeslot_id serial PRIMARY KEY,
supplier_id integer NOT NULL -- references supplier(id),
slot_a timestamptz NOT NULL,
slot_z timestamptz NOT NULL,
CONSTRAINT timeslot_range_valid CHECK (slot_a < slot_z)
CONSTRAINT timeslot_no_overlapping
EXCLUDE USING gist (supplier_id WITH =, tstzrange(slot_a, slot_z) WITH &&)
);
CREATE INDEX timeslot_slot_z ON timeslot (supplier_id, slot_z);
CREATE INDEX supplier_geo_position_gist ON supplier USING gist (geo_position);
Save two timestamptz columns slot_a and slot_z instead of the tstzrange column slot - and adapt constraints accordingly. This treats all ranges as default inclusive lower and exclusive upper bounds automatically now - which avoids corner case errors / headache.
Collateral benefit: only 16 bytes for 2 timestamptz instead of 25 bytes (32 with padding) for the tstzrange.
All queries you might have had on slot keep working with tstzrange(slot_a, slot_z) as drop-in replacement.
Add an index on (supplier_id, slot_z) for the query at hand.
And a spatial index on supplier.geo_position (which you probably have already).
Depending on data distribution in type, a couple of partial indexes for types common in queries might help performance:
CREATE INDEX supplier_geo_type_foo_gist ON supplier USING gist (geo_position)
WHERE supplier = 'foo'::service_type;
Query / Function
This query finds the X closest suppliers who offer the correct service_type (100 in the example), each with the one closest matching time slot (defined by the time distance to the start of the slot). I combined this with actually matching slots, which may or may not be what you need.
CREATE FUNCTION f_suppliers_nearby(_type service_type, _lat text, _lon text, at_time timestamptz)
RETURNS TABLE (supplier_id int
, name text
, duration interval
, geo_position geography(POINT,4326)
, distance float
, timeslot_id int
, slot_a timestamptz
, slot_z timestamptz
, time_dist interval
) AS
$func$
WITH sup_nearby AS ( -- find matching or later slot
SELECT s.id, s.name, s.duration, s.geo_position
, ST_Distance(ST_GeographyFromText('SRID=4326;POINT(' || _lat || ' ' || _lon || ')')
, geo_position) AS distance
, t.timeslot_id, t.slot_a, t.slot_z
, CASE WHEN t.slot_a IS NOT NULL
THEN GREATEST(t.slot_a - at_time, interval '0') END AS time_dist
FROM supplier s
LEFT JOIN LATERAL (
SELECT *
FROM timeslot
WHERE supplier_id = supplier_id
AND slot_z > at_time + s.duration -- excl. upper bound
ORDER BY slot_z
LIMIT 1
) t ON true
WHERE s.type = _type
ORDER BY s.distance
LIMIT 100
)
SELECT *
FROM (
SELECT DISTINCT ON (supplier_id) * -- 1 slot per supplier
FROM (
TABLE sup_nearby -- matching or later slot
UNION ALL -- earlier slot
SELECT s.id, s.name, s.duration, s.geo_position
, s.distance
, t.timeslot_id, t.slot_a, t.slot_z
, GREATEST(at_time - t.slot_a, interval '0') AS time_dist
FROM sup_nearby s
CROSS JOIN LATERAL ( -- this time CROSS JOIN!
SELECT *
FROM timeslot
WHERE supplier_id = s.supplier_id
AND slot_z <= at_time -- excl. upper bound
ORDER BY slot_z DESC
LIMIT 1
) t
WHERE s.time_dist IS DISTINCT FROM interval '0' -- exact matches are done
) sub
ORDER BY supplier_id, time_dist -- pick temporally closest slot per supplier
) sub
ORDER BY time_dist, distance; -- matches first, ordered by distance; then misses, ordered by time distance
$func$ LANGUAGE sql;
I did not use your view supplier_slots and optimized for performance instead. The view may still be convenient. You might include tstzrange(slot_a, slot_z) AS slot for backward compatibility.
The basic query to find the 100 closest suppliers is a textbook "K Nearest Neighbour" problem. A GiST index works well for this. Related:
How do I query all rows within a 5-mile radius of my coordinates?
The additional task (find the temporally nearest slot) can be split in two tasks: to find the next higher and the next lower row. The core feature of the solution is to have two subqueries with ORDER BY slot_z LIMIT 1 and ORDER BY slot_z DESC LIMIT 1, which result in two very fast index scans.
I combined the first one with finding actual matches, which is a (smart, I think) optimization, but may distract from the actual solution.

Optimizing SQL Server query / table

I have a database table that receives close to 1 million inserts a day that needs to be searchable for at least a year. Big hard drive and lots of data and not that great hardware to put it on either.
The table looks like this:
id | tag_id | value | time
----------------------------------------
279571 55 0.57 2013-06-18 12:43:22
...
tag_id might be something like AmbientTemperature or AmbientHumidity and the time is captured when the reading is taken from the sensor.
I'm querying on this table in a reporting format. I want to see all data for tags 1,55,72, and 4 between 2013-11-1 and 2013-11-28 at 1 hour intervals.
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name,
ROW_NUMBER() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
) k
WHERE seqnum = 1
ORDER BY time";
Can I optimize this table or my query at all? How should I set up my indexes?
It's pretty slow with a table size of 100 million + rows. It can take several minutes to get a data set of 7 days at an hourly interval with 3 tags in the query.
Filtering on the result of the row number function will make the query painfully slow. Also it will prevent optimal index use.
If your primary reporting need is hourly information you might want to consider storing which rows are the first sensor reading for a tag in a specific hour.
ALTER TABLE tag_values ADD IsHourlySensorReading BIT NULL;
In an hourly process, you calculate this column for new rows.
DECLARE #CalculateFrom DATETIME = (SELECT MIN(time) FROM tag_values WHERE IsHourlySensorReading IS NULL);
SET #CalculateFrom = dateadd(hour, 0, datediff(hour, 0, #CalculateFrom));
UPDATE k
SET IsHourlySensorReading = CASE seqnum WHEN 1 THEN 1 ELSE 0 END
FROM (
SELECT id, row_number() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
WHERE tv.time >= #CalculateFrom
AND tv.IsHourlySensorReading IS NULL
) as k
Your reporting query then becomes much simpler:
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
AND IsHourlySensorReading=1
) k
ORDER BY time;
The following index will help calculating the IsHourlySensorReading column. But remember, indexes will also cause your million inserts per day to take more time. Test thoroughly!
CREATE NONCLUSTERED INDEX tag_values_ixnc01 ON tag_values (time, IsHourlySensorReading) WHERE (IsHourlySensorReading IS NULL);
Use this index for reporting if you need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (time, tag_id, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Use this index for reporting if you don't need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (tag_id, time, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Some additional things to consider:
Is ORDER BY time really required?
Table partitioning can seriously improve both insert and query performance. Depending on your situation I would partition on either tag_id or date.
Instead of creating a column with an IsHourlySensorReading indicator, you can also create a separate table/database for specific reporting requirements and only load the relevant data into that.
I'm not an expert on sqlserver, but I would seriously consider setting this up as a partitioned table. This would also make archiving easier as partitions could simply be dropped (rather than an expensive delete from where...).
Also (with a bit of luck) the optimiser will only look in the partitions required for the data.

Timestamp difference between rows in Postgresql

In PostgreSQL I have a table
CREATE TABLE cubenotification.newnotification
(
idnewnotification serial NOT NULL,
contratto text,
idlocation numeric,
typology text,
idpost text,
iduser text,
idobject text,
idwhere text,
status text DEFAULT 'valid'::text,
data_crea timestamp with time zone DEFAULT now(),
username text,
usersocial text,
url text,
apikey text,
CONSTRAINT newnotification_pkey PRIMARY KEY (idnewnotification )
)
Let's say that typology field can be "owned_task" or "fwd_task".
What I'd like to get from DB is the timestamp difference in seconds strictly between data_crea of the row with typology "fwd_task" and data_crea of the row with typology "owned_task" for every couple "idobject,iduser", and I'd like to get also the EXTRACT(WEEK FROM data_crea)) as "weeks", grouping the results by "weeks". My problem is principally about performing the timestamp ordered difference between two rows with same idobject, same iduser and different typology.
EDIT:
Here some sample data
and sqlfiddle link http://sqlfiddle.com/#!12/6cd64/2
What you are looking for is the JOIN of two subselects:
SELECT EXTRACT(WEEK FROM o.data_crea) AS owned_week
, iduser, idobject
, EXTRACT('epoch' FROM (o.data_crea - f.data_crea)) AS diff_in_sek
FROM (SELECT * FROM newnotification WHERE typology = 'owned_task') o
JOIN (SELECT * FROM newnotification WHERE typology = 'fwd_task') f
USING (iduser, idobject)
ORDER BY 1,4
->sqlfiddle
I order by week and timestamp difference. The week number is based on the week of 'owned_task'.
This assumes there is exactly one row for 'owned_task' and one for 'fwd_task' per (iduser, idobject), not one per week. Your specification leaves room for interpretation.

How can I calculate the top % daily price changes using MySQL?

I have a table called prices which includes the closing price of stocks that I am tracking daily.
Here is the schema:
CREATE TABLE `prices` (
`id` int(21) NOT NULL auto_increment,
`ticker` varchar(21) NOT NULL,
`price` decimal(7,2) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `ticker` (`ticker`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=2200 ;
I am trying to calculate the % price drop for anything that has a price value greater than 0 for today and yesterday. Over time, this table will be huge and I am worried about performance. I assume this will have to be done on the MySQL side rather than PHP because LIMIT will be needed here.
How do I take the last 2 dates and do the % drop calculation in MySQL though?
Any advice would be greatly appreciated.
One problem I see right off the bat is using a timestamp data type for the date, this will complicate your sql query for two reasons - you will have to use a range or convert to an actual date in your where clause, but, more importantly, since you state that you are interested in today's closing price and yesterday's closing price, you will have to keep track of the days when the market is open - so Monday's query is different than tue - fri, and any day the market is closed for a holiday will have to be accounted for as well.
I would add a column like mktDay and increment it each day the market is open for business. Another approach might be to include a 'previousClose' column which makes your calculation trivial. I realize this violates normal form, but it saves an expensive self join in your query.
If you cannot change the structure, then you will do a self join to get yesterday's close and you can calculate the % change and order by that % change if you wish.
Below is Eric's code, cleaned up a bit it executed on my server running mysql 5.0.27
select
p_today.`ticker`,
p_today.`date`,
p_yest.price as `open`,
p_today.price as `close`,
((p_today.price - p_yest.price)/p_yest.price) as `change`
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.`date`) = date(p_yest.`date`) + INTERVAL 1 DAY
and p_today.price > 0
and p_yest.price > 0
and date(p_today.`date`) = CURRENT_DATE
order by `change` desc
limit 10
Note the back-ticks as some of your column names and Eric's aliases were reserved words.
Also note that using a where clause for the first table would be a less expensive query - the where get's executed first and only has to attempt to self join on the rows that are greater than zero and have today's date
select
p_today.`ticker`,
p_today.`date`,
p_yest.price as `open`,
p_today.price as `close`,
((p_today.price - p_yest.price)/p_yest.price) as `change`
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.`date`) = date(p_yest.`date`) + INTERVAL 1 DAY
and p_yest.price > 0
where p_today.price > 0
and date(p_today.`date`) = CURRENT_DATE
order by `change` desc
limit 10
Scott brings up a great point about consecutive market days. I recommend handling this with a connector table like:
CREATE TABLE `market_days` (
`market_day` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`date` DATE NOT NULL DEFAULT '0000-00-00',
PRIMARY KEY USING BTREE (`market_day`),
UNIQUE KEY USING BTREE (`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0
;
As more market days elapse, just INSERT new date values in the table. market_day will increment accordingly.
When inserting prices data, lookup the LAST_INSERT_ID() or corresponding value to a given date for past values.
As for the prices table itself, you can make storage, SELECT and INSERT operations much more efficient with a useful PRIMARY KEY and no AUTO_INCREMENT column. In the schema below, your PRIMARY KEY contains intrinsically useful information and isn't just a convention to identify unique rows. Using MEDIUMINT (3 bytes) instead of INT (4 bytes) saves an extra byte per row and more importantly 2 bytes per row in the PRIMARY KEY - all while still affording over 16 million possible dates and ticker symbols (each).
CREATE TABLE `prices` (
`market_day` MEDIUMINT(8) UNSIGNED NOT NULL DEFAULT '0',
`ticker_id` MEDIUMINT(8) UNSIGNED NOT NULL DEFAULT '0',
`price` decimal (7,2) NOT NULL DEFAULT '00000.00',
PRIMARY KEY USING BTREE (`market_day`,`ticker_id`),
KEY `ticker_id` USING BTREE (`ticker_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
;
In this schema each row is unique across each pair of market_day and ticker_id. Here ticker_id corresponds to a list of ticker symbols in a tickers table with a similar schema to the market_days table:
CREATE TABLE `tickers` (
`ticker_id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`ticker_symbol` VARCHAR(5),
`company_name` VARCHAR(50),
/* etc */
PRIMARY KEY USING BTREE (`ticker_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0
;
This yields a similar query to others proposed, but with two important differences: 1) There's no functional transformation on the date column, which destroys MySQL's ability to use keys on the join; in the query below MySQL will use part of the PRIMARY KEY to join on market_day. 2) MySQL can only use one key per JOIN or WHERE clause. In this query MySQL will use the full width of the PRIMARY KEY (market_day and ticker_id) whereas in the previous query it could only use one (MySQL will usually pick the more selective of the two).
SELECT
`market_days`.`date`,
`tickers`.`ticker_symbol`,
`yesterday`.`price` AS `close_yesterday`,
`today`.`price` AS `close_today`,
(`today`.`price` - `yesterday`.`price`) / (`yesterday`.`price`) AS `pct_change`
FROM
`prices` AS `today`
LEFT JOIN
`prices` AS `yesterday`
ON /* uses PRIMARY KEY */
`yesterday`.`market_day` = `today`.`market_day` - 1 /* this will join NULL for `today`.`market_day` = 0 */
AND
`yesterday`.`ticker_id` = `today`.`ticker_id`
INNER JOIN
`market_days` /* uses first 3 bytes of PRIMARY KEY */
ON
`market_days`.`market_day` = `today`.`market_day`
INNER JOIN
`tickers` /* uses KEY (`ticker_id`) */
ON
`tickers`.`ticker_id` = `today`.`ticker_id`
WHERE
`today`.`price` > 0
AND
`yesterday`.`price` > 0
;
A finer point is the need to also join against tickers and market_days in order to display the actual ticker_symbol and date, but these operations are very fast since they make use of keys.
Essentially, you can just join the table to itself to find the given % change. Then, order by change descending to get the largest changers on top. You could even order by abs(change) if you want the largest swings.
select
p_today.ticker,
p_today.date,
p_yest.price as open,
p_today.price as close,
--Don't have to worry about 0 division here
(p_today.price - p_yest.price)/p_yest.price as change
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.date) = date(date_add(p_yest.date interval 1 day))
and p_today.price > 0
and p_yest.price > 0
and date(p_today.date) = CURRENT_DATE
order by change desc
limit 10