Unique assignment of closest points between two tables - sql

In my Postgres 9.5 database with PostGis 2.2.0 installed, I have two tables with geometric data (points) and I want to assign points from one table to the points from the other table, but I don't want a buildings.gid to be assigned twice. As soon as one buildings.gid is assigned, it should not be assigned to another pvanlagen.buildid.
Table definitions
buildings:
CREATE TABLE public.buildings (
gid numeric NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr numeric,
capac numeric,
instdate date,
pvid numeric,
dist numeric,
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
);
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);
pvanlagen:
CREATE TABLE public.pvanlagen (
gid integer NOT NULL DEFAULT nextval('pv_bis2010_bayern_wgs84_gid_seq'::regclass),
tso character varying(254),
tso_number numeric(10,0),
system_ope character varying(254),
system_key character varying(254),
location character varying(254),
postal_cod numeric(10,0),
street character varying(254),
capacity numeric,
voltage_le character varying(254),
energy_sou character varying(254),
beginning_ date,
end_operat character varying(254),
id numeric(10,0),
kkz numeric(10,0),
geom geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
buildid numeric,
dist numeric,
trans boolean,
CONSTRAINT pv_bis2010_bayern_wgs84_pkey PRIMARY KEY (gid),
CONSTRAINT pvanlagen_buildid_fkey FOREIGN KEY (buildid)
REFERENCES public.buildings (gid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid)
);
CREATE INDEX pv_bis2010_bayern_wgs84_geom_idx
ON public.pvanlagen
USING gist
(geom);
Query
My idea was to add a boolean column pv in the buildings table, which is set when a buildings.gid was assigned:
UPDATE pvanlagen
SET buildid=buildings.gid, dist='50'
FROM buildings
WHERE buildid IS NULL
AND buildings.pv is NULL
AND pvanlagen.gemname=buildings.gemname
AND ST_Distance(ST_Transform(pvanlagen.geom,31467)
,ST_Transform(buildings.centroid,31467))<50;
UPDATE buildings
SET pv=true
FROM pvanlagen
WHERE buildings.gid=pvanlagen.buildid;
I tested for 50 rows in buildings but it takes too long to apply for all of them. I have 3.200.000 buildings and 260.000 PV.
The gid of the closest building shall be assigned. If In case of ties, it should not matter which gid is assigned. If we need to frame a rule, we can take the building with the lower gid.
50 meters was meant to work as a limit. I used ST_Distance() because it returns the minimum distance, which should be within 50 meters. Later I raised it multiple times, until every PV Anlage was assigned.
Buildings and PV are assigned to their respective regions (gemname). This should make the assignment cheaper, since I know the nearest building must be within the same region (gemname).
I tried this query after feedback below:
UPDATE pvanlagen p1
SET buildid = buildings.gid
, dist = buildings.dist
FROM (
SELECT DISTINCT ON (b.gid)
p.id, b.gid, b.dist::numeric
FROM (
SELECT id, ST_Transform(geom, 31467)
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid, ST_Distance(ST_Transform(p1.geom, 31467), ST_Transform(b.centroid, 31467)) AS dist
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid
WHERE p1.buildid IS NULL
AND b.gemname = p1.gemname
ORDER BY ST_Transform(p1.geom, 31467) <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b.gid, b.dist, p.id -- tie breaker
) x, buildings
WHERE p1.id = x.id;
But it returns with 0 rows affected in 234 ms execution time.
Where am I going wrong?

Table schema
To enforce your rule simply declare pvanlagen.buildid UNIQUE:
ALTER TABLE pvanlagen ADD CONSTRAINT pvanlagen_buildid_uni UNIQUE (buildid);
building.gid is the PK, as your update revealed. To also enforce referential integrity add a FOREIGN KEY constraint to buildings.gid.
You have implemented both by now. But it would be more efficient to run the big UPDATE below before you add these constraints.
There is a lot more that should be improved in your table definition. For one, buildings.gid as well as pvanlagen.buildid should be type integer (or possibly bigint if you burn a lot of PK values). numeric is expensive nonsense.
Let's focus on the core problem:
Basic Query to find closest building
The case is not as simple as it may seem. It's a "nearest neighbour" problem, with the additional complication of unique assignment.
This query finds the nearest one building for each PV (short for PV Anlage - row in pvanlagen), where neither is assigned, yet:
SELECT pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b;
To make this query fast, you need a spatial, functional GiST index on buildings to make it much faster:
CREATE INDEX build_centroid_gix ON buildings USING gist (ST_Transform(centroid, 31467));
Not sure why you don't
Related answers with more explanation:
Spatial query on large table with multiple self joins performing slow
How do I query all rows within a 5-mile radius of my coordinates?
Further reading:
http://workshops.boundlessgeo.com/postgis-intro/knn.html
http://www.postgresonline.com/journal/archives/306-KNN-GIST-with-a-Lateral-twist-Coming-soon-to-a-database-near-you.html
With the index in place, we don't need to restrict matches to the same gemname for performance. Only do this if it's an actual rule to enforced. If it has to be observed at all times, include the column in the FK constraint:
Restrict foreign key relationship to rows of related subtypes
Remaining Problem
We can use the above query it in an UPDATE statement. Each PV is only used once, but more than one PV might still find the same building to be closest. You only allow one PV per building. So how would you resolve that?
In other words, how would you assign objects here?
Simple solution
One simple solution would be:
UPDATE pvanlagen p1
SET buildid = sub.b_gid
, dist = sub.dist -- actual distance
FROM (
SELECT DISTINCT ON (b_gid)
pv_gid, b_gid, dist
FROM (
SELECT gid AS pv_gid, ST_Transform(geom, 31467) AS geom31467
FROM pvanlagen
WHERE buildid IS NULL -- not assigned yet
) p
, LATERAL (
SELECT b.gid AS b_gid
, round(ST_Distance(p.geom31467
, ST_Transform(b.centroid, 31467))::numeric, 2) AS dist -- see below
FROM buildings b
LEFT JOIN pvanlagen p1 ON p1.buildid = b.gid -- also not assigned ...
WHERE p1.buildid IS NULL -- ... yet
-- AND p.gemname = b.gemname -- not needed for performance, see below
ORDER BY p.geom31467 <-> ST_Transform(b.centroid, 31467)
LIMIT 1
) b
ORDER BY b_gid, dist, pv_gid -- tie breaker
) sub
WHERE p1.gid = sub.pv_gid;
I use DISTINCT ON (b_gid) to reduce to exactly one row per building, picking the PV with the shortest distance. Details:
Select first row in each GROUP BY group?
For any building that is closest for more one PV, only the closest PV is assigned. The PK column gid (alias pv_gid) serves as tiebreaker if two are equally close. In such a case, some PV are dropped from the update and remain unassigned. Repeat the query until all PV are assigned.
This is still a simplistic algorithm, though. Looking at my diagram above, this assigns building 4 to PV 4 and building 5 to PV 5, while 4-5 and 5-4 would probably be a better solution overall ...
Aside: type for dist column
Currently you use numeric for it. your original query assigned a constant integer, no point making in numeric.
In my new query ST_Distance() returns the actual distance in meters as double precision. If we simply assign that we get 15 or so fractional digits in the numeric data type, and the number is not that exact to begin with. I seriously doubt you want to waste the storage.
I would rather save the original double precision from the calculation. or, better yet, round as needed. If meters are exact enough, just cast to and save an integer (rounding the number automatically). Or multiply with 100 first to save cm:
(ST_Distance(...) * 100)::int

Related

Counting points a on linestring

I am trying to counts the number of points on a line for each row in the following table
CREATE TABLE outils.prod(
pk INTEGER PRIMARY KEY,
cable VARCHAR (25),
PA VARCHAR (10),
Art VARCHAR(7),
FT Numeric,
BT Numeric
);
INSERT INTO outils.prod (pk)
SELECT id_ftth
FROM outils.cable
WHERE type_cable = '2' ;
SELECT ADDGEOMETRYCOLUMN('outils','prod','geom',2154,'MultiLineString',2);
I have tried to update my line table but i have trouble getting an answer for each row.
UPDATE outils.prod SET FT=(SELECT COUNT( ST_INTERSECTION(outils.prod.geom,outils.ft.geom))
FROM outils.prod , outils.ft)
With the above code i managed to get the total number of intersection for every line but i would like to have the count by line in my line table.
Thank you ,
Hugo
You would have to write a sub-query to do the count per line.
Also you don't need to compute the intersection (the geom), but just to check if they intersect, which is much faster.
UPDATE outils.prod
SET FT= sub.cnt
FROM (
SELECT pk, count(*) as cnt
FROM outils.ft
JOIN outils.prod ON ST_INTERSECTS(prod.geom, ft.geom)
)
WHERE prod.pk = sub.pk;

Postgres: How to find nearest tsrange from timestamp outside of ranges?

I am modeling (in Postgres 9.6.1 / postGIS 2.3.1) a booking system for local services provided by suppliers:
create table supplier (
id serial primary key,
name text not null check (char_length(title) < 280),
type service_type,
duration interval,
...
geo_position geography(POINT,4326)
...
);
Each supplier keeps a calendar with time slots when he/she is available to be booked:
create table timeslot (
id serial primary key,
supplier_id integer not null references supplier(id),
slot tstzrange not null,
constraint supplier_overlapping_timeslot_not_allowed
exclude using gist (supplier_id with =, slot with &&)
);
For when a client wants to know which nearby suppliers are available to book at a certain time, I create a view and a function:
create view supplier_slots as
select
supplier.name, supplier.type, supplier.geo_position, supplier.duration, ...
timeslot.slot
from
supplier, timeslot
where
supplier.id = timeslot.supplier_id;
create function find_suppliers(wantedType service_type, near_latitude text, near_longitude text, at_time timestamptz)
returns setof supplier_slots as $$
declare
nearpoint geography;
begin
nearpoint := ST_GeographyFromText('SRID=4326;POINT(' || near_latitude || ' ' || near_longitude || ')');
return query
select * from supplier_slots
where type = wantedType
and tstzrange(at_time, at_time + duration) <# slot
order by ST_Distance( nearpoint, geo_position )
limit 100;
end;
$$ language plpgsql;
All this works really well.
Now, for the suppliers that did NOT have a bookable time slot at the requested time, I would like to find their closest available timeslots, before and after the requested at_time, also sorted by distance.
This has my mind spinning a little bit and I can't find any suitable operators to give me the nearest tsrange.
Any ideas on the smartest way to do this?
The solution depends on the exact definition of what you want.
Schema
I suggest these slightly adapted table definitions to make the task simpler, enforce integrity and improve performance:
CREATE TABLE supplier (
supplier_id serial PRIMARY KEY,
supplier text NOT NULL CHECK (length(title) < 280),
type service_type,
duration interval,
geo_position geography(POINT,4326)
);
CREATE TABLE timeslot (
timeslot_id serial PRIMARY KEY,
supplier_id integer NOT NULL -- references supplier(id),
slot_a timestamptz NOT NULL,
slot_z timestamptz NOT NULL,
CONSTRAINT timeslot_range_valid CHECK (slot_a < slot_z)
CONSTRAINT timeslot_no_overlapping
EXCLUDE USING gist (supplier_id WITH =, tstzrange(slot_a, slot_z) WITH &&)
);
CREATE INDEX timeslot_slot_z ON timeslot (supplier_id, slot_z);
CREATE INDEX supplier_geo_position_gist ON supplier USING gist (geo_position);
Save two timestamptz columns slot_a and slot_z instead of the tstzrange column slot - and adapt constraints accordingly. This treats all ranges as default inclusive lower and exclusive upper bounds automatically now - which avoids corner case errors / headache.
Collateral benefit: only 16 bytes for 2 timestamptz instead of 25 bytes (32 with padding) for the tstzrange.
All queries you might have had on slot keep working with tstzrange(slot_a, slot_z) as drop-in replacement.
Add an index on (supplier_id, slot_z) for the query at hand.
And a spatial index on supplier.geo_position (which you probably have already).
Depending on data distribution in type, a couple of partial indexes for types common in queries might help performance:
CREATE INDEX supplier_geo_type_foo_gist ON supplier USING gist (geo_position)
WHERE supplier = 'foo'::service_type;
Query / Function
This query finds the X closest suppliers who offer the correct service_type (100 in the example), each with the one closest matching time slot (defined by the time distance to the start of the slot). I combined this with actually matching slots, which may or may not be what you need.
CREATE FUNCTION f_suppliers_nearby(_type service_type, _lat text, _lon text, at_time timestamptz)
RETURNS TABLE (supplier_id int
, name text
, duration interval
, geo_position geography(POINT,4326)
, distance float
, timeslot_id int
, slot_a timestamptz
, slot_z timestamptz
, time_dist interval
) AS
$func$
WITH sup_nearby AS ( -- find matching or later slot
SELECT s.id, s.name, s.duration, s.geo_position
, ST_Distance(ST_GeographyFromText('SRID=4326;POINT(' || _lat || ' ' || _lon || ')')
, geo_position) AS distance
, t.timeslot_id, t.slot_a, t.slot_z
, CASE WHEN t.slot_a IS NOT NULL
THEN GREATEST(t.slot_a - at_time, interval '0') END AS time_dist
FROM supplier s
LEFT JOIN LATERAL (
SELECT *
FROM timeslot
WHERE supplier_id = supplier_id
AND slot_z > at_time + s.duration -- excl. upper bound
ORDER BY slot_z
LIMIT 1
) t ON true
WHERE s.type = _type
ORDER BY s.distance
LIMIT 100
)
SELECT *
FROM (
SELECT DISTINCT ON (supplier_id) * -- 1 slot per supplier
FROM (
TABLE sup_nearby -- matching or later slot
UNION ALL -- earlier slot
SELECT s.id, s.name, s.duration, s.geo_position
, s.distance
, t.timeslot_id, t.slot_a, t.slot_z
, GREATEST(at_time - t.slot_a, interval '0') AS time_dist
FROM sup_nearby s
CROSS JOIN LATERAL ( -- this time CROSS JOIN!
SELECT *
FROM timeslot
WHERE supplier_id = s.supplier_id
AND slot_z <= at_time -- excl. upper bound
ORDER BY slot_z DESC
LIMIT 1
) t
WHERE s.time_dist IS DISTINCT FROM interval '0' -- exact matches are done
) sub
ORDER BY supplier_id, time_dist -- pick temporally closest slot per supplier
) sub
ORDER BY time_dist, distance; -- matches first, ordered by distance; then misses, ordered by time distance
$func$ LANGUAGE sql;
I did not use your view supplier_slots and optimized for performance instead. The view may still be convenient. You might include tstzrange(slot_a, slot_z) AS slot for backward compatibility.
The basic query to find the 100 closest suppliers is a textbook "K Nearest Neighbour" problem. A GiST index works well for this. Related:
How do I query all rows within a 5-mile radius of my coordinates?
The additional task (find the temporally nearest slot) can be split in two tasks: to find the next higher and the next lower row. The core feature of the solution is to have two subqueries with ORDER BY slot_z LIMIT 1 and ORDER BY slot_z DESC LIMIT 1, which result in two very fast index scans.
I combined the first one with finding actual matches, which is a (smart, I think) optimization, but may distract from the actual solution.

What is the equivalent PostgreSQL syntax to Oracle's CONNECT BY ... START WITH?

In Oracle, if I have a table defined as …
CREATE TABLE taxonomy
(
key NUMBER(11) NOT NULL CONSTRAINT taxPkey PRIMARY KEY,
value VARCHAR2(255),
taxHier NUMBER(11)
);
ALTER TABLE
taxonomy
ADD CONSTRAINT
taxTaxFkey
FOREIGN KEY
(taxHier)
REFERENCES
tax(key);
With these values …
key value taxHier
0 zero null
1 one 0
2 two 0
3 three 0
4 four 1
5 five 2
6 six 2
This query syntax …
SELECT
value
FROM
taxonomy
CONNECT BY
PRIOR key = taxHier
START WITH
key = 0;
Will yield …
zero
one
four
two
five
six
three
How is this done in PostgreSQL?
Use a RECURSIVE CTE in Postgres:
WITH RECURSIVE cte AS (
SELECT key, value, 1 AS level
FROM taxonomy
WHERE key = 0
UNION ALL
SELECT t.key, t.value, c.level + 1
FROM cte c
JOIN taxonomy t ON t.taxHier = c.key
)
SELECT value
FROM cte
ORDER BY level;
Details and links to documentation in my previous answer:
Does PostgreSQL have a pseudo-column like "LEVEL" in Oracle?
Or you can install the additional module tablefunc which provides the function connectby() doing almost the same. See Stradas' answer for details.
Postgres does have an equivalent to the connect by. You will need to enable the module. Its turned off by default.
It is called tablefunc. It supports some cool crosstab functionality as well as the familiar "connect by" and "Start With". I have found it works much more eloquently and logically than the recursive CTE. If you can't get this turned on by your DBA, you should go for the way Erwin is doing it.
It is robust enough to do the "bill of materials" type query as well.
Tablefunc can be turned on by running this command:
CREATE EXTENSION tablefunc;
Here is the list of connection fields freshly lifted from the official documentation.
Parameter: Description
relname: Name of the source relation (table)
keyid_fld: Name of the key field
parent_keyid_fld: Name of the parent-key field
orderby_fld: Name of the field to order siblings by (optional)
start_with: Key value of the row to start at
max_depth: Maximum depth to descend to, or zero for unlimited depth
branch_delim: String to separate keys with in branch output (optional)
You really should take a look at the docs page. It is well written and it will give you the options you are used to. (On the doc page scroll down, its near the bottom.)
Postgreql "Connect by" extension
Below is the description of what putting that structure together should be like. There is a ton of potential so I won't do it justice, but here is a snip of it to give you an idea.
connectby(text relname, text keyid_fld, text parent_keyid_fld
[, text orderby_fld ], text start_with, int max_depth
[, text branch_delim ])
A real query will look like this. Connectby_tree is the name of the table. The line that starting with "AS" is how you name the columns. It does look a little upside down.
SELECT * FROM connectby('connectby_tree', 'keyid', 'parent_keyid', 'pos', 'row2', 0, '~')
AS t(keyid text, parent_keyid text, level int, branch text, pos int);
As indicated by Stradas I report the query:
SELECT value
FROM connectby('taxonomy', 'key', 'taxHier', '0', 0, '~')
AS t(keyid numeric, parent_keyid numeric, level int, branch text)
inner join taxonomy t on t.key = keyid;
For example, we have a table in PostgreSQL, its name is product_types. Our table columns are (id, parent_id, name, sort_order).
Our first selection should give (parent) a root line.
id = 76 will be our sql's top 1 parent record.
with recursive product_types as (
select
pt0.id,
pt0.parant_id,
pt0.name,
pt0.sort_order,
0 AS level
from product_types pt0
where pt0.id = 76
UNION ALL
select
pt1.id,
pt1.parant_id,
pt1.name,
pt1.sort_order, (product_types.level + 1) as level
from product_types pt1
inner join product_types on (pt1.parant_id = product_types.id )
)
select
*
from product_types
order by level, sort_order

DB design: Should I use constraints within a table or a new table

I inherited a large existing DB and I'd like to know if I should refactor it because 95% of my queries require joining at least 4 tables.
The DB has a 5 tables that only have an ID and Name column with less than 20 rows. I assume the author did this so he could change the names there and not change them in the other tables, but many of those tables are only referenced in one other table. Should I refactor these small 2 column tables into the a larger table and add a constraint to the column so users can't input incorrect names instead of having seperate tables?
Resist that urge. From your description I can deduce that the existing design is solid and probably well normalized. Your refactoring may actually undo a good db structure.
If you are bothered by writing a lot of joins in your queries I would suggest creating views to mitigate the boilerplate.
...the author did this so he could change the names there not change
them in the other tables...
That is evidence of good design and exactly what you should strive for in a normalized database.
no.
your db is normalized and proper.
and you save space, lookup time, indexing for storing an int rather then a varchar name
small tables are optimized away if they are properly keyed.
Sounds like what you have are lookup tables. Let me tell you waht happens when you decide to put all lookups in one table with an additonal column to specify which type it is. Fisrt instead of joining to 4 different tables in one query, you have to join to the same table 4 times. There ends up being more contention for the resources in the "one table to rule them all". Further, you lose FK constraints. That means you eventually lose data integrity. So if one lookup is state, nothing wil prevent you from putting the id values for a different lookup for customer type in the stateid column in the customeraddress table. When the lookups are separate you con enforce that relationship.
Suppose instead of one big table you decide to have a constraint on the column for customer type. Constraints are now enforced but you have a problem when they need to change. Now you have to alter the database in order to add a new type. Again usually this is a very bad idea espcially when the table gets large.
Short story: Replacing strings with ID numbers has nothing to do with normalization. Using natural keys in your case might improve performance. In my tests, queries using natural keys were faster by 1 or 2 orders of magnitude.
You might have accepted an answer too quickly.
The DB has a 5 tables that only have an ID and Name column with less
than 20 rows.
I'm assuming these tables have a structure something like this.
create table a (
a_id integer primary key,
a_name varchar(30) not null unique
);
create table b (...
-- Just like a
create table your_data (
yet_another_id integer primary key,
a_id integer not null references a (a_id),
b_id integer not null references b (b_id),
c_id integer not null references c (c_id),
d_id integer not null references d (d_id),
unique (a_id, b_id, c_id, d_id),
-- other columns go here
);
And it's obvious that your_data will require four joins (at least) to get usable information from it.
But the names in table a, b, c, and d are unique (ahem), so you can use the unique names as targets for foreign key references. You could rewrite the table your_data like this.
create table your_data (
yet_another_id integer primary key,
a_name varchar(30) not null references a (a_name),
b_name varchar(30) not null references b (b_name),
c_name varchar(30) not null references c (c_name),
d_name varchar(30) not null references d (d_name),
unique (a_name, b_name, c_name, d_name),
-- other columns go here
);
Replacing id numbers with strings doesn't change the normal form. (And replacing strings with id numbers doesn't have anything to do with normalization.) If the original table were in 5NF, then this rewrite will be in 5NF, too.
But what about performance? Aren't id numbers plus joins supposed to be faster than strings?
I tested that by inserting 20 rows into each of the four tables a, b, c, and d. Then I generated a Cartesian product to fill one test table written with id numbers, and another using the names. (So, 160K rows in each.) I updated the statistics, and ran a couple of queries.
explain analyze
select a.a_name, b.b_name, c.c_name, d.d_name
from your_data_id
inner join a on (a.a_id = your_data_id.a_id)
inner join b on (b.b_id = your_data_id.b_id)
inner join c on (c.c_id = your_data_id.c_id)
inner join d on (d.d_id = your_data_id.d_id)
...
Total runtime: 808.472 ms
explain analyze
select a_name, b_name, c_name, d_name
from your_data
Total runtime: 132.098 ms
The query using id numbers takes a lot longer to execute. I used a WHERE clause on all four columns, which returns a single row.
explain analyze
select a.a_name, b.b_name, c.c_name, d.d_name
from your_data_id
inner join a on (a.a_id = your_data_id.a_id and a.a_name = 'a one')
inner join b on (b.b_id = your_data_id.b_id and b.b_name = 'b one')
inner join c on (c.c_id = your_data_id.c_id and c.c_name = 'c one')
inner join d on (d.d_id = your_data_id.d_id and d.d_name = 'd one)
...
Total runtime: 14.671 ms
explain analyze
select a_name, b_name, c_name, d_name
from your_data
where a_name = 'a one' and b_name = 'b one' and c_name = 'c one' and d_name = 'd one';
...
Total runtime: 0.133 ms
The tables using id numbers took about 100 times longer to query.
Tests used PostgreSQL 9.something.
My advice: Try before you buy. I mean, test before you invest. Try rewriting your data table to use natural keys. Think carefully about ON UPDATE CASCADE and ON DELETE CASCADE. Test performance with representative sample data. Edit your original question and let us know what you found.

How can I calculate the top % daily price changes using MySQL?

I have a table called prices which includes the closing price of stocks that I am tracking daily.
Here is the schema:
CREATE TABLE `prices` (
`id` int(21) NOT NULL auto_increment,
`ticker` varchar(21) NOT NULL,
`price` decimal(7,2) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `ticker` (`ticker`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=2200 ;
I am trying to calculate the % price drop for anything that has a price value greater than 0 for today and yesterday. Over time, this table will be huge and I am worried about performance. I assume this will have to be done on the MySQL side rather than PHP because LIMIT will be needed here.
How do I take the last 2 dates and do the % drop calculation in MySQL though?
Any advice would be greatly appreciated.
One problem I see right off the bat is using a timestamp data type for the date, this will complicate your sql query for two reasons - you will have to use a range or convert to an actual date in your where clause, but, more importantly, since you state that you are interested in today's closing price and yesterday's closing price, you will have to keep track of the days when the market is open - so Monday's query is different than tue - fri, and any day the market is closed for a holiday will have to be accounted for as well.
I would add a column like mktDay and increment it each day the market is open for business. Another approach might be to include a 'previousClose' column which makes your calculation trivial. I realize this violates normal form, but it saves an expensive self join in your query.
If you cannot change the structure, then you will do a self join to get yesterday's close and you can calculate the % change and order by that % change if you wish.
Below is Eric's code, cleaned up a bit it executed on my server running mysql 5.0.27
select
p_today.`ticker`,
p_today.`date`,
p_yest.price as `open`,
p_today.price as `close`,
((p_today.price - p_yest.price)/p_yest.price) as `change`
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.`date`) = date(p_yest.`date`) + INTERVAL 1 DAY
and p_today.price > 0
and p_yest.price > 0
and date(p_today.`date`) = CURRENT_DATE
order by `change` desc
limit 10
Note the back-ticks as some of your column names and Eric's aliases were reserved words.
Also note that using a where clause for the first table would be a less expensive query - the where get's executed first and only has to attempt to self join on the rows that are greater than zero and have today's date
select
p_today.`ticker`,
p_today.`date`,
p_yest.price as `open`,
p_today.price as `close`,
((p_today.price - p_yest.price)/p_yest.price) as `change`
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.`date`) = date(p_yest.`date`) + INTERVAL 1 DAY
and p_yest.price > 0
where p_today.price > 0
and date(p_today.`date`) = CURRENT_DATE
order by `change` desc
limit 10
Scott brings up a great point about consecutive market days. I recommend handling this with a connector table like:
CREATE TABLE `market_days` (
`market_day` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`date` DATE NOT NULL DEFAULT '0000-00-00',
PRIMARY KEY USING BTREE (`market_day`),
UNIQUE KEY USING BTREE (`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0
;
As more market days elapse, just INSERT new date values in the table. market_day will increment accordingly.
When inserting prices data, lookup the LAST_INSERT_ID() or corresponding value to a given date for past values.
As for the prices table itself, you can make storage, SELECT and INSERT operations much more efficient with a useful PRIMARY KEY and no AUTO_INCREMENT column. In the schema below, your PRIMARY KEY contains intrinsically useful information and isn't just a convention to identify unique rows. Using MEDIUMINT (3 bytes) instead of INT (4 bytes) saves an extra byte per row and more importantly 2 bytes per row in the PRIMARY KEY - all while still affording over 16 million possible dates and ticker symbols (each).
CREATE TABLE `prices` (
`market_day` MEDIUMINT(8) UNSIGNED NOT NULL DEFAULT '0',
`ticker_id` MEDIUMINT(8) UNSIGNED NOT NULL DEFAULT '0',
`price` decimal (7,2) NOT NULL DEFAULT '00000.00',
PRIMARY KEY USING BTREE (`market_day`,`ticker_id`),
KEY `ticker_id` USING BTREE (`ticker_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
;
In this schema each row is unique across each pair of market_day and ticker_id. Here ticker_id corresponds to a list of ticker symbols in a tickers table with a similar schema to the market_days table:
CREATE TABLE `tickers` (
`ticker_id` MEDIUMINT(8) UNSIGNED NOT NULL AUTO_INCREMENT,
`ticker_symbol` VARCHAR(5),
`company_name` VARCHAR(50),
/* etc */
PRIMARY KEY USING BTREE (`ticker_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0
;
This yields a similar query to others proposed, but with two important differences: 1) There's no functional transformation on the date column, which destroys MySQL's ability to use keys on the join; in the query below MySQL will use part of the PRIMARY KEY to join on market_day. 2) MySQL can only use one key per JOIN or WHERE clause. In this query MySQL will use the full width of the PRIMARY KEY (market_day and ticker_id) whereas in the previous query it could only use one (MySQL will usually pick the more selective of the two).
SELECT
`market_days`.`date`,
`tickers`.`ticker_symbol`,
`yesterday`.`price` AS `close_yesterday`,
`today`.`price` AS `close_today`,
(`today`.`price` - `yesterday`.`price`) / (`yesterday`.`price`) AS `pct_change`
FROM
`prices` AS `today`
LEFT JOIN
`prices` AS `yesterday`
ON /* uses PRIMARY KEY */
`yesterday`.`market_day` = `today`.`market_day` - 1 /* this will join NULL for `today`.`market_day` = 0 */
AND
`yesterday`.`ticker_id` = `today`.`ticker_id`
INNER JOIN
`market_days` /* uses first 3 bytes of PRIMARY KEY */
ON
`market_days`.`market_day` = `today`.`market_day`
INNER JOIN
`tickers` /* uses KEY (`ticker_id`) */
ON
`tickers`.`ticker_id` = `today`.`ticker_id`
WHERE
`today`.`price` > 0
AND
`yesterday`.`price` > 0
;
A finer point is the need to also join against tickers and market_days in order to display the actual ticker_symbol and date, but these operations are very fast since they make use of keys.
Essentially, you can just join the table to itself to find the given % change. Then, order by change descending to get the largest changers on top. You could even order by abs(change) if you want the largest swings.
select
p_today.ticker,
p_today.date,
p_yest.price as open,
p_today.price as close,
--Don't have to worry about 0 division here
(p_today.price - p_yest.price)/p_yest.price as change
from
prices p_today
inner join prices p_yest on
p_today.ticker = p_yest.ticker
and date(p_today.date) = date(date_add(p_yest.date interval 1 day))
and p_today.price > 0
and p_yest.price > 0
and date(p_today.date) = CURRENT_DATE
order by change desc
limit 10