I have two tables:
orders
| id | item_id | quantity | ordered_on |
|----|---------|----------|------------|
| 1 | 1 | 2 | 2016-03-09 |
| 2 | 1 | 2 | 2016-03-12 |
| 3 | 4 | 3 | 2016-03-15 |
| 4 | 4 | 3 | 2016-03-13 |
stocks
| id | item_id | quantity | enter_on | expire_on |
|----|---------|----------|------------|------------|
| 1 | 1 | 10 | 2016-03-07 | 2016-03-10 |
| 2 | 1 | 20 | 2016-03-11 | 2016-03-15 |
| 3 | 1 | 20 | 2016-03-14 | 2016-03-17 |
| 4 | 4 | 10 | 2016-03-14 | NULL |
| 5 | 4 | 10 | 2016-03-12 | NULL |
I'm trying to create a view to show the orders along with their closest stocks enter_on like this (I'm using include_after and include_before to give an overview on which date I want to exclude the item that's preordered, so the stock would reflect correctly.)
include_after is always going to be the stock that came in but not expired yet, if expired, show NULL, include_before will always show the next incoming stock enter_on, unless there's an expire_on that's earlier than the next enter_on.
| item_id | quantity | ordered_on | include_after | include_before |
|---------|----------|------------|---------------|----------------|
| 1 | 2 | 2016-03-09 | 2016-03-07 | 2016-03-10 |
| 1 | 2 | 2016-03-12 | 2016-03-11 | 2016-03-14 |
| 4 | 3 | 2016-03-13 | 2016-03-12 | 2016-03-14 |
| 4 | 3 | 2016-03-15 | 2016-03-14 | NULL |
So this is what I came up with:
SELECT
o.item_id, o.quantity, o.order_on, (
SELECT COALESCE(MAX(s.enter_on), NULL::DATE)
FROM stocks s
WHERE s.enter_on <= o.order_on AND s.item_id = o.item_id
) as include_after, (
SELECT COALESCE(MIN(s.enter_on), NULL::DATE)
FROM stocks s
WHERE s.enter_on > o.order_on AND s.item_id = o.item_id
) as include_before
FROM
orders o;
It works fine (I haven't included the expire_on part), but I'm worrying about performance issue for using two subqueries in the select.
Does anyone have some alternative suggestions?
UPDATE
I'm using Postgresql 9.4 (Can't add anymore tags)
the actual problem is way more complicated than I stated, it's a lot of tables joined together and views, I shrunk it down to just one table to grasp the concept if there are alternatives
You should worry about performance when the situation arises. For the example that you provided, an index on stocks(item_id, enter_on, expire_on) should be sufficient. Then you might actually want two indexes: stocks(item_id, enter_on desc, expire_on).
If the performance is not sufficient, you have two choices. One is a GIST index for ranges. (Here is an interesting discussion of the issue.) The second is an alternative query formulation.
However, I would attempt to optimize the query until there is enough data to show a performance problem. Solutions on smaller amounts of data just might not scale well.
Discussing the query you display, also not considering expire_on.
COALESCE for a correlated subquery
First off, the expression COALESCE(anything, NULL) never makes sense. You would replace NULL with NULL.
Aggregate functions like max() return NULL anyway (preventing "no row"), even if no qualifying row is found. (The exception being count(), which returns 0).
A correlated subquery that would return "no row" (like the variant with ORDER BY ... LIMIT 1 I demonstrate below) defaults to NULL for the column value.
So, if you wanted to use COALESCE in this context, you would wrap it around the correlated subquery as a whole - and provide a default for NULL.
Query
I'm worrying about performance issue for using two subqueries in the select.
It depends.
If there are only few rows per item_id in table stocks and / or only an index on stocks(item_id), then it would make sense to merge the two correlated subqueries into one LATERAL subquery with conditional aggregates:
SELECT o.item_id, o.quantity, o.order_on
, s.include_after, s.include_before
FROM orders o
, LATERAL (
SELECT max(enter_on) FILTER (WHERE enter_on <= o.order_on) AS include_after
, min(enter_on) FILTER (WHERE enter_on > o.order_on) AS include_before
FROM stocks
WHERE item_id = o.item_id
) s;
Since the subquery returns a row in any case due to the aggregate functions, a simple CROSS JOIN is fine. Else you might want to use LEFT JOIN LATERAL (...) ON true. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The aggregate FILTER clause requires Postgres 9.4+. There are alternatives for older versions. See:
Aggregate columns with additional (distinct) filters
If, on the other hand, you have many rows per item_id in table stocks and an index ON stocks (item_id, enter_on), your query might still be faster. Or this slightly adapted version (test both!):
SELECT o.item_id, o.quantity, o.order_on
, (SELECT s.enter_on
FROM stocks s
WHERE s.item_id = o.item_id
AND s.enter_on <= o.order_on
ORDER BY 1 DESC NULLS LAST
LIMIT 1) AS include_after
, (SELECT s.enter_on
FROM stocks s
WHERE s.item_id = o.item_id
AND s.enter_on > o.order_on
ORDER BY 1
LIMIT 1) AS include_before
FROM orders o;
Because both correlated subqueries can be resolved to a single index lookup each.
To optimize performance, you might need a 2nd index on stocks(item_id, enter_on DESC NULLS LAST). But don't create specialized indexes unless you actually need to squeeze out more read performance for this query (key word: premature optimization).
Detailed discussion in this related answer:
Optimize GROUP BY query to retrieve latest row per user
I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.
In order to get the current location for each business_id for any given year you need two things:
A parameterized query to select the year, implemented as a SQL language function.
A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.
I assume you have actual dates for each business move, so we can make meaningful picks per year:
CREATE TEMP TABLE business_moves (
business_id int, -- why would you use inefficient varchar here?
move_date date,
longitude float,
latitude float);
Building on this, a more meaningful test case:
INSERT INTO business_moves VALUES
(001013580, '1991-1-1', 71.0557, 42.3588),
(001015924, '1993-1-1', 71.0728, 42.3504),
(001015924, '1993-3-3', 73.0728, 43.3504), -- 2nd move this year
(001015924, '1996-1-1', -122.28, 37.654),
(001020684, '1992-1-1', 84.3381, 33.5775);
Complete, very fast solution
SELECT *
FROM crosstab($$
SELECT business_id, year
, first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
FROM (
SELECT *
, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
Result:
biz_id | x91 | x92 | x93 | x94 | x95 | x96 | x97 ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
1015924 | | | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654) | (-122.28,37.654) ...
1020684 | | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...
Step-by-step
Step 1
Repair what you had:
SELECT *
FROM crosstab($$
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date) AS year
, point(longitude, latitude) AS long_lat
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.
You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
Select first row in each GROUP BY group?
As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
PostgreSQL Crosstab Query
Adapted the function to work with move_date date instead of year_move.
Step 2
To address your request:
I would ideally have populated values for each year
Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:
SELECT *
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
The set of years comes from a generate_series() call.
Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.
LEFT JOIN to actual business moves per year to arrive at a full grid of values.
Step 3
Fill in defaults:
with the most recent address carrying over to the next year.
SELECT business_id, year
, COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
,'(0,0)') AS x
FROM (
SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub;
In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.
For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.
In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.
To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.
SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.
Step 4
Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.