Apologies if this has been asked elsewhere. I have been looking on Stackoverflow all day and haven't found an answer yet. I am struggling to write the query to find the highest month's sales for each state from this example data.
The data looks like this:
| order_id | month | cust_id | state | prod_id | order_total |
+-----------+--------+----------+--------+----------+--------------+
| 67212 | June | 10001 | ca | 909 | 13 |
| 69090 | June | 10011 | fl | 44 | 76 |
... etc ...
My query
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders GROUP BY `month`, `state`
ORDER BY sales;
| month | state | sales |
+------------+--------+--------+
| September | wy | 435 |
| January | wy | 631 |
... etc ...
returns a few hundred rows: the sum of sales for each month for each state. I want it to only return the month with the highest sum of sales, but for each state. It might be a different month for different states.
This query
SELECT `state`, MAX(order_sum) as topmonth
FROM (SELECT `state`, SUM(order_total) order_sum FROM orders GROUP BY `month`,`state`)
GROUP BY `state`;
| state | topmonth |
+--------+-----------+
| ca | 119586 |
| ga | 30140 |
returns the correct number of rows with the correct data. BUT I would also like the query to give me the month column. Whatever I try with GROUP BY, I cannot find a way to limit the results to one record per state. I have tried PartitionBy without success, and have also tried unsuccessfully to do a join.
TL;DR: one query gives me the correct columns but too many rows; the other query gives me the correct number of rows (and the correct data) but insufficient columns.
Any suggestions to make this work would be most gratefully received.
I am using Apache Drill, which is apparently ANSI-SQL compliant. Hopefully that doesn't make much difference - I am assuming that the solution would be similar across all SQL engines.
This one should do the trick
SELECT t1.`month`, t1.`state`, t1.`sales`
FROM (
/* this one selects month, state and sales*/
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
) AS t1
JOIN (
/* this one selects the best value for each state */
SELECT `state`, MAX(sales) AS best_month
FROM (
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
)
GROUP BY `state`
) AS t2
ON t1.`state` = t2.`state` AND
t1.`sales` = t2.`best_month`
It's basically the combination of the two queries you wrote.
Try this:
SELECT `month`, `state`, SUM(order_total) FROM orders WHERE `month` IN
( SELECT TOP 1 t.month FROM ( SELECT `month` AS month, SUM(order_total) order_sum FROM orders GROUP BY `month`
ORDER BY order_sum DESC) t)
GROUP BY `month`, state ;
I have a table like:
|------------------------|
|day name trees_planted|
|------------------------|
|1 | alice | 3 |
|2 | alice | 4 |
|1 | bob | 2 |
|2 | bob | 4 |
|------------------------|
I'm using SELECT name, SUM(trees_planted) FROM year2016 GROUP BY name to get:
name | trees_planted
alice | 7
bob | 6
But then I have another table from 2015 and I want to compare the results with the previous year, if for example Alice planted more trees in 2016 than in 2015 I'd get a result like this:
name | tree_difference
alice | -2 (if previous year she planted 5 trees, 5 -7 = -2)
bob | 0 (planted the same number of trees last year)
You could use a sub-query to get the records from both 2016 and 2015, but negate the values from 2016. Then group and sum like you already did:
SELECT name,
SUM(trees_planted) AS tree_difference
FROM (SELECT name, trees_planted
FROM year2015
UNION ALL
SELECT name, -trees_planted
FROM year2016
) AS years
GROUP BY name
This will also work for cases where a number is only given in one of the two years.
Assuming you can join using user field, you can do:
select a.name, a.tp, b.tp, a.tp - b.tp
from
(
(select name, SUM(trees_planted) tp from year2016 group by name) a
inner join
(select name, SUM(trees_planted) tp from year2015 group by name) b
using(name)
)
If you can't join on field user (you have different set of users in 2015 and 2016), it'll be easy to add the missing information by using a couple of union clauses.
Here's a link with artificial data to SQLFIDDLE to try the query.
Lets says I have the following database table (date truncated for example only, two 'id_' preix columns join with other tables)...
+-----------+---------+------+--------------------+-------+
| id_table1 | id_tab2 | date | description | price |
+-----------+---------+------+--------------------+-------+
| 1 | 11 | 2014 | man-eating-waffles | 1.46 |
+-----------+---------+------+--------------------+-------+
| 2 | 22 | 2014 | Flying Shoes | 8.99 |
+-----------+---------+------+--------------------+-------+
| 3 | 44 | 2015 | Flying Shoes | 12.99 |
+-----------+---------+------+--------------------+-------+
...and I have a query like the following...
SELECT id, date, description FROM inventory ORDER BY date ASC;
How do I SELECT all the descriptions, but only once each while simultaneously only the latest year for that description? So I need the database query to return the first and last row from the sample data above; the second it not returned because the last row has a later date.
Postgres has something called distinct on. This is usually more efficient than using window functions. So, an alternative method would be:
SELECT distinct on (description) id, date, description
FROM inventory
ORDER BY description, date desc;
The row_number window function should do the trick:
SELECT id, date, description
FROM (SELECT id, date, description,
ROW_NUMBER() OVER (PARTITION BY description
ORDER BY date DESC) AS rn
FROM inventory) t
WHERE rn = 1
ORDER BY date ASC;
I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.
In order to get the current location for each business_id for any given year you need two things:
A parameterized query to select the year, implemented as a SQL language function.
A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.
I assume you have actual dates for each business move, so we can make meaningful picks per year:
CREATE TEMP TABLE business_moves (
business_id int, -- why would you use inefficient varchar here?
move_date date,
longitude float,
latitude float);
Building on this, a more meaningful test case:
INSERT INTO business_moves VALUES
(001013580, '1991-1-1', 71.0557, 42.3588),
(001015924, '1993-1-1', 71.0728, 42.3504),
(001015924, '1993-3-3', 73.0728, 43.3504), -- 2nd move this year
(001015924, '1996-1-1', -122.28, 37.654),
(001020684, '1992-1-1', 84.3381, 33.5775);
Complete, very fast solution
SELECT *
FROM crosstab($$
SELECT business_id, year
, first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
FROM (
SELECT *
, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
Result:
biz_id | x91 | x92 | x93 | x94 | x95 | x96 | x97 ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
1015924 | | | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654) | (-122.28,37.654) ...
1020684 | | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...
Step-by-step
Step 1
Repair what you had:
SELECT *
FROM crosstab($$
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date) AS year
, point(longitude, latitude) AS long_lat
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.
You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
Select first row in each GROUP BY group?
As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
PostgreSQL Crosstab Query
Adapted the function to work with move_date date instead of year_move.
Step 2
To address your request:
I would ideally have populated values for each year
Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:
SELECT *
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
The set of years comes from a generate_series() call.
Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.
LEFT JOIN to actual business moves per year to arrive at a full grid of values.
Step 3
Fill in defaults:
with the most recent address carrying over to the next year.
SELECT business_id, year
, COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
,'(0,0)') AS x
FROM (
SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub;
In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.
For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.
In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.
To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.
SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.
Step 4
Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.