GROUP BY and aggregate sequential numeric values

GROUP BY and aggregate sequential numeric values - sql

Using PostgreSQL 9.0.
Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:
Example Table:
+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google | Programmer | 2000 |
| Google | Sales | 2000 |
| Google | Sales | 2001 |
| Google | Sales | 2002 |
| Google | Sales | 2004 |
| Mozilla | Sales | 2002 |
+-----------------------------+
I'm interested in a query which would output rows similar to the following:
+-----------------------------------------+
| company | profession | year |
+---------+------------+------------------+
| Google | Programmer | [2000] |
| Google | Sales | [2000,2001,2002] |
| Google | Sales | [2004] |
| Mozilla | Sales | [2002] |
+-----------------------------------------+
The essential feature is that only consecutive years shall be grouped together.

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).
The first step is to identify non-consecutive values for the year:
Step 1) Identify non-consecutive values
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
This returns the following result:
company | profession | year | group_cnt
---------+------------+------+-----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 1
Google | Sales | 2001 | 0
Google | Sales | 2002 | 0
Google | Sales | 2004 | 1
Mozilla | Sales | 2002 | 1
Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:
Step 2) Define group IDs
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
This returns the following result:
company | profession | year | group_nr
---------+------------+------+----------
Google | Programmer | 2000 | 1
Google | Sales | 2000 | 2
Google | Sales | 2001 | 2
Google | Sales | 2002 | 2
Google | Sales | 2004 | 3
Mozilla | Sales | 2002 | 4
(6 rows)
As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:
Step 3) Final query
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
This returns the following result:
company | profession | years
---------+------------+------------------
Google | Programmer | {2000}
Google | Sales | {2000,2001,2002}
Google | Sales | {2004}
Mozilla | Sales | {2002}
(4 rows)
Which is exactly what you wanted, if I'm not mistaken.

There's much value to #a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID

Procedural solution with PL/pgSQL
The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:
Test table:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
, ('Google', 'Sales', 2000)
, ('Google', 'Sales', 2001)
, ('Google', 'Sales', 2002)
, ('Google', 'Sales', 2004)
, ('Mozilla', 'Sales', 2002)
;
Function:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[])
LANGUAGE plpgsql AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$;
Call:
SELECT * FROM f_periods();
db<>fiddle here
Produces the requested result.

Related

SQL Finding sum of rows and returning count of keys

For a database table looking something like this:
id | year | stint | sv
----+------+-------+---
mk1 | 2001 | 1 | 30
mk1 | 2001 | 2 | 20
ml0 | 1999 | 1 | 43
ml0 | 2000 | 1 | 44
hj2 | 1993 | 1 | 70
I want to get the following output:
count
-------
3
with the conditions being count the number of ids that have a sv > 40 for a single year greater than 1994. If there is more than one stint for the same year, add the sv points and see if > 40.
This is what I have written so far but it is obviously not right:
SELECT COUNT(DISTINCT id),
SUM(sv) as SV
FROM public.pitching
WHERE (year > 1994 AND sv >40);
I know the syntax is completely wrong and some of the conditions' information is missing but I'm not familiar enough with SQL and don't know how to properly do the summing of two rows in the same table with a condition (maybe with a subquery?). Any help would be appreciated! (using postgres)

You could use a nested query to get the aggregations, and wrap that for getting the count. Note that the condition on the sum must be in a having clause:
SELECT COUNT(id)
FROM (
SELECT id,
year,
SUM(sv) as SV
FROM public.pitching
WHERE year > 1994
GROUP BY id,
year
HAVING SUM(sv) > 40 ) years
If an id should only count once even it fulfils the condition in more than one year, then do COUNT(distinct id) instead of COUNT(id)

You can try like following using sum and partition by year.
select count( distinct year) from
(
select year, sum(sv) over (partition by year) s
from public.pitching
where year > 1994
) t where s>40
Online Demo

SQL to find max of sum of data in one table, with extra columns

Apologies if this has been asked elsewhere. I have been looking on Stackoverflow all day and haven't found an answer yet. I am struggling to write the query to find the highest month's sales for each state from this example data.
The data looks like this:
| order_id | month | cust_id | state | prod_id | order_total |
+-----------+--------+----------+--------+----------+--------------+
| 67212 | June | 10001 | ca | 909 | 13 |
| 69090 | June | 10011 | fl | 44 | 76 |
... etc ...
My query
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders GROUP BY `month`, `state`
ORDER BY sales;
| month | state | sales |
+------------+--------+--------+
| September | wy | 435 |
| January | wy | 631 |
... etc ...
returns a few hundred rows: the sum of sales for each month for each state. I want it to only return the month with the highest sum of sales, but for each state. It might be a different month for different states.
This query
SELECT `state`, MAX(order_sum) as topmonth
FROM (SELECT `state`, SUM(order_total) order_sum FROM orders GROUP BY `month`,`state`)
GROUP BY `state`;
| state | topmonth |
+--------+-----------+
| ca | 119586 |
| ga | 30140 |
returns the correct number of rows with the correct data. BUT I would also like the query to give me the month column. Whatever I try with GROUP BY, I cannot find a way to limit the results to one record per state. I have tried PartitionBy without success, and have also tried unsuccessfully to do a join.
TL;DR: one query gives me the correct columns but too many rows; the other query gives me the correct number of rows (and the correct data) but insufficient columns.
Any suggestions to make this work would be most gratefully received.
I am using Apache Drill, which is apparently ANSI-SQL compliant. Hopefully that doesn't make much difference - I am assuming that the solution would be similar across all SQL engines.

This one should do the trick
SELECT t1.`month`, t1.`state`, t1.`sales`
FROM (
/* this one selects month, state and sales*/
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
) AS t1
JOIN (
/* this one selects the best value for each state */
SELECT `state`, MAX(sales) AS best_month
FROM (
SELECT `month`, `state`, SUM(order_total) AS sales
FROM orders
GROUP BY `month`, `state`
)
GROUP BY `state`
) AS t2
ON t1.`state` = t2.`state` AND
t1.`sales` = t2.`best_month`
It's basically the combination of the two queries you wrote.

Try this:
SELECT `month`, `state`, SUM(order_total) FROM orders WHERE `month` IN
( SELECT TOP 1 t.month FROM ( SELECT `month` AS month, SUM(order_total) order_sum FROM orders GROUP BY `month`
ORDER BY order_sum DESC) t)
GROUP BY `month`, state ;

SQL (sqlite) compare sums of rows grouped by another repeating row

I have a table like:
|------------------------|
|day name trees_planted|
|------------------------|
|1 | alice | 3 |
|2 | alice | 4 |
|1 | bob | 2 |
|2 | bob | 4 |
|------------------------|
I'm using SELECT name, SUM(trees_planted) FROM year2016 GROUP BY name to get:
name | trees_planted
alice | 7
bob | 6
But then I have another table from 2015 and I want to compare the results with the previous year, if for example Alice planted more trees in 2016 than in 2015 I'd get a result like this:
name | tree_difference
alice | -2 (if previous year she planted 5 trees, 5 -7 = -2)
bob | 0 (planted the same number of trees last year)

You could use a sub-query to get the records from both 2016 and 2015, but negate the values from 2016. Then group and sum like you already did:
SELECT name,
SUM(trees_planted) AS tree_difference
FROM (SELECT name, trees_planted
FROM year2015
UNION ALL
SELECT name, -trees_planted
FROM year2016
) AS years
GROUP BY name
This will also work for cases where a number is only given in one of the two years.

Assuming you can join using user field, you can do:
select a.name, a.tp, b.tp, a.tp - b.tp
from
(
(select name, SUM(trees_planted) tp from year2016 group by name) a
inner join
(select name, SUM(trees_planted) tp from year2015 group by name) b
using(name)
)
If you can't join on field user (you have different set of users in 2015 and 2016), it'll be easy to add the missing information by using a couple of union clauses.
Here's a link with artificial data to SQLFIDDLE to try the query.

PostgreSQL return multiple rows with DISTINCT though only latest date per second column

Lets says I have the following database table (date truncated for example only, two 'id_' preix columns join with other tables)...
+-----------+---------+------+--------------------+-------+
| id_table1 | id_tab2 | date | description | price |
+-----------+---------+------+--------------------+-------+
| 1 | 11 | 2014 | man-eating-waffles | 1.46 |
+-----------+---------+------+--------------------+-------+
| 2 | 22 | 2014 | Flying Shoes | 8.99 |
+-----------+---------+------+--------------------+-------+
| 3 | 44 | 2015 | Flying Shoes | 12.99 |
+-----------+---------+------+--------------------+-------+
...and I have a query like the following...
SELECT id, date, description FROM inventory ORDER BY date ASC;
How do I SELECT all the descriptions, but only once each while simultaneously only the latest year for that description? So I need the database query to return the first and last row from the sample data above; the second it not returned because the last row has a later date.

Postgres has something called distinct on. This is usually more efficient than using window functions. So, an alternative method would be:
SELECT distinct on (description) id, date, description
FROM inventory
ORDER BY description, date desc;

The row_number window function should do the trick:
SELECT id, date, description
FROM (SELECT id, date, description,
ROW_NUMBER() OVER (PARTITION BY description
ORDER BY date DESC) AS rn
FROM inventory) t
WHERE rn = 1
ORDER BY date ASC;

Transform long rows to wide, filling all cells

I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.

In order to get the current location for each business_id for any given year you need two things:
A parameterized query to select the year, implemented as a SQL language function.
A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.

I assume you have actual dates for each business move, so we can make meaningful picks per year:
CREATE TEMP TABLE business_moves (
business_id int, -- why would you use inefficient varchar here?
move_date date,
longitude float,
latitude float);
Building on this, a more meaningful test case:
INSERT INTO business_moves VALUES
(001013580, '1991-1-1', 71.0557, 42.3588),
(001015924, '1993-1-1', 71.0728, 42.3504),
(001015924, '1993-3-3', 73.0728, 43.3504), -- 2nd move this year
(001015924, '1996-1-1', -122.28, 37.654),
(001020684, '1992-1-1', 84.3381, 33.5775);
Complete, very fast solution
SELECT *
FROM crosstab($$
SELECT business_id, year
, first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
FROM (
SELECT *
, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
Result:
biz_id | x91 | x92 | x93 | x94 | x95 | x96 | x97 ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
1015924 | | | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654) | (-122.28,37.654) ...
1020684 | | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...
Step-by-step
Step 1
Repair what you had:
SELECT *
FROM crosstab($$
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date) AS year
, point(longitude, latitude) AS long_lat
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.
You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
Select first row in each GROUP BY group?
As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
PostgreSQL Crosstab Query
Adapted the function to work with move_date date instead of year_move.
Step 2
To address your request:
I would ideally have populated values for each year
Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:
SELECT *
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
The set of years comes from a generate_series() call.
Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.
LEFT JOIN to actual business moves per year to arrive at a full grid of values.
Step 3
Fill in defaults:
with the most recent address carrying over to the next year.
SELECT business_id, year
, COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
,'(0,0)') AS x
FROM (
SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub;
In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.
For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.
In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.
To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.
SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.
Step 4
Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

GROUP BY and aggregate sequential numeric values - sql

Related

SQL Finding sum of rows and returning count of keys

SQL to find max of sum of data in one table, with extra columns

SQL (sqlite) compare sums of rows grouped by another repeating row

PostgreSQL return multiple rows with DISTINCT though only latest date per second column

Transform long rows to wide, filling all cells

Categories

Resources