How can I summarize data by year in SQL? - sql

I'm sure the request is rather straight-forward, but I'm stuck. I'd like to take the first table below and turn it into the second table by summing up Incremental_Inventory by Year.
+-------------+-----------+----------------------+-----+
|Warehouse_ID |Date |Incremental_Inventory |Year |
+-------------+-----------+----------------------+-----+
| 1|03/01/2010 |125 |2010 |
| 1|08/01/2010 |025 |2010 |
| 1|02/01/2011 |150 |2011 |
| 1|03/01/2011 |200 |2011 |
| 2|03/01/2012 |125 |2012 |
| 2|03/01/2012 |025 |2012 |
+-------------+-----------+----------------------+-----+
to
+-------------+-----------+---------------------------+
|Warehouse_ID |Date |Cumulative_Yearly_Inventory|
+-------------+-----------+---------------------------+
| 1|03/01/2010 |125 |
| 1|08/01/2010 |150 |
| 1|02/01/2011 |150 |
| 1|03/01/2011 |350 |
| 2|03/01/2012 |125 |
| 2|03/01/2012 |150 |
+-------------+-----------+---------------------------+

If your DBMS, which you haven't told us, supports window functions you could simply do something like:
SELECT warehouse_id,
date,
sum(incremental_inventory) OVER (PARTITION BY warehouse_id,
year(date)
ORDER BY date) cumulative_yearly_inventory
FROM elbat
ORDER BY date;
year() maybe needs to replaced by the means your DBMS provides to extract the year from a date.
If it doesn't support window functions you had to use a subquery and aggregation.
SELECT t1.warehouse_id,
t1.date,
(SELECT sum(t2.incremental_inventory)
FROM elbat t2
WHERE t2.warehouse_id = t1.warehouse_id
AND year(t2.date) = year(t1.date)
AND t2.date <= t1.date) cumulative_yearly_inventory
FROM elbat t1
ORDER BY t1.date;
However, if there are two equal dates, this will print the same sum for both of them. One would need another, distinct column to sort that out and as far as I can see you don't have such a column in the table.
I'm not sure if you want the sum over all warehouses or only per warehouse. If you don't want the sums split by warehouses but one sum for all warehouses together, remove the respective expressions from the PARTITION BY or inner WHERE clause.

If you have SAS/ETS then the time series tasks will do this for you. Assuming not, here's a data step solution.
Use RETAIN to hold value across rows
Use BY to identify the first record for each year
data want;
set have;
by year;
retain cum_total;
if first.year then cum_total=incremental_inventory;
else cum_total+incremental_inventory;
run;

Related

DATA TYPE TIME manupilation

how to aggregate the sells by date i want to know the total of sells for each day
|DATE | SELLS |
|2022-01-27 |48$ |
|2022-01-27 | 25$ |
|2022-01-27 | 150$ |
|2022-01-25 | 55$ |
no idea about the query
perhaps i should creat an other table which hold only total sells per day
Do a GROUP BY
select date, sum(sells)
from tablename
group by date
(No need for another table. Such copying of data too often leads to data inconsistency.)

If condition TRUE in a row (that is grouped)

Table:
|Months |ID|Commission|
|2020-01|1 |2312 |
|2020-02|2 |24412 |
|2020-02|1 |123 |
|... |..|... |
What I need:
COUNT(Months),
ID,
SUM(Commission),
Country
GROUP BY ID...
How it should look:
|Months |ID|Commission|
|4 |1 |5356 |
|6 |2 |5436 |
|... |..|... |
So I want to know how many months each ID received his commission, however (and that's the part where I ask for your help) if the ID is still receiving commission up to this month (current month) - I want to exclude him from the list. If he stopped receiving comm last month or last year, I want to see him in the table.
In other words, I want a table with old clients (who doesn't receive commission anymore)
Use aggregation. Assuming there is one row per month:
select id, count(*)
from t
group by id
having max(months) < date_format(now(), '%Y-%m');
Note this uses MySQL syntax, which was one of the original tags.

How to use condition and aggregation in postgresql

I tried to use a select query in which contains both where and sum aggreation query.But its shows error while executing the query .
Below is sample table
|sensorid | timestamp | reading |
====================================
|1 | 1604192522 | 10 |
|1 | 1604192525 | 15 |
|2 | 1605783723 | 8.1 |
My query is
select date_trunc('day', v.timestamp) as day,sum(reading) from sensor v(timestamp,sensorid) group by (DAY) having sensorid=1;
while executing below error occured
Cannot use column sensorid outside of an Aggregation in HAVING clause. Only GROUP BY keys allowed here.]
If you apply group by, you cease particular values of all other columns.
Probably you want to either filter by values -> use where
select date_trunc('day', v.timestamp) as day,sum(reading) from sensor v(timestamp,sensorid) where sensorid=1 group by (DAY) ;
or filter by aggregation -> keep having but use aggregation function
select date_trunc('day', v.timestamp) as day,sum(reading) from sensor v(timestamp,sensorid) group by (DAY) having min(sensorid)=1;
Not clear what's your intention, post expected output if I didn't guess any variant.

Multiple correlated subqueries with different conditions to same table

I have two tables:
orders
| id | item_id | quantity | ordered_on |
|----|---------|----------|------------|
| 1 | 1 | 2 | 2016-03-09 |
| 2 | 1 | 2 | 2016-03-12 |
| 3 | 4 | 3 | 2016-03-15 |
| 4 | 4 | 3 | 2016-03-13 |
stocks
| id | item_id | quantity | enter_on | expire_on |
|----|---------|----------|------------|------------|
| 1 | 1 | 10 | 2016-03-07 | 2016-03-10 |
| 2 | 1 | 20 | 2016-03-11 | 2016-03-15 |
| 3 | 1 | 20 | 2016-03-14 | 2016-03-17 |
| 4 | 4 | 10 | 2016-03-14 | NULL |
| 5 | 4 | 10 | 2016-03-12 | NULL |
I'm trying to create a view to show the orders along with their closest stocks enter_on like this (I'm using include_after and include_before to give an overview on which date I want to exclude the item that's preordered, so the stock would reflect correctly.)
include_after is always going to be the stock that came in but not expired yet, if expired, show NULL, include_before will always show the next incoming stock enter_on, unless there's an expire_on that's earlier than the next enter_on.
| item_id | quantity | ordered_on | include_after | include_before |
|---------|----------|------------|---------------|----------------|
| 1 | 2 | 2016-03-09 | 2016-03-07 | 2016-03-10 |
| 1 | 2 | 2016-03-12 | 2016-03-11 | 2016-03-14 |
| 4 | 3 | 2016-03-13 | 2016-03-12 | 2016-03-14 |
| 4 | 3 | 2016-03-15 | 2016-03-14 | NULL |
So this is what I came up with:
SELECT
o.item_id, o.quantity, o.order_on, (
SELECT COALESCE(MAX(s.enter_on), NULL::DATE)
FROM stocks s
WHERE s.enter_on <= o.order_on AND s.item_id = o.item_id
) as include_after, (
SELECT COALESCE(MIN(s.enter_on), NULL::DATE)
FROM stocks s
WHERE s.enter_on > o.order_on AND s.item_id = o.item_id
) as include_before
FROM
orders o;
It works fine (I haven't included the expire_on part), but I'm worrying about performance issue for using two subqueries in the select.
Does anyone have some alternative suggestions?
UPDATE
I'm using Postgresql 9.4 (Can't add anymore tags)
the actual problem is way more complicated than I stated, it's a lot of tables joined together and views, I shrunk it down to just one table to grasp the concept if there are alternatives
You should worry about performance when the situation arises. For the example that you provided, an index on stocks(item_id, enter_on, expire_on) should be sufficient. Then you might actually want two indexes: stocks(item_id, enter_on desc, expire_on).
If the performance is not sufficient, you have two choices. One is a GIST index for ranges. (Here is an interesting discussion of the issue.) The second is an alternative query formulation.
However, I would attempt to optimize the query until there is enough data to show a performance problem. Solutions on smaller amounts of data just might not scale well.
Discussing the query you display, also not considering expire_on.
COALESCE for a correlated subquery
First off, the expression COALESCE(anything, NULL) never makes sense. You would replace NULL with NULL.
Aggregate functions like max() return NULL anyway (preventing "no row"), even if no qualifying row is found. (The exception being count(), which returns 0).
A correlated subquery that would return "no row" (like the variant with ORDER BY ... LIMIT 1 I demonstrate below) defaults to NULL for the column value.
So, if you wanted to use COALESCE in this context, you would wrap it around the correlated subquery as a whole - and provide a default for NULL.
Query
I'm worrying about performance issue for using two subqueries in the select.
It depends.
If there are only few rows per item_id in table stocks and / or only an index on stocks(item_id), then it would make sense to merge the two correlated subqueries into one LATERAL subquery with conditional aggregates:
SELECT o.item_id, o.quantity, o.order_on
, s.include_after, s.include_before
FROM orders o
, LATERAL (
SELECT max(enter_on) FILTER (WHERE enter_on <= o.order_on) AS include_after
, min(enter_on) FILTER (WHERE enter_on > o.order_on) AS include_before
FROM stocks
WHERE item_id = o.item_id
) s;
Since the subquery returns a row in any case due to the aggregate functions, a simple CROSS JOIN is fine. Else you might want to use LEFT JOIN LATERAL (...) ON true. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The aggregate FILTER clause requires Postgres 9.4+. There are alternatives for older versions. See:
Aggregate columns with additional (distinct) filters
If, on the other hand, you have many rows per item_id in table stocks and an index ON stocks (item_id, enter_on), your query might still be faster. Or this slightly adapted version (test both!):
SELECT o.item_id, o.quantity, o.order_on
, (SELECT s.enter_on
FROM stocks s
WHERE s.item_id = o.item_id
AND s.enter_on <= o.order_on
ORDER BY 1 DESC NULLS LAST
LIMIT 1) AS include_after
, (SELECT s.enter_on
FROM stocks s
WHERE s.item_id = o.item_id
AND s.enter_on > o.order_on
ORDER BY 1
LIMIT 1) AS include_before
FROM orders o;
Because both correlated subqueries can be resolved to a single index lookup each.
To optimize performance, you might need a 2nd index on stocks(item_id, enter_on DESC NULLS LAST). But don't create specialized indexes unless you actually need to squeeze out more read performance for this query (key word: premature optimization).
Detailed discussion in this related answer:
Optimize GROUP BY query to retrieve latest row per user

Transform long rows to wide, filling all cells

I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.
In order to get the current location for each business_id for any given year you need two things:
A parameterized query to select the year, implemented as a SQL language function.
A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.
I assume you have actual dates for each business move, so we can make meaningful picks per year:
CREATE TEMP TABLE business_moves (
business_id int, -- why would you use inefficient varchar here?
move_date date,
longitude float,
latitude float);
Building on this, a more meaningful test case:
INSERT INTO business_moves VALUES
(001013580, '1991-1-1', 71.0557, 42.3588),
(001015924, '1993-1-1', 71.0728, 42.3504),
(001015924, '1993-3-3', 73.0728, 43.3504), -- 2nd move this year
(001015924, '1996-1-1', -122.28, 37.654),
(001020684, '1992-1-1', 84.3381, 33.5775);
Complete, very fast solution
SELECT *
FROM crosstab($$
SELECT business_id, year
, first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
FROM (
SELECT *
, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
Result:
biz_id | x91 | x92 | x93 | x94 | x95 | x96 | x97 ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
1015924 | | | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654) | (-122.28,37.654) ...
1020684 | | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...
Step-by-step
Step 1
Repair what you had:
SELECT *
FROM crosstab($$
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date) AS year
, point(longitude, latitude) AS long_lat
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
$$
,'VALUES
(1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
) AS t(biz_id int
, x91 point, x92 point, x93 point, x94 point, x95 point
, x96 point, x97 point, x98 point, x99 point, x00 point
, x01 point, x02 point, x03 point, x04 point, x05 point
, x06 point, x07 point, x08 point, x09 point, x10 point);
You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.
You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
Select first row in each GROUP BY group?
As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
PostgreSQL Crosstab Query
Adapted the function to work with move_date date instead of year_move.
Step 2
To address your request:
I would ideally have populated values for each year
Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:
SELECT *
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
The set of years comes from a generate_series() call.
Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.
LEFT JOIN to actual business moves per year to arrive at a full grid of values.
Step 3
Fill in defaults:
with the most recent address carrying over to the next year.
SELECT business_id, year
, COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
,'(0,0)') AS x
FROM (
SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
FROM (SELECT DISTINCT business_id FROM business_moves) b
CROSS JOIN generate_series(1991, 2010) year
LEFT JOIN (
SELECT DISTINCT ON (1,2)
business_id
, EXTRACT('year' FROM move_date)::int AS year
, point(longitude, latitude) AS x
FROM business_moves
WHERE move_date >= '1991-1-1'
AND move_date < '2011-1-1'
ORDER BY 1,2, move_date DESC
) bm USING (business_id, year)
) sub;
In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.
For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.
In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.
To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.
SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.
Step 4
Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.