Related
I have a postgres DB with a table containing data according to an id and a timestamp.
The table has several columns with data. I want to create an pgsql function that would allow me to get an aggregation of data according to a time interval.
The table looks something like this:
user_id | created_at | value_a | value_b | value_c | value_d | unique_key
------------+---------------------+---------+---------+---------+---------+------------
1 | 2019-12-16 17:37:07 | 1 | 5 | 0 | 5 | 1
2 | 2019-12-19 15:37:07 | 4 | 7 | 0 | 42 | 2
3 | 2019-12-16 15:37:07 | 20 | 1 | 20 | 143 | 3
2 | 2019-12-18 12:01:32 | 0 | 0 | 5 | 987 | 4
1 | 2019-12-11 14:12:50 | 6 | 0 | 9 | 0 | 5
2 | 2019-12-10 15:37:07 | 1 | 72 | 100 | 90 | 6
1 | 2019-12-20 15:37:07 | 5 | 3 | 56 | 1546 | 7
3 | 2019-12-20 15:37:07 | 30 | 4 | 789 | 3 | 8
4 | 2019-12-01 15:37:07 | 35 | 90 | 0 | 5 | 9
(9 rows)
I want to create the function in a way that I can get a time range (before and after) and an interval so it would then group the data according to the interval (daily for example), group by user_id.
I have managed to create a function with a generate_series that returns the aggregated results, but it ignores some of the data.
The aggregation uses different formulas to get the data.
Most of the answers that I have found managed to select a grouped sum of only one value, and not several, I.E. it returns something along the lines of:
user_id | date | value_a + value_b + value_c + value_c
But in my case I would like to manipulate the data in different ways, for example:
user_id | date | a + b | (a*b)/c | count(a)
etc. (of course I will handle the divide by zero and stuff)..
So the function that I tried to create was something along the lines of:
CREATE OR REPLACE FUNCTION branch_performance_measurements_daily(
IN after DATE,
IN before DATE,
)
RETURNS TABLE (
date_of_sum DATE,
func_a INT,
func_b INT,
func_c INT
)
AS $$
BEGIN
RETURN QUERY
WITH days_series AS (
SELECT d::date day FROM generate_series(after, before, '1 day') day)
SELECT days_series.day AS date_of_sum,
sum(a + b),
sum((a*b)/c),
count(a)
FROM table b
WHERE DATE(b.created_at) = DATE(days_series.day)
GROUP BY days_series.day, b.user_id;
END;
$$ LANGUAGE plpgsql;
Sadly this type of query does not return all the available data in the table according to all dates available..
Is there any way to perhaps point me as to the proper usage of the generate_series for the case that I need?
P.S.
I am aware that the function of the sum won't work, it's just for the example :)
Many thanks in advance!
Welcome to Stack Overflow.
Your functions had a few syntax errors. This is what you might be looking for:
CREATE OR REPLACE FUNCTION branch_performance_measurements_daily(
after DATE, before DATE)
RETURNS TABLE (
date_of_sum DATE, func_a BIGINT,func_b BIGINT, func_c BIGINT) AS $$
BEGIN
RETURN QUERY
WITH days_series AS (
SELECT generate_series(after, before, '1 day') AS d)
SELECT
DATE(ds.d) AS date_of_sum,
sum(value_a + value_b),
COALESCE(sum((value_a*value_b)/NULLIF(value_c,0)),0),
count(value_a) FROM t
JOIN days_series ds ON ds.d = DATE(t.created_at)
GROUP BY ds.d, t.user_id
ORDER BY ds.d;
END;
$$ LANGUAGE plpgsql;
Sample data
CREATE TEMPORARY TABLE t
(user_id INT, created_at date,
value_a int,value_b int,value_c int,value_d int, unique_key int);
INSERT INTO t VALUES
(1,' 2019-12-16 17:37:07',1,5,0,5,1),
(2,' 2019-12-19 15:37:07',4,7,0, 42,2),
(3,' 2019-12-16 15:37:07',20,1,20,143,3),
(2,' 2019-12-18 12:01:32',0,0,5,987,4),
(1,' 2019-12-11 14:12:50',6,0,9,0,5),
(2,' 2019-12-10 15:37:07',1,72,100, 90,6),
(1,' 2019-12-20 15:37:07',5,3,56,1546,7),
(3,' 2019-12-20 15:37:07',30,4,789,3,8),
(4,' 2019-12-01 15:37:07',35, 90,0,5,9);
Testing function
SELECT * FROM branch_performance_measurements_daily('2019-12-01', '2019-12-20');
date_of_sum | func_a | func_b | func_c
-------------+--------+--------+--------
2019-12-01 | 125 | 0 | 1
2019-12-10 | 73 | 0 | 1
2019-12-11 | 6 | 0 | 1
2019-12-16 | 6 | 0 | 1
2019-12-16 | 21 | 1 | 1
2019-12-18 | 0 | 0 | 1
2019-12-19 | 11 | 0 | 1
2019-12-20 | 8 | 0 | 1
2019-12-20 | 34 | 0 | 1
(9 rows)
In case you want to group just by the generated date (not together with the user_id, as your query suggests) just remove the user_id from the GROUP BY clause and you'll get something like this:
date_of_sum | func_a | func_b | func_c
-------------+--------+--------+--------
2019-12-01 | 125 | 0 | 1
2019-12-10 | 73 | 0 | 1
2019-12-11 | 6 | 0 | 1
2019-12-16 | 27 | 1 | 2
2019-12-18 | 0 | 0 | 1
2019-12-19 | 11 | 0 | 1
2019-12-20 | 42 | 0 | 2
Table name: Copies
+------------------------------------------------------------------------------------+
| group_id | my_id | previous | in_this | higher_value | most_recent |
+----------------------------------------------------------------------------------------------------------------
| 900 | 1 | null | Y | 7 | May16 |
| 900 | 2 | null | Y | 3 | Oct 16 |
| 900 | 3 | null | N | 9 | Oct 16 |
| 901 | 4 | 378 | Y | 3 | Oct 16 |
| 901 | 5 | null | N | 2 | Oct 16 |
| 902 | 6 | null | N | 5 | May16 |
| 902 | 7 | null | N | 9 | Oct 16 |
| 903 | 8 | null | Y | 3 | Oct 16 |
| 903 | 9 | null | Y | 3 | May16 |
| 904 | 10 | null | N | 0 | May 16 |
| 904 | 11 | null | N | 0 | May16
--------------------------------------------------------------------------------------
Output table
+---------------------------------------------------------------------------------------------------+
| group_id | my_id | previous | in_this | higher_value |most_recent|
+----------------------------------------------------------------------------------------------------
| 900 | 1 | null | Y | 7 | May16 |
| 902 | 7 | null | N | 9 | Oct 16 |
| 903 | 8 | null | Y | 3 | Oct 16 |
---------------------------------------------------------------------------------------------------------
Hi all, I need help with a query that returns one record within a group based on the importance of the field. The importance is ranked as follows:
previous- if one record within the group_id is not null, then neither record within a group_id is returned (because according to our rules, all records within a group should have the same previous value)
in_this- If one record is Y, and the other is N within a group_id, then we keep the Y; If all records are Y or all are N, then we move to the next attribute
Higher_value- If all records in the ‘in_this’ field are equal, then we need to select the record with the greater value from this field. If both records have an equal value, we move to the next attribute
Most_recent- If all records were of equal value in the ‘higher_value’ field, then we consider the newest record. If these are equal, then nothing is returned.
This is a simplified version of the table I am looking at, but I just would like to get the gist of how something like this would work. Basically, my table has multiple copies of records that have been grouped through some algorithm. I have been tasked with selecting which of these records within a group is the ‘good’ one, and we are basing this on these fields.
I’d like the output to actually show all fields, because I will likely attempt to refine the query to include other fields (there are over 40 to consider), but the most important is the group_id and my_id fields. It would be neat if we could also somehow flag why each record got picked, but that isn’t necessary.
It seems like something like this should be easy, but I have a hard time wrapping my head around how to pick from within a group_id. Thanks for your help.
You can use analytic functions for this. The trick is establishing the right variables for each condition:
select t.*
from (select t.*,
max(in_this) over (partition by group_id) as max_in_this,
min(higher_value) over (partition by group_id) as min_higher_value,
max(higher_value) over (partition by group_id) as max_higher_value,
row_number() over (partition by group_id, higher_value order by my_id) as seqnum_ghv,
min(most_recent) over (partition by group_id) as min_most_recent,
max(most_recent) over (partition by group_id) as max_most_recent,
row_number() over (partition by group_id order by most_recent) as seqnum_mr
from t
) t
where max_in_this is not null and
( (min_higher_value <> max_higher_value and seqnum_ghv = 1) or
(min_higher_value = max_higher_value and min_most_recent <> max_most_recent and seqnum_mr = 1
)
);
The third condition as stated makes no sense, but you should get the idea for how to implement this.
I would like to filter my table by MIN() function but still keep columns which cant be grouped.
I have table:
+----+----------+----------------------+
| ID | distance | geom |
+----+----------+----------------------+
| 1 | 2 | DSDGSAsd23423DSFF |
| 2 | 11.2 | SXSADVERG678BNDVS4 |
| 2 | 2 | XCZFETEFD567687SDF |
| 3 | 24 | SADASDSVG3423FD |
| 3 | 10 | SDFSDFSDF343DFDGF |
| 4 | 34 | SFDHGHJ546GHJHJHJ |
| 5 | 22 | SDFSGTHHGHGFHUKJYU45 |
| 6 | 78 | SDFDGDHKIKUI45 |
| 6 | 15 | DSGDHHJGHJKHGKHJKJ65 |
+----+----------+----------------------+
This is what I would like to achieve:
+----+----------+----------------------+
| ID | distance | geom |
+----+----------+----------------------+
| 1 | 2 | DSDGSAsd23423DSFF |
| 2 | 2 | XCZFETEFD567687SDF |
| 3 | 10 | SDFSDFSDF343DFDGF |
| 4 | 34 | SFDHGHJ546GHJHJHJ |
| 5 | 22 | SDFSGTHHGHGFHUKJYU45 |
| 6 | 15 | DSGDHHJGHJKHGKHJKJ65 |
+----+----------+----------------------+
it is possible when I use MIN() on distance column and grouping by ID but then I loose my geom which is essential.
The query looks like this:
SELECT "ID", MIN(distance) AS distance FROM somefile GROUP BY "ID"
the result is:
+----+----------+
| ID | distance |
+----+----------+
| 1 | 2 |
| 2 | 2 |
| 3 | 10 |
| 4 | 34 |
| 5 | 22 |
| 6 | 15 |
+----+----------+
but this is not what I want.
Any suggestions?
One common approach to this is to find the minimum values in a derived table that you join with:
SELECT somefile."ID", somefile.distance, somefile.geom
FROM somefile
JOIN (
SELECT "ID", MIN(distance) AS distance FROM somefile GROUP BY "ID"
) t ON t.distance = somefile.distance AND t.ID = somefile.ID;
Sample SQL Fiddle
You need a window function to do this:
SELECT "ID", distance, geom
FROM (
SELECT "ID", distance, geom, rank() OVER (PARTITION BY "ID" ORDER BY distance) AS rnk
FROM somefile) sub
WHERE rnk = 1;
This effectively orders the entire set of rows first by the "ID" value, then by the distance and returns the record for each "ID" where the distance is minimal - no need to do a GROUP BY.
select a.*,b.geom from
(SELECT ID, MIN(distance) AS distance FROM somefile GROUP BY ID) as a
inner join somefile as b on a.id=b.id and a.distance=b.distance
You can use "distinct on" clause of the PostgreSQL.
select distinct on(id) id, distance, geom
from table_name
order by distance;
I think this is what you are exactly looking for.
For more details on how "distinct on" works, refer the documentation and the example.
But, remember, using "distinct on" does not comply to SQL standards.
Assume I have this schema (tested on postgresql) where the 'Scorelines' relation contains results of sport matches. (kickoff is a TIMESTAMP but replaced by INT for readability)
SQLFiddle here: http://sqlfiddle.com/#!12/52475/3
CREATE TABLE Scorelines (
team TEXT,
kickoff INT,
scored INT,
conceded INT
);
Now I want to produce another column 'three_matches_scored' that contains the sum of the points scored
over the 3 preceding game (determined by kickoff) of the same team. I have this:
SELECT team, kickoff, scored, conceded, SUM(scored) OVER three_matches AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
This works beautifully so far, except that I get values starting from the second game. Example:
| TEAM | KICKOFF | SCORED | CONCEDED | THREE_MATCHES_SCORED |
|------|---------|--------|----------|----------------------|
| A | 1 | 1 | 0 | (null) |
| B | 2 | 1 | 1 | (null) |
| A | 3 | 1 | 1 | 1 |
| A | 4 | 3 | 0 | 2 |
| B | 4 | 1 | 4 | 1 |
| A | 6 | 0 | 2 | 5 |
| B | 6 | 4 | 2 | 2 |
| B | 8 | 1 | 2 | 6 |
| B | 10 | 1 | 1 | 6 |
| A | 11 | 2 | 1 | 4 |
I want the column 'three_matches_scored' to be (null) for the first 3 games because there are no 3 results to sum up. How can I achieve this?
I'd prefer simple understandable solutions, performance is not critical for this particular case.
My only idea right now, is to define a stored function SUM3, that results in (null) with less than 3 values to add up. But I never defined a function in SQL and can't seem to figure it out.
You can use a case statement to null the rows where there are less than 3 games:
SELECT team, kickoff, scored, conceded,
CASE WHEN COUNT(scored) OVER three_matches = 3
THEN SUM(scored) OVER three_matches
ELSE NULL
END AS three_matches_scored
FROM Scorelines
WINDOW three_matches AS
(PARTITION BY team ORDER BY kickoff
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING)
ORDER BY kickoff;
Output:
team | kickoff | scored | conceded | three_matches_scored
------+---------+--------+----------+----------------------
A | 1 | 1 | 0 |
B | 2 | 1 | 1 |
A | 3 | 1 | 1 |
A | 4 | 3 | 0 |
B | 4 | 1 | 4 |
A | 6 | 0 | 2 | 5
B | 6 | 4 | 2 |
B | 8 | 1 | 2 | 6
B | 10 | 1 | 1 | 6
A | 11 | 2 | 1 | 4
(10 rows)
See harmics answer above.
(my first solution, just for reference)
Solution with user defined aggregate:
CREATE TYPE intermediate_sum AS (
sum INT,
count INT
);
CREATE FUNCTION sum_sfunc(intermediate_sum, INTEGER) RETURNS intermediate_sum AS
$$ SELECT $2 + $1.sum AS sum, $1.count - 1 AS count $$ LANGUAGE SQL;
CREATE FUNCTION sum_ffunc(intermediate_sum) RETURNS INTEGER AS
$$ SELECT (CASE WHEN $1.count > 1 THEN null
WHEN $1.count = 0 THEN $1.sum
END)
$$ LANGUAGE SQL;
CREATE AGGREGATE sum3(INTEGER) (
sfunc = sum_sfunc,
finalfunc = sum_ffunc,
stype = intermediate_sum,
initcond = '(0,3)'
);
The aggregate SUM3 wants at least 3 values, otherwise it returns (null). One can define other aggreates like SUM4 by changing the initcond, for example to '(0,4)'.
What is 'correct' query to fetch a cumulative sum in MySQL?
I've a table where I keep information about files, one column list contains the size of the files in bytes. (the actual files are kept on disk somewhere)
I would like to get the cumulative file size like this:
+------------+---------+--------+----------------+
| fileInfoId | groupId | size | cumulativeSize |
+------------+---------+--------+----------------+
| 1 | 1 | 522120 | 522120 |
| 2 | 2 | 316042 | 316042 |
| 4 | 2 | 711084 | 1027126 |
| 5 | 2 | 697002 | 1724128 |
| 6 | 2 | 663425 | 2387553 |
| 7 | 2 | 739553 | 3127106 |
| 8 | 2 | 700938 | 3828044 |
| 9 | 2 | 695614 | 4523658 |
| 10 | 2 | 744204 | 5267862 |
| 11 | 2 | 609022 | 5876884 |
| ... | ... | ... | ... |
+------------+---------+--------+----------------+
20000 rows in set (19.2161 sec.)
Right now, I use the following query to get the above results
SELECT
a.fileInfoId
, a.groupId
, a.size
, SUM(b.size) AS cumulativeSize
FROM fileInfo AS a
LEFT JOIN fileInfo AS b USING(groupId)
WHERE a.fileInfoId >= b.fileInfoId
GROUP BY a.fileInfoId
ORDER BY a.groupId, a.fileInfoId
My solution is however, extremely slow. (around 19 seconds without cache).
Explain gives the following execution details
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,foreignId | PRIMARY | 4 | NULL | 14905 | |
| 1 | SIMPLE | b | ref | PRIMARY,foreignId | foreignId | 4 | db.a.foreignId | 36 | Using where |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
My question is:
How can I optimize the above query?
Update
I've updated the question as to provide the table structure and a procedure to fill the table with 20,000 records test data.
CREATE TABLE `fileInfo` (
`fileInfoId` int(10) unsigned NOT NULL AUTO_INCREMENT
, `groupId` int(10) unsigned NOT NULL
, `name` varchar(128) NOT NULL
, `size` int(10) unsigned NOT NULL
, PRIMARY KEY (`fileInfoId`)
, KEY `groupId` (`groupId`)
) ENGINE=InnoDB;
delimiter $$
DROP PROCEDURE IF EXISTS autofill$$
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE gid INT DEFAULT 0;
DECLARE nam char(20);
DECLARE siz INT DEFAULT 0;
WHILE i < 20000 DO
SET gid = FLOOR(RAND() * 250);
SET nam = CONV(FLOOR(RAND() * 10000000000000), 20, 36);
SET siz = FLOOR((RAND() * 1024 * 1024));
INSERT INTO `fileInfo` (`groupId`, `name`, `size`) VALUES(gid, nam, siz);
SET i = i + 1;
END WHILE;
END;$$
delimiter ;
CALL autofill();
About the possible duplicate question
The question linked by Forgotten Semicolon is not the same question. My question has extra column. because of this extra groupId column, the accepted answer there does not work for my problem. (maybe it can be adapted to work, but I don't know how, hence my question)
You could use a variable - it's far quicker than any join:
SELECT
id,
size,
#total := #total + size AS cumulativeSize,
FROM table, (SELECT #total:=0) AS t;
Here's a quick test case on a Pentium III with 128MB RAM running Debian 5.0:
Create the table:
DROP TABLE IF EXISTS `table1`;
CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`size` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Fill with 20,000 random numbers:
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 20000 DO
INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000)));
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
Check the row count:
SELECT COUNT(*) FROM table1;
+----------+
| COUNT(*) |
+----------+
| 20000 |
+----------+
Run the cumulative total query:
SELECT
id,
size,
#total := #total + size AS cumulativeSize
FROM table1, (SELECT #total:=0) AS t;
+-------+------+----------------+
| id | size | cumulativeSize |
+-------+------+----------------+
| 1 | 226 | 226 |
| 2 | 869 | 1095 |
| 3 | 668 | 1763 |
| 4 | 733 | 2496 |
...
| 19997 | 966 | 10004741 |
| 19998 | 522 | 10005263 |
| 19999 | 713 | 10005976 |
| 20000 | 0 | 10005976 |
+-------+------+----------------+
20000 rows in set (0.07 sec)
UPDATE
I'd missed the grouping by groupId in the original question, and that certainly made things a bit trickier. I then wrote a solution which used a temporary table, but I didn't like it—it was messy and overly complicated. I went away and did some more research, and have come up with something far simpler and faster.
I can't claim all the credit for this—in fact, I can barely claim any at all, as it is just a modified version of Emulate row number from Common MySQL Queries.
It's beautifully simple, elegant, and very quick:
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId:=0, #cs:=0) AS vars
ORDER BY groupId
) AS tmp;
You can remove the outer SELECT ... AS tmp if you don't mind the prev_groupID column being returned. I found that it ran marginally faster without it.
Here's a simple test case:
INSERT INTO `fileInfo` VALUES
( 1, 3, 'name0', '10'),
( 5, 3, 'name1', '10'),
( 7, 3, 'name2', '10'),
( 8, 1, 'name3', '10'),
( 9, 1, 'name4', '10'),
(10, 2, 'name5', '10'),
(12, 4, 'name6', '10'),
(20, 4, 'name7', '10'),
(21, 4, 'name8', '10'),
(25, 5, 'name9', '10');
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId := 0, #cs := 0) AS vars
ORDER BY groupId
) AS tmp;
+------------+---------+-------+------+----------------+
| fileInfoId | groupId | name | size | cumulativeSize |
+------------+---------+-------+------+----------------+
| 8 | 1 | name3 | 10 | 10 |
| 9 | 1 | name4 | 10 | 20 |
| 10 | 2 | name5 | 10 | 10 |
| 1 | 3 | name0 | 10 | 10 |
| 5 | 3 | name1 | 10 | 20 |
| 7 | 3 | name2 | 10 | 30 |
| 12 | 4 | name6 | 10 | 10 |
| 20 | 4 | name7 | 10 | 20 |
| 21 | 4 | name8 | 10 | 30 |
| 25 | 5 | name9 | 10 | 10 |
+------------+---------+-------+------+----------------+
Here's a sample of the last few rows from a 20,000 row table:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 |
| 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 |
| 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 |
| 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 |
| 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 |
| 275 | 249 | 86RJM39K72K | 860323 | 1688396 |
| 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 |
...
| 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 |
| 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 |
| 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 |
+------------+---------+-------------+---------+----------------+
20000 rows in set (0.31 sec)
I think that MySQL is only using one of the indexes on the table. In this case, it's choosing the index on foreignId.
Add a covering compound index that includes both primaryId and foreignId.