Linear extrapolate values down to 0 from variable starting points - sql

I want to build a query which allows me to flexible linear extrapolate a number down to Age 0 starting from the last known value. The table (see below) has two columns, column Age and Volume. My last known volume is 321.60 at age 11, how can I linear extrapolate the 321.60 down to age 0 in annual steps? Also, I would like to design the query in a way which allows the age to change. For example, in another scenario the last volume is at age 27. I have been experimenting with the lead function, as a result I can extrapolate the volume at age 10 but the function does not allow me to extrapolate down to 0. How can I design a query which (A) allows me to linear extrapolate to age 0 and (B) is flexible and allows different starting points for the linear extrapolation.
SELECT [age],
[volume],
Concat(CASE WHEN volume IS NULL THEN ( Lead(volume, 1, 0) OVER (ORDER BY age) ) / ( age + 1 ) *
age END, volume) AS 'Extrapolate'
FROM tbl_volume
+-----+--------+-------------+
| Age | Volume | Extrapolate |
+-----+--------+-------------+
| 0 | NULL | NULL |
| 1 | NULL | NULL |
| 2 | NULL | NULL |
| 3 | NULL | NULL |
| 4 | NULL | NULL |
| 5 | NULL | NULL |
| 6 | NULL | NULL |
| 7 | NULL | NULL |
| 8 | NULL | NULL |
| 9 | NULL | NULL |
| 10 | NULL | 292.363 |
| 11 | 321.60 | 321.60 |
| 12 | 329.80 | 329.80 |
| 13 | 337.16 | 337.16 |
| 13 | 343.96 | 343.96 |
| 14 | 349.74 | 349.74 |
+-----+--------+-------------+

If I assume that the value is 0 at 0, then you can use simple arithmetic. This seems to work in your case:
select t.*,
coalesce(t.volume, t.age * (t2.volume / t2.age)) as extrapolated_volume
from t cross join
(select top (1) t2.*
from t t2
where t2.volume is not null
order by t2.age asc
) t2;
Here is a db<>fiddle

You can use a windowing function with an empty over() for this kind of thing. As a trivial example:
create table t(j int, k decimal(3,2));
insert t values (1, null), (2, null), (3, 3), (4, 4);
select j, j * avg(k / j) over ()
from t
Note that avg() ignores nulls.

Related

Generate multiple record from existing records based on interval columns [from and to]

I have 2 types of score [M,B] in column 3, if a type is M, then the score is either an S[scored] or SB[bonus scored] in column 6. Every interval [from_hrs - to_hrs] for a type B must have a corresponding SB for type M, thus, an interval for a type B cannot have a score of S for a type M. I have several records that were unfortunately captured as seen in the table below.
CREATE TABLE SCORE_TBL
(
ID int IDENTITY(1,1) PRIMARY KEY,
PERSONID_FK int NOT NULL,
S_TYPE varchar(50) NULL,
FROM_HRS int NULL,
TO_HRS int NULL,
SCORE varchar(50) NULL,
);
INSERT INTO SCORE_TBL(PERSONID_FK,S_TYPE,FROM_HRS,TO_HRS,SCORE)
VALUES
(1, 'M' , 0,20, 'S'),
(1, 'B',6, 8, 'B'),
(2, 'B',0, 2, 'B'),
(2, 'M',0,20, 'S'),
(2, 'B', 10,13, 'B'),
(2, 'B', 18,20, 'B'),
(2, 'M', 13,18, 'S');
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 20 | S |
| 2 | 1 | B | 6 | 8 | B |
| 3 | 2 | B | 0 | 2 | B |
| 4 | 2 | M | 0 | 20 | S |
| 5 | 2 | B | 10 | 13 | B |
| 6 | 2 | B | 18 | 20 | B |
| 7 | 2 | M | 13 | 18 | S |
I want the data to look like this
| ID | PERSONID_FK |S_TYPE| FROM_HRS | TO_HRS | SCORE |
|----|-------------|------|----------|--------|-------|
| 1 | 1 | M | 0 | 6 | S |
| 2 | 1 | M | 6 | 8 | SB |
| 3 | 1 | B | 6 | 8 | B |
| 4 | 1 | M | 8 | 20 | S |
| 5 | 2 | B | 0 | 2 | B |
| 6 | 2 | M | 0 | 2 | SB |
| 7 | 2 | M | 2 | 10 | S |
| 8 | 2 | B | 10 | 13 | B |
| 9 | 2 | M | 10 | 13 | SB |
| 10 | 2 | M | 13 | 18 | S |
| 11 | 2 | B | 18 | 20 | B |
| 12 | 2 | S | 18 | 20 | SB |
Any ideas on how to generate this data in SQL Server select statement? Visually, this what am trying to get.
Tricky part here is that interval might need to be split in several pieces like 0..20 for person 2.
Window functions to the rescue. This query illustrates what you need to do:
WITH
deltas AS (
SELECT personid_fk, hrs, sum(delta_s) as delta_s, sum(delta_b) as delta_b
FROM (SELECT personid_fk, from_hrs as hrs,
case when score = 'S' then 1 else 0 end as delta_s,
case when score = 'B' then 1 else 0 end as delta_b
FROM score_tbl
UNION ALL
SELECT personid_fk, to_hrs as hrs,
case when score = 'S' then -1 else 0 end as delta_s,
case when score = 'B' then -1 else 0 end as delta_b
FROM score_tbl) _
GROUP BY personid_fk, hrs
),
running AS (
SELECT personid_fk, hrs as from_hrs,
lead(hrs) over (partition by personid_fk order by hrs) as to_hrs,
sum(delta_s) over (partition by personid_fk order by hrs) running_s,
sum(delta_b) over (partition by personid_fk order by hrs) running_b
FROM deltas
)
SELECT personid_fk, 'M' as s_type, from_hrs, to_hrs,
case when running_b > 0 then 'SB' else 'S' end as score
FROM running
WHERE running_s > 0
UNION ALL
SELECT personid_fk, s_type, from_hrs, to_hrs, score
FROM score_tbl
WHERE s_type = 'B'
ORDER BY personid_fk, from_hrs;
Step by step:
deltas is union of two passes on score_tbl - one for start and one for end of score/bonus interval, creating a timeline of +1/-1 events
running calculates running total of deltas over time, yielding split intervals where score/bonus are active
final query just converts score codes and unions bonus intervals (which are passed unchanged)
SQL Fiddle here.

PostgreSQL Current count of specific value

I need to achieve a view such as:
+------------+----------+--------------+----------------+------------------+
| Parent id | Expected | Parent Value | Distinct Value | Distinct Value 2 |
+------------+----------+--------------+----------------+------------------+
| 1 | 001.001 | 3 | 6/1/2017 | 5,000.00 |
| 1 | 001.002 | 3 | 9/1/2018 | 3,500.00 |
| 1 | 001.003 | 3 | 1/7/2018 | 9,000.00 |
| 2 | 002.001 | 7 | 9/1/2017 | 2,500.00 |
| 3 | 003.001 | 5 | 3/6/2017 | 1,200.00 |
| 3 | 003.002 | 5 | 16/8/2017 | 8,700.00 |
+------------+----------+--------------+----------------+------------------+
where I get distinct child objects that have same parents, but I cannot make the "Expected" column work. Those zeros don't really matter, I just need to get subindex like "1.1", "1.2" to work. I tried rank() function but it seems it doesnt really help.
Any help appreciated, thanks in advance.
My initial try looks like this:
SELECT DISTINCT
parent.parent_id,
rank() OVER ( order by parent_id ) as expected,
parent.parent_value,
ct.distinct_value,
ct.distinct_value_2
FROM parent
LEFT JOIN (crosstab (...) )
AS ct( ... )
ON ...
Use partition by parent_id in the window function and order by another_col to define the order in groups by parent_id.
with parent(parent_id, another_col) as (
values (1, 30), (1, 20), (1, 10), (2, 40), (3, 60), (3, 50)
)
select
parent_id,
another_col,
format('%s.%s', parent_id, row_number() over w) as expected
from parent
window w as (partition by parent_id order by another_col);
parent_id | another_col | expected
-----------+-------------+----------
1 | 10 | 1.1
1 | 20 | 1.2
1 | 30 | 1.3
2 | 40 | 2.1
3 | 50 | 3.1
3 | 60 | 3.2
(6 rows)

Converting rows from a table into days of the week

What I thought was going to be a fairly easy task is becoming a lot more difficult than I expected. We have several tasks that get performed sometimes several times per day, so we have a table that gets a row added whenever a user performs the task. What I need is a snapshot of the month with the initials and time of the person that did the task like this:
The 'activity log' table is pretty simple, it just has the date/time the task was performed along with the user that did it and the scheduled time (the "Pass Time" column in the image); this is the table I need to flatten out into days of the week.
Each 'order' can have one or more 'pass times' and each pass time can have zero or more initials for that day. For example, for pass time 8:00, it can be done several times during that day or not at all.
I have tried standard joins to get the orders and the scheduled pass times with no issues, but getting the days of the week is escaping me. I have tried creating a function to get all the initials for the day and just creating
'select FuncCall() as 1, FuncCall() as 2', etc. for each day of the week but that is a real performance suck.
Does anyone know of a better technique?
Update: I think the comment about PIVOT looks promising, but not quite sure because everything I can find uses an aggregate function in the PIVOT part. So if I have the following table:
create table #MyTable (OrderName nvarchar(10),DateDone date, TimeDone time, Initials nvarchar(4), PassTime nvarchar(8))
insert into #MyTable values('Order 1','2018/6/1','2:00','ABC','1st Pass')
insert into #MyTable values('Order 1','2018/6/1','2:20','DEF','1st Pass')
insert into #MyTable values('Order 1','2018/6/1','4:40','XYZ','2nd Pass')
insert into #MyTable values('Order 1','2018/6/3','5:00','ABC','1st Pass')
insert into #MyTable values('Order 1','2018/6/4','4:00','QXY','2nd Pass')
insert into #MyTable values('Order 1','2018/6/10','2:00','ABC','1st Pass')
select * from #MyTable
pivot () -- Can't figure out what goes here since all examples I see have an aggregate function call such as AVG...
drop table #MyTable
I don't see how to get this output since I am not aggregating anything other than the initials column:
Something like this?
DECLARE #taskTable TABLE(ID INT IDENTITY,Task VARCHAR(100),TaskPerson VARCHAR(100),TaskDate DATETIME);
INSERT INTO #taskTable VALUES
('Task before June 2018','AB','2018-05-15T12:00:00')
,('Task 1','AB','2018-06-03T13:00:00')
,('Task 1','CD','2018-06-04T14:00:00')
,('Task 2','AB','2018-06-05T15:00:00')
,('Task 1','CD','2018-06-06T16:00:00')
,('Task 1','EF','2018-06-06T17:00:00')
,('Task 1','EF','2018-06-06T18:00:00')
,('Task 2','GH','2018-06-07T19:00:00')
,('Task 1','CD','2018-06-07T20:00:00')
,('After June 2018','CD','2018-07-15T21:00:00');
SELECT p.*
FROM
(
SELECT t.Task
,ROW_NUMBER() OVER(PARTITION BY t.Task,CAST(t.TaskDate AS DATE) ORDER BY t.TaskDate) AS Taskindex
,CONCAT(t.TaskPerson,' ',CONVERT(VARCHAR(5),t.TaskDate,114)) AS Content
,DAY(TaskDate) AS ColumnName
FROM #taskTable t
WHERE YEAR(t.TaskDate)=2018 AND MONTH(t.TaskDate)=6
) tbl
PIVOT
(
MAX(Content) FOR ColumnName IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
,[11],[12],[13],[14],[15],[16],[17],[18],[19],[20]
,[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31])
) P
ORDER BY P.Task,Taskindex;
The result
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Task | Taskindex | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Task 1 | 1 | NULL | NULL | AB 13:00 | CD 14:00 | NULL | CD 16:00 | CD 20:00 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Task 1 | 2 | NULL | NULL | NULL | NULL | NULL | EF 17:00 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Task 1 | 3 | NULL | NULL | NULL | NULL | NULL | EF 18:00 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
| Task 2 | 1 | NULL | NULL | NULL | NULL | AB 15:00 | NULL | GH 19:00 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+--------+-----------+------+------+----------+----------+----------+----------+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+------+
The first trick is, to use the day's index (DAY()) as column name. The second trick is the ROW_NUMBER(). This will add a running index per task and day thus replicating the rows per index. Otherwise you'd get just one entry per day.
You input tables will be more complex, but I think this shows the principles...
UPDATE: So we have to get it even slicker :-D
WITH prepareData AS
(
SELECT t.Task
,t.TaskPerson
,t.TaskDate
,CONVERT(VARCHAR(10),t.TaskDate,126) AS TaskDay
,DAY(t.TaskDate) AS TaskDayIndex
,CONVERT(VARCHAR(5),t.TaskDate,114) AS TimeContent
FROM #taskTable t
WHERE YEAR(t.TaskDate)=2018 AND MONTH(t.TaskDate)=6
)
SELECT p.*
FROM
(
SELECT t.Task
,STUFF((
SELECT ', ' + CONCAT(x.TaskPerson,' ',TimeContent)
FROM prepareData AS x
WHERE x.Task=t.Task
AND x.TaskDay= t.TaskDay
ORDER BY x.TaskDate
FOR XML PATH(''),TYPE
).value(N'.',N'nvarchar(max)'),1,2,'') AS Content
,t.TaskDayIndex
FROM prepareData t
GROUP BY t.Task, t.TaskDay,t.TaskDayIndex
) p--tbl
PIVOT
(
MAX(Content) FOR TaskDayIndex IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
,[11],[12],[13],[14],[15],[16],[17],[18],[19],[20]
,[21],[22],[23],[24],[25],[26],[27],[28],[29],[30],[31])
) P
ORDER BY P.Task;
The result
+--------+------+------+----------+----------+----------+------------------------------+----------+------+
| Task | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+--------+------+------+----------+----------+----------+------------------------------+----------+------+
| Task 1 | NULL | NULL | AB 13:00 | CD 14:00 | NULL | CD 16:00, EF 17:00, EF 18:00 | CD 20:00 | NULL |
+--------+------+------+----------+----------+----------+------------------------------+----------+------+
| Task 2 | NULL | NULL | NULL | NULL | AB 15:00 | NULL | GH 19:00 | NULL |
+--------+------+------+----------+----------+----------+------------------------------+----------+------+
This will use a well discussed XML trick within a correlated sub-query in order to get all common entries together as one. With this united content you can go the normal PIVOT path. The aggregate will not compute anything, as there is - for sure - just one value per cell.

How do I do multiple selection based on a flowchart of criteria?

Table name: Copies
+------------------------------------------------------------------------------------+
| group_id | my_id | previous | in_this | higher_value | most_recent |
+----------------------------------------------------------------------------------------------------------------
| 900 | 1 | null | Y | 7 | May16 |
| 900 | 2 | null | Y | 3 | Oct 16 |
| 900 | 3 | null | N | 9 | Oct 16 |
| 901 | 4 | 378 | Y | 3 | Oct 16 |
| 901 | 5 | null | N | 2 | Oct 16 |
| 902 | 6 | null | N | 5 | May16 |
| 902 | 7 | null | N | 9 | Oct 16 |
| 903 | 8 | null | Y | 3 | Oct 16 |
| 903 | 9 | null | Y | 3 | May16 |
| 904 | 10 | null | N | 0 | May 16 |
| 904 | 11 | null | N | 0 | May16
--------------------------------------------------------------------------------------
Output table
+---------------------------------------------------------------------------------------------------+
| group_id | my_id | previous | in_this | higher_value |most_recent|
+----------------------------------------------------------------------------------------------------
| 900 | 1 | null | Y | 7 | May16 |
| 902 | 7 | null | N | 9 | Oct 16 |
| 903 | 8 | null | Y | 3 | Oct 16 |
---------------------------------------------------------------------------------------------------------
Hi all, I need help with a query that returns one record within a group based on the importance of the field. The importance is ranked as follows:
previous- if one record within the group_id is not null, then neither record within a group_id is returned (because according to our rules, all records within a group should have the same previous value)
in_this- If one record is Y, and the other is N within a group_id, then we keep the Y; If all records are Y or all are N, then we move to the next attribute
Higher_value- If all records in the ‘in_this’ field are equal, then we need to select the record with the greater value from this field. If both records have an equal value, we move to the next attribute
Most_recent- If all records were of equal value in the ‘higher_value’ field, then we consider the newest record. If these are equal, then nothing is returned.
This is a simplified version of the table I am looking at, but I just would like to get the gist of how something like this would work. Basically, my table has multiple copies of records that have been grouped through some algorithm. I have been tasked with selecting which of these records within a group is the ‘good’ one, and we are basing this on these fields.
I’d like the output to actually show all fields, because I will likely attempt to refine the query to include other fields (there are over 40 to consider), but the most important is the group_id and my_id fields. It would be neat if we could also somehow flag why each record got picked, but that isn’t necessary.
It seems like something like this should be easy, but I have a hard time wrapping my head around how to pick from within a group_id. Thanks for your help.
You can use analytic functions for this. The trick is establishing the right variables for each condition:
select t.*
from (select t.*,
max(in_this) over (partition by group_id) as max_in_this,
min(higher_value) over (partition by group_id) as min_higher_value,
max(higher_value) over (partition by group_id) as max_higher_value,
row_number() over (partition by group_id, higher_value order by my_id) as seqnum_ghv,
min(most_recent) over (partition by group_id) as min_most_recent,
max(most_recent) over (partition by group_id) as max_most_recent,
row_number() over (partition by group_id order by most_recent) as seqnum_mr
from t
) t
where max_in_this is not null and
( (min_higher_value <> max_higher_value and seqnum_ghv = 1) or
(min_higher_value = max_higher_value and min_most_recent <> max_most_recent and seqnum_mr = 1
)
);
The third condition as stated makes no sense, but you should get the idea for how to implement this.

Optimal query to fetch a cumulative sum in MySQL

What is 'correct' query to fetch a cumulative sum in MySQL?
I've a table where I keep information about files, one column list contains the size of the files in bytes. (the actual files are kept on disk somewhere)
I would like to get the cumulative file size like this:
+------------+---------+--------+----------------+
| fileInfoId | groupId | size | cumulativeSize |
+------------+---------+--------+----------------+
| 1 | 1 | 522120 | 522120 |
| 2 | 2 | 316042 | 316042 |
| 4 | 2 | 711084 | 1027126 |
| 5 | 2 | 697002 | 1724128 |
| 6 | 2 | 663425 | 2387553 |
| 7 | 2 | 739553 | 3127106 |
| 8 | 2 | 700938 | 3828044 |
| 9 | 2 | 695614 | 4523658 |
| 10 | 2 | 744204 | 5267862 |
| 11 | 2 | 609022 | 5876884 |
| ... | ... | ... | ... |
+------------+---------+--------+----------------+
20000 rows in set (19.2161 sec.)
Right now, I use the following query to get the above results
SELECT
a.fileInfoId
, a.groupId
, a.size
, SUM(b.size) AS cumulativeSize
FROM fileInfo AS a
LEFT JOIN fileInfo AS b USING(groupId)
WHERE a.fileInfoId >= b.fileInfoId
GROUP BY a.fileInfoId
ORDER BY a.groupId, a.fileInfoId
My solution is however, extremely slow. (around 19 seconds without cache).
Explain gives the following execution details
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,foreignId | PRIMARY | 4 | NULL | 14905 | |
| 1 | SIMPLE | b | ref | PRIMARY,foreignId | foreignId | 4 | db.a.foreignId | 36 | Using where |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
My question is:
How can I optimize the above query?
Update
I've updated the question as to provide the table structure and a procedure to fill the table with 20,000 records test data.
CREATE TABLE `fileInfo` (
`fileInfoId` int(10) unsigned NOT NULL AUTO_INCREMENT
, `groupId` int(10) unsigned NOT NULL
, `name` varchar(128) NOT NULL
, `size` int(10) unsigned NOT NULL
, PRIMARY KEY (`fileInfoId`)
, KEY `groupId` (`groupId`)
) ENGINE=InnoDB;
delimiter $$
DROP PROCEDURE IF EXISTS autofill$$
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE gid INT DEFAULT 0;
DECLARE nam char(20);
DECLARE siz INT DEFAULT 0;
WHILE i < 20000 DO
SET gid = FLOOR(RAND() * 250);
SET nam = CONV(FLOOR(RAND() * 10000000000000), 20, 36);
SET siz = FLOOR((RAND() * 1024 * 1024));
INSERT INTO `fileInfo` (`groupId`, `name`, `size`) VALUES(gid, nam, siz);
SET i = i + 1;
END WHILE;
END;$$
delimiter ;
CALL autofill();
About the possible duplicate question
The question linked by Forgotten Semicolon is not the same question. My question has extra column. because of this extra groupId column, the accepted answer there does not work for my problem. (maybe it can be adapted to work, but I don't know how, hence my question)
You could use a variable - it's far quicker than any join:
SELECT
id,
size,
#total := #total + size AS cumulativeSize,
FROM table, (SELECT #total:=0) AS t;
Here's a quick test case on a Pentium III with 128MB RAM running Debian 5.0:
Create the table:
DROP TABLE IF EXISTS `table1`;
CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`size` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Fill with 20,000 random numbers:
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 20000 DO
INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000)));
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
Check the row count:
SELECT COUNT(*) FROM table1;
+----------+
| COUNT(*) |
+----------+
| 20000 |
+----------+
Run the cumulative total query:
SELECT
id,
size,
#total := #total + size AS cumulativeSize
FROM table1, (SELECT #total:=0) AS t;
+-------+------+----------------+
| id | size | cumulativeSize |
+-------+------+----------------+
| 1 | 226 | 226 |
| 2 | 869 | 1095 |
| 3 | 668 | 1763 |
| 4 | 733 | 2496 |
...
| 19997 | 966 | 10004741 |
| 19998 | 522 | 10005263 |
| 19999 | 713 | 10005976 |
| 20000 | 0 | 10005976 |
+-------+------+----------------+
20000 rows in set (0.07 sec)
UPDATE
I'd missed the grouping by groupId in the original question, and that certainly made things a bit trickier. I then wrote a solution which used a temporary table, but I didn't like it—it was messy and overly complicated. I went away and did some more research, and have come up with something far simpler and faster.
I can't claim all the credit for this—in fact, I can barely claim any at all, as it is just a modified version of Emulate row number from Common MySQL Queries.
It's beautifully simple, elegant, and very quick:
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId:=0, #cs:=0) AS vars
ORDER BY groupId
) AS tmp;
You can remove the outer SELECT ... AS tmp if you don't mind the prev_groupID column being returned. I found that it ran marginally faster without it.
Here's a simple test case:
INSERT INTO `fileInfo` VALUES
( 1, 3, 'name0', '10'),
( 5, 3, 'name1', '10'),
( 7, 3, 'name2', '10'),
( 8, 1, 'name3', '10'),
( 9, 1, 'name4', '10'),
(10, 2, 'name5', '10'),
(12, 4, 'name6', '10'),
(20, 4, 'name7', '10'),
(21, 4, 'name8', '10'),
(25, 5, 'name9', '10');
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId := 0, #cs := 0) AS vars
ORDER BY groupId
) AS tmp;
+------------+---------+-------+------+----------------+
| fileInfoId | groupId | name | size | cumulativeSize |
+------------+---------+-------+------+----------------+
| 8 | 1 | name3 | 10 | 10 |
| 9 | 1 | name4 | 10 | 20 |
| 10 | 2 | name5 | 10 | 10 |
| 1 | 3 | name0 | 10 | 10 |
| 5 | 3 | name1 | 10 | 20 |
| 7 | 3 | name2 | 10 | 30 |
| 12 | 4 | name6 | 10 | 10 |
| 20 | 4 | name7 | 10 | 20 |
| 21 | 4 | name8 | 10 | 30 |
| 25 | 5 | name9 | 10 | 10 |
+------------+---------+-------+------+----------------+
Here's a sample of the last few rows from a 20,000 row table:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 |
| 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 |
| 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 |
| 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 |
| 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 |
| 275 | 249 | 86RJM39K72K | 860323 | 1688396 |
| 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 |
...
| 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 |
| 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 |
| 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 |
+------------+---------+-------------+---------+----------------+
20000 rows in set (0.31 sec)
I think that MySQL is only using one of the indexes on the table. In this case, it's choosing the index on foreignId.
Add a covering compound index that includes both primaryId and foreignId.