T-SQL return individual values instead of cumulative value - sql

I have a 1 table in a db that stored Incoming, Outgoing and Net values for various Account Codes over time. Although there is a date field the sequence of events per Account Code is based on the "Version" number where 0 = original record for each Account Code and it increments by 1 after each change to that Account Code.
The Outgoing and Incoming values are stored in the db as cumulative values rather than the individual transaction value but I am looking for a way to Select * From this table and return the individual amounts as opposed to the cumulative.
Below are test scripts of table and data, and also 2 examples.
If i Select where code = '123' in the test table I currently get this (values are cumulative);
+------+------------+---------+---------+---------+-----+
| Code | Date | Version | Incoming| Outgoing| Net |
+------+------------+---------+---------+---------+-----+
| 123 | 01/01/2018 | 0 | 100 | 0 | 100 |
| 123 | 07/01/2018 | 1 | 150 | 0 | 150 |
| 123 | 09/01/2018 | 2 | 150 | 100 | 50 |
| 123 | 14/01/2018 | 3 | 200 | 100 | 100 |
| 123 | 18/01/2018 | 4 | 200 | 175 | 25 |
| 123 | 23/01/2018 | 5 | 225 | 175 | 50 |
| 123 | 30/01/2018 | 6 | 225 | 225 | 0 |
+------+------------+---------+---------+---------+-----+
This is what I would like to see (each individual transaction);
+------+------------+---------+----------+----------+------+
| Code | Date | Version | Incoming | Outgoing | Net |
+------+------------+---------+----------+----------+------+
| 123 | 01/01/2018 | 0 | 100 | 0 | 100 |
| 123 | 07/01/2018 | 1 | 50 | 0 | 50 |
| 123 | 09/01/2018 | 2 | 0 | 100 | -100 |
| 123 | 14/01/2018 | 3 | 50 | 0 | 50 |
| 123 | 18/01/2018 | 4 | 0 | 75 | -75 |
| 123 | 23/01/2018 | 5 | 25 | 0 | 25 |
| 123 | 30/01/2018 | 6 | 0 | 50 | -50 |
+------+------------+---------+----------+----------+------+
If I had the individual transaction values and wanted to report on the cumulative, I would use an OVER PARTITION BY, but is there an opposite to that?
I am not looking to redesign the create table or the process in which it is stored, I am just looking for a way to report on this from our MI environment.
Note: I've added other random Account Codes into this to emphasis how the data is not ordered by Code or Version, but by Date.
thanks in advance for any help.
USE [tempdb];
IF EXISTS ( SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'Table1'
AND TABLE_SCHEMA = 'dbo')
DROP TABLE [dbo].[Table1];
GO
CREATE TABLE [dbo].[Table1]
(
[Code] CHAR(3)
,[Date] DATE
,[Version] CHAR(3)
,[Incoming] DECIMAL(20,2)
,[Outgoing] DECIMAL(20,2)
,[Net] DECIMAL(20,2)
);
GO
INSERT INTO [dbo].[Table1] VALUES
('123','2018-01-01','0','100','0','100'),
('456','2018-01-02','0','50','0','50'),
('789','2018-01-03','0','0','0','0'),
('456','2018-01-04','1','100','0','100'),
('456','2018-01-05','2','150','0','150'),
('789','2018-01-06','1','50','50','0'),
('123','2018-01-07','1','150','0','150'),
('456','2018-01-08','3','200','0','200'),
('123','2018-01-09','2','150','100','50'),
('789','2018-01-10','2','0','0','0'),
('456','2018-01-11','4','225','0','225'),
('789','2018-01-12','3','75','25','50'),
('987','2018-01-13','0','0','50','-50'),
('123','2018-01-14','3','200','100','100'),
('654','2018-01-15','0','100','0','100'),
('456','2018-01-16','5','250','0','250'),
('987','2018-01-17','1','50','50','0'),
('123','2018-01-18','4','200','175','25'),
('789','2018-01-19','4','100','25','75'),
('987','2018-01-20','2','150','125','25'),
('321','2018-01-21','0','100','0','100'),
('654','2018-01-22','1','0','0','0'),
('123','2018-01-23','5','225','175','50'),
('321','2018-01-24','1','100','50','50'),
('789','2018-01-25','5','100','50','50'),
('987','2018-01-26','3','150','150','0'),
('456','2018-01-27','6','250','250','0'),
('456','2018-01-28','7','270','250','20'),
('321','2018-01-29','2','100','100','0'),
('123','2018-01-30','6','225','225','0'),
('987','2018-01-31','4','175','150','25')
;
GO
SELECT *
FROM [dbo].[Table1]
WHERE [Code] = '123'
GO;
USE [tempdb];
IF EXISTS ( SELECT *
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'Table1'
AND TABLE_SCHEMA = 'dbo')
DROP TABLE [dbo].[Table1];
GO;
}

Just use lag():
select Evt, Date, Version,
(Loss - lag(Loss, 1, 0) over (partition by evt order by date)) as incoming,
(Rec - lag(Rec, 1, 0) over (partition by evt order by date)) as outgoing,
(Net - lag(Net, 1, 0) over (partition by evt order by date)) as net
from [dbo].[Table1];

Related

Running maths over an entire database and ranking all users

I have a database of bets. Each bet has a 'Win', 'Loss', or 'Pending' state. What I want to do is to have an SQL statement that will get the last, say, 20 bets a user has placed, find out their ROI (Total profit / Total staked * 100).
So I'm just wondering if there is a better way to do this. Do I basically have to get the users table, loop over every user, get their last 20 bets, find the ROI and then order it. If my User table gets huge then this process is going to take ages, right?
Is creating a 'View' going to save on this time?
Is there a way to do this in one statement that won't cost my life in processing time?
Here are the tables
Users
| ID | User |
| 1 | Test1 |
| 2 | Test2 |
| 3 | Test3 |
| 4 | Test4 |
Bets
| ID | User | Amount | Odds | Result |
| 1 | 1 | 10 | 1.35 | Win |
| 2 | 1 | 25 | 2.55 | Win |
| 3 | 3 | 15 | 1.65 | Loss |
| 4 | 2 | 11 | 2.12 | Pending |
Se essentially I would like a table that ranks them as ROI.
| User | AmountBet | AmountWon | ROI |
| 1 | 35 | 77 | 215 |
| 2 | 11 | 0 | 0 |
| 3 | 15 | 0 | 0 |
| 4 | 0 | 0 | 0 |
Assuming the ID of the bets table represents increasing time such that it can be used to identify "last 20", then
WITH b
AS
(
SELECT id,
user,
CASE WHEN result = 'Pending' THEN 0 ELSE amount END AS amount,
CASE WHEN result = 'Win' THEN amount * odds ELSE 0 END as winnings,
ROW_NUMBER() OVER (PARTITION BY user ORDER BY id DESC) AS rownum
FROM bets
)
SELECT user,
SUM(amount) AS amount_bet,
SUM(winnings) AS amount_won,
CASE
WHEN SUM(amount) > 0
THEN SUM(winnings) * 100 / SUM(amount)
ELSE 0
END AS roi
FROM b
WHERE rownum < 21
GROUP BY user;
dbfiddle.uk

How to join transactional data with customer data tables and perform case-based operations in SQL

I'm trying to perform a query between two different tables and come up with a case by case scenario, coming up with a list of records of calls for a specific month.
Here are my tables:
Customer table:
+----+----------------+------------+
| id | name | number |
+----+----------------+------------+
| 1 | John Doe | 8973221232 |
| 2 | American Dad | 7165531212 |
| 3 | Michael Clean | 8884731234 |
| 4 | Samuel Gatsby | 9197543321 |
| 5 | Mike Chat | 8794029819 |
+----+----------------+------------+
Transaction data:
+----------+------------+------------+----------+---------------------+
| trans_id | incoming | outgoing | duration | date_time |
+----------+------------+------------+----------+---------------------+
| 1 | 8973221232 | 9197543321 | 64 | 2018-03-09 01:08:09 |
| 2 | 3729920490 | 7651113929 | 276 | 2018-07-20 05:53:10 |
| 3 | 8884731234 | 8973221232 | 382 | 2018-05-02 13:12:13 |
| 4 | 8973221232 | 9234759208 | 127 | 2018-07-07 15:32:30 |
| 5 | 7165531212 | 9197543321 | 852 | 2018-08-02 07:40:23 |
| 6 | 8884731234 | 9833823023 | 774 | 2018-07-03 14:27:52 |
| 7 | 8273820928 | 2374987349 | 120 | 2018-07-06 05:27:44 |
| 8 | 8973221232 | 9197543321 | 79 | 2018-07-30 12:51:55 |
| 9 | 7165531212 | 7651113929 | 392 | 2018-05-22 02:27:38 |
| 10 | 5423541524 | 7165531212 | 100 | 2018-07-21 22:12:20 |
| 11 | 9197543321 | 2983479820 | 377 | 2018-07-20 17:46:36 |
| 12 | 8973221232 | 7651113929 | 234 | 2018-07-09 03:32:53 |
| 13 | 7165531212 | 2309483932 | 88 | 2018-07-16 16:22:21 |
| 14 | 8973221232 | 8884731234 | 90 | 2018-09-03 13:10:00 |
| 15 | 3820838290 | 2093482348 | 238 | 2018-04-12 21:59:01 |
+----------+------------+------------+----------+---------------------+
What am I trying to accomplish?
I'm trying to compile a list of "costs" for each of the customers that made calls on July 2018. The costs are based on:
1) If the customer received a call (incoming), the cost of the call is equal to the duration;
2) if the customer made a call (outgoing), the cost of the call is 100 if the call is 30 or less in duration. If it exceeds 30 duration, then the cost is 100 plus 5 * duration of the exceeded period.
If the customer didn't make any calls during that month he shouldn't be on the list.
Examples:
1) Customer American Dad has 3 incoming calls and 1 outgoing call, however only trans_id 10 and 13 are for the month of July. He should be paying a total of 538:
for trans_id 10 = 450 (100 for the first 30s + 5 * 70 for the remaining)
for trans_id 13 = 88
2) Customer Samuel Gatsby has 1 incoming call and 3 outgoing calls, however only trans_id 8 and 11 are for the month of July. He should be paying a total of 722:
for trans_id 8 = 345 (100 for the first 30s + 5 * 49 for the remaining)
for trans_id 11 = 377
Considering only these two examples, the output would be:
+----+----------------+------------+------------+
| id | name | number | billable |
+----+----------------+------------+------------+
| 2 | American Dad | 7165531212 | 538 |
| 4 | Samuel Gatsby | 9197543321 | 722 |
+----+----------------+------------+------------+
Note: Mike Chat shouldn't be on the list as he didn't make or receive any calls for that specific month.
What have I tried so far?
I've been playing cat and mouse with this one, I'm using the number as uniqueID, already attempted both a full outer join and combining where incoming or outgoing is not null then applying rules by case, tried doing a left join and applying cases, but I'm circling around and I can't get to a final list. Whenever I get incoming or outgoing, I'm either not able to apply the case or not able to come with both together. Really appreciate the help!
select customer_name.name, customer_name.number, bill = (CASE
WHEN customer_name.number = transaction_data.incoming then 'sum bill'
else 'multiply and add'
end)
from customer_name
left join transaction_data on customer_name.number = transaction_data.incoming or customer_name.name = transaction_data.outgoing
where strftime('%Y-%m', transaction_data.date_time) = '2018-07'
Note: I'm using sqlite to try it out online but the database is on SQL Server 2012, so I know that I can use a date format much easier, that way, but I'd like to keep as close to T-SQL as possible.
Also tried creating a case to determine whether it's incoming call or outgoing, but I'm only getting incoming as a result, even though trans_id 10 is outgoing:
select name, number, duration, case
when customer_name.number = transaction_data.incoming then 'incoming'
when customer_name.number = transaction_data.outgoing then 'outgoing'
END direction
from customer_name
left join transaction_data on customer_name.number = transaction_data.incoming or customer_name.name = transaction_data.outgoing
where strftime('%Y-%m', transaction_data.date_time) = '2018-07'
Try this:
SELECT
c."name", c.number,
SUM(CASE c.number
WHEN t.incoming THEN t.duration
ELSE IIF(t.duration - 30 < 0, 0, t.duration - 30) * 5 + 100
END) AS billable
FROM Customer AS c INNER JOIN [Transaction] AS t
ON c.number IN(t.incoming, t.outgoing)
WHERE t.date_time >= '20180701' AND t.date_time < '20180801'
GROUP BY c."name", c.number
Output:
| name | number | billable |
+---------------+------------+----------+
| John Doe | 8973221232 | 440 |
| American Dad | 7165531212 | 538 |
| Michael Clean | 8884731234 | 774 |
| Samuel Gatsby | 9197543321 | 722 |
Test it online with SQL Fiddle.

SQL Combine two tables with two parameters

I searched forum for 1h and didn't find nothing similar.
I have this problem: I want to compare two colums ID and DATE if they are the same in both tables i want to put number from table 2 next to it. But if it is not the same i want to fill yearly quota on the date. I am working in Access.
table1
id|date|state_on_date
1|30.12.2013|23
1|31.12.2013|25
1|1.1.2014|35
1|2.1.2014|12
2|30.12.2013|34
2|31.12.2013|65
2|1.1.2014|43
table2
id|date|year_quantity
1|31.12.2013|100
1|31.12.2014|150
2|31.12.2013|200
2|31.12.2014|300
I want to get:
table 3
id|date|state_on_date|year_quantity
1|30.12.2013|23|100
1|31.12.2013|25|100
1|1.1.2014|35|150
1|2.1.2014|12|150
2|30.12.2013|34|200
2|31.12.2013|65|200
2|1.1.2014|43|300
I tried joins and reading forums but didn't find solution.
Are you looking for this?
SELECT id, date, state_on_date,
(
SELECT TOP 1 year_quantity
FROM table2
WHERE id = t.id
AND date >= t.date
ORDER BY date
) AS year_quantity
FROM table1 t
Output:
| ID | DATE | STATE_ON_DATE | YEAR_QUANTITY |
|----|------------|---------------|---------------|
| 1 | 2013-12-30 | 23 | 100 |
| 1 | 2013-12-31 | 25 | 100 |
| 1 | 2014-01-01 | 35 | 150 |
| 1 | 2014-01-02 | 12 | 150 |
| 2 | 2013-12-30 | 34 | 200 |
| 2 | 2013-12-31 | 65 | 200 |
| 2 | 2014-01-01 | 43 | 300 |
Here is SQLFiddle demo It's for SQL Server but should work just fine in MS Accesss.

Multiple self joins plus one inner join

I have two tables: ck_startup and ck_price. The price table contains the columns cu_type, prd_type, part_cd, qty, and dllrs. The startup table is linked to the price table through a one-to-many relationship on ck_startup.prd_type_cd = ck_price.prd_type.
The price table contains multiple entries for the same product/part/qty but under different customer types. Not all customer types have the same unique combination of those three values. I'm trying to create a query that will do two things:
Join some columns from ck_startup onto ck_price (description, and some additional values).
Join ck_price onto itself with a dllrs column for each customer type. So in total I would only have one instance of each unique key of product/part/qty, and a value in each customer's price column if they have one.
I've never worked with self joining tables, and so far I can only get records to show up where both customers have the same options available.
And because someone is going to demand I post sample code, here's the crappy query that doesn't show missing prices:
select pa.*, pac.dllrs from ck_price pa
join ck_price pac on pa.prd_type = pac.prd_type and pa.part_carbon_cd = pac.part_carbon_cd and pa.qty = pac.qty
where pa.cu_type = 'A' and pac.cu_type = 'AC';
EDIT: Here's sample data from the two tables, and how I want them to look when I'm done:
CK_STARTUP
+-----+-----------------+-------------+
| CD | DSC | PRD_TYPE_CD |
+-----+-----------------+-------------+
| 3D | Stuff | SKD3 |
| DC | Different stuff | SKD |
| DN2 | Similar stuff | SKD |
+-----+-----------------+-------------+
CK_PRICE
+---------+-------------+---------+-----+-------+
| CU_TYPE | PRD_TYPE_CD | PART_CD | QTY | DLLRS |
+---------+-------------+---------+-----+-------+
| A | SKD3 | 1 | 100 | 10 |
| A | SKD3 | 1 | 200 | 20 |
| A | SKD3 | 1 | 300 | 30 |
| A | SKD | 1 | 100 | 50 |
| A | SKD | 1 | 200 | 100 |
| AC | SKD3 | 1 | 300 | 30 |
| AC | SKD | 1 | 100 | 100 |
| AC | SKD | 1 | 200 | 200 |
| AC | SKD | 1 | 300 | 300 |
| AC | SKD | 1 | 400 | 400 |
+---------+-------------+---------+-----+-------+
COMBO:
+----+-----------------+---------+-----+---------+----------+
| CD | DSC | PART_CD | QTY | DLLRS_A | DLLRS_AC |
+----+-----------------+---------+-----+---------+----------+
| 3D | Stuff | 1 | 100 | 10 | null |
| 3D | Stuff | 1 | 200 | 20 | null |
| 3D | Stuff | 1 | 300 | 30 | 60 |
| DC | Different stuff | 1 | 100 | 50 | 100 |
| DC | Different stuff | 1 | 200 | 100 | 200 |
| DC | Different stuff | 1 | 300 | null | 300 |
| DC | Different stuff | 1 | 400 | null | 400 |
+----+-----------------+---------+-----+---------+----------+
Ok, take a look at below query and at the results:
SELECT *
FROM (SELECT
cs.cd, cs.dsc, cp.part_cd, cp.qty, cp.dllrs, cp.cu_type
FROM ck_startup cs
JOIN ck_price cp ON (cs.prd_type_cd = cp.prd_type_cd))
PIVOT (SUM(dllrs) AS dlllrs FOR (cu_type) IN ('A' AS a, 'AC' AS ac))
ORDER BY cd, qty
;
Output:
CD DSC PART_CD QTY A_DLLLRS AC_DLLLRS
-------- ----------------- ---------- ------- ---------- ----------
3D Stuff 1 100 10
3D Stuff 1 200 20
3D Stuff 1 300 30 30
DC Different stuff 1 100 50 50
DC Different stuff 1 200 100 100
DC Different stuff 1 300 150
DC Different stuff 1 400 200
DN2 Similar stuff 1 100 50 50
DN2 Similar stuff 1 200 100 100
DN2 Similar stuff 1 300 150
DN2 Similar stuff 1 400 200
It is not what you would expect, because I do not understand why you have different values in DLLRS_AC column that are in the CK_PRICE table? I mean, for example, why do you have 400 in last line of your output, not 200? Why is this value doubled (as others are in DLLRS_AC column)?
If you are using Oracle 10g, you can achieve the same result using DECODE and GROUP BY, take a look:
SELECT
cd,
dsc,
part_cd,
qty,
SUM(DECODE(cu_type, 'A', dllrs, NULL)) AS dllrs_a,
SUM(DECODE(cu_type, 'AC', dllrs, NULL)) AS dllrs_ac
FROM (
SELECT
cs.cd, cs.dsc, cp.part_cd, cp.qty, cp.dllrs, cp.cu_type
FROM ck_startup cs
JOIN ck_price cp ON (cs.prd_type_cd = cp.prd_type_cd)
)
GROUP BY cd, dsc, part_cd, qty
ORDER BY cd, qty;
Result is the same.
If you want to read more about pivoting, I recommend article by Tim Hall: Pivot and Unpivot at Oracle Base

Optimal query to fetch a cumulative sum in MySQL

What is 'correct' query to fetch a cumulative sum in MySQL?
I've a table where I keep information about files, one column list contains the size of the files in bytes. (the actual files are kept on disk somewhere)
I would like to get the cumulative file size like this:
+------------+---------+--------+----------------+
| fileInfoId | groupId | size | cumulativeSize |
+------------+---------+--------+----------------+
| 1 | 1 | 522120 | 522120 |
| 2 | 2 | 316042 | 316042 |
| 4 | 2 | 711084 | 1027126 |
| 5 | 2 | 697002 | 1724128 |
| 6 | 2 | 663425 | 2387553 |
| 7 | 2 | 739553 | 3127106 |
| 8 | 2 | 700938 | 3828044 |
| 9 | 2 | 695614 | 4523658 |
| 10 | 2 | 744204 | 5267862 |
| 11 | 2 | 609022 | 5876884 |
| ... | ... | ... | ... |
+------------+---------+--------+----------------+
20000 rows in set (19.2161 sec.)
Right now, I use the following query to get the above results
SELECT
a.fileInfoId
, a.groupId
, a.size
, SUM(b.size) AS cumulativeSize
FROM fileInfo AS a
LEFT JOIN fileInfo AS b USING(groupId)
WHERE a.fileInfoId >= b.fileInfoId
GROUP BY a.fileInfoId
ORDER BY a.groupId, a.fileInfoId
My solution is however, extremely slow. (around 19 seconds without cache).
Explain gives the following execution details
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,foreignId | PRIMARY | 4 | NULL | 14905 | |
| 1 | SIMPLE | b | ref | PRIMARY,foreignId | foreignId | 4 | db.a.foreignId | 36 | Using where |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
My question is:
How can I optimize the above query?
Update
I've updated the question as to provide the table structure and a procedure to fill the table with 20,000 records test data.
CREATE TABLE `fileInfo` (
`fileInfoId` int(10) unsigned NOT NULL AUTO_INCREMENT
, `groupId` int(10) unsigned NOT NULL
, `name` varchar(128) NOT NULL
, `size` int(10) unsigned NOT NULL
, PRIMARY KEY (`fileInfoId`)
, KEY `groupId` (`groupId`)
) ENGINE=InnoDB;
delimiter $$
DROP PROCEDURE IF EXISTS autofill$$
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE gid INT DEFAULT 0;
DECLARE nam char(20);
DECLARE siz INT DEFAULT 0;
WHILE i < 20000 DO
SET gid = FLOOR(RAND() * 250);
SET nam = CONV(FLOOR(RAND() * 10000000000000), 20, 36);
SET siz = FLOOR((RAND() * 1024 * 1024));
INSERT INTO `fileInfo` (`groupId`, `name`, `size`) VALUES(gid, nam, siz);
SET i = i + 1;
END WHILE;
END;$$
delimiter ;
CALL autofill();
About the possible duplicate question
The question linked by Forgotten Semicolon is not the same question. My question has extra column. because of this extra groupId column, the accepted answer there does not work for my problem. (maybe it can be adapted to work, but I don't know how, hence my question)
You could use a variable - it's far quicker than any join:
SELECT
id,
size,
#total := #total + size AS cumulativeSize,
FROM table, (SELECT #total:=0) AS t;
Here's a quick test case on a Pentium III with 128MB RAM running Debian 5.0:
Create the table:
DROP TABLE IF EXISTS `table1`;
CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`size` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Fill with 20,000 random numbers:
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 20000 DO
INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000)));
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
Check the row count:
SELECT COUNT(*) FROM table1;
+----------+
| COUNT(*) |
+----------+
| 20000 |
+----------+
Run the cumulative total query:
SELECT
id,
size,
#total := #total + size AS cumulativeSize
FROM table1, (SELECT #total:=0) AS t;
+-------+------+----------------+
| id | size | cumulativeSize |
+-------+------+----------------+
| 1 | 226 | 226 |
| 2 | 869 | 1095 |
| 3 | 668 | 1763 |
| 4 | 733 | 2496 |
...
| 19997 | 966 | 10004741 |
| 19998 | 522 | 10005263 |
| 19999 | 713 | 10005976 |
| 20000 | 0 | 10005976 |
+-------+------+----------------+
20000 rows in set (0.07 sec)
UPDATE
I'd missed the grouping by groupId in the original question, and that certainly made things a bit trickier. I then wrote a solution which used a temporary table, but I didn't like it—it was messy and overly complicated. I went away and did some more research, and have come up with something far simpler and faster.
I can't claim all the credit for this—in fact, I can barely claim any at all, as it is just a modified version of Emulate row number from Common MySQL Queries.
It's beautifully simple, elegant, and very quick:
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId:=0, #cs:=0) AS vars
ORDER BY groupId
) AS tmp;
You can remove the outer SELECT ... AS tmp if you don't mind the prev_groupID column being returned. I found that it ran marginally faster without it.
Here's a simple test case:
INSERT INTO `fileInfo` VALUES
( 1, 3, 'name0', '10'),
( 5, 3, 'name1', '10'),
( 7, 3, 'name2', '10'),
( 8, 1, 'name3', '10'),
( 9, 1, 'name4', '10'),
(10, 2, 'name5', '10'),
(12, 4, 'name6', '10'),
(20, 4, 'name7', '10'),
(21, 4, 'name8', '10'),
(25, 5, 'name9', '10');
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
#cs := IF(#prev_groupId = groupId, #cs+size, size) AS cumulativeSize,
#prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT #prev_groupId := 0, #cs := 0) AS vars
ORDER BY groupId
) AS tmp;
+------------+---------+-------+------+----------------+
| fileInfoId | groupId | name | size | cumulativeSize |
+------------+---------+-------+------+----------------+
| 8 | 1 | name3 | 10 | 10 |
| 9 | 1 | name4 | 10 | 20 |
| 10 | 2 | name5 | 10 | 10 |
| 1 | 3 | name0 | 10 | 10 |
| 5 | 3 | name1 | 10 | 20 |
| 7 | 3 | name2 | 10 | 30 |
| 12 | 4 | name6 | 10 | 10 |
| 20 | 4 | name7 | 10 | 20 |
| 21 | 4 | name8 | 10 | 30 |
| 25 | 5 | name9 | 10 | 10 |
+------------+---------+-------+------+----------------+
Here's a sample of the last few rows from a 20,000 row table:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 |
| 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 |
| 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 |
| 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 |
| 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 |
| 275 | 249 | 86RJM39K72K | 860323 | 1688396 |
| 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 |
...
| 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 |
| 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 |
| 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 |
+------------+---------+-------------+---------+----------------+
20000 rows in set (0.31 sec)
I think that MySQL is only using one of the indexes on the table. In this case, it's choosing the index on foreignId.
Add a covering compound index that includes both primaryId and foreignId.