Conditional Running Tally / Cumulative Sum: Looking for Formula or Script - sql

I work in a warehouse, and I'm using Google Sheets to keep track of inventory. Adding and subtracting is easy, but I've been tasked with creating a "reserve" system:
A number of pieces of stock are reserved for upcoming jobs. When that stock is ordered and received, the reserve quantity is "satisfied" and decreases by the number of pieces received. The problem with setting it up just like the ADD and SUBTRACT function is that not all stock received is "reserved", and my RESERVE totals end up being "-57", "-72", "-112", etc.
I have a large dataset of form responses logged in four columns: Timestamp, Item ID#, Action (ADD, SUBTRACT, or RESERVE), and QTY. What I'm looking for is a way to create a running tally in column E for each unique Item ID#, using the values in column D "QTY". I need for any value <0 to return "0".
Example Sheet
I've been able to create a running tally formula that satisfies my conditions for one Item ID# at a time. To avoid creating a separate column for each Item ID#, though, I need to figure out how to apply it separately to each unique Item ID#, and array it down Column E so each new form response is calculated automatically.
=if(C2="RESERVE",E1+D2,if(and(C2="ADD",(E1+D2)<0),0,E1+D2))
The closest thing to a solution I've been able to find is a script created by user79865 for this question titled: "Running Total In Google Sheets with Array". Unfortunately, trying to plug this into Google Sheets Script Editor gives me an error popup:
TypeError: Cannot read property "length" from undefined. (line 2, file "runningtotal")
I have no programming background and never dreamed I'd be looking at code just to make a running tally.
If anybody can offer any insight into this, fixing or replacing the script or offering an ARRAYFORMULA solution, I'd really appreciate it!
function runningTotal(names, dates, amounts) {
var sum, totals = [], n = names.length;
if (dates.length != n || amounts.length != n) {
return 'Error: need three columns of equal length';
}
for (var i = 0; i < n; i++) {
if (names[i][0]) {
sum = 0;
for (var j = 0; j < n; j++) {
if (names[j][0] == names[i][0] && dates[j][0] <= dates[i][0]) {
sum = sum + amounts[j][0];
}
}
}
else {
sum = '';
}
totals.push([sum]);
}
return totals;
}

Related

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

How to get same rank for same scores in Redis' ZRANK?

If I have 5 members with scores as follows
a - 1
b - 2
c - 3
d - 3
e - 5
ZRANK of c returns 2, ZRANK of d returns 3
Is there a way to get same rank for same scores?
Example: ZRANK c = 2, d = 2, e = 3
If yes, then how to implement that in spring-data-redis?
Any real solution needs to fit the requirements, which are kind of missing in the original question. My 1st answer had assumed a small dataset, but this approach does not scale as dense ranking is done (e.g. via Lua) in O(N) at least.
So, assuming that there are a lot of users with scores, the direction that for_stack suggested is better, in which multiple data structures are combined. I believe this is the gist of his last remark.
To store users' scores you can use a Hash. While conceptually you can use a single key to store a Hash of all users scores, in practice you'd want to hash the Hash so it will scale. To keep this example simple, I'll ignore Hash scaling.
This is how you'd add (update) a user's score in Lua:
local hscores_key = KEYS[1]
local user = ARGV[1]
local increment = ARGV[2]
local new_score = redis.call('HINCRBY', hscores_key, user, increment)
Next, we want to track the current count of users per discrete score value so we keep another hash for that:
local old_score = new_score - increment
local hcounts_key = KEYS[2]
local old_count = redis.call('HINCRBY', hcounts_key, old_score, -1)
local new_count = redis.call('HINCRBY', hcounts_key, new_score, 1)
Now, the last thing we need to maintain is the per score rank, with a sorted set. Every new score is added as a member in the zset, and scores that have no more users are removed:
local zdranks_key = KEYS[3]
if new_count == 1 then
redis.call('ZADD', zdranks_key, new_score, new_score)
end
if old_count == 0 then
redis.call('ZREM', zdranks_key, old_score)
end
This 3-piece-script's complexity is O(logN) due to the use of the Sorted Set, but note that N is the number of discrete score values, not the users in the system. Getting a user's dense ranking is done via another, shorter and simpler script:
local hscores_key = KEYS[1]
local zdranks_key = KEYS[2]
local user = ARGV[1]
local score = redis.call('HGET', hscores_key, user)
return redis.call('ZRANK', zdranks_key, score)
You can achieve the goal with two Sorted Set: one for member to score mapping, and one for score to rank mapping.
Add
Add items to member to score mapping: ZADD mem_2_score 1 a 2 b 3 c 3 d 5 e
Add the scores to score to rank mapping: ZADD score_2_rank 1 1 2 2 3 3 5 5
Search
Get score first: ZSCORE mem_2_score c, this should return the score, i.e. 3.
Get the rank for the score: ZRANK score_2_rank 3, this should return the dense ranking, i.e. 2.
In order to run it atomically, wrap the Add, and Search operations into 2 Lua scripts.
Then there's this Pull Request - https://github.com/antirez/redis/pull/2011 - which is dead, but appears to make dense rankings on the fly. The original issue/feature request (https://github.com/antirez/redis/issues/943) got some interest so perhaps it is worth reviving it /cc #antirez :)
The rank is unique in a sorted set, and elements with the same score are ordered (ranked) lexically.
There is no Redis command that does this "dense ranking"
You could, however, use a Lua script that fetches a range from a sorted set, and reduces it to your requested form. This could work on small data sets, but you'd have to devise something more complex for to scale.
unsigned long zslGetRank(zskiplist *zsl, double score, sds ele) {
zskiplistNode *x;
unsigned long rank = 0;
int i;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele,ele) <= 0))) {
rank += x->level[i].span;
x = x->level[i].forward;
}
/* x might be equal to zsl->header, so test if obj is non-NULL */
if (x->ele && x->score == score && sdscmp(x->ele,ele) == 0) {
return rank;
}
}
return 0;
}
https://github.com/redis/redis/blob/b375f5919ea7458ecf453cbe58f05a6085a954f0/src/t_zset.c#L475
This is the piece of code redis uses to compute the rank in sorted sets. Right now ,it just gives rank based on the position in the Skiplist (which is sorted based on scores).
What does the skiplistnode variable "span" mean in redis.h? (what is span ?)

Reduce Complexity of This Formula?

I have a question about optimizing a formula I've been using in Google Sheets:
=ARRAYFORMULA(
IF(
IFERROR(
MATCH($B2 & A2, ($B$1:B1) & ($A$1:A2), 0),
0
) = 0,
1,
0))
The formula works by counting all the unique values in column A (ID) given that it appears in the date range of column B (Date), to give an output in column C (Count).
Notice how the count values are only 0 and 1, and will only show a 1 if it is the ID's first appearance in the date range.
Example data below.
ID Date Count
138 Oct-13 1
138 Oct-13 0
29 Oct-13 1
29 Nov-13 1
138 Nov-13 1
138 Nov-13 0
The issue is once I get over 10000 lines to parse, the formula grinds to a slow pace, and takes upwards of an hour to finish computing. I'm wondering if anyone has a suggestion on how to optimize this formula so I don't need to have it running for so long.
Thanks,
I've been playing around with some formulas, and I think this one works better, but is still becoming quite slow after 10000 lines.
=IF(COUNTIF((FILTER($A$1:$A2, $B$1:$B2 = $B2)),$A2) = 1, 1, 0)
Edit
Here is an additional formula posted on the Google Product Forum which only has to be put in one cell, and autofills down. This is the best answer I've found so far.
=ArrayFormula(IF(LEN(A2:A),--(MATCH(A2:A&B2:B,A2:A&B2:B,0)=ROW(A2:A)-1),))
I wasn't able to find a formula-only solution that I could say outperforms what you have. I did, however, come up with a custom function that runs in linear time, so it ought to perform well. I'd be curious to know how it compares to your final solution.
/**
* Returns 1 for rows in the given range that have not yet occurred in the range,
* or 0 otherwise.
*
* #param {A2:B8} range A range of cells
* #param {2} key_col Relative position of a column to key by, e.g. the sort
* column (optional; may improve performance)
* #return 1 if the values in the row have not yet occurred in the range;
* otherwise 0.
* #customfunction
*/
function COUNT_FIRST_OF_GROUP(range, key_col) {
if (!Array.isArray(range)) {
return 1;
}
const grouped = {};
key_col = typeof key_col === 'undefined' ? 0 : key_col - 1; // convert from 1-based to 0-based
return range.map(function(rowCells) {
const group = groupFor_(grouped, rowCells, key_col);
const rowStr = JSON.stringify(rowCells); // a bit of a hack to identify unique rows, but probably a good compromise
if (rowStr in group) {
return 0;
} else {
group[rowStr] = true;
return 1;
}
});
}
/** #private */
function groupFor_(grouped, row, key_col) {
if (key_col < 0) {
return grouped; // no key column; use one big group for all rows
}
const key = JSON.stringify(row[key_col]);
if (!(key in grouped)) {
grouped[key] = {};
}
return grouped[key];
}
To use it, in Google Sheets go to Tools > Script editor..., paste it into the editor, and click Save. Then, in your spreadsheet, use the function like so:
=COUNT_FIRST_OF_GROUP(A2:B99, 2)
It will autofill for all rows in the range. You can see it in action here.
If certain assumptions are fulfilled, Like, 1. Same ID numbers always occur together(If not, maybe you could SORT them by ID first and then date later), then,
=ARRAYFORMULA(1*(A2:A10000&B2:B10000<>A1:A9999&B1:B9999))
If dates are recognised, I think you could use + instead of & . Again, Various assumptions were made here and there.

Conditional formatting: max value, comparing rows with specific data

I am making an exercise tracking sheet with Google Sheets, and ran into a problem. The sheet has a table for raw data such as day, exercise type chosen from a validated list, and sets, reps, weight, you name it. To find the useful information for analysis, I have set up a pivot table. I want to find the max values for each type of value per exercise.
For example, comparing all the three instances of "DL-m BB" in column D, the table should highlight the highest values between all them: H9 would be the record weight, F5 record volume and so on, and for "SQ-lb BB box" H12 would be max weight and F3 max volume. Eventually the table will have several hundred rows per year, and finding max values per exercise per attribute is going to be too much of a task, time better spent elsewhere.
The Conditional Formatting can be set as follow for the two examples you give above. A separate rule is set for each. They are set from the same cell (H1) adding an additional rule.
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="DL-m BB"))
Add another rule
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="SQ-lb box BB"))
Place this on your Pivot table page (Try M1 - it must be outside of the PT)
It list max correctly.
=UNIQUE(query($D2:$K,"SELECT D,Max(F),Max(G),Max(H),Max(I),Max(J), Max(K) Where D !='' Group By D Label Max(F) 'Max F', Max(G) 'Max G', Max(H) 'Max H', Max(I) 'Max I',Max(J) 'Max J',Max(K) 'Max K'"))
The below query lists the max for F
=query($D2:$F,"SELECT Max(F) Where D !='' Group By D label Max(F)''")
I have been trying this with conditional formatting and it almost works. Maybe you will see something I don't. Still trying.
This works.
function onOpen(){
keepUnique()
}
function keepUnique(){ //create array og unique non black values to find max for
var col = 4 ; // choose the column you want to use as data source (0 indexed, it works at array level)
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
var data = sh.getRange(2, col, sh.getLastRow()-2).getValues();
var newdata = new Array();
for(nn in data){
var duplicate = false;
for(j in newdata){
if(data[nn][0] == newdata[j][0] || data[nn][0]==""){
duplicate = true;
}
}
if(!duplicate){
newdata.push([data[nn][0]]);
}}
colorMax(newdata)
}
function colorMax(newdata){
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
lc=sh.getLastColumn()
var data = sh.getRange(2, 4, sh.getLastRow()-2,lc).getValues(); //get col 4 to last col
for(k=2;k<lc;k++){
for(i=0;i<newdata.length;i++){
var maxVal=0
for(j=0;j<data.length;j++){
if(data[j][0]==newdata[i][0]){
if(data[j][k]>maxVal){maxVal=data[j][k];var row=j+2} //find max value and max value row number
}}
var c =sh.getRange(row,k+4,1,1)//get cell to format
var cv=c.getValue()
c.setFontColor("red") //set font red
}}
}

Getting a Count of Array Items that Meet a Certain Criteria

I have an array called #friend_comparisons which is populated with a number of user objects. I then sort the array using the following:
#friend_comparisons.sort! { |a,b| b.completions.where(:list_id => #list.id).first.counter <=> a.completions.where(:list_id => #list.id).first.counter }
This is sorting the array by a certain counter associated with each user (the specifics of which are not important to the question).
I want to find out how many user objects in the array have a counter that is greater than a certain number (let's say 5). How do I do this?
Here is how I am currently solving the problem:
#friends_rank = 1
for friend in #friend_comparisons do
if friend.completions.where(:list_id => #list.id).first.counter > #user_restaurants.count
#friends_rank = #friends_rank + 1
end
end
You can use Array#count directly.
#friend_comparisons.count {|friend| friend.counter >= 5 }
Docs: http://ruby-doc.org/core-2.2.0/Array.html#method-i-count
(same for ruby 1.9.3)
Array#select will get the job done.
Docs: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-select
You might do something like this:
number_of_users = #friend_comparisons.select{|friend| friend.counter >= 5 }.size