Reduce Complexity of This Formula? - optimization

I have a question about optimizing a formula I've been using in Google Sheets:
=ARRAYFORMULA(
IF(
IFERROR(
MATCH($B2 & A2, ($B$1:B1) & ($A$1:A2), 0),
0
) = 0,
1,
0))
The formula works by counting all the unique values in column A (ID) given that it appears in the date range of column B (Date), to give an output in column C (Count).
Notice how the count values are only 0 and 1, and will only show a 1 if it is the ID's first appearance in the date range.
Example data below.
ID Date Count
138 Oct-13 1
138 Oct-13 0
29 Oct-13 1
29 Nov-13 1
138 Nov-13 1
138 Nov-13 0
The issue is once I get over 10000 lines to parse, the formula grinds to a slow pace, and takes upwards of an hour to finish computing. I'm wondering if anyone has a suggestion on how to optimize this formula so I don't need to have it running for so long.
Thanks,

I've been playing around with some formulas, and I think this one works better, but is still becoming quite slow after 10000 lines.
=IF(COUNTIF((FILTER($A$1:$A2, $B$1:$B2 = $B2)),$A2) = 1, 1, 0)
Edit
Here is an additional formula posted on the Google Product Forum which only has to be put in one cell, and autofills down. This is the best answer I've found so far.
=ArrayFormula(IF(LEN(A2:A),--(MATCH(A2:A&B2:B,A2:A&B2:B,0)=ROW(A2:A)-1),))

I wasn't able to find a formula-only solution that I could say outperforms what you have. I did, however, come up with a custom function that runs in linear time, so it ought to perform well. I'd be curious to know how it compares to your final solution.
/**
* Returns 1 for rows in the given range that have not yet occurred in the range,
* or 0 otherwise.
*
* #param {A2:B8} range A range of cells
* #param {2} key_col Relative position of a column to key by, e.g. the sort
* column (optional; may improve performance)
* #return 1 if the values in the row have not yet occurred in the range;
* otherwise 0.
* #customfunction
*/
function COUNT_FIRST_OF_GROUP(range, key_col) {
if (!Array.isArray(range)) {
return 1;
}
const grouped = {};
key_col = typeof key_col === 'undefined' ? 0 : key_col - 1; // convert from 1-based to 0-based
return range.map(function(rowCells) {
const group = groupFor_(grouped, rowCells, key_col);
const rowStr = JSON.stringify(rowCells); // a bit of a hack to identify unique rows, but probably a good compromise
if (rowStr in group) {
return 0;
} else {
group[rowStr] = true;
return 1;
}
});
}
/** #private */
function groupFor_(grouped, row, key_col) {
if (key_col < 0) {
return grouped; // no key column; use one big group for all rows
}
const key = JSON.stringify(row[key_col]);
if (!(key in grouped)) {
grouped[key] = {};
}
return grouped[key];
}
To use it, in Google Sheets go to Tools > Script editor..., paste it into the editor, and click Save. Then, in your spreadsheet, use the function like so:
=COUNT_FIRST_OF_GROUP(A2:B99, 2)
It will autofill for all rows in the range. You can see it in action here.

If certain assumptions are fulfilled, Like, 1. Same ID numbers always occur together(If not, maybe you could SORT them by ID first and then date later), then,
=ARRAYFORMULA(1*(A2:A10000&B2:B10000<>A1:A9999&B1:B9999))
If dates are recognised, I think you could use + instead of & . Again, Various assumptions were made here and there.

Related

Calculate all possible sums of all summands in 4 columns in R

I have 4 columns of different sizes (eg column 1: 96 rows, column 2: 36 rows; column 3: 12 rows; column 4: 401 rows)
I am now looking for a function that allows me to calculate all possible sums of these 4 summands.
So at the end i need to have 963612*401= 16630272 summs as a result in a data frame or vector or array to make a histogramm with ggplot.
I tried solve it with a for case that did not work:
r = 1
for(i in 1:(length(df$column1))) {
for(j in 1:(length(df$column2))) {
for(h in 1:(length(df$column3))){
for(k in 1:(length(df$column4))) {
(i+j)-> a
r = r +1
}
}
}
Does someone have an idea how to solve this problem? Or does anyone have an idea why my code does not work? It has problems by storing my results in variable a and second problem it produces somehow way to much sums. (More than 16630272)
Many thanks!
Your real column1/2/3/4 are probably not columns in a single data frame - as those must be in uniform length. Anyway, if these are the true sizes the naive solution shouldn't hog an impossible amount of memory (~126M):
d <- expand.grid(column1, column2, column3, column4)
all.sums <- mapply(sum, d[[1]], d[[2]], d[[3]], d[[4]])

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Conditional Running Tally / Cumulative Sum: Looking for Formula or Script

I work in a warehouse, and I'm using Google Sheets to keep track of inventory. Adding and subtracting is easy, but I've been tasked with creating a "reserve" system:
A number of pieces of stock are reserved for upcoming jobs. When that stock is ordered and received, the reserve quantity is "satisfied" and decreases by the number of pieces received. The problem with setting it up just like the ADD and SUBTRACT function is that not all stock received is "reserved", and my RESERVE totals end up being "-57", "-72", "-112", etc.
I have a large dataset of form responses logged in four columns: Timestamp, Item ID#, Action (ADD, SUBTRACT, or RESERVE), and QTY. What I'm looking for is a way to create a running tally in column E for each unique Item ID#, using the values in column D "QTY". I need for any value <0 to return "0".
Example Sheet
I've been able to create a running tally formula that satisfies my conditions for one Item ID# at a time. To avoid creating a separate column for each Item ID#, though, I need to figure out how to apply it separately to each unique Item ID#, and array it down Column E so each new form response is calculated automatically.
=if(C2="RESERVE",E1+D2,if(and(C2="ADD",(E1+D2)<0),0,E1+D2))
The closest thing to a solution I've been able to find is a script created by user79865 for this question titled: "Running Total In Google Sheets with Array". Unfortunately, trying to plug this into Google Sheets Script Editor gives me an error popup:
TypeError: Cannot read property "length" from undefined. (line 2, file "runningtotal")
I have no programming background and never dreamed I'd be looking at code just to make a running tally.
If anybody can offer any insight into this, fixing or replacing the script or offering an ARRAYFORMULA solution, I'd really appreciate it!
function runningTotal(names, dates, amounts) {
var sum, totals = [], n = names.length;
if (dates.length != n || amounts.length != n) {
return 'Error: need three columns of equal length';
}
for (var i = 0; i < n; i++) {
if (names[i][0]) {
sum = 0;
for (var j = 0; j < n; j++) {
if (names[j][0] == names[i][0] && dates[j][0] <= dates[i][0]) {
sum = sum + amounts[j][0];
}
}
}
else {
sum = '';
}
totals.push([sum]);
}
return totals;
}

Conditional formatting: max value, comparing rows with specific data

I am making an exercise tracking sheet with Google Sheets, and ran into a problem. The sheet has a table for raw data such as day, exercise type chosen from a validated list, and sets, reps, weight, you name it. To find the useful information for analysis, I have set up a pivot table. I want to find the max values for each type of value per exercise.
For example, comparing all the three instances of "DL-m BB" in column D, the table should highlight the highest values between all them: H9 would be the record weight, F5 record volume and so on, and for "SQ-lb BB box" H12 would be max weight and F3 max volume. Eventually the table will have several hundred rows per year, and finding max values per exercise per attribute is going to be too much of a task, time better spent elsewhere.
The Conditional Formatting can be set as follow for the two examples you give above. A separate rule is set for each. They are set from the same cell (H1) adding an additional rule.
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="DL-m BB"))
Add another rule
Apply to Range
H1:H1000
Custom Formula is
=$H1=max(filter($D:$H,$D:$D="SQ-lb box BB"))
Place this on your Pivot table page (Try M1 - it must be outside of the PT)
It list max correctly.
=UNIQUE(query($D2:$K,"SELECT D,Max(F),Max(G),Max(H),Max(I),Max(J), Max(K) Where D !='' Group By D Label Max(F) 'Max F', Max(G) 'Max G', Max(H) 'Max H', Max(I) 'Max I',Max(J) 'Max J',Max(K) 'Max K'"))
The below query lists the max for F
=query($D2:$F,"SELECT Max(F) Where D !='' Group By D label Max(F)''")
I have been trying this with conditional formatting and it almost works. Maybe you will see something I don't. Still trying.
This works.
function onOpen(){
keepUnique()
}
function keepUnique(){ //create array og unique non black values to find max for
var col = 4 ; // choose the column you want to use as data source (0 indexed, it works at array level)
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
var data = sh.getRange(2, col, sh.getLastRow()-2).getValues();
var newdata = new Array();
for(nn in data){
var duplicate = false;
for(j in newdata){
if(data[nn][0] == newdata[j][0] || data[nn][0]==""){
duplicate = true;
}
}
if(!duplicate){
newdata.push([data[nn][0]]);
}}
colorMax(newdata)
}
function colorMax(newdata){
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheets()[1];
lc=sh.getLastColumn()
var data = sh.getRange(2, 4, sh.getLastRow()-2,lc).getValues(); //get col 4 to last col
for(k=2;k<lc;k++){
for(i=0;i<newdata.length;i++){
var maxVal=0
for(j=0;j<data.length;j++){
if(data[j][0]==newdata[i][0]){
if(data[j][k]>maxVal){maxVal=data[j][k];var row=j+2} //find max value and max value row number
}}
var c =sh.getRange(row,k+4,1,1)//get cell to format
var cv=c.getValue()
c.setFontColor("red") //set font red
}}
}

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem