Calculate all possible sums of all summands in 4 columns in R - sum

I have 4 columns of different sizes (eg column 1: 96 rows, column 2: 36 rows; column 3: 12 rows; column 4: 401 rows)
I am now looking for a function that allows me to calculate all possible sums of these 4 summands.
So at the end i need to have 963612*401= 16630272 summs as a result in a data frame or vector or array to make a histogramm with ggplot.
I tried solve it with a for case that did not work:
r = 1
for(i in 1:(length(df$column1))) {
for(j in 1:(length(df$column2))) {
for(h in 1:(length(df$column3))){
for(k in 1:(length(df$column4))) {
(i+j)-> a
r = r +1
}
}
}
Does someone have an idea how to solve this problem? Or does anyone have an idea why my code does not work? It has problems by storing my results in variable a and second problem it produces somehow way to much sums. (More than 16630272)
Many thanks!

Your real column1/2/3/4 are probably not columns in a single data frame - as those must be in uniform length. Anyway, if these are the true sizes the naive solution shouldn't hog an impossible amount of memory (~126M):
d <- expand.grid(column1, column2, column3, column4)
all.sums <- mapply(sum, d[[1]], d[[2]], d[[3]], d[[4]])

Related

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Pandas identifying if any element is in a row

I have a data frame that is a single row of numerical values and I want to know if any of those values is greater than 2 and if so create a new column with the word 'Diff'
Col_,F_1,F_2
1,5,0
My dataframe is diff_df. Here is one thing I tried
c = diff_df >2
if c.any():
diff_df['difference']='Difference'
If I were to print c. it would be
Col_,F_1,F_2
False,True,False
I have tried c.all() and many iterations of other things. Clearly my inexperience is holding me back and google is not helping in this regards. Everything I try is either "The truth value of a Series (or Data Frame) is ambiguous use a.any(), a.all()...." Any help would be appreciated.
Since it is only one row, take the .max().max() of the dataframe. With one .max() you are going to get the .max() of each column. The second .max() takes the max of all the columns.
if diff_df.max().max() > 2: diff_df['difference']='Difference'
output:
Col_ F_1 F_2 difference
0 1 5 0 Difference
Use .loc accessor and .gt() to query and at the same time create new column and populate it
df.loc[df.gt(2).any(1), "difference"] = 'Difference'
Col_ F_1 F_2 difference
0 1 5 0 Difference
In addition to David's reponse you may also try this:
if ((df > 2).astype(int)).sum(axis=1).values[0] == 1:
df['difference']='Difference'

SQL Query to return which columns have different values given two rows

I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem