I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()
Related
I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.
Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286
I have a data frame called df that looks like this in Pandas:
**id amt date seq**
SB 450,000,000 2020-05-11 1
OM 430,000,000 2020-05-11 1
SB 450,000,000 2020-05-12 1
OM 450,000,000 2020-05-12 1
OM 130,000,000 2020-05-12 2
I need to find the value in amt for each ID for each day. The issue is that one some days there are multiple cycles as indicated by "seq".
If there are 2 cycles (aka seq=2) for any one day, I need to take the value when seq=2 for that id on that day, and drop any values for seq=1 with the same day and id. Some days there are only 1 cycle for any one id, and on those days I can just stick with the value where seq=1.
My goal is to Pandas groupby day and then again groupby id, then apply an if statement for if the seq column contains a 2 for that id and that day, then filter that groupby object to only include the row where seq=2 for that day and id. The end result would be a data frame with only the rows where seq=2 for any day when there are multiple cycles and seq=1 or 2, and the rows where seq=1 for days where there is only one cycle and seq=1 for all ids.
So far I have tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
if 2 in id[1]['seq']:
id[1]=id[1].apply(lambda g: g[g['seq']==2])`
Which gives me:
KeyError: 'seq'
and I have also tried:
`for day in df.groupby(df['date']):
for id in day[1].groupby(['id']):
id=list(id)
if 2 in id[1]['seq']:
id[1]=id[1][id[1]['seq']==2]`
Which runs fine but then doesn't actually change or doing anything to df (same number of rows remain).
Can anyone help me with how I can accomplish this?
Thank you in advance!
You can do this if you groupby date + id, then get the indices of the rows where seq is at it's maximum for those groupings. Once you get the those indices, you can slice back into the original dataframe to get your desired subset:
max_seq_indices = df.groupby(["date", "**id"])["seq**"].idxmax()
print(max_seq_indices)
date **id
2020-05-11 OM 1
SB 0
2020-05-12 OM 4
SB 2
Name: seq**, dtype: int64
Looking at the values of this Series, you can see that we have a maximum seq for ["2020-05-11", "OM"] at row 1. Likewise, there is a maximum seq for ["2020-05-11", "SB"] at row 0. And so on. If we use this to slice back into our original dataframe, we end up with a subset that you described in your question:
new_df = df.loc[max_seq_indices]
print(new_df)
**id amt date seq**
1 OM 430,000,000 2020-05-11 1
0 SB 450,000,000 2020-05-11 1
4 OM 130,000,000 2020-05-12 2
2 SB 450,000,000 2020-05-12 1
This approach will encounter issues if your have a seq greater than 2, but only want the rows where seq is 2. However if that is the case, leave a comment and I can update my answer with a more robust (but probably more complex) solution
You can also work with a sorted dataframe like:
df.sort_values(['date', '**id', 'seq**'], inplace=True)
Then you can use groupby to take just the last of each group
df.reset_index(drop=True).groupby(['date', '**id'])['amt'].agg('last')
I am trying to calculate the mean of Interval without selling of a product.
I thought that a good way to get this is:
Count (Days without selling) / Count (Intervals of consecutive days without selling)
Units Sold
0 1
1 4
2 0
3 0
4 0
5 7
6 0
7 0
8 0
9 0
10 1
11 0
In this example I had:
8 days without selling
3 Intervals of consecutive days without selling
So, 8/3 = 2.7 should be my result.
Counting days with No units sold I am using this:
x['Units Sold'] == 0).sum()
However, I don't figured out a good approach to calculate 'Intervals of consecutive days without selling' in a efficient way (considering I will run on multiple products)
Another approach using nunique
s = df["Units Sold"].eq(0)
d = s.sum()
i = s[s].index.to_series().diff().ne(1).cumsum().nunique()
final = d/i # 2.6666666666666665
Using eq, cumsum and diff
First we use eq(0) and sum, to count the amount of days where nothing was sold.
Then we get the cumsum of these days and check wether or not there's a difference between the rows. If this difference is 0, that means there was an interval.
days = x['Units Sold'].eq(0).sum()
intervals = x['Units Sold'].eq(0).cumsum().diff().eq(0)
mask = x['Units Sold'].shift(-1).eq(0)
days / (intervals & mask).sum()
Output
2.6666666666666665
You already knew how to get sum of count of 0, so try this to find number of consective group of 0
s = df['Units Sold'].eq(0)
(s & ~s.shift(fill_value=False)).sum()
Out[567]: 3
You can use:
df.eq(0).sum()/((df.eq(0)&df.shift().ne(0)).sum())
Output:
Units Solds 2.666667
dtype: float64
The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!
I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)