Mean of consecutive days without selling - pandas

I am trying to calculate the mean of Interval without selling of a product.
I thought that a good way to get this is:
Count (Days without selling) / Count (Intervals of consecutive days without selling)
Units Sold
0 1
1 4
2 0
3 0
4 0
5 7
6 0
7 0
8 0
9 0
10 1
11 0
In this example I had:
8 days without selling
3 Intervals of consecutive days without selling
So, 8/3 = 2.7 should be my result.
Counting days with No units sold I am using this:
x['Units Sold'] == 0).sum()
However, I don't figured out a good approach to calculate 'Intervals of consecutive days without selling' in a efficient way (considering I will run on multiple products)

Another approach using nunique
s = df["Units Sold"].eq(0)
d = s.sum()
i = s[s].index.to_series().diff().ne(1).cumsum().nunique()
final = d/i # 2.6666666666666665

Using eq, cumsum and diff
First we use eq(0) and sum, to count the amount of days where nothing was sold.
Then we get the cumsum of these days and check wether or not there's a difference between the rows. If this difference is 0, that means there was an interval.
days = x['Units Sold'].eq(0).sum()
intervals = x['Units Sold'].eq(0).cumsum().diff().eq(0)
mask = x['Units Sold'].shift(-1).eq(0)
days / (intervals & mask).sum()
Output
2.6666666666666665

You already knew how to get sum of count of 0, so try this to find number of consective group of 0
s = df['Units Sold'].eq(0)
(s & ~s.shift(fill_value=False)).sum()
Out[567]: 3

You can use:
df.eq(0).sum()/((df.eq(0)&df.shift().ne(0)).sum())
Output:
Units Solds 2.666667
dtype: float64

Related

updating the next several row values based on the value of a row in another column

I'm trying to figure out how to add the values of one column (the amount column) to the next few rows based on the condition of another column (the days column). If the condition of the days column is greater than 1, for each day greater than 1 I add the amount column to that many following rows. So if days is three, I add the amount to the next two rows (the first day is just the current row). I actually think this is easier if I make a copy of the amount column, so I made a copy called backlog.
So let's say I have an amount column that represents the amount of support tickets that need to be resolved each day. Each amount has a number of days it takes for the amount to be resolved. I need the total amount to be a sum of the value today and the sum of the outstanding tickets. So if I have an amount of 1 for 2 days, I have 1 ticket amount today and I add that same 1 tomorrow to the ticket amount of tomorrow. If this doesn't make sense, the below examples will. I have a solution as well, but my main issue is doing this efficiently.
Here is a sample dataframe to use:
amount = list(np.zeros(10)) + [random.randint(1,3) for val in range(15)]
random.shuffle(amount)
ex = pd.DataFrame({
'Amount': amount
})
ex.loc[ex['Amount']>0, 'Days'] = [random.randint(0,4) for val in range(15)]
ex.loc[ex['Amount']==0, 'Days'] = 0
ex['Days'] = ex['Days'].astype(int)
ex['Backlog'] = ex['Amount']
ex.head(10)
Input Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
2
3
0
3
Desired Output Dataframe:
Amount
Days
Backlog
2
0
2
1
3
1
2
2
3
3
0
6
In the last two values of the backlog column, I have a value of 3 (2 from the current day amount plus 1 from the prior day amount) and a value of 6 (3 for the current day + 2 from the previous day amount + 1 from two days ago).
I have made code for this below, which I think achieves the outcome:
for i in range(0, len(ex['Amount'])):
Days = ex['Days'].iloc[i]
if Days >= 2:
for j in range (1,Days):
if (i+j)>= len(ex['Amount']):
break
ex['Backlog'].iloc[i+j] += ex['Amount'].iloc[i]
The problem is that I'm already using two for loops to slice the data frame for two features first, so when this code is used as a function for a very large data frame it runs far too slowly, and my main goal has been to implement a faster way to do this. Is there a more efficient pandas method to achieve the same outcome? Possibly without having to use slow iteration or a nested for loop? I'm at a loss.

Pandas Cumulative sum over 1 indice but not the other 3

I have a dataframe with 4 variables DIVISION, QTR, MODEL_SCORE, MONTH with the sum of variable X aggregated by those 4.
I would like to effective partition the data by DIVISION,QTR, and MODEL SCORE and keep a running total order the MONTH FIELD order smallest to largest. The idea being it would reset if it got to a new permutation of the other 3 columns
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I'm trying
df['cumsum'] = df.groupby(level=3)['X'].cumsum()
having tried all numbers I can think in the level argument. It seems be able to work any way other than what I want.
EDIT: I know the below isn't formatted ideally, but basically as long as the only variable changing was MONTH the cumulative sum would continue but any other variable would cause it to reset.
DIVSION QTR MODEL MONTHS X CUMSUM
A 1 1 1 10 10
A 1 1 2 20 30
A 1 2 1 5 5
I'm sorry for all the trouble I believe the answer was way simpler than I was making it to be.
After
df = df.groupby(['DIVISION','MODEL','QTR','MONTHS'])['X'].sum()
I was supposed to reset the index I did not want a multi-index and this appears to have worked.
df = df.reset_index()
df['cumsum'] = df.groupby(['DIVISION','MODEL','QTR'])['X'].cumsum()

Is there a way to use cumsum with a threshold to create bins?

Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286

Python count number of periods in a series

I am having an error while counting the number of periods in a series.
I have tried this
series = pd.Series(['how. are. you. today.', 'i. am. fine.', 'thank. you.'])
count = series.str.count('.')
Expected results are
0 4
1 3
2 2
but instead I get
0 21
1 12
2 11
How do I solve this? Thank you in advance.
series = pd.Series(['how. are. you. today.', 'i. am. fine.', 'thank. you.'])
count = series.str.count('\.')

How to calculate the rolling sum on custom time columns?

The rolling function in Pandas can only calculate rolling statistics according to row counts or date/time columns. But I want to have a discrete time column for calculating rolling sum, something like this:
key time value
A 1 10
A 2 20
A 4 30
A 7 10
B 1 15
B 2 30
B 3 15
I want to first group by key, then calculate the rolling sum on value for the nearest 3 time:
key time value output
A 1 10 10
A 2 20 30(10+20)
A 4 30 60(10+20+30)
A 7 10 40(30+10)
B 1 15 15
B 2 30 45
B 3 15 60
I tried this:
grouped = input.groupby("key", as_index=False)
for name, group in grouped:
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
out = calcRollingStat(time, value, mode="avg")
group["output"] = out #out is a list
But then I don't know how to convert grouped back to DataFrame. Pandas tells me that there is no reset_index attribute in grouped.
Is my code the best method to do this? How would you tackle this problem?
Thank you!
I believe you can use GroupBy.apply with custom function:
def f(group):
group = group.sort_values("time")
time = list(group["time"])
value = list(group["value"])
#calcRollingStat is a custom function that outputs a list of corresponding results
group["output"] = calcRollingStat(time, value, mode="avg")
return group
df = input.groupby("key", as_index=False).apply(f)