Pandas how to find consecutive value increase over time on time series data - pandas

I have a dataframe with three columns (tilted as name, value, date)as below:
Tool
Value
Date
A
52.14
1/1
A
51.5
1/7
A
52
1/10
A
52.9
2/1
B
53.1
1/5
B
51.7
1/10
B
51.9
1/21
B
52.4
1/22
B
53.0
2/1
B
51.5
2/15
I would like to find which tools have increased measure values on three measurement days.
Tool B's value has increased on 1/21 and then increased on 1/22 and then increased on 2/1. so the outcome will be as below:
Tool
Value
Date
B
51.7
1/10
B
51.9
1/21
B
52.4
1/22
B
53.0
2/1
I am wondering how can I define a function in pandas to give the desired result.
Thanks.

Here's an answer, but I bet there is a better one. I make new Series to keep track of whether a value is less than the next (increasing), increasing 'runs' as groups (increase_group) and then how many consecutive increases happen in that group (consec_increases). I've kept these as stand-alone Series, but if you add them as columns to your table you can see the reasoning. Getting the 2/1 row added is a bit hacked together because it's not part of the same increase_group and I can't figure that out in a more clever way than just adding one more index past the group max
Tool Value Date increasing increase_group consec_increases
0 A 52.14 1/1 False 1 0
1 A 51.50 1/7 True 2 2
2 A 52.00 1/10 True 2 2
3 A 52.90 2/1 False 3 0
4 B 53.10 1/5 False 4 0
5 B 51.70 1/10 True 5 3
6 B 51.90 1/21 True 5 3
7 B 52.40 1/22 True 5 3
8 B 53.00 2/1 False 6 0
9 B 51.50 2/15 False 6 0
Here's the actual code
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [52.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
#Assuming your table is already sorted by Tool then by Date like your example
#Bool Series that is True if the prev Tool value is lt (<) the next (increasing)
increasing = df['Value'].lt(df.groupby('Tool')['Value'].shift(-1))
#df['increasing'] = increasing #uncomment if you want to see what's happening
#Group the rows by increasing 'runs'
increase_group = increasing.groupby(df['Tool'], group_keys=False).apply(lambda v: v.ne(v.shift(1))).cumsum()
#df['increase_group'] = increase_group #uncomment if you want to see what's happening
#Count the number of consecutive increases per group
consec_increases = increasing.groupby(increase_group).transform(lambda v: v.cumsum().max())
#df['consec_increases'] = consec_increases #uncomment if you want to see what's happening
#print(df) #uncomment if you want to see what's happening
#get the row indices of each group w/ 3 or more consec increases
inds_per_group = (
increase_group[consec_increases >= 3] #this is where you can set the threshold
.groupby(increase_group)
.apply(lambda g: list(g.index)+[max(g.index)+1]) #get all the row inds AND one more
)
#use the row indices to get the output table
out_df = pd.concat(df.loc[inds].assign(group=i) for i,inds in enumerate(inds_per_group))
With output:
Tool Value Date group
5 B 51.7 1/10 0
6 B 51.9 1/21 0
7 B 52.4 1/22 0
8 B 53.0 2/1 0
The reason I've added a group column is to help you distinguish when you have multiple consecutive runs of 3 or more. For example if your original table had 50.14 instead of 52.14 for the first value:
df = pd.DataFrame({
'Tool': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Value': [50.14, 51.5, 52.0, 52.9, 53.1, 51.7, 51.9, 52.4, 53.0, 51.5],
'Date': ['1/1','1/7','1/10','2/1','1/5','1/10','1/21','1/22','2/1','2/15'],
})
Then the output is
Tool Value Date group
0 A 50.14 1/1 0
1 A 51.50 1/7 0
2 A 52.00 1/10 0
3 A 52.90 2/1 0
5 B 51.70 1/10 1
6 B 51.90 1/21 1
7 B 52.40 1/22 1
8 B 53.00 2/1 1

You can first sort everything according to "tool" and "date"
df = df.sort_values(by=["tool", "date"])
Then, you can use the shift() on "Value" to add another column which contains the "Value" from the day before.
df["value_minus_1"] = df["Value"].shift()
This gives you a column which contains the measurement of the previous day. Then, you can use shift(2) to get the measurements of two days ago.
df["value_minus_2"] = df["Value"].shift(2)
Finally, you can drop the rows that have nas and filter the rows:
df = df.dropna()
df = df[df.val > df.value_minus_1]
df = df[df.value_minus_1 > df.value_minus_2]

Related

Generating a separate column that stores weighted average per group

It must not be that hard but I can't cope with this problem.
Imagine I have a long format dataframe with some data and want to calculate a weighted average of score per person and weighted by a manager and keep it as a separate variable - 'w_mean_m'.
df['w_mean_m'] = df.groupby('person')['score'].transform(lambda x: np.average(x['score'], weights=x['manager_weight']))
throws an error and I have no idea how to fix it.
Because GroupBy.transform working with each column separately is not possible select multiple columns, so is used GroupBy.apply with Series.map for new column:
s = (df.groupby('contact')
.apply(lambda x: np.average(x['score'], weights=x['manager_weight'])))
df['w_mean_m'] = df['contact'].map(s)
One hack is possible with selected values by unique index for weights:
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights=df.loc[x.index, "manager_weight"])
df['w_mean_m1'] = df.groupby('contact')['score'].transform(f)
print (df)
manager_weight score contact w_mean_m1
0 1.0 1 a 1.282609
1 1.1 1 a 1.282609
2 1.2 1 a 1.282609
3 1.3 2 a 1.282609
4 1.4 2 b 2.355556
5 1.5 2 b 2.355556
6 1.6 3 b 2.355556
7 1.7 3 c 3.770270
8 1.8 4 c 3.770270
9 1.9 4 c 3.770270
10 2.0 4 c 3.770270
Setup:
df = pd.DataFrame(
{
"manager_weight": [1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0],
"score": [1,1,1,2,2,2,3,3,4,4,4],
"contact": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'c']
})

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

Why use to_frame before reset_index?

Using a data set like this one
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_id','module_id','week'])
we often see this pattern:
df.groupby(['user_id'])['module_id'].count().to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
But we get exactly the same result from
df.groupby(['user_id'])['module_id'].count().reset_index(name='count')
(N.B. we need the additional rename in the former because reset_index on Series (here) includes a name parameter and returns a data frame, while reset_index on DataFrame (here) does not include the name parameter.)
Is there any advantage in using to_frame first?
(I wondered if it might be an artefact of earlier versions of pandas, but that looks unlikely:
Series.reset_index was added in this commit on the 27th of January 2012.
Series.to_frame was added in this commit on the 13th of October 2013.
So Series.reset_index was available over a year before Series.to_frame.)
There is no noticeable advantage of using to_frame(). Both approaches can be used to achieve the same result. It is common in pandas to use multiple approaches for solving a problem. The only advantage I can think of is that for larger sets of data, it maybe more convenient to have a dataframe view first before resetting the index. If we take your dataframe as an example, you will find that to_frame() displays a dataframe view that maybe useful to understand the data in terms of a neat dataframe table v/s a count series. Also, the usage of to_frame() makes the intent more clear to a new user who looks at your code for the first time.
The example dataframe:
In [7]: df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=['user_i
...: d','module_id','week'])
In [8]: df.head()
Out[8]:
user_id module_id week
0 3 4 4
1 1 3 4
2 1 2 2
3 1 3 4
4 1 2 2
The count() function returns a Series:
In [18]: test1 = df.groupby(['user_id'])['module_id'].count()
In [19]: type(test1)
Out[19]: pandas.core.series.Series
In [20]: test1
Out[20]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [21]: test1.index
Out[21]: Int64Index([0, 1, 2, 3, 4], dtype='int64', name='user_id')
Using to_frame makes it explicit that you intend to convert the Series to a Dataframe. The index here is user_id:
In [22]: test1.to_frame()
Out[22]:
module_id
user_id
0 2
1 7
2 4
3 6
4 1
And now we reset the index and rename the column using Dataframe.rename. As you rightly pointed, Dataframe.reset_index() does not have a name parameter and therefore, we will have to rename the column explicitly.
In [24]: testdf1 = test1.to_frame().reset_index().rename({'module_id':'count'}, axis='columns')
In [25]: testdf1
Out[25]:
user_id count
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
Now lets look at the other case. We will use the same count() series test1 but rename it as test2 to differentiate between the two approaches. In other words, test1 is equal to test2.
In [26]: test2 = df.groupby(['user_id'])['module_id'].count()
In [27]: test2
Out[27]:
user_id
0 2
1 7
2 4
3 6
4 1
Name: module_id, dtype: int64
In [28]: test2.reset_index()
Out[28]:
user_id module_id
0 0 2
1 1 7
2 2 4
3 3 6
4 4 1
In [30]: testdf2 = test2.reset_index(name='count')
In [31]: testdf1 == testdf2
Out[31]:
user_id count
0 True True
1 True True
2 True True
3 True True
4 True True
As you can see both dataframes are equivalent, and in the second approach we just had to use reset_index(name='count') to both reset the index and rename the column name because Series.reset_index() does have a name parameter.
The second case has lesser code but is less readable for new eyes and I'd prefer the first approach of using to_frame() because it makes the intent clear: "Convert this count series to a dataframe and rename the column 'module_id' to 'count'".

Count the number of times a value from one dataframe has repeated in another dataframe

I have 3 dataframes say A, B and C with a common column 'com_col' in all three dataframes. I want to create a new column called 'com_col_occurrences' in B which should be calculated as below. For each value in 'com_col in dataframe B, check whether the value is available in A or not. If it is available then return the number of times the value has occurred in A. If it is not, then check in C whether it is available or not and if it is then return the number of times it has repeated. Please tell me how to write a function for this in Pandas. Please find below the sample code which demonstrates the problem.
import pandas as pd
#Given dataframes
df1 = pd.DataFrame({'comm_col': ['A', 'B', 'B', 'A']})
df2 = pd.DataFrame({'comm_col': ['A', 'B', 'C', 'D', 'E']})
df3 = pd.DataFrame({'comm_col':['A', 'A', 'D', 'E']})
# The value 'A' from df2 occurs in df1 twice. Hence the output is 2.
#Similarly for 'B' the output is 2. 'C' doesn't occur in any of the
#dataframes. Hence the output is 0
# 'D' and 'E' occur don't occur in df1 but occur in df3 once. Hence
#the output for 'D' and 'E' should be 1
#Output should be as shown below
df2['comm_col_occurrences'] = [2, 2, 0, 1, 1]
Output:
**df1**
comm_col
0 A
1 B
2 B
3 A
**df3**
comm_col
0 A
1 A
2 D
3 E
**df2**
comm_col
0 A
1 B
2 C
3 D
4 E
**Output**
comm_col comm_col_occurrences
0 A 2
1 B 2
2 C 0
3 D 1
4 E 1
Thanks in advance
You need:
result = pd.DataFrame({
'df1':df1['comm_col'].value_counts(),
'df2':df2['comm_col'].value_counts(),
'df3':df3['comm_col'].value_counts()
})
result['comm_col_occurrences'] = np.nan
result.loc[result['df1'].notnull(), 'comm_col_occurrences'] = result['df1']
result.loc[result['df3'].notnull(), 'comm_col_occurrences'] = result['df3']
result['comm_col'] = result['comm_col'].fillna(0)
result = result.drop(['df1', 'df2', 'df3'], axis=1)
Output:
comm_col comm_col_occurrences
0 A 2.0
1 B 2.0
2 C 0.0
3 D 1.0
4 E 1.0