segmentation total based on multiple condition

segmentation total based on multiple condition - pandas

data frame:-
ID spend month_diff
12 10 -1
12 10 -2
12 20 1
12 30 2
13 15 -1
13 20 -2
13 25 1
13 30 2
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year.so,i want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:-
if month_diff >= -2 and < 0 then cumulative spend for negative months - flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months - flag=post
Desired data frame:-
ID spend month_diff tot_spend flag
12 10 -2 20 pre
12 30 2 50 post
13 20 -2 35 pre
13 30 2 55 post

Use numpy.sign with Series.shift , Series.ne and Series.cumsum for consecutive groups and pass to DataFrame.groupby with aggregate GroupBy.last and sum.
Last use numpy.select:
a = np.sign(df['month_diff'])
g = a.ne(a.shift()).cumsum()
df1 = (df.groupby(['ID', g])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -2 20 pre
1 12 2 50 post
2 13 -2 35 pre
3 13 2 55 post

Related

Pandas Dataframe to subtract the values with Previous Executed Value

I want to subtract the first-row value from the total count of the
test case and the remaining values with the executed count outcome.
**Input:**
Date Count
17-10-2022 20
18-10-2022 18
19-10-2022 15
20-10-2022 10
21-10-2022 5
**Code:**
df['Date'] = pd.to_datetime(df['Date'])
edate = df['Date'].max().strftime('%Y-%m-%d')
sdate = df['Date'].min().strftime('%Y-%m-%d')
df['Date'] = pd.to_datetime(df['Date']).apply(lambda x: x.date())
df=df.groupby(['Date'])['Date'].count().reset_index(name='count')
df['result'] = Test_case - df['count'].iloc[0]
df['result'] = df['result'] - df['count'].shift(1)
**Output generating:**
Date count result
0 2022-10-17 20 NaN
1 2022-10-18 18 40.0
**Expected Output:**
Date Count Result
17-10-2022 20 60(80-20) - 80 is the total Test case count for example
18-10-2022 18 42(60-18)
19-10-2022 15 27(42-15)
20-10-2022 10 17(27-10)
21-10-2022 5 12(17-5)

Is 80 an arbitrary number? then use following code:
n = 80
df.assign(Result=df['Count'].cumsum().mul(-1).add(n))
output:
Date Count Result
0 17-10-2022 20 60
1 18-10-2022 18 42
2 19-10-2022 15 27
3 20-10-2022 10 17
4 21-10-2022 5 12
and you can change n

Pandas: drop both rows if one column matches same and another don't

I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data

Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.

Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40

Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2

group total based on months difference

data frame:
ID spend month_diff
12 10 -5
12 10 -4
12 10 -3
12 10 1
12 10 -2
12 20 0
12 30 2
12 10 -1
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year. So, I want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:
if month_diff >= -2 and < 0 then cumulative spend for negative months -> flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months -> flag=post
Note: the no. of month_diff +ve and -ve are not same. it might be the case that customer had 4 transactions in -ve month_diff and only 2 transaction on +ve so I want to take only 2 month cumulative sum of -ve month_diff and 2 for +ve and don't want to consider the spend where month_diff is 0.
Desired data frame:
ID spend month_diff spend_tot flag
12 10 -2 20 pre
12 30 2 40 post
40 is the cumulative sum of spend for month_diff +1 and +2 (i.e. 10+30) and same for month_diff -1 and -2 and its cumulative spend is 20(i.e.10 + 10

Use:
#filter values by list
df = df[df['month_diff'].isin([1,2,-1,-2])]
#filter duplicated values with absolute values of month_diff
df = df[df.assign(a=df['month_diff'].abs()).duplicated(['ID','a'], keep=False)]
#sign column
a = np.sign(df['month_diff'])
#aggregate sum and last
df1 = (df.groupby(['ID', a])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -1 20 pre
1 12 2 40 post

remove outliers by group in sql

In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)

Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75

SQL - Select rows after reaching minimum value/threshold

Using Sql Server Mgmt Studio. My data set is as below.
ID Days Value Threshold
A 1 10 30
A 2 20 30
A 3 34 30
A 4 25 30
A 5 20 30
B 1 5 15
B 2 10 15
B 3 12 15
B 4 17 15
B 5 20 15
I want to run a query so only rows after the threshold has been reached are selected for each ID. Also, I want to create a new days column starting at 1 from where the rows are selected. The expected output for the above dataset will look like
ID Days Value Threshold NewDayColumn
A 3 34 30 1
A 4 25 30 2
A 5 20 30 3
B 4 17 15 1
B 5 20 15 2
It doesn't matter if the data goes below the threshold for the latter rows, I want to take the first row when threshold is crossed as 1 and continue counting rows for the ID.
Thank you!

You can use window functions for this. Here is one method:
select t.*, row_number() over (partition by id order by days) as newDayColumn
from (select t.*,
min(case when value > threshold then days end) over (partition by id) as threshold_days
from t
) t
where days >= threshold_days;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

segmentation total based on multiple condition - pandas

Related

Pandas Dataframe to subtract the values with Previous Executed Value

Pandas: drop both rows if one column matches same and another don't

group total based on months difference

remove outliers by group in sql

SQL - Select rows after reaching minimum value/threshold

Categories

Resources