Pandas: drop both rows if one column matches same and another don't - pandas

I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data

Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.

Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40

Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2

Related

Cumulative Deviation of 2 Columns in Pandas DF

I have a rather simple request and have not found a suitable solution online. I have a DF that looks like this below and I need to find the cumulative deviation as shown in a new column to the DF. My DF looks like this:
year month Curr Yr LT Avg
0 2022 1 667590.5985 594474.2003
1 2022 2 701655.5967 585753.1173
2 2022 3 667260.5368 575550.6112
3 2022 4 795338.8914 562312.5309
4 2022 5 516510.1103 501330.4306
5 2022 6 465717.9192 418087.1358
6 2022 7 366100.4456 344854.2453
7 2022 8 355089.157 351539.9371
8 2022 9 468479.4396 496831.2979
9 2022 10 569234.4156 570767.1723
10 2022 11 719505.8569 594368.6991
11 2022 12 670304.78 576495.7539
And, I need the cumulative deviation new column in this DF to look like this:
Cum Dev
0.122993392
0.160154637
0.159888559
0.221628609
0.187604073
0.178089327
0.16687643
0.152866293
0.129326033
0.114260993
0.124487107
0.128058305
In Excel, the calculation would look like this with data in Excel columns Z3:Z14, AA3:AA14 for the first row: =SUM(Z$3:Z3)/SUM(AA$3:AA3)-1 and for the next row: =SUM(Z$3:Z4)/SUM(AA$3:AA4)-1 and for the next as follows with the last row looking like this in the Excel example: =SUM(Z$3:Z14)/SUM(AA$3:AA14)-1
Thank you kindly for your help,
You can divide the cumulative sums of those 2 columns element-wise, and then subtract 1 at the end:
>>> (df["Curr Yr"].cumsum() / df["LT Avg"].cumsum()) - 1
0 0.122993
1 0.160155
2 0.159889
3 0.221629
4 0.187604
5 0.178089
6 0.166876
7 0.152866
8 0.129326
9 0.114261
10 0.124487
11 0.128058
dtype: float64

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Can I reference a prior row's value and populate it in the current row in a new column?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
45.593
1
1
1
2
2003
46.538
46.66
46.47
46.673
45.593
1
2
1
3
2003
46.717
46.781
46.53
46.750
45.593
1
3
1
4
2003
46.815
46.843
46.68
46.750
45.593
1
4
1
5
2003
46.935
47.000
46.56
46.593
45.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
389.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
389.019
43
7259
10
28
2022
379.869
389.519
379.67
389.019
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
385.24
44
7261
11
1
2022
390.14
390.39
383.29
384.519
385.24
44
I want to create a new column titled 'Prior_Week_Close' which will reference the prior week's 'Week Close' value (and the last week of the prior year for the first week of every year). For example, row 7260's value for Prior_Week_Close should equal 389.019
I'm trying:
SPY['prior_week_close'] = np.where(SPY['Week'].shift(1) == (SPY['Week'] - 1), SPY['Week_Close'].shift(1), np.nan)
TypeError: boolean value of NA is ambiguous
I thought about just using shift and creating a new column but some weeks only have 4 days and that would lead to inaccurate values.
Any help is greatly appreciated!
I was able to solve this by creating a new column called 'Overall_Week' (the week number in the entire data set, not just the calendar year) and using the following code:
def fn(s):
result = SPY[SPY.Overall_Week == (s.iloc[0] - 1)]['Week_Close']
if result.shape[0] > 0:
return np.broadcast_to(result.iloc[0], s.shape)
else:
return np.broadcast_to(np.NaN, s.shape)
SPY['Prior_Week_Close'] = SPY.groupby('Overall_Week')['Overall_Week'].transform(fn)```

segmentation total based on multiple condition

data frame:-
ID spend month_diff
12 10 -1
12 10 -2
12 20 1
12 30 2
13 15 -1
13 20 -2
13 25 1
13 30 2
I want to get the spend_total based on the month difference for a particular ID. month_diff in negative means spend done by customer in last year and positive means this year.so,i want to compare the spend of customers for past year and this year. so the conditions are as follows:
Conditions:-
if month_diff >= -2 and < 0 then cumulative spend for negative months - flag=pre
if month_diff > 0 and <=2 then cumulative spend for positive months - flag=post
Desired data frame:-
ID spend month_diff tot_spend flag
12 10 -2 20 pre
12 30 2 50 post
13 20 -2 35 pre
13 30 2 55 post
Use numpy.sign with Series.shift , Series.ne and Series.cumsum for consecutive groups and pass to DataFrame.groupby with aggregate GroupBy.last and sum.
Last use numpy.select:
a = np.sign(df['month_diff'])
g = a.ne(a.shift()).cumsum()
df1 = (df.groupby(['ID', g])
.agg({'month_diff':'last', 'spend':'sum'})
.reset_index(level=1, drop=True)
.reset_index())
df1['flag'] = np.select([df1['month_diff'].ge(-2) & df1['month_diff'].lt(0),
df1['month_diff'].gt(0) & df1['month_diff'].le(2)],
['pre','post'], default='another val')
print (df1)
ID month_diff spend flag
0 12 -2 20 pre
1 12 2 50 post
2 13 -2 35 pre
3 13 2 55 post

Pandas 1.0 create column of months from year and date

I have a dataframe df with values as:
df.iloc[1:4, 7:9]
Year Month
38 2020 4
65 2021 4
92 2022 4
I am trying to create a new MonthIdx column as:
df['MonthIdx'] = pd.to_timedelta(df['Year'], unit='Y') + pd.to_timedelta(df['Month'], unit='M') + pd.to_timedelta(1, unit='D')
But I get the error:
ValueError: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta values durations.
Following is the desired output:
df['MonthIdx']
MonthIdx
38 2020/04/01
65 2021/04/01
92 2022/04/01
So you can pad the month value in a series, and then reformat to get a datetime for all of the values:
month = df.Month.astype(str).str.pad(width=2, side='left', fillchar='0')
df['MonthIdx'] = pd.to_datetime(pd.Series([int('%d%s' % (x,y)) for x,y in zip(df['Year'],month)]),format='%Y%m')
This will give you:
Year Month MonthIdx
0 2020 4 2020-04-01
1 2021 4 2021-04-01
2 2022 4 2022-04-01
You can reformat the date to be a string to match exactly your format:
df['MonthIdx'] = df['MonthIdx'].apply(lambda x: x.strftime('%Y/%m/%d'))
Giving you:
Year Month MonthIdx
0 2020 4 2020/04/01
1 2021 4 2021/04/01
2 2022 4 2022/04/01