Below is a small dataset of transaction records, with ID, DATE of the month, dummy variable of Bad_Credit or not. I would like to pull out all the transactions after a bad credit start.
The OUTPUT column indicate the correct result, which is row 1,2,3,5,6,8,10.
This is just an example, there could be thousands of rows. SQL, R, SPSS will all work. Thank you.
DATE
ID
Bad_CREDIT
OUTPUT
12
A
1
1
15
A
1
1
18
A
0
1
2
B
0
0
10
B
1
1
20
B
0
1
5
C
0
0
15
C
1
1
1
D
0
0
9
E
1
1
You can arrange the data by ID and DATE and for each ID assign 0 if the first value of Bad_CREDIT is 0.
library(dplyr)
df %>%
arrange(ID, DATE) %>%
group_by(ID) %>%
mutate(OUTPUT = as.integer(!(first(Bad_CREDIT) == 0 & row_number() == 1)))
# DATE ID Bad_CREDIT OUTPUT
# <int> <chr> <int> <int>
# 1 12 A 1 1
# 2 15 A 1 1
# 3 18 A 0 1
# 4 2 B 0 0
# 5 10 B 1 1
# 6 20 B 0 1
# 7 5 C 0 0
# 8 15 C 1 1
# 9 1 D 0 0
#10 9 E 1 1
data
df <- structure(list(DATE = c(12L, 15L, 18L, 2L, 10L, 20L, 5L, 15L,
1L, 9L), ID = c("A", "A", "A", "B", "B", "B", "C", "C", "D",
"E"), Bad_CREDIT = c(1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L)),
row.names = c(NA, -10L), class = "data.frame")
If I understand correctly, you can use window functions:
select t.*
from (select t.*,
min(case when bad_credit = 1 then date end) over (partition by id) as min_bd_date
from t
) t
where date >= min_bd_date;
You can also do this with a correlated subquery:
select t.*
from t
where t.date >= (select min(t2.date)
from t t2
where t2.id = t.id and
t2.bad_credit = 1
);
If this is in a database, then I think SQL is likely the better place to address this. However, if you already have it in R, then ...
Here's an R method, using dplyr:
library(dplyr)
dat %>%
group_by(ID) %>%
mutate(OUTPUT2 = +cumany(Bad_CREDIT)) %>%
ungroup()
# # A tibble: 10 x 5
# DATE ID Bad_CREDIT OUTPUT OUTPUT2
# <int> <chr> <int> <int> <int>
# 1 12 A 1 1 1
# 2 15 A 1 1 1
# 3 18 A 0 1 1
# 4 2 B 0 0 0
# 5 10 B 1 1 1
# 6 20 B 0 1 1
# 7 5 C 0 0 0
# 8 15 C 1 1 1
# 9 1 D 0 0 0
# 10 9 E 1 1 1
Because this is effectively a simple grouping operation, then base R and data.table solutions are as straight-forward.
+ave(dat$Bad_CREDIT, dat$ID, FUN=cumany)
# [1] 1 1 1 0 1 1 0 1 0 1
library(data.table)
datDT <- as.data.table(dat)
datDT[, OUTPUT2 := +cumany(Bad_CREDIT), by = .(ID)]
You can use EXISTS as follows:
select t.* from your_table t
where exists
(select 1
from your_table tt
where t.id = tt.id
and t.date >= tt.date
and tt.bad_credit = 1);
This is for SPSS:
sort cases by ID date.
compute PullOut=Bad_CREDIT.
if $casenum>1 and ID=lag(ID) and lag(PullOut)=1 PullOut=1.
exe.
Related
How do one add a different substring to each row based on a condition in pandas?
Here is a dummy dataframe that I created:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,5,size=(5, 2)))
df.columns = ['A','B']
If I replace the rows in B, with a string YYYY for those rows which have the value in A less then 5, then I would do it this way:
df.loc[df['A'] < 2, 'B'] = 'YYYY'
This is the current output of original df:
A B
0 3 4
1 0 1
2 3 0
3 0 1
4 4 4
Of replaced df:
A B
0 3 4
1 0 YYYY
2 3 0
3 0 YYYY
4 4 4
What I instead want is:
A B
0 3 4
1 0 1_1
2 3 0
3 0 1_2
4 4 4
Here is necessary generate list with same size like number of Trues values with range and sum, then convert to strings and join together:
m = df['A'] < 2
df.loc[m, 'B'] = df.loc[m, 'B'].astype(str) + '_' + list(map(str, range(1, m.sum() + 1)))
print (df)
A B
0 3 4
1 0 1_1
2 3 0
3 0 1_2
4 4 4
Or you can use f-strings for generate new list:
m = df['A'] < 2
df.loc[m, 'B'] = [f'{b}_{a}' for a, b in zip(range(1, m.sum() + 1), df.loc[m, 'B'])]
EDIT1:
m = df['A'] < 4
df.loc[m, 'B'] = df.loc[m, 'B'].astype(str) + '_' + df[m].groupby('B').cumcount().add(1).astype(str)
print (df)
A B
0 3 4_1
1 0 1_1
2 3 0_1
3 0 1_2
4 4 4
Say this my dataframe
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
If the number of a particular value in column B does not have a particular occurrence count, I want to duplicate all rows which have that particular value for B.
For the df above, say this particular value is 3. If a value for Column B is less than three, than all rows with that column value are duplicated. So rows with column value 0, 1, and 2 are duplicated, but rows with column b value of 5 are not.
Desired result:
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
10 b 2
11 m 2
12 g 3
13 p 3
14 c 0
15 c 0
Here is my approach
n=3 #threshold
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = 'A',
aggfunc = 'first')
)
r = max(n,len(df2.columns))
df2 = df2.reindex(columns = range(r))
notNaN_count = df2.count(axis=1)
m_ffill = notNaN_count.mul(2).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).add(1)
new_df = (df2.ffill(axis = 1)
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.rename('A')
.reset_index()
.loc[:,df.columns]
)
print(new_df)
Output
A B
0 c 0
1 c 0
2 c 0
3 q 1
4 z 1
5 q 1
6 z 1
7 b 2
8 m 2
9 b 2
10 m 2
11 g 3
12 p 3
13 g 3
14 p 3
15 a 5
16 d 5
17 u 5
if instead of duplicating we want to multiply by a factor d,
we must make the following modifications:
n = 3
d = 2
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
EDIT
n=3 #threshold
d = 2
values = df.columns.difference(['B'])
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = values,
aggfunc = 'first'))
r = max(n,len(df2.columns.get_level_values('columns').unique()))
df2 = df2.reindex(columns = range(r),level = 'columns')
notNaN_count = df2.count(axis=1).div(len(values))
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
new_df = (df2.T
.groupby(level=0)
.ffill()
.T
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.reset_index()
.loc[:,df.columns]
)
I would like to delete all rows in the Dataframe that have number of appereance = 10 and status = 1.
Example of Dataframe X is
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
...
First I found all rows with status=1 with count()=10
exclude=X[X.Status == 1].groupby('ID')['Status'].value_counts().loc[lambda x: x==10].index
exclude is Series
MultiIndex([( 371391, 1),
( 383537, 1),
...
Is it possible to delete rows in Dataframe X based od info for ID from Series ?
If your original DataFrame looks something like this:
print(df)
ID Status
0 366804 0
1 371391 1
2 383537 1
3 383538 0
4 383539 0
5 371391 1
6 371391 1
7 371391 1
8 371391 1
9 371391 1
10 371391 1
11 371391 1
12 371391 1
13 371391 1
And you group IDs and statuses together to find the IDs you want to exclude:
df2 = df.groupby(['ID', 'Status']).size().to_frame('size').reset_index()
print(df2)
ID Status size
0 366804 0 1
1 371391 1 10
2 383537 1 1
3 383538 0 1
4 383539 0 1
excludes = df2.loc[(df2['size'] == 10) & (df2['Status'] == 1), 'ID']
print(excludes)
1 371391
Name: ID, dtype: int64
Then you could use Series.isin and invert the boolean Series ~s:
df = df[~df['ID'].isin(excludes)]
print(df)
ID Status
0 366804 0
2 383537 1
3 383538 0
4 383539 0
A have a dataset
ID ID2 var1
1 p 10
1 r 5
1 p 9
2 p 7
2 r 6
2 r 7
I need to certify that in each NÂș ID the difference between (the sum of var1 by "p") and (the sum of var1 by "r") is more than 0. In other words, I need to group by ID and apply arithmetic operations between values grouped by ID2.
Thank you for any suggestions
import pandas as pd
from io import StringIO
df = pd.read_fwf(StringIO(
"""ID ID2 var1
1 p 10
1 r 5
1 p 9
2 p 7
2 r 6
2 r 7""")).set_index("ID")
df2 = df.pivot_table(values = "var1", index="ID", columns="ID2", aggfunc='sum')
# Example operatin -- difference
df2['diff'] = df2['p'] - df2['r']
df2
Result
ID2 p r diff
ID
1 19 5 14
2 7 13 -6
You can use .groupby and .diff() to calculate the difference after the groupby.
df.groupby(['ID', 'ID2']).var1.sum().diff()
Out[72]:
ID ID2
1 p NaN
r -14.0
2 p 2.0
r 6.0
Name: var1, dtype: float64
You can also add an indicator, which shows if the difference was greater than 0 with np.where, before that we use .reset_index to get our var1 column back.
groupby = df.groupby(['ID', 'ID2']).var1.sum().diff().reset_index()
groupby['indicator'] = np.where(groupby.var1 > 0, 'yes', 'no')
print(groupby)
ID ID2 var1 indicator
0 1 p NaN no
1 1 r -14.0 no
2 2 p 2.0 yes
3 2 r 6.0 yes
I think you need
df.groupby(['ID','ID2']).sum().groupby(level=[0]).diff()
Out[174]:
var1
ID ID2
1 p NaN
r -14.0
2 p NaN
r 6.0
Your data:
import pandas as pd
df=pd.DataFrame([[1,'p',10], [1,'r',5], [1,'p',9 ],
[2,'p',7 ], [2,'r',6 ], [2,'r',7 ]],
columns=['ID', 'ID2', 'var1'])
You can make a cross tabulation:
df=pd.crosstab(df.ID, [df.ID2,df.var1], margins=True)
>>>df
ID2 p r All
var1 7 9 10 5 6 7
ID
1 0 1 1 1 0 0 3
2 1 0 0 0 1 1 3
All 1 1 1 1 1 1 6
With no margins:
pd.crosstab(df.ID, [df.ID2,df.var1])
ID2 p r
var1 7 9 10 5 6 7
ID
1 0 1 1 1 0 0
2 1 0 0 0 1 1
Thank you guys very much for all your suggestions! I'm almost there...:)
I was trying all the codes.
I think I was not clear when explaining what the output I want. I think for the practical case I'm working on it would be useful to add an additional variable or two into the original list like this (below) This allows me to take decisions regarding IDs with negative differences in following steps.
output:
ID ID2 var1 var2(diff) var_control
1 p 10 14 0
1 r 5 14 0
1 p 9 14 0
2 p 7 -6 1
2 r 6 -6 1
2 r 7 -6 1
I think i did it with all your help. Thank you so much! You are awesome
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [23, 23, 23, 43, 43],
'id2': ["r", "p", "p", "p", "r"],
'var1': [4, 6, 7, 1, 3]})
print(df)
df2 = df.pivot_table(values = "var1", index="id", columns="id2", aggfunc='sum')
df2['diff'] = df2['p'] - df2['r']
df["var_2"]=df['id'].map(df2["diff"])
df['control'] = np.where(df['var_2']<0, 1, 0)
my sample data set:
import pandas as pd
import numpy as np
df = {'ID': ['A',0,0,1,'A',1],
'ID1':['Yes','Yes','No','No','Yes','Yes']}
df = pd.DataFrame(df)
my real data set is read in from an excel file, the column 'ID1' contains 'Yes' or 'No'. the column 'ID' contains 1, 0 and 'A'.
I want to:
For column 'ID1' I want to replace 'Yes' as 1 and 'No' as 0.
for column 'ID' I want to replace 'A' as 0
I tried following ways
# The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0])
# Or, The values didn't change
df['ID1']=df['ID1'].replace(['Yes', 'No'], [1, 0],inplace='ignore')
# Or, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0})
# OR, it turns 'A' to 'nan'
df['ID'] = df['ID'].map({1: 1, 0: 0, 'A':0}, na_action=None)
My code works perfectly if you run my sample data set code to get the sample data set, which converts the Series into DF, but it doesn't work with my real data set which I read in from an excel file. I searched online but couldn't figure out why. these columns from my real data set are object type, i tried converting to string but still doesn't work.
edit:
my code for reading my real data set:
path =os.chdir(r"S:\path")
df1 = pd.read_excel('data.xlsx',skiprows=[0])
df1['ID']=df1['ID'].str.strip()
df1['ID'] = df1['ID'].map({'1': 1, '0': 0, 'A':0}, na_action=None)
df1['ID1']=df1['ID1'].str.strip()
df1['ID1']=df1['ID1'].replace(['Yes', 'No'], [1, 0])
df1.head()
Out[55]:
ID1 ID
0 1 NaN
1 1 NaN
2 1 NaN
3 1 0.0
4 1 NaN
I have uploaded my file online, please check this link : https://filebin.ca/3UAh5051Psnv/test.xlsx
Try to clean up ID1 and ID columns:
df['ID'] = df['ID'].astype(str).str.strip().map({'1': 1, '0': 0, 'A':0}, na_action=None)
df['ID1'] = df['ID1'].str.strip().replace(['Yes', 'No'], [1, 0])
Result:
In [234]: df
Out[234]:
ID1 ID
0 1 1
1 1 1
2 1 1
3 1 0
4 1 1
5 1 1
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 1 0
14 1 1
15 1 1
16 1 0
17 1 1
18 1 1
19 1 1
20 1 1
21 1 1
22 1 1
23 1 1
24 1 1
25 1 1
26 1 1
27 1 1
28 1 1
29 1 1
30 1 1