finding rows with one difference in DataFrame - pandas

I have a data set were a number of rows are nearly identical, meaning they have the same values for all fields except column C.
A B C D ..... Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
2 40 'Kansas' 'Dem' 34 1
3 30 'Kansas' 'Dem' 45 2
4 55 'Texas' 'Rep' 2 7
....
38 55 'Texas' 'Dem' 2 7
I would like to identify all rows that are identical except for column C, but within column C I only want to to find combinations of 'Rep' and 'Dem'. So I don't want 2 identical rows with column C for instance being 'Rep' and 'Rep'.
A B C D ......Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
38 55 'Texas' 'Dem' 2 7
I have used the duplicated method on all columns (but C) and that provides all the rows that are identical. However, it does not lead to a duplication where each duplicated row with 'Rep' has exactly one duplicated row with 'Dem'.

Get all columns without C by difference to list, then sort_values per column C and convert it to tuples per groups. Last join to original, compare by Rep,Dem and filter by boolean indexing:
cols = df.columns.difference(['C']).tolist()
s = df.sort_values('C').groupby(cols)['C'].apply(tuple).rename('m') == ('Dem','Rep')
df = df[df.join(s, on=cols)['m']]
Another solution is compare by sets, but if possible because multiple same values per groups like Rep,Dem,Dem is possible chain condition with size:
g = df.groupby(cols)['C']
m1 = g.transform('size') == 2
m2 = g.transform(lambda x: set(x) == set(['Rep','Dem']))
df = df[m1 & m2]
print (df)
A B C D Z
0 50 'Ohio' Rep 3 45
1 50 'Ohio' Dem 3 45
4 55 'Texas' Rep 2 7
38 55 'Texas' Dem 2 7

You can use duplicated with the argument keepto Falseto create a mask for duplicated rows having dropped column c and use isin to filter the rows that have any of ['Rep','Dem']in them:
mask = df.drop(['C'], axis = 1).duplicated(keep=False)
df[mask][df['C'].isin(['Rep','Dem'])].drop_duplicates()
A B C D Z
0 50 'Ohio' 'Rep' 3 45
1 50 'Ohio' 'Dem' 3 45
4 55 'Texas' 'Rep' 2 7
5 55 'Texas' 'Dem' 2 7

Related

Pivot_table returns all NAN

I have a df that has 4 categorical columns and 2 numerical columns.
ID
Condition 1
Condition 2
Condition 3
Condition 4
Var 1
Var 2
1
A
X
one
foo
100
12
1
B
X
two
bar
90
13
1
A
Y
two
bar
80
14
2
B
Y
one
foo
20
156
2
A
X
two
foo
1
333
2
B
Y
one
bar
0
235
I would like to pivot it so that each user would have one row, with columns for each combination of the condition, once for with Var 1 and once with Var 2 values.
I tried running the following line
table = pd.pivot_table(Purchases, index='ID', columns=['Condition 1', 'Condition 2','Condition 3', 'Condition 4'], values = ['Var1', 'Var2'],aggfunc = 'sum')
But the output was full with NaN. What am I doing wrong?

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

To prepare a dataframe with elements being repeated from a list in python

I have a list as primary = ['A' , 'B' , 'C' , 'D']
and a DataFrame as
df2 = pd.DataFrame(data=dateRange, columns = ['Date'])
which contains 1 date column starting from 01-July-2020 till 31-Dec-2020.
I created another column 'DayNum' which will contain the day number from the date like 01-July-2020 is Wednesday so the 'DayNum' column will have 2 and so on.
Now using the list I want to create another column 'primary' so that the DataFrame looks as follows:
In short, the elements on the list should repeat. You can say that this is a roster to show the name of the person on the roster on a weekly basis where Monday is the start (day 0) and Sunday is the end (day 6).
The output should be like this:
Date DayNum Primary
0 01-Jul-20 2 A
1 02-Jul-20 3 A
2 03-Jul-20 4 A
3 04-Jul-20 5 A
4 05-Jul-20 6 A
5 06-Jul-20 0 B
6 07-Jul-20 1 B
7 08-Jul-20 2 B
8 09-Jul-20 3 B
9 10-Jul-20 4 B
10 11-Jul-20 5 B
11 12-Jul-20 6 B
12 13-Jul-20 0 C
13 14-Jul-20 1 C
14 15-Jul-20 2 C
15 16-Jul-20 3 C
16 17-Jul-20 4 C
17 18-Jul-20 5 C
18 19-Jul-20 6 C
19 20-Jul-20 0 D
20 21-Jul-20 1 D
21 22-Jul-20 2 D
22 23-Jul-20 3 D
23 24-Jul-20 4 D
24 25-Jul-20 5 D
25 26-Jul-20 6 D
26 27-Jul-20 0 A
27 28-Jul-20 1 A
28 29-Jul-20 2 A
29 30-Jul-20 3 A
30 31-Jul-20 4 A
First compare column for 0 by Series.eq with cumulative sum by Series.cumsum for groups for each week, then use modulo by Series.mod with number of values in list and last map by dictioanry created by enumerate and list by Series.map:
primary = ['A','B','C','D']
d = dict(enumerate(primary))
df['Primary'] = df['DayNum'].eq(0).cumsum().mod(len(primary)).map(d)

pandas replace multiple values (that you do not know) on one column

What is the best way to change several values in a column ('Status') that differ from the only two values that you want to analyse?
As an example, my df is:
Id Status Email Product Age
1 ok g# A 20
5 not ok l# J 45
1 A a# A 27
2 B h# B 25
2 ok t# B 33
3 C b# E 23
4 not ok c# D 30
In the end, I want to have:
Id Status Email Product Age
1 ok g# A 20
5 not ok l# J 45
1 other a# A 27
2 other h# B 25
2 ok t# B 33
3 other b# E 23
4 not ok c# D 30
The greatest difficulty is that my df is very huge, so I do not know all the others values different than 'ok' and 'not ok' (the values that I want to analise).
Thanks in advance!
np.where + isin
df.Status=np.where(df.Status.isin(['ok','not ok']),df.Status,'Others')
df
Out[384]:
Id Status Email Product Age
0 1 ok g# A 20
1 5 not ok l# J 45
2 1 Others a# A 27
3 2 Others h# B 25
4 2 ok t# B 33
5 3 Others b# E 23
6 4 not ok c# D 30
use an apply
df['Status'] = df.apply(lambda x: 'other' if x['Status'] not in ['ok', 'not ok'] else x['Status'], axis=1)

How to subtract one dataframe from another?

First, let me set the stage.
I start with a pandas dataframe klmn, that looks like this:
In [15]: klmn
Out[15]:
K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97
Next I split klmn into two dataframes, klmn0 and klmn1, according to the value in the 'K' column:
In [16]: k0 = klmn.groupby(klmn['K'] == 0)
In [17]: klmn0, klmn1 = [klmn.ix[k0.indices[tf]] for tf in (True, False)]
In [18]: klmn0, klmn1
Out[18]:
( K L M N
0 0 a -1.374201 35
1 0 b 1.415697 29
2 0 a 0.233841 18
3 0 b 1.550599 30
4 0 a -0.178370 63
5 0 b -1.235956 42
6 0 a 0.088046 2
7 0 b 0.074238 84,
K L M N
8 1 a 0.469924 44
9 1 b 1.231064 68
10 2 a -0.979462 73
11 2 b 0.322454 97)
Finally, I compute the mean of the M column in klmn0, grouped by the value in the L column:
In [19]: m0 = klmn0.groupby('L')['M'].mean(); m0
Out[19]:
L
a -0.307671
b 0.451144
Name: M
Now, my question is, how can I subtract m0 from the M column of the klmn1 sub-dataframe, respecting the value in the L column? (By this I mean that m0['a'] gets subtracted from the M column of each row in klmn1 that has 'a' in the L column, and likewise for m0['b'].)
One could imagine doing this in a way that replaces the the values in the M column of klmn1 with the new values (after subtracting the value from m0). Alternatively, one could imagine doing this in a way that leaves klmn1 unchanged, and instead produces a new dataframe klmn11 with an updated M column. I'm interested in both approaches.
If you reset the index of your klmn1 dataframe to be that of the column L, then your dataframe will automatically align the indices with any series you subtract from it:
In [1]: klmn1.set_index('L')['M'] - m0
Out[1]:
L
a 0.777595
a -0.671791
b 0.779920
b -0.128690
Name: M
Option #1:
df1.subtract(df2, fill_value=0)
Option #2:
df1.subtract(df2, fill_value=None)