Subtract 1 from column based on another DataFrame. Pandas - pandas

Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!

a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4

This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']

Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1

Related

Lookup a pandas df for a column value by matching rows with another dataframe

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.

pandas - get count of duplicate rows (matching across multiple columns)

I have a table like below - unique IDs and names. I want to return any duplicated names (based on matching First and Last).
Id First Last
1 Dave Davis
2 Dave Smith
3 Bob Smith
4 Dave Smith
I've managed to return a count of duplicates across all columns if I don't have an ID column, i.e.
import pandas as pd
dict2 = {'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df2 = pd.DataFrame(dict2)
print(df2.groupby(df2.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
Output:
First Last records
0 Bob Smith 1
1 Dave Davis 1
2 Dave Smith 2
I want to be able to return the duplicates (of first and last) when I also have an ID column, i.e.
import pandas as pd
dict1 = {'Id': pd.Series([1, 2, 3, 4]),
'First': pd.Series(["Dave", "Dave", "Bob", "Dave"]),
'Last': pd.Series(["Davis", "Smith", "Smith", "Smith"])}
df1 = pd.DataFrame(dict1)
print(df1.groupby(df1.columns.tolist()).size().reset_index().\
rename(columns={0:'records'}))
gives:
Id First Last records
0 1 Dave Davis 1
1 2 Dave Smith 1
2 3 Bob Smith 1
3 4 Dave Smith 1
I want (ideally):
First Last records Ids
0 Dave Smith 2 2, 4
first filter only duplicated rows by DataFrame.duplicated by columns for check and keep=False for return all dupes, filter by boolean indexing. Then aggregate by GroupBy.agg counts with GroupBy.size and join id with converting to strings:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1[df1.duplicated(['First','Last'], keep=False)]
.groupby(['First','Last'])['Id'].agg(tup)
.reset_index())
print (df2)
First Last records Ids
0 Dave Smith 2 2,4
Another idea is aggregate all values and then filter with DataFrame.query:
tup = [('records','size'), ('Ids',lambda x: ','.join(x.astype(str)))]
df2 = (df1.groupby(['First','Last'])['Id'].agg(tup)
.reset_index()
.query('records != 1'))
print (df2)
First Last records Ids
2 Dave Smith 2 2,4

compare 2 data frames based on 3 columns and update second data

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma
A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column
You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')

Concatinating randomly selected value from df2 with df1

So I have a Student dataframe like this,
ID,STUDENT_ID
1,0123
2,9876
3,4567
4,2986
and a Courses dataframe like this,
ID,COURSE_ID
990,CourseA
991,CourseB
992,CourseC
What I'd like to do is to RANDOMLY SELECT ANY 2 COURSE_IDs from the Courses dataframe and append it to each indiviual STUDENT_ID in the following format.
ID,STUDENT_ID,COURSE_ID
1,0123,CourseA
2,0123,CourseB
3,9876,CourseB
4,9876,CourseC
5,4567,CourseA
6,4567,CourseC
7,2986,CourseA
8,2986,CourseC
Basically, I have to create 1 replica of each individual STUDENT_ID. Then after selecting the 2 random COURSE_IDs, attach it to the STUDENT_ID one by one. I only have to make sure that the randomly selected COURSE_IDs for each STUDENT_ID is always unique i.e., a STUDENT should not receive the same course twice.
I know I can use
df1 = df1.append([df1]*1, ignore_index=True)
df1['ID'] = np.arange(1, len(df1) + 1)
df1.sort_values(['STUDENT_ID'], inplace=True)
to make a duplicate of my STUDENT_IDs.
I also know that I can use
df2.sample(2)
to randomly select 2 COURSE_IDs.
But I'm not sure how to combine these 2 to get the expected result. I'd really appreciate some help here. Thanks in advance.
You could try numpy.hstack in a list comprehension to create your array of random courses, then Index.repeat and DataFrame.assign to create the desired output:
import numpy as np
rand_courses = np.hstack([Courses['COURSE_ID'].sample(2).values for i in range(len(Student))])
Student.loc[Student.index.repeat(2)].assign(COURSE_ID=rand_courses, ID=np.arange(len(Student)*2) + 1)
[out]
ID STUDENT_ID COURSE_ID
0 1 123 CourseA
0 2 123 CourseC
1 3 9876 CourseB
1 4 9876 CourseA
2 5 4567 CourseA
2 6 4567 CourseB
3 7 2986 CourseB
3 8 2986 CourseA

Looking up values from one dataframe in specific row of another dataframe

I'm struggling with a bit of a complex (to me) lookup-type problem.
I have a dataframe df1 that looks like this:
Grade Value
0 A 16
1 B 12
2 C 5
And another dataframe (df2) where the values in one of the columns('Grade') from df1 forms the index:
Tier 1 Tier 2 Tier 3
A 20 17 10
B 16 11 3
C 7 6 2
I've been trying to write a bit of code that for each row in df1, look ups the row corresponding with 'Grade' in df2, finds the smallest value in df2 greater than 'Value', and returns the name of that column.
E.g. for the second row of df1, it looks up the row with index 'B' in df2: 16 is the smallest value greater than 12, so it returns 'Tier 1'. Ideal output would be:
Grade Value Tier
0 A 16 Tier 2
1 B 12 Tier 1
2 C 5 Tier 2
My novice, downloaded-Python-last-week attempt so far has been as follows, which is throwing up all manner of errors and doesn't even try returning the column name. Sorry also about the micro-ness of the question: any help appreciated!
for i, row in input_df1.iterrows():
Tier = np.argmin(df1['Value']<df2.loc[row,0:df2.shape[1]])
df2.loc[df1.Grade].eq(df1.Value, 0).idxmax(1)