compare 2 data frames based on 3 columns and update second data - pandas

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma

A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column

You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')

Related

Subtract 1 from column based on another DataFrame. Pandas

Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!
a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4
This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']
Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1

How to append/concat 2 Pandas Dataframe with different columns

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview
import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

Concatinating randomly selected value from df2 with df1

So I have a Student dataframe like this,
ID,STUDENT_ID
1,0123
2,9876
3,4567
4,2986
and a Courses dataframe like this,
ID,COURSE_ID
990,CourseA
991,CourseB
992,CourseC
What I'd like to do is to RANDOMLY SELECT ANY 2 COURSE_IDs from the Courses dataframe and append it to each indiviual STUDENT_ID in the following format.
ID,STUDENT_ID,COURSE_ID
1,0123,CourseA
2,0123,CourseB
3,9876,CourseB
4,9876,CourseC
5,4567,CourseA
6,4567,CourseC
7,2986,CourseA
8,2986,CourseC
Basically, I have to create 1 replica of each individual STUDENT_ID. Then after selecting the 2 random COURSE_IDs, attach it to the STUDENT_ID one by one. I only have to make sure that the randomly selected COURSE_IDs for each STUDENT_ID is always unique i.e., a STUDENT should not receive the same course twice.
I know I can use
df1 = df1.append([df1]*1, ignore_index=True)
df1['ID'] = np.arange(1, len(df1) + 1)
df1.sort_values(['STUDENT_ID'], inplace=True)
to make a duplicate of my STUDENT_IDs.
I also know that I can use
df2.sample(2)
to randomly select 2 COURSE_IDs.
But I'm not sure how to combine these 2 to get the expected result. I'd really appreciate some help here. Thanks in advance.
You could try numpy.hstack in a list comprehension to create your array of random courses, then Index.repeat and DataFrame.assign to create the desired output:
import numpy as np
rand_courses = np.hstack([Courses['COURSE_ID'].sample(2).values for i in range(len(Student))])
Student.loc[Student.index.repeat(2)].assign(COURSE_ID=rand_courses, ID=np.arange(len(Student)*2) + 1)
[out]
ID STUDENT_ID COURSE_ID
0 1 123 CourseA
0 2 123 CourseC
1 3 9876 CourseB
1 4 9876 CourseA
2 5 4567 CourseA
2 6 4567 CourseB
3 7 2986 CourseB
3 8 2986 CourseA

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

Sorting by maximum value and display a different column than the one used for sorting

I have data in a file that looks like :
id Name records
1 joe 3
1 james 4
1 jacky 4
2 mike 10
2 mat 8
2 peter 10
3 bob 3
3 alice 1
3 wis 1
All records with the same id belongs to one person but the names may be different. I need to find the id with maximum records . In the above eg id 2 has records equal to 10+8+10 = 28 and is the maximum value as compared to other ids.
So the result of my query should be any one of the given names i.e either mike or mat or peter,I need to this using awk;
I tried the following:
awk '{arr[$1]+=$3} END {for (i in arr){if(arr[i]>max) max=arr[i] ; name=i} } END {print name}'
A couple of issues:
you aren't ignoring the header row
you aren't saving the name ($2) anywhere,
you have 2 ENDs.
I think you want this:
awk 'NR>1{count[$1]+=$3;name[$1]=$2;} END{for(i in count){if(count[i]>m){m=count[i]; n=name[i]}};print m,n}' file
28 peter