How to append/concat 2 Pandas Dataframe with different columns - pandas

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview

import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

Related

Get value from another df based on condition

I have 2 df
df1:
ID X Y Cond
Johnson 2 3 fine
Sand NAN NAN sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
and df2 :
id2 X Y
Magic 2 3
Sand 2 3
Cooper 1 2
Dean 1 2
I want to update x value in df1, if Cond ="sick" and df["id"] = df["id2]
to get the new df1 :
ID X Y Cond
Johnson 2 3 fine
Sand 2 3 sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
I tried :
df1["x"] = np.where((df["cond"]=="sick")& (df1["id"]==df2["id2"]),df2["x"],"")
But its not working. I get this ValueError :
ValueError: Can only compare identically-labeled Series objects
Thank you
First convert both ID columns to index values for possible match selected rows by DataFrame.loc:
df11 = df1.set_index('ID')
df22 = df2.set_index('id2')
df11.loc[df11["Cond"]=="sick", ['X','Y']] = df22[['X','Y']]
df = df11.reset_index()
print (df)
ID X Y Cond
0 Johnson 2 3 fine
1 Sand 2 3 sick
2 Cooper 1 2 fine
3 Nelson 1 2 fine
4 Peterson 4 5 fine
You can use the where() method of pandas dataframes instead of the wherefunction from numpy. The code looks like this :
df1.loc[:,["X", "Y"]] = df1.loc[:,["X", "Y"]].where(df1["Cond"]!="sick",df2.loc[:,["X", "Y"]])

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

compare 2 data frames based on 3 columns and update second data

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma
A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column
You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')

Concatinating randomly selected value from df2 with df1

So I have a Student dataframe like this,
ID,STUDENT_ID
1,0123
2,9876
3,4567
4,2986
and a Courses dataframe like this,
ID,COURSE_ID
990,CourseA
991,CourseB
992,CourseC
What I'd like to do is to RANDOMLY SELECT ANY 2 COURSE_IDs from the Courses dataframe and append it to each indiviual STUDENT_ID in the following format.
ID,STUDENT_ID,COURSE_ID
1,0123,CourseA
2,0123,CourseB
3,9876,CourseB
4,9876,CourseC
5,4567,CourseA
6,4567,CourseC
7,2986,CourseA
8,2986,CourseC
Basically, I have to create 1 replica of each individual STUDENT_ID. Then after selecting the 2 random COURSE_IDs, attach it to the STUDENT_ID one by one. I only have to make sure that the randomly selected COURSE_IDs for each STUDENT_ID is always unique i.e., a STUDENT should not receive the same course twice.
I know I can use
df1 = df1.append([df1]*1, ignore_index=True)
df1['ID'] = np.arange(1, len(df1) + 1)
df1.sort_values(['STUDENT_ID'], inplace=True)
to make a duplicate of my STUDENT_IDs.
I also know that I can use
df2.sample(2)
to randomly select 2 COURSE_IDs.
But I'm not sure how to combine these 2 to get the expected result. I'd really appreciate some help here. Thanks in advance.
You could try numpy.hstack in a list comprehension to create your array of random courses, then Index.repeat and DataFrame.assign to create the desired output:
import numpy as np
rand_courses = np.hstack([Courses['COURSE_ID'].sample(2).values for i in range(len(Student))])
Student.loc[Student.index.repeat(2)].assign(COURSE_ID=rand_courses, ID=np.arange(len(Student)*2) + 1)
[out]
ID STUDENT_ID COURSE_ID
0 1 123 CourseA
0 2 123 CourseC
1 3 9876 CourseB
1 4 9876 CourseA
2 5 4567 CourseA
2 6 4567 CourseB
3 7 2986 CourseB
3 8 2986 CourseA

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4