Concatinating randomly selected value from df2 with df1 - pandas

So I have a Student dataframe like this,
ID,STUDENT_ID
1,0123
2,9876
3,4567
4,2986
and a Courses dataframe like this,
ID,COURSE_ID
990,CourseA
991,CourseB
992,CourseC
What I'd like to do is to RANDOMLY SELECT ANY 2 COURSE_IDs from the Courses dataframe and append it to each indiviual STUDENT_ID in the following format.
ID,STUDENT_ID,COURSE_ID
1,0123,CourseA
2,0123,CourseB
3,9876,CourseB
4,9876,CourseC
5,4567,CourseA
6,4567,CourseC
7,2986,CourseA
8,2986,CourseC
Basically, I have to create 1 replica of each individual STUDENT_ID. Then after selecting the 2 random COURSE_IDs, attach it to the STUDENT_ID one by one. I only have to make sure that the randomly selected COURSE_IDs for each STUDENT_ID is always unique i.e., a STUDENT should not receive the same course twice.
I know I can use
df1 = df1.append([df1]*1, ignore_index=True)
df1['ID'] = np.arange(1, len(df1) + 1)
df1.sort_values(['STUDENT_ID'], inplace=True)
to make a duplicate of my STUDENT_IDs.
I also know that I can use
df2.sample(2)
to randomly select 2 COURSE_IDs.
But I'm not sure how to combine these 2 to get the expected result. I'd really appreciate some help here. Thanks in advance.

You could try numpy.hstack in a list comprehension to create your array of random courses, then Index.repeat and DataFrame.assign to create the desired output:
import numpy as np
rand_courses = np.hstack([Courses['COURSE_ID'].sample(2).values for i in range(len(Student))])
Student.loc[Student.index.repeat(2)].assign(COURSE_ID=rand_courses, ID=np.arange(len(Student)*2) + 1)
[out]
ID STUDENT_ID COURSE_ID
0 1 123 CourseA
0 2 123 CourseC
1 3 9876 CourseB
1 4 9876 CourseA
2 5 4567 CourseA
2 6 4567 CourseB
3 7 2986 CourseB
3 8 2986 CourseA

Related

Subtract 1 from column based on another DataFrame. Pandas

Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!
a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4
This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']
Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1

How to append/concat 2 Pandas Dataframe with different columns

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview
import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

Lookup a pandas df for a column value by matching rows with another dataframe

Say I have a pandas dataframe df1 as follows:
OpDay Rid Tid Sid Dist
0 18Sep 1 1 1 10
1 18Sep 1 1 1 15
2 18Sep 1 1 1 20
3 18Sep 1 5 4 5
4 18Sep 1 5 4 50
and df2 like:
S_Day R_ID T_ID S_ID ABC XYZ
0 18Sep 1 1 1 100 60
1 18Sep 1 5 4 125 100
Number of rows in df2 is equal to total number of unique combinations of OpDay+Rid+Tid+Sid in df1.
Now, I want the values of columns ABC and XYZ from df2 corresponding to this each unique combination. But I don't want to store these values in df1. Just need these values for some computation purpose and then I want to store the result in df2 only by creating a new column.
To summarize, lets say ,I want to do some computation using df1.Dist[3] for which I need values from columns df2.ABC and df2.XYZ also, so first find the row index in df2 where,
S_Day = OpDay[3],
R_ID = Rid[3],
T_ID = Tid[3] and
S_ID = Sid[3]
(In this case its row#1),
so use df2.ABC[1] and df2.XYZ[1] and store results in df2.RESULT[1].
So now df2 will look something like:
S_Day R_ID T_ID S_ID ABC XYZ RESULT
0 18Sep 1 1 1 100 60 Nan
1 18Sep 1 5 4 125 100 some computed value
Basically I guess I need a lookup kind of a function but don't know how to proceed further.
Please help as I am new to the world of python and programming. Many thanks in advance.
You can use .loc and Boolean indices to do what you want. Let's say that you're after the ith row of df1:
i = 3
Next, you can use Boolean indexing to find the corresponding rows in df2:
bool_index = (df1.loc[i, 'OpDay'] == df2.loc[:, 'S_Day']) & (df1.loc[i, 'Rid'] == df2.loc[:, 'R_ID']) & (df1.loc[i, 'Tid'] == df2.loc[:, 'T_ID']) & (df1.loc[i, 'Sid'] == df2.loc[:, 'S_ID'])
You might want to include a check to verify that you found one and only one combination:
sum(bool_index) == 1
And finally, you can use the boolean index to call the right values from df2:
ABC_for_computation = df2.loc[bool_index, 'ABC']
XYZ_for_computation = df2.loc[bool_index, 'XYZ']
Note that I'm not too sure about the speed of this operation if you have large datasets. In my experience, if speed is affected you should switch to numpy arrays instead of dataframes, particularly when writing data into your dataframe.

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

compare 2 data frames based on 3 columns and update second data

Here are my data frames looks like. Need to compare based on if df1.mid = df2.mid & df1.name=df2.name & df1.pid != df2.pid then update df2.pid with df1.pid.
df1
mid pid name
1 2 John
2 14 Peter
3 16 Emma
4 20 Adam
df2
mid pid name
1 2 John
2 16 Peter
3 16 Emma
expected result in df2 after update
mid pid name
1 2 John
2 14 Peter
3 16 Emma
A merge is what you want but there are some finesse to take into account:
df2.merge(df1, on=['mid', 'name'], how='left', suffixes=('_2', '_1')) \
.assign(pid=lambda x: x['pid_1'].combine_first(x['pid_2'])) \
.drop(columns=['pid_1', 'pid_2'])
merge aligns df1 and df2 based on mid and name. The two pid columns are renamed pid_1 and pid_2.
assign creates a new pid column by combining the two previous pids: if pid_1 is available, use that, if not, keep the original pid_2
drop drops pid_1 and pid_2, leaving one and only one pid column
You can try with:
df3= df1.join(df2,on=['mid','name','pid'],how='right')