Update a DF in Pandas based on condition from different DF - pandas

I have df1
index state fin_salary new_title
5 CA 1 Data Scientist
8 CA 1 Data Scientist
35 CA 150000 Deep Learning Engineer
36 CA 1 Data Analyst
39 CA 1 Data Engineer
43 CA 1 Data Scientist
56 CA 1 Data Scientist
And another datfarame df2
state new_title fin_salary
CA Artificial Intelligence Engineer 207500.0
CA Data Analyst 64729.0
CA Data Engineer 146000.0
CA Data Scientist 129092.75
CA Deep Learning Engineer 162500.0
CA Machine Learning Engineer 133120.0
CA Python Developer 96797.0
So I want to update df1 with fin_salary from df2 based on condition state and new_title and where fin_salary = 1. So my desired output should be
index state fin_salary new_title
5 CA 129092.75 Data Scientist
8 CA 129092.75 Data Scientist
35 CA 150000 Deep Learning Engineer
36 CA 64729.0 Data Analyst
39 CA 146000.0 Data Engineer
43 CA 129092.75 Data Scientist
56 CA 129092.75 Data Scientist

You can do this:
DF = df1.merge(df2, on=['state','new_title'], how = 'inner')
df_final = DF[DF['fin_salary']==1]

Related

How to distribute clients exactly between employees of the same department?

I have 2 dataframes. 1 with clients and departments where they belong, 2 with departments and managers located there. In 1 department there can be any number of clients from 1 to 1000, for example, and the number of departments and managers there can also change every day. I need to connect the dataframes so that there is 1 manager per 1 client and so that the distribution of clients is divided exactly between the managers in the department (if it is not exactly divided, then with a difference of 1). For example, if there are 5 clients and 2 managers in one department, then respectively one manager has 3 clients, the second has 2 clients. If the number of managers is more than clients, then distribute the random (if we have 2 clients, but 3 managers.)
df_all = df_c.merge(df_d, how='left', on='dep')
when I do a regular merge, the data is duplicated
df_all_new = df_all.groupby(['cli']).head(1)
Tried through groupby and head(1), but in this case, only 1 manager got it, the rest were lost.
First df with clients and deps
client
dep
001
34
002
34
003
34
004
34
005
34
006
42
007
42
second df with deps and managers
dep
manager
34
847
34
147
47
472
47
198
47
374
and the result should be
client
dep
manager
001
34
847
002
34
847
003
34
847
004
34
147
005
34
147
006
42
472
007
42
198

Find maximum value of each group within a Pandas Frame

I do have a question, hoping that you could give me a little support. I looked into the archiv here, found a solution but that's taking much time and is not "beautiful", since works with Loops
Suppose you have a following frame
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra 8 2
PL1 AD Andorra 15 5
PPE AD Andorra 14 5
P11 AD Andorra 9 5
P16 AD Andorra 12 4
PEM AE Emirates 3 5
PL1 AE Emirates 15 4
PPE AE Emirates 15 5
P11 AE Emirates 15 6
P16 AE Emirates 13 5
I found the following approach for two columns Get the max value from each group with pandas.DataFrame.groupby
However, in my case I do really have many columns and need to set the index for the first three columns "System", "Country_Key" and "Name"
my desire output would be the following
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra
PL1 15 5
PPE 5
P11 5
P16
PEM AE Emirates
PL1 15
PPE 15
P11 15 6
P16
So actually dropping the lowest values except the max value. Any kind of hint would be really benefical
You can try mask the not max value to empty string and mask the duplicated values to empty string
keys = ['Country_Key', 'Name']
cols = ['Bank_number_length', 'Check rule for bank acct no.']
df[cols] = df[cols].mask(df[cols].ne(df.groupby(keys)[cols].transform(max)), '')
df.loc[df.duplicated(keys), keys] = ''
print(df)
System Country_Key Name Bank_number_length Check rule for bank acct no.
0 PEM AD Andorra
1 PL1 15 5
2 PPE 5
3 P11 5
4 P16
5 PEM AE Emirates
6 PL1 15
7 PPE 15
8 P11 15 6
9 P16

Pandas loc() function

I am trying to slice item from CSV here is example [enter image description here][1]
df1 = pandas.read_csv("supermarkets.csv")
df1
ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco California 94114 USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df2 = df1.loc["735 Dolores St":"332 Hill St","City":"Country"]
df2
In output I am only getting this output
City State Country
How do I correct?
As you can read in pandas documentation .loc[] can access a group of rows and columns by label(s) or a boolean array.
You cannot directly select using the values in the Series.
In your example df1.loc["735 Dolores St":"332 Hill St","City":"Country"] you are getting an empty selection because only "City":"Country" is a valid accessor.
"735 Dolores St":"332 Hill St" will return an empty row selection as they are not labels on the index.
You can try selecting by index with .iloc[[1,2], "City":"Country"] if you want specific rows.
df.loc is primarily label based and commonly slices the rows using an index. In this case, you can use the numeric index or set address as index
print(df)
ID Address City State Country Name Employees
0 1 3666 21st San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores San Francisco CA 94114 USA Bready Shop 15
2 3 332 Hill San Francisco CA 94114 USA Super River 25
df2=df.loc[1:2,'City':'Country']
print(df2)
City State Country
1 San Francisco CA 94114 USA
2 San Francisco CA 94114 USA
Or
df2=df.set_index('Address').loc['735 Dolores':'332 Hill','City':'Country']
print(df2)
City State Country
Address
735 Dolores San Francisco CA 94114 USA
332 Hill San Francisco CA 94114 USA

How do I drop a row from this data frame?

ID Address City State Country Name Employees
0 1 3666 21st St San Francisco CA 94114 USA Madeira 8
1 2 735 Dolores St San Francisco CA 94119 USA Bready Shop 15
2 3 332 Hill St San Francisco Cal USA Super River 25
3 4 3995 23rd St San Francisco CA 94114 USA Ben's Shop 10
4 5 1056 Sanchez St San Francisco California USA Sanchez 12
5 6 551 Alvarado St San Francisco CA 94114 USA Richvalley 20
df=df.drop(['3666 21st St'], axis=1, inplace=True)
I am using this code and still, it's showing an error stating that :
KeyError: "['3666 21st St'] not found in axis"
Can anyone help me solve this?
The drop method only works on the index or column names. There are 2 ways to do what you want:
Make the Address column the index, then drop the value(s) you want to drop. You should use axis=0 for this, and not axis=1. The default is axis=0. Do not use inplace=True if you are assigning the output.
Use a Boolean filter instead of drop.
The 1st method is preferred if the Address values are all distinct. The index of a data frame is effectively a sequence of row labels, so it doesn't make much sense to have duplicate row labels:
df.set_index('Address', inplace=True)
df.drop(['3666 21st St'], inplace=True)
The 2nd method is therefore preferred if the Address column is not distinct:
is_bad_address = df['Address'] == '3666 21st St'
# Alternative if you have multiple bad addresses:
# is_bad_address = df['Address'].isin(['366 21st St'])
df = df.loc[~is_bad_address]
You need to consult the Pandas documentation for the correct usage of the axis= and inplace= keyword arguments. You are using both of them incorrectly. DO NOT COPY AND PASTE CODE WITHOUT UNDERSTANDING HOW IT WORKS.

Object Model to identify customers with similar address

Let's say I have a table of customers and each customer has an address. My task is to design an object model that allows to group the customers by similar address. Example:
John 123 Main St, #A; Los Angeles, CA 90032
Jane 92 N. Portland Ave, #1; Pasadena, CA 91107
Peter 92 N. Portland Avenue, #2; Pasadena, CA 91107
Lester 92 N Portland Av #4; Pasadena, CA 91107
Mark 123 Main Street, #C; Los Angeles, CA 90032
The query should somehow return:
1 Similar_Address_Key1
5 Similar_Address_Key1
2 Similar_Address_key2
3 Similar_Address_key2
4 Similar_Address_key2
What is the best way to accomplish this? Notice the addresses are NOT consistent (some address have "Avenue" others have "Av" and the apartment numbers are different). The existing data of names/address cannot be corrected so doing a GROUP BY (Address) on the table itself is out of the question.
I was thinking to add a SIMILAR_ADDRESSES table that takes an address, evaluates it and gives it a key, so something like:
cust_key address similar_addr_key
1 123 Main St, #A; Los Angeles, CA 90032 1
2 92 N. Portland Ave, #1; Pasadena, CA 91107 2
3 92 N. Portland Avenue, #2; Pasadena, CA 91107 2
4 92 N. Portland Av #4; Pasadena, CA 91107 2
5 123 Main Street, #C; Los Angeles, CA 90032 1
Then group by the similar address key. But the question is how to best accomplish the "evaluation" part. One way would be to modify the address in the SIMILAR_ADDRESSES table so that they are consistent and ignoring things like apt, #, or suite and assign a "key" to each exact match. Another different approach I thought about was to feed the address to a Geolocator service and save the latitude/longitude values to a table and use these values to generate a similar address key.
Any ideas?