Create a new column in Pandas dataframe by arbitrary function over rows - pandas

I have a Pandas dataframe. I want to add a column to the dataframe, where the value in the new column is dependent on other values in the row.
What is an efficient way to go about this?
Example
Begin
Start with this dataframe (let's call it df), and a dictionary of people's roles.
first last
--------------------------
0 Jon McSmith
1 Jennifer Foobar
2 Dan Raizman
3 Alden Lowe
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
End
We have a dataframe where we've 'iterated' over each row, and used the last_name to lookup values from our role_dict and add that value to each row as role.
first last role
--------------------------------------
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer

One solution is using series map function since the role is a dictionary
df['role'] = df.loc[:, 'last'].map(role_dict)

try this using merge
import pandas as pd
df = pd.DataFrame([["Jon","McSmith"],
["Jennifer","Foobar"],
["Dan","Raizman"],
["Alden","Lowe"]],columns=["first","last"])
role_dict = {
"Raizman": "sales",
"McSmith": "analyst",
"Foobar": "analyst",
"Lowe": "designer"
}
df_2 = pd.DataFrame(role_dict.items(),columns=["last","role"])
result = pd.merge(df,df_2,on=["last"],how="left")
output
first last role
0 Jon McSmith analyst
1 Jennifer Foobar analyst
2 Dan Raizman sales
3 Alden Lowe designer

Related

plotly - remove or ignore "Non-leaves rows" for sunburst diagram

I have a DataFrame with some "Non-leaves rows" in it. Is there any way to get plotly to ignore them, or a way to automatically remove them?
Here's a sample DataFrame:
0
1
2
3
0
Alice
Bob
1
Alice
Bob
Carol
David
2
Alice
Bob
Chuck
Delia
3
Alice
Bob
Chuck
Ella
4
Alice
Bob
Frank
In this case, I get the error Non-leaves rows are not permitted in the dataframe because the 0th row is not a distinct leaf.
I've tried using df = df.replace(np.NaN, pd.NA).where(df.notnull(), None) to add the None to the empty values, but the error persists.
Is there any way to have the non-leaves ignored? If not, is there a simple way to prune them? My real dataset is several thousand rows.
One way is to reshape your dataframe with unique relations parent-child. Here is one way:
import plotly.express as px
cols = df.columns
data = (
pd.concat(
[df[[i,j]].rename(columns={i:'parents',j:'childs'})
for i,j in zip(cols[:-1], cols[1:])])
.drop_duplicates()
)
fig = px.sunburst(data, names='childs', parents='parents')
fig.show()

Create new column based on date column Pandas

I am trying to create a new column of students' grade levels based on their DOB. The cut off dates for 1st grade would be 2014/09/02 - 2015/09/01. Is there a simple solution for this besides making a long if/elif statement. Thanks.
Name
DOB
Sally
2011/06/20
Mike
2009/02/19
Kevin
2012/12/22
You can use pd.cut(), which also supports custom bins.
from datetime import date
import pandas as pd
dob = {
'Sally': '2011/06/20',
'Mike': '2009/02/19',
'Kevin': '2012/12/22',
'Ron': '2009/09/01',
}
dob = pd.Series(dob).astype('datetime64').rename("DOB").to_frame()
grades = [
'2008/9/1',
'2009/9/1',
'2010/9/1',
'2011/9/1',
'2012/9/1',
'2013/9/1',
]
grades = pd.Series(grades).astype('datetime64')
dob['grade'] = pd.cut(dob['DOB'], grades, labels = [5, 4, 3, 2, 1])
print(dob.sort_values('DOB'))
DOB grade
Mike 2009-02-19 5
Ron 2009-09-01 5
Sally 2011-06-20 3
Kevin 2012-12-22 1
I sorted the data frame by date of birth, to show that oldest students are in the highest grades.
​

How to get a set of values from a Pandas dataframe?

I have this Pandas dataframe with rows and columns having the same titles which are names of people: Alex, Bob, Cynthia and cells being the number of times they have met, where -1 means that the cell is diagonal.
Alex
Bob
Cynthia
Alex
-1
2
3
Bob
1
-1
2
Cynthia
2
2
-1
Is there any elegant way to get a set of numeric values that are in the cells? So, for this table I want values = {1, 2, 3}. So far I can think of only iterating over the whole table in a nested-loop fashion and putting everything in a set.
Is there any other way of getting this set?
You can try:
set(df.values.flatten())
And to exclude -1:
set(df.values.flatten()).difference([-1])

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

Why is column name not going over actual column and creating new columns in dataframe?

I am assigning column names to a dataframe in pandas but the column names are creating new columns how do I go around this issue?
What dataframe looks like now:
abs_subdv_cd abs_subdv_desc
0 A0001A ASHTON ... NaN
1 A0002A J. AYERS ... NaN
2 A0003A NEWTON ALLSUP ... NaN
3 A0004A M. AUSTIN ... NaN
4 A0005A RICHARD W. ALLEN ... NaN
What I want dataframe look like:
abs_subdv_cd abs_subdv_desc
0 A0001A ASHTON
1 A0002A J. AYERS
2 A0003A NEWTON ALLSUP
3 A0004A M. AUSTIN
4 A0005A RICHARD W. ALLEN
code so far:
import pandas as pd
###Declaring path###
path = ('file_path')
###Calling file in folder###
appraisal_abstract_subdv = pd.read_table(path + '/2015-07-28_003820_APPRAISAL_ABSTRACT_SUBDV.txt',
encoding = 'iso-8859-1' ,error_bad_lines = False,
names = ['abs_subdv_cd','abs_subdv_desc'])
print(appraisal_abstract_subdv.head())
-edit-
When I try appraisal_abstract_subdv.shape..the dataframe is showing shape as (4000,1) where as the data has two columns.
this example of data I am using:
A0001A ASHTON
A0002A J. AYERS
Thank you in advance.
it looks like your data file has another delimiter (not a TAB, which is a default separator for pd.read_table()), so try to use: sep='\s+' or delim_whitespace=True parameter.
In order to check your columns after reading your data file do the following:
print(df.columns.tolist())
There is a rename function in pandas that you can use to get the column names
appraisal_abstract_subdv.columns.values
then with those column names use this method to rename them appropriately
df.rename(columns={'OldColumn1': 'Newcolumn1', 'OldColumn2': 'Newcolumn2'}, inplace=True)