How to get the unique values - pandas

I have this df:
data = pd.read_csv('attacks.csv', encoding="latin-1")
new_data = data.loc[:,'Name':'Investigator or Source']
new_data.head(5)
Name Sex Age Injury Fatal (Y/N) Time Species Investigator or Source
0 Julie Wolfe F 57 No injury to occupant, outrigger canoe and pad... N 18h00 White shark R. Collier, GSAF
1 Adyson McNeely F 11 Minor injury to left thigh N 14h00 -15h00 NaN K.McMurray, TrackingSharks.com
2 John Denges M 48 Injury to left lower leg from surfboard skeg N 07h45 NaN K.McMurray, TrackingSharks.com
3 male M NaN Minor injury to lower leg N NaN 2 m shark B. Myatt, GSAF
4 Gustavo Ramos M NaN Lacerations to leg & hand shark PROVOKED INCIDENT N NaN Tiger shark, 3m A .Kipper
How can I get the unique values ​​of the 'Species' category?
I'm trying with:
new_data["Species"].unique()
But it does not work.
Thank you!

you can also try:
uniqueSpecies = set(new_data["Species"])
in case you wanna drop NaN
uniqueSpecies = set(new_data["Species"].dropna())

Related

how to transcope a part of column as rows in pandas

i have a data of some patients and they are in a configuration that cant be used for data analysis
we have a couple of patients that each patients have multiple visits to our clinic. therefor, in our data we have a row for a visit an it contains some data and as i mentioned every patient have multiple visit. so i have multiple rows for a single patient . i would like a way that we can have just one row for a patient and multiple variables for each visit
for example
enter image description here
as you can see we have multiple visits for a patient
and i wana change it to this format
enter image description here
You have to pivot your dataframe:
out = (df.pivot_table(index=['Family Name', 'ID', 'Sex', 'age'],
columns = df.groupby('ID').cumcount().add(1).astype(str))
.reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
Family Name ID Sex age AA1 AA2 AB1 AB2 AC1 AC2 AE1 AE2
0 Mr James 2 male 64 9.0 NaN 10.0 NaN 11.0 NaN 12.0 NaN
1 Mr Lee 1 male 62 1.0 5.0 2.0 6.0 3.0 7.0 4.0 8.0

A loop to fill null values in a column using the mode value is breaking. Still no working Solution

The following is a sample from my data frame:
import pandas as pd
import numpy as np
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','SPORTS',\
'nan','SEDAN','SEDAN','SEDAN','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df.head()
Out[24]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
There are some null values in the 'body' column
df.body.isnull().sum()
Out[25]: 3
So i am trying to fill the null values in body column by using the mode of body type for a particular make_model. For instance, 2 observations of SKODASUPERB have body as 'SUV' and 1 observation has body as null. So the mode of body for SKODASUPERB would be 'SUV' and i want 'SUV to be filled in for the third observation too. For this i am using the following code
make_model_list=df.make_model.unique().tolist()
for x in make_model_list:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']\
.fillna(df.loc[df['make_model']==x,'body'].mode())
Unfortunately, the loop is breaking as some observation dont have a mode value
df.body.isnull().sum()
Out[30]: 3
How can i force the loop to run even if there is no mode 'body' value for a particular make_model. I know that i can use continue command, but i am not sure how to write it.
Assuming that make_model and body are distinct values:
donor = df.dropna().groupby(by=['make_model']).agg(pd.Series.mode).reset_index()
df = df.merge(donor, how='left', on=['make_model'])
df['body_x'].fillna(df.body_y, inplace=True)
df.drop(columns=['body_y'], inplace=True)
df.columns = ['make_model', 'body']
df
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB SUV
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE SPORTS
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS SEDAN
11 TOYOTAAVENSIS SEDAN
12 FERRARI360 SPORT
13 FERRARILAFERRARI SPORT
Finally, I have worked out a solution. It was just a matter of putting try and exception. This solution works perfectly for the purpose of my project and has filled 95% of the missing values. I have slightly changed the data to show that this method is effective:
d=['SKODASUPERB','SKODASUPERB',\
'SKODASUPERB','MERCEDES-BENZE CLASS','ASTON MARTINVIRAGE'\
,'ASTON MARTINVIRAGE','ASTON MARTINVIRAGE','TOYOTAHIACE',\
'TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS','TOYOTAAVENSIS',\
'FERRARI360','FERRARILAFERRARI']
x=['SUV','SUV','nan','nan','SPORTS','SPORTS','nan',\
'nan','SEDAN','SEDAN','nan','SEDAN','SPORT','SPORT']
df=pd.DataFrame({'make_model':d,'body':x})
df.body=df.body.replace('nan',np.NaN)
df
Out[6]:
make_model body
0 SKODASUPERB SUV
1 SKODASUPERB SUV
2 SKODASUPERB NaN
3 MERCEDES-BENZE CLASS NaN
4 ASTON MARTINVIRAGE SPORTS
5 ASTON MARTINVIRAGE SPORTS
6 ASTON MARTINVIRAGE NaN
7 TOYOTAHIACE NaN
8 TOYOTAAVENSIS SEDAN
9 TOYOTAAVENSIS SEDAN
10 TOYOTAAVENSIS NaN
11 TOYOTAAVENSIS SEDAN
df.body.isnull().sum()
Out[7]: 5
My Solution
for x in make_model_list:
try:
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body']=\
df.loc[(df['make_model']==x)&(df['body'].isnull()),'body'].fillna\
(df.loc[df['make_model']==x,'body'].value_counts().index[0])
except:
pass
df.body.isnull().sum()
Out[9]: 2 #null values have dropped from 5 to 2.
Those 2 null values couldn't be filled because there was no frequent or mode value for them at all.

Passing values from one data frame columns to another data frame in Pandas

I have a couple of data frames. I want to get data from 2 columns from first data frame for marking the rows that are present in second data frame.
First data frame (df1) looks like this
Sup4 Seats Primary Seats Back up Seats
Pa 3 2 1
Ka 2 1 1
Ga 1 0 1
Gee 1 1 0
Re 2 2 0
(df2) looks like
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He NaN NaN 2.3
Ka Sonia Du NaN NaN 2.99
Ga Agnes Bla NaN NaN 3.24
Gee Jeffery Rus NaN NaN 3.5
Gee John Cro NaN NaN 1.3
Pa Pavol Rac NaN NaN 1.99
Pa Ciara Lee NaN NaN 1.88
Re David Wool NaN NaN 2.34
Re Stefan Rot NaN NaN 2
Re Franc Bor NaN NaN 1.34
Ka Tania Le NaN NaN 2.35
the output i require for each Sup4 name is to be grouped also by sorting the Rating from highest to lowest and then mark the columns for seats based on the df1 columns Primary Seats and Backup seats.
i did grouping and sorting for first Sup4 name Pa for sample and i have to do for all the names
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He M 2.3
Pa Pavol Rac M 1.99
Pa Ciara Lee M 1.88
Ka Sonia Du M 2.99
Ka Tania Le M 2.35
Ga Agnes Bla M 3.24
:
:
:
continues like this
I have tried until grouping and sorting
sorted_df = df2.sort_values(['Sup4','Rating'],ascending=[True,False])
however i need help to pass df1 columns values to mark in second dataframe
Solution #1:
You can do a merge, but you need to include some logic to update your Seats columns. Also, it is important to mention that you need to decide what to do with data with unequal lengths. ~GeeandRe` have unequal lengths in both dataframes. More information in Solution #2:
df3 = (pd.merge(df2[['Sup4', 'First', 'Last', 'Rating']], df1, on='Sup4')
.sort_values(['Sup4', 'Rating'], ascending=[True, False]))
s = df3.groupby('Sup4', sort=False).cumcount() + 1
df3['Backup Seats'] = np.where(s - df3['Primary Seats'] > 0, 'M', '')
df3['Primary Seats'] = np.where(s <= df3['Primary Seats'], 'M', '')
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
5 Ga Agnes Bla M 3.24
6 Gee Jeffery Rus M 3.5
7 Gee John Cro M 1.3
3 Ka Sonia Du M 2.99
4 Ka Tania Le M 2.35
0 Pa Peter He M 2.3
1 Pa Pavol Rac M 1.99
2 Pa Ciara Lee M 1.88
8 Re David Wool M 2.34
9 Re Stefan Rot M 2.0
10 Re Franc Bor M 1.34
Solution #2:
After doing this solution, I realized Solution #1 would be much simpler, but I thought I mine as well include this. Also, this gives you insight on what do with values that had unequal size in both dataframes. You can reindex the first dataframe and use combine_first() but you have to do some preparation. Again, you need to decide what to do with data with unequal lengths. In my answer, I have simply excluded Sup4 groups with unequal lengths to guarantee that the indices align when finally calling combine_first():
# Purpose of `mtch` is to check if rows in second dataframe are equal to the count of seats in first.
# If not, then I have excluded the `Sup4` with unequal lengths in both dataframes
mtch = df1.groupby('Sup4')['Seats'].first().eq(df2.groupby('Sup4').size())
df1 = df1.sort_values('Sup4', ascending=True)[df1['Sup4'].isin(mtch[mtch].index)]
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
#`reindex` the dataframe, get the cumulative count, and manipulate data with `np.where`
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
s = df1.groupby('Sup4').cumcount() + 1
df1['Backup Seats'] = np.where(s - df1['Primary Seats'] > 0, 'M', '')
df1['Primary Seats'] = np.where(s <= df1['Primary Seats'], 'M', '')
#like df1, in df2 we exclude groups with uneven lengths and sort
df2 = (df2[df2['Sup4'].isin(mtch[mtch].index)]
.sort_values(['Sup4', 'Rating'], ascending=[True, False]).reset_index(drop=True))
#can use `combine_first` since we have ensured that the data is sorted and of equal lengths in both dataframes
df3 = df2.combine_first(df1)
#order columns and only include required columns
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
0 Ga Agnes Bla M 3.24
1 Ka Sonia Du M 2.99
2 Ka Tania Le M 2.35
3 Pa Peter He M 2.3
4 Pa Pavol Rac M 1.99
5 Pa Ciara Lee M 1.88

Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?

I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0

Pandas showing Ker Error with mean function

Getting the KeyError: 'BasePay' for the BasePay element while its therein the DataFrame but missing while using mean() function.
My pandas version is '0.23.3' python3.6.3
>>> import numpy as np
>>> salDataF = pd.read_csv('Salaries.csv', low_memory=False)
>>> salDataF.head()
Id EmployeeName JobTitle BasePay OvertimePay OtherPay ... TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 ... 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 ... 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.6 ... 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.0 56120.71 198306.9 ... 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.6 9737.0 182234.59 ... 326373.19 326373.19 2011 NaN San Francisco NaN
[5 rows x 13 columns]
>>> EmpSal = salDataF.groupby('Year').mean()
KeyboardInterrupt
>>> salDataF.groupby('Year').mean()
Id TotalPay TotalPayBenefits Notes
Year
2011 18080.0 71744.103871 71744.103871 NaN
2012 54542.5 74113.262265 100553.229232 NaN
2013 91728.5 77611.443142 101440.519714 NaN
2014 129593.0 75463.918140 100250.918884 NaN
>>> EmpSal = salDataF.groupby('Year').mean()['BasePay']
Error: KeyError: 'BasePay'
Here is problem BasePay is not numeric, so salDataF.groupby('Year').mean() exclude all non numeric columns by design.
Solution is first try astype:
salDataF['BasePay'] = salDataF['BasePay'].astype(float)
...and if failed because some non numeric data use to_numeric with errors='coerce' for convert them to NaNs
salDataF['BasePay'] = pd.to_numeric(salDataF['BasePay'], errors='coerce')
and then better is select column before mean:
EmpSal = salDataF.groupby('Year')['BasePay'].mean()