how to transcope a part of column as rows in pandas - pandas

i have a data of some patients and they are in a configuration that cant be used for data analysis
we have a couple of patients that each patients have multiple visits to our clinic. therefor, in our data we have a row for a visit an it contains some data and as i mentioned every patient have multiple visit. so i have multiple rows for a single patient . i would like a way that we can have just one row for a patient and multiple variables for each visit
for example
enter image description here
as you can see we have multiple visits for a patient
and i wana change it to this format
enter image description here

You have to pivot your dataframe:
out = (df.pivot_table(index=['Family Name', 'ID', 'Sex', 'age'],
columns = df.groupby('ID').cumcount().add(1).astype(str))
.reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
Family Name ID Sex age AA1 AA2 AB1 AB2 AC1 AC2 AE1 AE2
0 Mr James 2 male 64 9.0 NaN 10.0 NaN 11.0 NaN 12.0 NaN
1 Mr Lee 1 male 62 1.0 5.0 2.0 6.0 3.0 7.0 4.0 8.0

Related

How do you copy data from a dataframe to another

I am having a difficult time getting the correct data from a reference csv file to the one I am working on.
I have a csv file that has over 6 million rows and 19 columns. I looks something like this :
enter image description here
For each row there is a brand and a model of a car amongst other information.
I want to add to this file the fuel consumption per 100km traveled and the type of fuel that is used.
I have another csv file that has the fuel consumption of every model of car that looks something like this : enter image description here
What I want to ultimately do is add the matching values of G,H, I and J columns from the second file to the first one.
Because of the size of the file I was wondering if there is another way to do it other than with a "for" or a "while" loop?
EDIT :
For example...
The first df would look something like this
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
NaN
NaN
2
Honda
Civic
b
NaN
NaN
3
GMC
Sierra
c
NaN
NaN
4
Toyota
Rav4
d
NaN
NaN
The second df would be something like this
ID
Brand
Model
Fuel_consu_1
Fuel_consu_2
1
Toyota
Corrola
100
120
2
Toyota
Rav4
80
84
3
GMC
Sierra
91
105
4
Honda
Civic
112
125
The output should be :
ID
Brand
Model
Other_columns
Fuel_consu_1
Fuel_consu_2
1
Toyota
Rav4
a
80
84
2
Honda
Civic
b
112
125
3
GMC
Sierra
c
91
105
4
Toyota
Rav4
d
80
84
The first df may have many times the same brand and model for different ID's. The order is completely random.
Thank you for providing updates I was able to put something together that should be able to help you
#You drop these two columns because you won't need them once you join them to df1 (which is your 2nd table provided)
df.drop(['Fuel_consu_1', 'Fuel_consu_2'], axis = 1 , inplace = True)
#This will join your first and second column to each other on the Brand and Model columns
df_merge = pd.merge(df, df1, on=['Brand', 'Model'])

Passing values from one data frame columns to another data frame in Pandas

I have a couple of data frames. I want to get data from 2 columns from first data frame for marking the rows that are present in second data frame.
First data frame (df1) looks like this
Sup4 Seats Primary Seats Back up Seats
Pa 3 2 1
Ka 2 1 1
Ga 1 0 1
Gee 1 1 0
Re 2 2 0
(df2) looks like
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He NaN NaN 2.3
Ka Sonia Du NaN NaN 2.99
Ga Agnes Bla NaN NaN 3.24
Gee Jeffery Rus NaN NaN 3.5
Gee John Cro NaN NaN 1.3
Pa Pavol Rac NaN NaN 1.99
Pa Ciara Lee NaN NaN 1.88
Re David Wool NaN NaN 2.34
Re Stefan Rot NaN NaN 2
Re Franc Bor NaN NaN 1.34
Ka Tania Le NaN NaN 2.35
the output i require for each Sup4 name is to be grouped also by sorting the Rating from highest to lowest and then mark the columns for seats based on the df1 columns Primary Seats and Backup seats.
i did grouping and sorting for first Sup4 name Pa for sample and i have to do for all the names
Sup4 First Last Primary Seats Backup Seats Rating
Pa Peter He M 2.3
Pa Pavol Rac M 1.99
Pa Ciara Lee M 1.88
Ka Sonia Du M 2.99
Ka Tania Le M 2.35
Ga Agnes Bla M 3.24
:
:
:
continues like this
I have tried until grouping and sorting
sorted_df = df2.sort_values(['Sup4','Rating'],ascending=[True,False])
however i need help to pass df1 columns values to mark in second dataframe
Solution #1:
You can do a merge, but you need to include some logic to update your Seats columns. Also, it is important to mention that you need to decide what to do with data with unequal lengths. ~GeeandRe` have unequal lengths in both dataframes. More information in Solution #2:
df3 = (pd.merge(df2[['Sup4', 'First', 'Last', 'Rating']], df1, on='Sup4')
.sort_values(['Sup4', 'Rating'], ascending=[True, False]))
s = df3.groupby('Sup4', sort=False).cumcount() + 1
df3['Backup Seats'] = np.where(s - df3['Primary Seats'] > 0, 'M', '')
df3['Primary Seats'] = np.where(s <= df3['Primary Seats'], 'M', '')
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
5 Ga Agnes Bla M 3.24
6 Gee Jeffery Rus M 3.5
7 Gee John Cro M 1.3
3 Ka Sonia Du M 2.99
4 Ka Tania Le M 2.35
0 Pa Peter He M 2.3
1 Pa Pavol Rac M 1.99
2 Pa Ciara Lee M 1.88
8 Re David Wool M 2.34
9 Re Stefan Rot M 2.0
10 Re Franc Bor M 1.34
Solution #2:
After doing this solution, I realized Solution #1 would be much simpler, but I thought I mine as well include this. Also, this gives you insight on what do with values that had unequal size in both dataframes. You can reindex the first dataframe and use combine_first() but you have to do some preparation. Again, you need to decide what to do with data with unequal lengths. In my answer, I have simply excluded Sup4 groups with unequal lengths to guarantee that the indices align when finally calling combine_first():
# Purpose of `mtch` is to check if rows in second dataframe are equal to the count of seats in first.
# If not, then I have excluded the `Sup4` with unequal lengths in both dataframes
mtch = df1.groupby('Sup4')['Seats'].first().eq(df2.groupby('Sup4').size())
df1 = df1.sort_values('Sup4', ascending=True)[df1['Sup4'].isin(mtch[mtch].index)]
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
#`reindex` the dataframe, get the cumulative count, and manipulate data with `np.where`
df1 = df1.reindex(df1.index.repeat(df1['Seats'])).reset_index(drop=True)
s = df1.groupby('Sup4').cumcount() + 1
df1['Backup Seats'] = np.where(s - df1['Primary Seats'] > 0, 'M', '')
df1['Primary Seats'] = np.where(s <= df1['Primary Seats'], 'M', '')
#like df1, in df2 we exclude groups with uneven lengths and sort
df2 = (df2[df2['Sup4'].isin(mtch[mtch].index)]
.sort_values(['Sup4', 'Rating'], ascending=[True, False]).reset_index(drop=True))
#can use `combine_first` since we have ensured that the data is sorted and of equal lengths in both dataframes
df3 = df2.combine_first(df1)
#order columns and only include required columns
df3 = df3[['Sup4', 'First', 'Last', 'Primary Seats', 'Backup Seats', 'Rating']]
df3
Out[1]:
Sup4 First Last Primary Seats Backup Seats Rating
0 Ga Agnes Bla M 3.24
1 Ka Sonia Du M 2.99
2 Ka Tania Le M 2.35
3 Pa Peter He M 2.3
4 Pa Pavol Rac M 1.99
5 Pa Ciara Lee M 1.88

Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?

I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0

How to create the column in pandas based on values of another column

I have created a new column, by adding values from one column, to the index of the column from which I have created this new column. However, my problem is the code works fine when I implement on sample column, but when I pass the already existing dataframe, it throws the error, "can only perform ops with scalar values". As per what I found is the code expects dist and that is why it is throwing error.
I tried converting the dataframe to dictionary or to a list, but no luck.
df = pd.DataFrame({'Name': ['Sam', 'Andrea', 'Alex', 'Robin', 'Kia', 'Sia'], 'Age':[14,25,55,8,21,43], 'd_id_max':[2,1,1,2,0,0]})`
df['Expected_new_col'] = df.loc[df.index + df['d_id_max'].to_list, 'Age'].to_numpy()
print(df)
error: can only perform ops with scalar values.
This is the dataframe I want to implement this code:
Weight Name Age 1 2 abs_max d_id_max
0 45 Sam 14 11.0 41.0 41.0 2
1 88 Andrea 25 30.0 -17.0 30.0 1
2 56 Alex 55 -47.0 -34.0 47.0 1
3 15 Robin 8 13.0 35.0 35.0 2
4 71 Kia 21 22.0 24.0 24.0 2
5 44 Sia 43 2.0 22.0 22.0 2
6 54 Ryan 45 20.0 0.0 20.0 1
Writing your new column like this will not return an error:
df.loc[df.index + df['d_id_max'], 'Age'].to_numpy()
EDIT:
You should first format d_id_max as int (or float):
df['d_id_max'] = df['d_id_max'].astype(int)
The solution was very simple, I was getting the error because the data type of the column d_id_max was object type, which should be either float or integer, so i change the data type and it worked fine.

How to get the unique values

I have this df:
data = pd.read_csv('attacks.csv', encoding="latin-1")
new_data = data.loc[:,'Name':'Investigator or Source']
new_data.head(5)
Name Sex Age Injury Fatal (Y/N) Time Species Investigator or Source
0 Julie Wolfe F 57 No injury to occupant, outrigger canoe and pad... N 18h00 White shark R. Collier, GSAF
1 Adyson McNeely F 11 Minor injury to left thigh N 14h00 -15h00 NaN K.McMurray, TrackingSharks.com
2 John Denges M 48 Injury to left lower leg from surfboard skeg N 07h45 NaN K.McMurray, TrackingSharks.com
3 male M NaN Minor injury to lower leg N NaN 2 m shark B. Myatt, GSAF
4 Gustavo Ramos M NaN Lacerations to leg & hand shark PROVOKED INCIDENT N NaN Tiger shark, 3m A .Kipper
How can I get the unique values ​​of the 'Species' category?
I'm trying with:
new_data["Species"].unique()
But it does not work.
Thank you!
you can also try:
uniqueSpecies = set(new_data["Species"])
in case you wanna drop NaN
uniqueSpecies = set(new_data["Species"].dropna())