Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1? - pandas

I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?

The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0

Related

how to transcope a part of column as rows in pandas

i have a data of some patients and they are in a configuration that cant be used for data analysis
we have a couple of patients that each patients have multiple visits to our clinic. therefor, in our data we have a row for a visit an it contains some data and as i mentioned every patient have multiple visit. so i have multiple rows for a single patient . i would like a way that we can have just one row for a patient and multiple variables for each visit
for example
enter image description here
as you can see we have multiple visits for a patient
and i wana change it to this format
enter image description here
You have to pivot your dataframe:
out = (df.pivot_table(index=['Family Name', 'ID', 'Sex', 'age'],
columns = df.groupby('ID').cumcount().add(1).astype(str))
.reset_index())
out.columns = out.columns.to_flat_index().map(''.join)
print(out)
# Output
Family Name ID Sex age AA1 AA2 AB1 AB2 AC1 AC2 AE1 AE2
0 Mr James 2 male 64 9.0 NaN 10.0 NaN 11.0 NaN 12.0 NaN
1 Mr Lee 1 male 62 1.0 5.0 2.0 6.0 3.0 7.0 4.0 8.0

Pandas data frame problem. 1st is as input and 2nd image would be output or result

Working with pandas data frame. How Can I separate Scope and its value in new Type columns and value is added in various row. Please see the 1st iamge
If all columns are ordered in pairs is possible filter even and odd columns and recreate DataFrame:
df1 = pd.DataFrame({'Verified By':df.iloc[:, 1::2].stack(dropna=False).to_numpy(),
'tCO2':df.iloc[:, ::2].stack(dropna=False).to_numpy()})
print (df1)
Verified By tCO2
0 Cventure LLC 12.915
1 Cventure LLC 61.801
2 NaN 78.551
3 NaN 5.712
4 NaN 49.513
5 Cventure LLC 24.063
6 Carbon Trust 679.000
7 NaN 4.445
8 Cventure LLC 56290.000
Another idea is split by first 2 spaces and reshape by DataFrame.stack:
df.columns = df.columns.str.split(n=2, expand=True)
df1 = df.stack([0,1]).droplevel(0)
df1.index = df1.index.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.rename_axis('Score').reset_index()
print (df1)
Score Verified By tCO2
0 Scope 1 Cventure LLC 12.915
1 Scope 2 Cventure LLC 61.801
2 Scope 3 NaN 78.551
3 Scope 1 NaN 5.712
4 Scope 2 NaN 49.513
5 Scope 3 Cventure LLC 24.063
6 Scope 1 Carbon Trust 679.000
7 Scope 2 NaN 4.445
8 Scope 3 Cventure LLC 56290.000

Pandas Groupby and Apply

I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.

How to count the distance in cells (e.g. in indices) between two repeating values in one column in Pandas dataframe?

I have the following dataset. It lists the words that were presented to a participant in the psycholinguistic experiment (I set the order of the presentation of each word as an index):
data = {'Stimulus': ['sword','apple','tap','stick', 'elephant', 'boots', 'berry', 'apple', 'pear', 'apple', 'stick'],'Order': [1,2,3,4,5,6,7,8,9,10,11]}
df = pd.DataFrame(data, columns = ['Stimulus', 'Order'])
df.set_index('Order', inplace=True)
Stimulus
Order
1 sword
2 apple
3 tap
4 stick
5 elephant
6 boots
7 berry
8 apple
9 pear
10 apple
11 stick
Some values in this dataset are repeated (e.g. apple), some are not. The problem is that I need to calculate the distance in cells based on the order column between each occurrence of repeated values and store it in a separate column, like this:
Stimulus Distance
Order
1 sword NA
2 apple NA
3 tap NA
4 stick NA
5 elephant NA
6 boots NA
7 berry NA
8 apple 6
9 pear NA
10 apple 2
11 stick 7
It shouldn't be hard to implement, but I've got stuck.. Initially, I made a dictionary of duplicates where I store items as keys and their indices as values:
{'apple': [2,8,10],'stick': [4, 11]}
And then I failed to find a solution to put those values into a column. If there is a simplier way to do it in a loop without using dictionaries, please let me know. I will appreciate any advice.
Use, df.groupby on Stimulus then transform the Order column using pd.Series.diff:
df = df.reset_index()
df['Distance'] = df.groupby('Stimulus').transform(pd.Series.diff)
df = df.set_index('Order')
# print(df)
Stimulus Distance
Order
1 sword NaN
2 apple NaN
3 tap NaN
4 stick NaN
5 elephant NaN
6 boots NaN
7 berry NaN
8 apple 6.0
9 pear NaN
10 apple 2.0
11 stick 7.0

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0