Pandas data frame problem. 1st is as input and 2nd image would be output or result - pandas

Working with pandas data frame. How Can I separate Scope and its value in new Type columns and value is added in various row. Please see the 1st iamge

If all columns are ordered in pairs is possible filter even and odd columns and recreate DataFrame:
df1 = pd.DataFrame({'Verified By':df.iloc[:, 1::2].stack(dropna=False).to_numpy(),
'tCO2':df.iloc[:, ::2].stack(dropna=False).to_numpy()})
print (df1)
Verified By tCO2
0 Cventure LLC 12.915
1 Cventure LLC 61.801
2 NaN 78.551
3 NaN 5.712
4 NaN 49.513
5 Cventure LLC 24.063
6 Carbon Trust 679.000
7 NaN 4.445
8 Cventure LLC 56290.000
Another idea is split by first 2 spaces and reshape by DataFrame.stack:
df.columns = df.columns.str.split(n=2, expand=True)
df1 = df.stack([0,1]).droplevel(0)
df1.index = df1.index.map(lambda x: f'{x[0]} {x[1]}')
df1 = df1.rename_axis('Score').reset_index()
print (df1)
Score Verified By tCO2
0 Scope 1 Cventure LLC 12.915
1 Scope 2 Cventure LLC 61.801
2 Scope 3 NaN 78.551
3 Scope 1 NaN 5.712
4 Scope 2 NaN 49.513
5 Scope 3 Cventure LLC 24.063
6 Scope 1 Carbon Trust 679.000
7 Scope 2 NaN 4.445
8 Scope 3 Cventure LLC 56290.000

Related

Creating a new DataFrame out of 2 existing Dataframes with Values coming from Dataframe 1?

I have 2 DataFrames.
DF1:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
DF2:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
My new DataFrame should look like this.
DF_new:
userID Toy Story Jumanji Grumpier Old Men Waiting to Exhale Father of the Pride Part II
1 4.0
2
3
4
The Values will be the ratings of the the indiviudel user to each movie.
The movie titles are now the columns.
The userId are now the rows.
I think it will work over concatinating via the movieid. But im not sure how to do this exactly, so that i still have the movie names attached to the movieid.
Anybody has an idea?
The problem consists of essentially 2 parts:
How to transpose df2, the sole table where user ratings comes from, to the desired format. pd.DataFrame.pivot_table is the standard way to go.
The rest is about mapping the movieIDs to their names. This can be easily done by direct substitution on df.columns.
In addition, if movies receiving no ratings were to be listed as well, just insert the missing movieIDs directly before name substitution mentioned previously.
Code
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
data={
"movieId": [1,2,3,4,5],
"title": ["toy story (1995)",
"Jumanji (1995)",
"Grumpier 33 (1995)", # shortened for printing
"Waiting 44 (1995)",
"Father 55 (1995)"],
}
)
# to better demonstrate the correctness, 2 distinct user ids were used.
df2 = pd.DataFrame(
data={
"userId": [1,1,1,2,2],
"movieId": [1,2,2,3,5],
"rating": [4,5,4,5,4]
}
)
# 1. Produce the main table
df_new = df2.pivot_table(index=["userId"], columns=["movieId"], values="rating")
print(df_new) # already pretty close
Out[17]:
movieId 1 2 3 5
userId
1 4.0 4.5 NaN NaN
2 NaN NaN 5.0 4.0
# 2. map movie ID's to titles
# name lookup dataset
df_names = df1[["movieId", "title"]].set_index("movieId")
# strip the last 7 characters containing year
# (assume consistent formatting in df1)
df_names["title"] = df_names["title"].apply(lambda s: s[:-7])
# (optional) fill unrated columns and sort
for movie_id in df_names.index.values:
if movie_id not in df_new.columns.values:
df_new[movie_id] = np.nan
else:
df_new = df_new[df_names.index.values]
# replace IDs with titles
df_new.columns = df_names.loc[df_new.columns, "title"].values
Result
df_new
Out[16]:
toy story Jumanji Grumpier 33 Waiting 44 Father 55
userId
1 4.0 4.5 NaN NaN NaN
2 NaN NaN 5.0 NaN 4.0

adding lists with different length to a new dataframe

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.
Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')