How to merge similar rows and split column into rows by values? - pandas

I have this data set for example:
Name Number Is true
0 Dani 2 yes
1 Dani 2 no
2 Jack 5 no
3 Jack 5 maybe
4 Dani 2 maybe
I want to create a new data set that combines similar rows and adds columns by column different values. This is the output I'm trying to get:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe
I couldn't get it working from example 10 here:
How to pivot a dataframe
Would you be able to provide a specific example for this use case please?
Thanks.
Edit for respond:
Name yes no maybe
0 Dani 2 2 2
1 Jack NaN 5 5

With combination of pivot_table(...) and apply(...):
df.pivot_table(index=["Name", "Number"], values="Is true", aggfunc=list).apply(lambda x: pd.Series({f"Is true{id+1}": el for id, el in enumerate(x[0])}), axis=1).reset_index()
Output:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe NaN
Edit
For your follow up. This might be something along the lines, what you're looking for:
df.pivot_table(index=["Name"], columns="Is true", values="Number", aggfunc=list).fillna('').apply(lambda x: pd.Series({f"{col}{id+1}": el for col in x.keys() for id, el in enumerate(x[col])}), axis=1).reset_index()
Output:
Name maybe1 no1 yes1
0 Dani 2.0 2.0 2.0
1 Jack 5.0 5.0 NaN

You can try this:
df2 = df.drop_duplicates(subset=['Name', 'Number Is'])
df2 = df2.reset_index(drop=True).assign(true= df.groupby('Number Is')['true'].agg(list).reset_index(drop=True) )
temp = df2['true'].apply(pd.Series).T
temp.index = temp.index+1
temp = temp.T
df2 = df2.assign(**temp.add_prefix('true').add_suffix(' Is')).drop(columns='true').fillna('')
output:
Name Number Is true1 Is true2 Is true3 Is
0 Dani 2 yes no maybe
1 Jack 5 no maybe

Related

Python Pandas: Rerange data from vertical to horizontal

I would like to transform a data frame using pandas.
Old-Dataframe:
Person-ID
Reference-ID
Name
1
1
Max
2
1
Kevin
3
1
Sara
4
4
Chessi
5
9
Fernando
into a new-dataframe in the following format.
New-Dataframe:
Person-ID
Reference-ID
Member1
Member2
Member3
1
1
Max
Kevin
Sara
4
4
Chessi
5
9
Fernando
My solution would be:
Write all the Reference-IDs from the old-dataframe into the new-dataframe
Write all the Person-Ids from the old-dataframe into the new-dataframe, which their reference_id is not in the old-dataframe (see example Fernando)
Loop trough the "old"-dataframe and add the name to the corresponding line in the new dataframe
Do you have any suggestions, on how to make this faster/simpler?
PS: The old-dataframe can be made like this
person_id = [1,2,3,4,5]
reference_id = [1,1,1,4,9]
name = ['Max','Kevin','Sara',"Chessi","Fernando"]
list_tuples=list(zip(person_id,reference_id,name))
old_dataframe = pd.DataFrame(list_tuples,columns=['Person_ID','Reference_id','Name'])
You can use pivot_table() like this:
df1= pd.pivot_table(df, index=['Reference-ID'], values=['Person-ID', 'Name'], aggfunc=({'Person-ID':'min', 'Name':lambda x:list(x), 'Person-ID':'min'}))
df1.reset_index()[['Person-ID','Reference-ID']].join(pd.DataFrame(df1.Name.tolist()))
Output:
Person-ID
Reference-ID
0
1
2
1
1
Max
Kevin
Sara
4
4
Chessi
None
None
5
9
Fernando
None
None
You can reassign column names like this:
df2=df1.reset_index()[['Person-ID','Reference-ID']].join(pd.DataFrame(df1.Name.tolist()))
df2.columns=list(df2.columns[0:2])+[f"Member{x+1}" for x in df2.columns[2:]]
Output:
Person-ID
Reference-ID
Member1
Member2
Member3
1
1
Max
Kevin
Sara
4
4
Chessi
None
None
5
9
Fernando
None
None

Get value from another df based on condition

I have 2 df
df1:
ID X Y Cond
Johnson 2 3 fine
Sand NAN NAN sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
and df2 :
id2 X Y
Magic 2 3
Sand 2 3
Cooper 1 2
Dean 1 2
I want to update x value in df1, if Cond ="sick" and df["id"] = df["id2]
to get the new df1 :
ID X Y Cond
Johnson 2 3 fine
Sand 2 3 sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
I tried :
df1["x"] = np.where((df["cond"]=="sick")& (df1["id"]==df2["id2"]),df2["x"],"")
But its not working. I get this ValueError :
ValueError: Can only compare identically-labeled Series objects
Thank you
First convert both ID columns to index values for possible match selected rows by DataFrame.loc:
df11 = df1.set_index('ID')
df22 = df2.set_index('id2')
df11.loc[df11["Cond"]=="sick", ['X','Y']] = df22[['X','Y']]
df = df11.reset_index()
print (df)
ID X Y Cond
0 Johnson 2 3 fine
1 Sand 2 3 sick
2 Cooper 1 2 fine
3 Nelson 1 2 fine
4 Peterson 4 5 fine
You can use the where() method of pandas dataframes instead of the wherefunction from numpy. The code looks like this :
df1.loc[:,["X", "Y"]] = df1.loc[:,["X", "Y"]].where(df1["Cond"]!="sick",df2.loc[:,["X", "Y"]])

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

Return Value Based on Conditional Lookup on Different Pandas DataFrame

Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])