Pandas Dataframe and duplicate names [duplicate] - pandas

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 4 years ago.
I've a Pandas dataframe, and some numerical data about some people.
What I need to do is to find people that appare more than one time in the dataframe, and to substitute all the row about one people with one row where the numeric values are the sum of the numeric values of the rows before.
Example:
Names Column1 Column1
Jonh 1 2
Bob 2 3
Pier 1 1
John 3 3
Bob 1 0
Have to become:
Names Column1 Column1
Jonh 4 5
Bob 3 3
Pier 1 1
How can I do?

Try this:
In [975]: df.groupby('Names')[['Column1','Column2']].sum()
Out[975]:
Column1 Column2
Names
Bob 3 3
John 4 5
Pier 1 1

groupby and sum should do the job
df.groupby('Names').sum().sort_values('Column1', ascending=False)
Column1 Column1.1
Names
Jonh 4 5
Bob 3 3
Pier 1 1

Related

Create a new column pandas based on another column condition [duplicate]

This question already has an answer here:
increment a value each time the next row is different from the previous one
(1 answer)
Closed 3 months ago.
I wanted to create a new column, let say named "group id" on the basis of:
compare the nth row with (n-1)th row.
if both the records are equal then in a "group id", previous "group id" is copied
If these records are not equal, then 1 should be added to "group id column".
I wanted to have the result in the following way:
The expected result
Column A
Column B
6-Aug-10
0
30-Aug-11
1
31-Aug-11
2
31-Aug-11
2
6-Sep-12
3
30-Aug-13
4
Looking for the result, similar to this excel function
=IF(T3=T2, U2, U2+1)
you can use ngroup:
df['Group ID']=df.groupby('DOB').ngroup()
#according to your example
df['Group ID']=df.groupby('Column A').ngroup()
Use factorize - consecutive groups are not count separately like compare shifted values with Series.cumsum and subtract 1:
print (df)
Column A Column B
0 6-Aug-10 0
1 30-Aug-11 1
2 31-Aug-11 2
3 31-Aug-11 2
4 6-Sep-12 3
5 30-Aug-13 4
6 30-Aug-11 5 <- added data for see difference
7 31-Aug-11 6 <- added data for see difference
df['Group ID1'] = pd.factorize(df['Column A'])[0]
df['Group ID2'] = df['Column A'].ne(df['Column A'].shift()).cumsum().sub(1)
print (df)
Column A Column B Group ID1 Group ID2
0 6-Aug-10 0 0 0
1 30-Aug-11 1 1 1
2 31-Aug-11 2 2 2
3 31-Aug-11 2 2 2
4 6-Sep-12 3 3 3
5 30-Aug-13 4 4 4
6 30-Aug-11 5 1 5
7 31-Aug-11 5 2 6

Convert subset of rows to column pyspark dataframe

Suppose we have the following df
Id PlaceCod Val
1 1 0
1 2 3
2 2 4
2 1 5
3 1 6
How can I convert this DF to this one:
Id Store Warehouse
1 0 3
2 5 4
3 6 null
I've tried to use df.pivot(f.col("PlaceCod")) but got error message 'DataFrame has no pivot attribute'
As posted by #Emma on the comments:
df.groupby('Id').pivot('PlaceCod').agg(F.first('Val'))
Using the above solution my problem was solved!

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

Pandas: get the unique value with the biggest index [duplicate]

This question already has answers here:
group by pandas dataframe and select latest in each group
(6 answers)
Closed 4 years ago.
I have a df like this
Name Data
0 Mike 123
1 Mike 456
2 Mike 789
3 Fred 345
4 Fred 123
5 Ted 333
I need to get unique Name with the max index value
output:
Name Data
0 Mike 789
1 Fred 123
2 Ted 333
Step 1st: Import pandas.
import pandas as pd
Step 2nd: Copy OP's df values.
Step 3rd: Now run following command to create data frame from OP's samples.
df=pd.read_clipboard()
Step 4th: Run following code to remove duplicates and keep last value of Name column.
df.drop_duplicates(subset='Name',keep='last')
Output will be as follows.
Name Data
2 Mike 789
4 Fred 123
5 Ted 333

Update one dataframe from another dataframe [duplicate]

I have a situation like that, one original pandas dataframe, for example, like:
columnA columnB
1 2
1 3
then because of updating, this table looks like this:
columnA columnB columnC
2 3 2
2 4 3
1 3 3
However, I want keep original table, so what I wanted is shown below, only new things, third column and third row updated
columnA columnB columnC
1 2 2
1 3 3
1 3 3
so which kind of merge should I use to expand original table with new rows and columns? THX!
I think need combine_first which convert integer columns to floats, so added astype:
df = df1.combine_first(df2).astype(int)
print (df)
columnA columnB columnC
0 1 2 2
1 1 3 3
2 1 3 3