Replace values of duplicated rows with first record in pandas? - pandas

Input
df
id label
a 1
b 2
a 3
a 4
b 2
b 3
c 1
c 2
d 2
d 3
Expected
df
id label
a 1
b 2
a 1
a 1
b 2
b 2
c 1
c 1
d 2
d 2
For id a, the label value is 1 and id b is 2 because 1 and 2 is the first record for a and b.
Try
I refer this post, but still not solve it.

Update with transform first
df['lb2']=df.groupby('id').label.transform('first')
df
Out[87]:
id label lb2
0 a 1 1
1 b 2 2
2 a 3 1
3 a 4 1
4 b 2 2
5 b 3 2
6 c 1 1
7 c 2 1
8 d 2 2
9 d 3 2

Related

How to create a rolling unique count by group using pandas

I have a dataframe like the following:
group value
1 a
1 a
1 b
1 b
1 b
1 b
1 c
2 d
2 d
2 d
2 d
2 e
I want to create a column with how many unique values there have been so far for the group. Like below:
group value group_value_id
1 a 1
1 a 1
1 b 2
1 b 2
1 b 2
1 b 2
1 c 3
2 d 1
2 d 1
2 d 1
2 d 1
2 e 2
Use custom lambda function with GroupBy.transform and factorize:
df['group_value_id']=df.groupby('group')['value'].transform(lambda x:pd.factorize(x)[0]) + 1
print (df)
group value group_value_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2
because:
df['group_value_id'] = df.groupby('group')['value'].rank('dense')
print (df)
DataError: No numeric types to aggregate
Also cab be solved as :
df['group_val_id'] = (df.groupby('group')['value'].
apply(lambda x:x.astype('category').cat.codes + 1))
df
group value group_val_id
0 1 a 1
1 1 a 1
2 1 b 2
3 1 b 2
4 1 b 2
5 1 b 2
6 1 c 3
7 2 d 1
8 2 d 1
9 2 d 1
10 2 d 1
11 2 e 2

Reorder pandas DataFrame based on repetitive set of integer in index

I have a pandas dataframe contains some columns, I didn't find a way to order rows as follows:
I need to order the dataframe by the field label but in sequential order (like groups)
Input
I category tags
1 A #25-74
1 B #26-170
0 C #29-106
2 A #18-109
3 B #26-86
2 A #26-108
2 C #30-125
1 B #28-145
0 B #29-93
0 D #21-102
1 F #26-108
2 F #30-125
3 A #28-145
3 D #29-93
0 B #21-102
Needed Order:
I category tags
0 C #29-106
1 B #25-74
2 F #18-109
3 C #26-86
0 B #29-93
1 D #26-170
2 B #26-108
3 B #28-145
0 C #21-102
1 D #28-145
2 A #30-125
3 A #29-93
0 B #21-102
1 A #26-108
2 C #30-125
I have searched for different ways to sort but couldn't find a way to sort using only pandas.
I appreciate every help!
One idea with helper column by GroupBy.cumcount and DataFrame.sort_values:
df['a'] = df.groupby('I').cumcount()
df = df.sort_values(['a','I'])
print (df)
I category tags a
2 0 C #29-106 0
0 1 A #25-74 0
3 2 A #18-109 0
4 3 B #26-86 0
8 0 B #29-93 1
1 1 B #26-170 1
5 2 A #26-108 1
12 3 A #28-145 1
9 0 D #21-102 2
7 1 B #28-145 2
6 2 C #30-125 2
13 3 D #29-93 2
14 0 B #21-102 3
10 1 F #26-108 3
11 2 F #30-125 3
Or first sorting by column | and then change order with Series.argsort and DataFrame.iloc:
df = df.sort_values('I')
df = df.iloc[df.groupby('I').cumcount().argsort()]
print (df)
I category tags
2 0 C #29-106
0 1 A #25-74
3 2 A #18-109
4 3 B #26-86
8 0 B #29-93
1 1 B #26-170
5 2 A #26-108
12 3 A #28-145
9 0 D #21-102
7 1 B #28-145
6 2 C #30-125
13 3 D #29-93
14 0 B #21-102
10 1 F #26-108
11 2 F #30-125

How to remove one specific duplicate named column in columns of a dataframe?

I have a sample dataframe df with columns as:
a b c a a b b c c
0 2 2 1 2 2 1 1 2 2
1 2 2 2 2 2 1 2 1 2
. . .
. . .
I want to remove the duplicate columns named with only 'a' and keep other as same
The expected o/p is:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
Here is a general solution to drop any duplicates of a column, no matter where these columns are in the dataframe and what the content of these columns is.
First we get all column indexes for the given column name and drop the first occurrence. Then we "substract" these indexes from all indexes and return the remaining columns:
to_drop = 'a'
dup = [i for i,v in enumerate(df.columns) if v==to_drop][1:]
df = df.iloc[:, list(set(range(len(df.columns))) - set(dup))]
Result:
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2
df = df.T.reset_index().drop_duplicates().set_index('index').T
del df.columns.name
Exp
since the column a has only dupe values, so we can simply transpose with reset index
df.T.reset_index()
index 0 1
0 a 2 2
1 b 2 2
2 c 1 2
3 b 1 1
4 b 1 2
5 c 2 1
6 c 2 2
Apply drop_duplicate on above df and only the dupes will get removed. It serves the purpose in those instances too where there are more than one column which has dupe value
Output
a b c b b c c
0 2 2 1 1 1 2 2
1 2 2 2 1 2 1 2

compare two column of two dataframe pandas

I have 2 data frames like :
df_out:
a b c d
1 1 2 1
2 1 2 3
3 1 3 5
df_fin:
a e f g
1 0 2 1
2 5 2 3
3 1 3 5
5 2 4 6
7 3 2 5
I want to get result as :
a b c d a e f g
1 1 2 1 1 0 2 1
2 1 2 3 2 5 2 3
3 1 3 5 3 1 3 5
in the other word I have two diffrent data frames that are common in one column(a), I want two compare this two columns(df_fin.a and df_out.a) and select the rows from df_fin that have the same value in column a and create new dataframe that has selected rows from df_fin and added columns from df_out ?
I think you need merge with left join:
df = pd.merge(df_out, df_fin, on='a', how='left')
print (df)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5
EDIT:
df1 = df_fin[df_fin['a'].isin(df_out['a'])]
df2 = df_out.join(df1.set_index('a'), on='a')
print (df2)
a b c d e f g
0 1 1 2 1 0 2 1
1 2 1 2 3 5 2 3
2 3 1 3 5 1 3 5

column names to column, pandas

What is an apposite function of pivot in Pandas?
For example I have
a b c
1 1 2
2 2 3
3 1 2
What I want
a newcol newcol2
1 b 1
1 c 2
2 b 2
2 c 3
3 b 1
3 c 2
use pd.melt http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
import pandas as pd
df = pd.DataFrame({'a':[1,2,3],'b':[1,2,1],'c':[2,3,2]})
pd.melt(df,id_vars=['a'])
Out[8]:
a variable value
0 1 b 1
1 2 b 2
2 3 b 1
3 1 c 2
4 2 c 3
5 3 c 2