Reset 'Id' value of appended Dataframe - pandas

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5

You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Related

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

pandas aggregate based on continuous same rows

Suppose I have this data frame and I want to aggregate and sum values on column 'a' based on the labels that have the same amount.
a label
0 1 0
1 3 0
2 5 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 0
8 4 1
The desired result will be:
a label
0 9 0
1 6 1
2 6 0
3 4 1
and not this:
a label
0 15 0
1 10 1
IIUC
s=df.groupby(df.label.diff().ne(0).cumsum()).agg({'a':'sum','label':'first'})
s
Out[280]:
a label
label
1 9 0
2 6 1
3 6 0
4 4 1

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2