Label the first element in each groupby

Label the first element in each groupby - pandas

I have a data frame that looks like the following
df = pd.DataFrame({'group':[1,1,2,2,2],'time':[1,2,3,4,5],'C':[6,7,8,9,10]})
group time C
0 1 1 6
1 1 2 7
2 2 3 8
3 2 4 9
4 2 5 10
and I'm looking to label the first element (in terms of time) in each group as True, i.e.:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
I tried several combinations of groupby, first but did not manage to achieve what I wanted.
Is there an elegant way to do it in Pandas?

Use duplicated:
df['first_in_group'] = ~df.group.duplicated()
OUTPUT:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
NOTE: Do the sorting 1st (if required).
df = df.sort_values(['group', 'time'])

Related

Allotting unique identifier to a group of groups in pandas dataframe

Given a frame like this
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2,11,13,10,1,5],'B':[1,1,1,2,2,2,2,3,3,3,3,3,4,4],
'C':[1,1,1,1,1,1,1,2,2,2,2,2,3,3]})
I want to allot a unique identifier to multiple groups in column B. For example, going from top for every two groups allot a unique identifier as shown in red boxes in below image. The end result would look like below:
Currently I am doing like below but it seems to be over kill. It's taking too much time to update even 70,000 rows:
b_unique_cnt = df['B'].nunique()
the_list = list(range(1, b_unique_cnt+1))
slice_size = 2
list_of_slices = zip(*(iter(the_list),) * slice_size)
counter = 1
df['D'] = -1
for i in list_of_slices:
df.loc[df['B'].isin(i), 'D'] = counter
counter = counter + 1
df.head(15)

You could do
df['new'] = df.B.factorize()[0]//2+1
#(df.groupby(['B'], sort=False).ngroup()//2).add(1)
df
Out[153]:
A B C new
0 1 1 1 1
1 2 1 1 1
2 3 1 1 1
3 4 2 1 1
4 6 2 1 1
5 3 2 1 1
6 7 2 1 1
7 3 3 2 2
8 2 3 2 2
9 11 3 2 2
10 13 3 2 2
11 10 3 2 2
12 1 4 3 2
13 5 4 3 2

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random

To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

Comparing two dataframe and output the index of the duplicated row once

I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy

Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7

Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)

Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?

With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64

You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2

Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2

using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Label the first element in each groupby - pandas

Use duplicated: df['first_in_group'] = ~df.group.duplicated() OUTPUT: group time C first_in_group 0 1 1 6 True 1 1 2 7 False 2 2 3 8 True 3 2 4 9 False 4 2 5 10 False NOTE: Do the sorting 1st (if required). df = df.sort_values(['group', 'time'])

Related

Allotting unique identifier to a group of groups in pandas dataframe

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Comparing two dataframe and output the index of the duplicated row once

which rows are duplicates to each other

Pandas count values inside dataframe

Categories

Resources