Allotting unique identifier to a group of groups in pandas dataframe - pandas

Given a frame like this
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,6,3,7,3,2,11,13,10,1,5],'B':[1,1,1,2,2,2,2,3,3,3,3,3,4,4],
'C':[1,1,1,1,1,1,1,2,2,2,2,2,3,3]})
I want to allot a unique identifier to multiple groups in column B. For example, going from top for every two groups allot a unique identifier as shown in red boxes in below image. The end result would look like below:
Currently I am doing like below but it seems to be over kill. It's taking too much time to update even 70,000 rows:
b_unique_cnt = df['B'].nunique()
the_list = list(range(1, b_unique_cnt+1))
slice_size = 2
list_of_slices = zip(*(iter(the_list),) * slice_size)
counter = 1
df['D'] = -1
for i in list_of_slices:
df.loc[df['B'].isin(i), 'D'] = counter
counter = counter + 1
df.head(15)

You could do
df['new'] = df.B.factorize()[0]//2+1
#(df.groupby(['B'], sort=False).ngroup()//2).add(1)
df
Out[153]:
A B C new
0 1 1 1 1
1 2 1 1 1
2 3 1 1 1
3 4 2 1 1
4 6 2 1 1
5 3 2 1 1
6 7 2 1 1
7 3 3 2 2
8 2 3 2 2
9 11 3 2 2
10 13 3 2 2
11 10 3 2 2
12 1 4 3 2
13 5 4 3 2

Related

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

change all values in a dataframe with other values from another dataframe

I just started with learning pandas.
I have 2 dataframes.
The first one is
val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4
and the second one is
0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2
I want to change my second dataframe so that the values present in the dataframe are compared with the column val in the first dataframe and every values that is the same needs then to be changed in the values that is present in de the num column from dataframe 1. Which means that in the end i need to get the following dataframe:
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1
How do i do that in pandas?
You can use DataFrame.replace() to do this:
df2.replace(df1.set_index('val')['num'])
Explanation:
The first step is to set the val column of the first DataFrame as the index. This will change how the matching is performed in the third step.
Convert the first DataFrame to a Series, by sub-setting to the index and the num column. It looks like this:
val
1 0
2 1
3 2
4 3
5 4
Name: num, dtype: int64
Next, use DataFrame.replace() to do the replacement in the second DataFrame. It looks up each value from the second DataFrame, finds a matching index in the Series, and replaces it with the value from the Series.
Full reproducible example:
import pandas as pd
import io
s = """ val num
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4"""
df1 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
s = """ 0 1 2 3
0 1 2 3 4
1 5 3 2 2
2 2 5 3 2"""
df2 = pd.read_csv(io.StringIO(s), delim_whitespace=True)
print(df2.replace(df1.set_index('val')['num']))
Creat the mapping dict , then replace
mpd = dict(zip(df1.val,df1.num))
df2.replace(mpd, inplace=True)
0 1 2 3
0 0 1 2 3
1 4 2 1 1
2 1 4 2 1

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2