How to shuffle according column (id) but keep descending True - pandas

I have a dataframe which is structed as followed:
>>>df
a b id
0 1 4 3
1 4 1 2
2 7 5 1
3 2 9 3
4 4 11 2
5 2 7 1
6 3 4 2
7 9 2 1
I have added paragraphs in code for readability.
Now I want to shuffle according id but keep the initial descending order of column id True. What is the best way?
A possible output would look like following:
>>>df
a b id
0 3 4 2
1 9 2 1
2 2 9 3
3 4 11 2
4 2 7 1
5 1 4 3
6 4 1 2
7 7 5 1
So in principle I just want the blocks to mix or to be randomly placed in another place.

Create groups by difference in id - each groups strat if difference is not -1 and then get unique groups ids, shuffling and change ordering by DataFrame.loc:
df['g'] = df['id'].diff().ne(-1).cumsum()
#if possible differency is not always -1
df['g'] = df['id'].ge(df['id'].shift()).cumsum()
print (df)
a b id g
0 1 4 3 1
1 4 1 2 1
2 7 5 1 1
3 2 9 3 2
4 4 11 2 2
5 2 7 1 2
6 3 4 2 3
7 9 2 1 3
ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index(drop=True)
print (df)
a b id
0 1 4 3
1 4 1 2
2 7 5 1
3 3 4 2
4 9 2 1
5 2 9 3
6 4 11 2
7 2 7 1
If need test groups by helper column change last reset_index(drop=True):
ids = df['g'].unique()
np.random.shuffle(ids)
df = df.set_index('g').loc[ids].reset_index()
print (df)
g a b id
0 2 3 4 2
1 2 9 2 1
2 1 2 9 3
3 1 4 11 2
4 1 2 7 1
5 0 1 4 3
6 0 4 1 2
7 0 7 5 1
Performance: In sample data, I guess repeated sorting should be reason for slowier perfromance in another solution.
#4k rows
df = pd.concat([df] * 500, ignore_index=True)
print (df)
In [70]: %%timeit
...: out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
...: cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
...: out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
...:
6.13 ms ± 845 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df['g'] = df['id'].diff().ne(-1).cumsum()
ids = df['g'].unique()
np.random.shuffle(ids)
df.set_index('g').loc[ids].reset_index(drop=True)
3.93 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Use a categorical index to sort values by block:
out = df.assign(order=df['id'].ge(df['id'].shift()).cumsum()).sample(frac=1)
cat = pd.CategoricalDtype(out['order'].unique(), ordered=True)
out = out = out.astype({'order': cat}).sort_values(['order', 'id'], ascending=False)
print(out)
# Output:
a b id order
0 1 4 3 0
1 4 1 2 0
2 7 5 1 0
6 3 4 2 2
7 9 2 1 2
3 2 9 3 1
4 4 11 2 1
5 2 7 1 1
Obviously, you can remove the order column by appending .drop(columns='order') after sort_values but I keep it here for demonstration purpose.
The key here is to set ordered=True to your new categorical dtype.
>>> cat
CategoricalDtype(categories=[1, 2, 0], ordered=True)

Related

Insert a level o in the existing data frame such that 4 columns are grouped as one

I want to do multiindexing for my data frame such that MAE,MSE,RMSE,MPE are grouped together and given a new index level. Similarly the rest of the four should be grouped together in the same level but different name
> mux3 = pd.MultiIndex.from_product([list('ABCD'),list('1234')],
> names=['one','two'])###dummy data
> df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) #### dummy data frame
> print(df3) #intended output required for the data frame in the picture given below
Assuming column groups are already in the appropriate order we can simply create an np.arange over the length of the columns and floor divide by 4 to get groups and create a simple MultiIndex.from_arrays.
Sample Input and Output:
import numpy as np
import pandas as pd
initial_index = [1, 2, 3, 4] * 3
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, len(initial_index))), columns=initial_index
)
1 2 3 4 1 2 3 4 1 2 3 4 # Column headers are in repeating order
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
# Create New Columns
df3.columns = pd.MultiIndex.from_arrays([
np.arange(len(df3.columns)) // 4, # Group Each set of 4 columns together
df3.columns # Keep level 1 the same as current columns
], names=['one', 'two']) # Set Names (optional)
df3
one 0 1 2
two 1 2 3 4 1 2 3 4 1 2 3 4
0 3 6 6 0 9 8 4 7 0 0 7 1
1 5 7 0 1 4 6 2 9 9 9 9 1
2 2 7 0 5 0 0 4 4 9 3 2 4
If columns are in mixed order:
np.random.seed(5)
df3 = pd.DataFrame(
np.random.choice(10, (3, 8)), columns=[1, 1, 3, 2, 4, 3, 2, 4]
)
df3
1 1 3 2 4 3 2 4 # Cannot select groups positionally
0 3 6 6 0 9 8 4 7
1 0 0 7 1 5 7 0 1
2 4 6 2 9 9 9 9 1
We can convert Index.to_series then enumerate columns using groupby cumcount then sort_index if needed to get in order:
df3.columns = pd.MultiIndex.from_arrays([
# Enumerate Groups to create new level 0 index
df3.columns.to_series().groupby(df3.columns).cumcount(),
df3.columns
], names=['one', 'two']) # Set Names (optional)
# Sort to Order Correctly
# (Do not sort before setting columns it will break alignment with data)
df3 = df3.sort_index(axis=1)
df3
one 0 1
two 1 2 3 4 1 2 3 4 # Notice Data has moved with headers
0 3 0 6 9 6 4 8 7
1 0 1 7 5 0 0 7 1
2 4 9 2 9 6 9 9 1

Comparing two dataframe and output the index of the duplicated row once

I need help with comparing two dataframes. For example:
The first dataframe is
df_1 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 2 2 2 2 2 2
5 5 5 5 5 5 5
6 1 1 1 1 1 1
7 6 6 6 6 6 6
The second dataframe is
df_2 =
0 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
4 5 5 5 5 5 5
5 6 6 6 6 6 6
May I know if there is a way (without using for loop) to find the index of the rows of df_1 that have the same row values of df_2. In the example above, my expected output is below
index =
0
1
2
3
5
7
The size of the column of the "index" variable above should have the same column size of df_2.
If the same row of df_2 repeated in df_1 more than once, I only need the index of the first appearance, thats why I don't need the index 4 and 6.
Please help. Thank you so much!
Tommy
Use DataFrame.merge with DataFrame.drop_duplicates and DataFrame.reset_index for convert index to column for avoid lost index values, last select column called index:
s = df_2.merge(df_1.drop_duplicates().reset_index())['index']
print (s)
0 0
1 1
2 2
3 3
4 5
5 7
Name: index, dtype: int64
Detail:
print (df_2.merge(df_1.drop_duplicates().reset_index()))
0 1 2 3 4 5 index
0 1 1 1 1 1 1 0
1 2 2 2 2 2 2 1
2 3 3 3 3 3 3 2
3 4 4 4 4 4 4 3
4 5 5 5 5 5 5 5
5 6 6 6 6 6 6 7
Check the solution
df1=pd.DataFrame({'0':[1,2,3,4,2,5,1,6],
'1':[1,2,3,4,2,5,1,6],
'2':[1,2,3,4,2,5,1,6],
'3':[1,2,3,4,2,5,1,6],
'4':[1,2,3,4,2,5,1,6],
'5':[1,2,3,4,2,5,1,6]})
df1=pd.DataFrame({'0':[1,2,3,4,5,6],
'1':[1,2,3,4,5,66],
'2':[1,2,3,4,5,6],
'3':[1,2,3,4,5,66],
'4':[1,2,3,4,5,6],
'5':[1,2,3,4,5,6]})
df1[df1.isin(df2)].index.values.tolist()
### Output
[0, 1, 2, 3, 4, 5, 6, 7]

which rows are duplicates to each other

I have got a database with a lot of columns. Some of the rows are duplicates (on a certain subset).
Now I want to find out which row duplicates which row and put them together.
For instance, let's suppose that the data frame is
id A B C
0 0 1 2 0
1 1 2 3 4
2 2 1 4 8
3 3 1 2 3
4 4 2 3 5
5 5 5 6 2
and subset is
['A','B']
I expect something like this:
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2
Is there any function that can help me do this?
Thanks :)
Use DataFrame.duplicated with keep=False for mask with all dupes, then flter by boolean indexing, sorting by DataFrame.sort_values and join together by concat:
L = ['A','B']
m = df.duplicated(L, keep=False)
df = pd.concat([df[m].sort_values(L), df[~m]], ignore_index=True)
print (df)
id A B C
0 0 1 2 0
1 3 1 2 3
2 1 2 3 4
3 4 2 3 5
4 2 1 4 8
5 5 5 6 2

pandas: bin data into specific number of bins of specific size

I would like to bin a dataframe by the values in a single column into bins of a specific size and number.
Here is an example df:
df= pd.DataFrame(np.random.randint(0,10000,size=(10000, 4)), columns=list('ABCD'))
Say I want to bin by column D, I will first sort the data:
df.sort('D')
I would now wish to bin so that the first if bin size is 50 and bin number is 100, the first 50 values will go into bin 1, the next into bin 2, and so on and so forth. Any remaining values after the twenty bins should all go into the final bin. Is there anyway of doing this?
EDIT:
Here is a sample input:
x = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
And here is the expected output:
A B C D bin
0 6 8 6 5 3
1 5 4 9 1 1
2 5 1 7 4 3
3 6 3 3 3 2
4 2 5 9 3 2
5 2 5 1 3 2
6 0 1 1 0 1
7 3 9 5 8 3
8 2 4 0 1 1
9 6 4 5 6 3
As an extra aside, is it also possible to bin any equal values in the same bin? So for example, say I have bin 1 which contains values, 0,1,1 and then bin 2 contains 1,1,2. Is there any way of putting those two 1 values in bin 2 into bin 1? This will create very uneven bin sizes but this is not an issue.
It seems you need floor divide np.arange and then assign to new column:
idx = df['D'].sort_values().index
df['b'] = pd.Series(np.arange(len(df)) // 3 + 1, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 4
8 2 4 0 1 1 1
9 6 4 5 6 3 3
Detail:
print (np.arange(len(df)) // 3 + 1)
[1 1 1 2 2 2 3 3 3 4]
EDIT:
I create another question about problem with last values here:
N = 3
idx = df['D'].sort_values().index
#one possible solution, thanks divakar
def replace_irregular_groupings(a, N):
n = len(a)
m = N*(n//N)
if m!=n:
a[m:] = a[m-1]
return a
idx = df['D'].sort_values().index
arr = replace_irregular_groupings(np.arange(len(df)) // N + 1, N)
df['b'] = pd.Series(arr, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 3
8 2 4 0 1 1 1
9 6 4 5 6 3 3

Pandas count values inside dataframe

I have a dataframe that looks like this:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
and I want to count the values so to make df like this:
total
1 2
3 2
4 1
5 2
8 2
is it possible with pandas?
With np.unique -
In [332]: df
Out[332]:
A B C
1 1 8 3
2 5 4 3
3 5 8 1
In [333]: ids, c = np.unique(df.values.ravel(), return_counts=1)
In [334]: pd.DataFrame({'total':c}, index=ids)
Out[334]:
total
1 2
3 2
4 1
5 2
8 2
With pandas-series -
In [357]: pd.Series(np.ravel(df)).value_counts().sort_index()
Out[357]:
1 2
3 2
4 1
5 2
8 2
dtype: int64
You can also use stack() and groupby()
df = pd.DataFrame({'A':[1,8,3],'B':[5,4,3],'C':[5,8,1]})
print(df)
A B C
0 1 5 5
1 8 4 8
2 3 3 1
df1 = df.stack().reset_index(1)
df1.groupby(0).count()
level_1
0
1 2
3 2
4 1
5 2
8 2
Other alternative may be to use stack, followed by value_counts then, result changed to frame and finally sorting the index:
count_df = df.stack().value_counts().to_frame('total').sort_index()
count_df
Result:
total
1 2
3 2
4 1
5 2
8 2
using np.unique(, return_counts=True) and np.column_stack():
pd.DataFrame(np.column_stack(np.unique(df, return_counts=True)))
returns:
0 1
0 1 2
1 3 2
2 4 1
3 5 2
4 8 2