How to use the each pair of values in column and transpose them into rows inside groups (groupby)

How to use the each pair of values in column and transpose them into rows inside groups (groupby) - pandas

I have data on which I already applied group by user and sort by time (data.groupby('id').apply(lambda x: x.sort_values('time'))
):
user time point_id
1 00:00 1
1 00:01 3
1 00:02 4
1 00:03 2
2 00:00 1
2 00:05 3
2 00:15 1
3 00:00 1
3 01:00 2
3 02:00 3
And from that I need to inside each group to made links/transpose next 2 values into rows. It should look like this for the example above:
user start_point end_point
1 1 3
1 3 4
1 4 2
2 1 3
2 3 1
3 1 2
3 2 3
My final goal is to get matrix which will show how many links come into each point:
point_id | 1 | 2 | 3 | 4 |
--------------------------------------------
1 0 1 3 0
2 1 0 0 1
3 3 0 0 1
4 0 1 1 0
So this matrix means that from point 2 one link goes to point 1, from point 3 that 3 links go to the point one and etc.
The picture of this looks like this:

First, you can use shift() to group point_id into rows.
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
print(df)
user start_point end_point
0 1 1 3
1 1 3 4
2 1 4 2
4 2 1 3
5 2 3 1
7 3 1 2
8 3 2 3
Then you can use pd.crosstab to count directed link.
u = pd.crosstab(df.start_point, df.end_point)
print(u)
end_point 1 2 3 4
start_point
1 0 1 2 0
2 0 0 1 0
3 1 0 0 1
4 0 1 0 0
According to your results, what you need is undirected graph counting, so all we need to do is transpose and add.
result = u + u.T
print(result)
end_point 1 2 3 4
start_point
1 0 1 3 0
2 1 0 1 1
3 3 1 0 1
4 0 1 1 0
Final code as follow:
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
u = pd.crosstab(df.start_point, df.end_point)
result = u + u.T

I believe this works for your example, taking df = data.groupby('id').apply(lambda x: x.sort_values('time')) (your starting example):
groups = [(k, df.loc[v, 'point_id'].values) for k, v in df.groupby('user').groups.items()]
res = []
for g in groups:
res.append([(g[0], i) for i in (zip(g[1], g[1][1:]))])
df1 = pd.DataFrame([item for sublist in res for item in sublist])
df2 = df1.copy()
df2.iloc[:,-1] = df2.iloc[:,-1].apply(lambda x: (x[1], x[0])) # df2 swaps around the points
df_ = pd.concat([df1, df2]).sort_values(by=0)
df_['1'], df_['2'] = df_.iloc[:,-1].apply(lambda x: x[0]), df_.iloc[:,-1].apply(lambda x: x[1])
df_ = df_.drop(columns=1)
df_.columns = ['user', 'start_point', 'end_point'] # your intermediate table
df_.pivot_table(index='start_point', columns='end_point', aggfunc='count').fillna(0)
Output:
user
end_point 1 2 3 4
start_point
1 0.0 1.0 3.0 0.0
2 1.0 0.0 1.0 1.0
3 3.0 1.0 0.0 1.0
4 0.0 1.0 1.0 0.0

Related

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?

EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!

You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

Turning a matrix to dummy matrix

I've generated a list of combination and would like to turn it into "dummies" matrix
import pandas as pd
from itertools import combinations
comb = pd.DataFrame(list(combinations(range(1, 6), 4)))
0 1 2 3
0 1 2 3 4
1 1 2 3 5
2 1 2 4 5
3 1 3 4 5
4 2 3 4 5
would like to turn the above dataframe to a dataframe look like below. Thanks.
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1

You can use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
df = pd.DataFrame(lb.fit_transform(comb.values), columns= lb.classes_)
print (df)
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1

Pandas change each group into a single row

I have a dataframe like the follows.
>>> data
target user data
0 A 1 0
1 A 1 0
2 A 1 1
3 A 2 0
4 A 2 1
5 B 1 1
6 B 1 1
7 B 1 0
8 B 2 0
9 B 2 0
10 B 2 1
You can see that each user may contribute multiple claims about a target. I want to only store each user's most frequent data for each target. For example, for the dataframe shown above, I want the result like follows.
>>> result
target user data
0 A 1 0
1 A 2 0
2 B 1 1
3 B 2 0
How to do this? And, can I do this using groupby? (my real dataframe is not sorted)
Thanks!

Using groupby with count create the helper key , then we using idxmax
df['helperkey']=df.groupby(['target','user','data']).data.transform('count')
df.groupby(['target','user']).helperkey.idxmax()
Out[10]:
target user
A 1 0
2 3
B 1 5
2 8
Name: helperkey, dtype: int64
df.loc[df.groupby(['target','user']).helperkey.idxmax()]
Out[11]:
target user data helperkey
0 A 1 0 2
3 A 2 0 1
5 B 1 1 2
8 B 2 0 2

Pandas: The best way to create new Frame by specific criteria

I have a DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
I want to get new DateFrame where there are only id's with both sex values.
So I want to get something like this.
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 3 1
5 3 1
6 3 0

Using groupby and filter with required condition
In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1]))
Out[2952]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Also,
In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]]))
Out[2953]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

Use drop_duplicates by both columns and then get size of one column by value_counts first.
Then filter all values by boolean indexing with isin:
s = df.drop_duplicates()['id'].value_counts()
print (s)
3 2
1 2
4 1
2 1
Name: id, dtype: int64
df = df[df['id'].isin(s.index[s == 2])]
print (df)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

One more:)
df.groupby('id').filter(lambda x: x['sex'].nunique()>1)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

Use isin()
Something like this:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
male = df[df['sex'] == 0]
male = male['id']
female = df[df['sex'] == 1]
female = female['id']
df = df[(df['id'].isin(male)) & (df['id'].isin(female))]
print(df)
Output:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

Or you can try this
m=df.groupby('id')['sex'].nunique().eq(2)
df.loc[df.id.isin(m[m].index)]
Out[112]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to use the each pair of values in column and transpose them into rows inside groups (groupby) - pandas

Related

How to compute column sum on the basis of other column value in pandas dataframe?

Pandas concat function with count assigned for each iteration

Turning a matrix to dummy matrix

Pandas change each group into a single row

Pandas: The best way to create new Frame by specific criteria

Categories

Resources