How to use the each pair of values in column and transpose them into rows inside groups (groupby) - pandas

I have data on which I already applied group by user and sort by time (data.groupby('id').apply(lambda x: x.sort_values('time'))
):
user time point_id
1 00:00 1
1 00:01 3
1 00:02 4
1 00:03 2
2 00:00 1
2 00:05 3
2 00:15 1
3 00:00 1
3 01:00 2
3 02:00 3
And from that I need to inside each group to made links/transpose next 2 values into rows. It should look like this for the example above:
user start_point end_point
1 1 3
1 3 4
1 4 2
2 1 3
2 3 1
3 1 2
3 2 3
My final goal is to get matrix which will show how many links come into each point:
point_id | 1 | 2 | 3 | 4 |
--------------------------------------------
1 0 1 3 0
2 1 0 0 1
3 3 0 0 1
4 0 1 1 0
So this matrix means that from point 2 one link goes to point 1, from point 3 that 3 links go to the point one and etc.
The picture of this looks like this:

First, you can use shift() to group point_id into rows.
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
print(df)
user start_point end_point
0 1 1 3
1 1 3 4
2 1 4 2
4 2 1 3
5 2 3 1
7 3 1 2
8 3 2 3
Then you can use pd.crosstab to count directed link.
u = pd.crosstab(df.start_point, df.end_point)
print(u)
end_point 1 2 3 4
start_point
1 0 1 2 0
2 0 0 1 0
3 1 0 0 1
4 0 1 0 0
According to your results, what you need is undirected graph counting, so all we need to do is transpose and add.
result = u + u.T
print(result)
end_point 1 2 3 4
start_point
1 0 1 3 0
2 1 0 1 1
3 3 1 0 1
4 0 1 1 0
Final code as follow:
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
u = pd.crosstab(df.start_point, df.end_point)
result = u + u.T

I believe this works for your example, taking df = data.groupby('id').apply(lambda x: x.sort_values('time')) (your starting example):
groups = [(k, df.loc[v, 'point_id'].values) for k, v in df.groupby('user').groups.items()]
res = []
for g in groups:
res.append([(g[0], i) for i in (zip(g[1], g[1][1:]))])
df1 = pd.DataFrame([item for sublist in res for item in sublist])
df2 = df1.copy()
df2.iloc[:,-1] = df2.iloc[:,-1].apply(lambda x: (x[1], x[0])) # df2 swaps around the points
df_ = pd.concat([df1, df2]).sort_values(by=0)
df_['1'], df_['2'] = df_.iloc[:,-1].apply(lambda x: x[0]), df_.iloc[:,-1].apply(lambda x: x[1])
df_ = df_.drop(columns=1)
df_.columns = ['user', 'start_point', 'end_point'] # your intermediate table
df_.pivot_table(index='start_point', columns='end_point', aggfunc='count').fillna(0)
Output:
user
end_point 1 2 3 4
start_point
1 0.0 1.0 3.0 0.0
2 1.0 0.0 1.0 1.0
3 3.0 1.0 0.0 1.0
4 0.0 1.0 1.0 0.0

Related

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Pandas concat function with count assigned for each iteration

At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5

Turning a matrix to dummy matrix

I've generated a list of combination and would like to turn it into "dummies" matrix
import pandas as pd
from itertools import combinations
comb = pd.DataFrame(list(combinations(range(1, 6), 4)))
0 1 2 3
0 1 2 3 4
1 1 2 3 5
2 1 2 4 5
3 1 3 4 5
4 2 3 4 5
would like to turn the above dataframe to a dataframe look like below. Thanks.
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1
You can use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
lb = MultiLabelBinarizer()
df = pd.DataFrame(lb.fit_transform(comb.values), columns= lb.classes_)
print (df)
1 2 3 4 5
0 1 1 1 1 0
1 1 1 1 0 1
2 1 1 0 1 1
3 1 0 1 1 1
4 0 1 1 1 1

Pandas change each group into a single row

I have a dataframe like the follows.
>>> data
target user data
0 A 1 0
1 A 1 0
2 A 1 1
3 A 2 0
4 A 2 1
5 B 1 1
6 B 1 1
7 B 1 0
8 B 2 0
9 B 2 0
10 B 2 1
You can see that each user may contribute multiple claims about a target. I want to only store each user's most frequent data for each target. For example, for the dataframe shown above, I want the result like follows.
>>> result
target user data
0 A 1 0
1 A 2 0
2 B 1 1
3 B 2 0
How to do this? And, can I do this using groupby? (my real dataframe is not sorted)
Thanks!
Using groupby with count create the helper key , then we using idxmax
df['helperkey']=df.groupby(['target','user','data']).data.transform('count')
df.groupby(['target','user']).helperkey.idxmax()
Out[10]:
target user
A 1 0
2 3
B 1 5
2 8
Name: helperkey, dtype: int64
df.loc[df.groupby(['target','user']).helperkey.idxmax()]
Out[11]:
target user data helperkey
0 A 1 0 2
3 A 2 0 1
5 B 1 1 2
8 B 2 0 2

Pandas: The best way to create new Frame by specific criteria

I have a DataFrame:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 2 0
5 2 0
6 2 0
7 3 1
8 3 1
9 3 0
10 4 1
11 4 1
I want to get new DateFrame where there are only id's with both sex values.
So I want to get something like this.
id sex
0 1 0
1 1 0
2 1 0
3 1 1
4 3 1
5 3 1
6 3 0
Using groupby and filter with required condition
In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1]))
Out[2952]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Also,
In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]]))
Out[2953]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use drop_duplicates by both columns and then get size of one column by value_counts first.
Then filter all values by boolean indexing with isin:
s = df.drop_duplicates()['id'].value_counts()
print (s)
3 2
1 2
4 1
2 1
Name: id, dtype: int64
df = df[df['id'].isin(s.index[s == 2])]
print (df)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
One more:)
df.groupby('id').filter(lambda x: x['sex'].nunique()>1)
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Use isin()
Something like this:
df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4],
'sex': [0,0,0,1,0,0,0,1,1,0,1,1]})
male = df[df['sex'] == 0]
male = male['id']
female = df[df['sex'] == 1]
female = female['id']
df = df[(df['id'].isin(male)) & (df['id'].isin(female))]
print(df)
Output:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0
Or you can try this
m=df.groupby('id')['sex'].nunique().eq(2)
df.loc[df.id.isin(m[m].index)]
Out[112]:
id sex
0 1 0
1 1 0
2 1 0
3 1 1
7 3 1
8 3 1
9 3 0