How to use the each pair of values in column and transpose them into rows inside groups (groupby) - pandas
I have data on which I already applied group by user and sort by time (data.groupby('id').apply(lambda x: x.sort_values('time'))
):
user time point_id
1 00:00 1
1 00:01 3
1 00:02 4
1 00:03 2
2 00:00 1
2 00:05 3
2 00:15 1
3 00:00 1
3 01:00 2
3 02:00 3
And from that I need to inside each group to made links/transpose next 2 values into rows. It should look like this for the example above:
user start_point end_point
1 1 3
1 3 4
1 4 2
2 1 3
2 3 1
3 1 2
3 2 3
My final goal is to get matrix which will show how many links come into each point:
point_id | 1 | 2 | 3 | 4 |
--------------------------------------------
1 0 1 3 0
2 1 0 0 1
3 3 0 0 1
4 0 1 1 0
So this matrix means that from point 2 one link goes to point 1, from point 3 that 3 links go to the point one and etc.
The picture of this looks like this:
First, you can use shift() to group point_id into rows.
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
print(df)
user start_point end_point
0 1 1 3
1 1 3 4
2 1 4 2
4 2 1 3
5 2 3 1
7 3 1 2
8 3 2 3
Then you can use pd.crosstab to count directed link.
u = pd.crosstab(df.start_point, df.end_point)
print(u)
end_point 1 2 3 4
start_point
1 0 1 2 0
2 0 0 1 0
3 1 0 0 1
4 0 1 0 0
According to your results, what you need is undirected graph counting, so all we need to do is transpose and add.
result = u + u.T
print(result)
end_point 1 2 3 4
start_point
1 0 1 3 0
2 1 0 1 1
3 3 1 0 1
4 0 1 1 0
Final code as follow:
df = df.assign(end_point=df['point_id'].shift(-1))[df['user']==df['user'].shift(-1)].rename(columns={'point_id':'start_point'}).astype(int)
u = pd.crosstab(df.start_point, df.end_point)
result = u + u.T
I believe this works for your example, taking df = data.groupby('id').apply(lambda x: x.sort_values('time')) (your starting example):
groups = [(k, df.loc[v, 'point_id'].values) for k, v in df.groupby('user').groups.items()]
res = []
for g in groups:
res.append([(g[0], i) for i in (zip(g[1], g[1][1:]))])
df1 = pd.DataFrame([item for sublist in res for item in sublist])
df2 = df1.copy()
df2.iloc[:,-1] = df2.iloc[:,-1].apply(lambda x: (x[1], x[0])) # df2 swaps around the points
df_ = pd.concat([df1, df2]).sort_values(by=0)
df_['1'], df_['2'] = df_.iloc[:,-1].apply(lambda x: x[0]), df_.iloc[:,-1].apply(lambda x: x[1])
df_ = df_.drop(columns=1)
df_.columns = ['user', 'start_point', 'end_point'] # your intermediate table
df_.pivot_table(index='start_point', columns='end_point', aggfunc='count').fillna(0)
Output:
user
end_point 1 2 3 4
start_point
1 0.0 1.0 3.0 0.0
2 1.0 0.0 1.0 1.0
3 3.0 1.0 0.0 1.0
4 0.0 1.0 1.0 0.0
Related
How to compute column sum on the basis of other column value in pandas dataframe?
P T1 T2 T3 0 1 2 3 1 1 2 0 2 3 1 2 3 1 0 2 In the above pandas dataframe df, I want to add columns on the basis of the value of column 'P'. if df['P'] == 0: 0 if df['P'] == 1: T1 (=1) if df['P'] == 2: T1+T2 (=3+1=4) if df['P'] == 3: T1+T2+T3 (=1+0+2=3) In other words, I want to add from T1 to TN if df['P'] == N. How can I implement this with Python code?
EDIT: For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows: np.random.seed(20) c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)] df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c) arrP = df['P'].to_numpy()[:, None] for c in ['T','U','V']: df1 = df.filter(regex=rf'^{c}') df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1) print (df) P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM 0 3 2 3 3 0 2 1 0 3 2 8 3 5 1 3 2 0 2 0 1 2 2 3 3 4 3 8 2 0 1 2 2 2 0 1 1 3 1 0 0 0 3 3 2 2 2 1 3 2 1 3 2 6 6 6 4 3 1 1 3 1 2 2 0 2 3 5 5 5 5 2 3 2 3 1 1 1 0 3 0 5 2 3 6 2 3 2 3 3 3 2 1 1 2 5 6 2 7 3 2 0 2 1 1 2 2 2 3 4 4 7 8 2 2 1 0 2 2 0 3 3 0 3 4 6 9 2 2 3 2 2 3 2 2 1 1 5 5 3
Pandas concat function with count assigned for each iteration
At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)? Orig df: a b 0 1 2 1 2 3 df replicated with pd.concat[df]*5 and with an additional Column c: a b c 0 1 2 1 1 2 3 1 0 1 2 2 1 2 3 2 0 1 2 3 1 2 3 3 0 1 2 4 1 2 3 4 0 1 2 5 1 2 3 5 This is a multi-row dataframe where the count variable would have to be applied to multiple rows. Thanks for your thoughts!
You could use np.arange and np.repeat: N = 5 new_df = pd.concat([df] * N) new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1 Output: >>> new_df a b c 0 1 2 1 1 2 3 1 0 1 2 2 1 2 3 2 0 1 2 3 1 2 3 3 0 1 2 4 1 2 3 4 0 1 2 5 1 2 3 5
Turning a matrix to dummy matrix
I've generated a list of combination and would like to turn it into "dummies" matrix import pandas as pd from itertools import combinations comb = pd.DataFrame(list(combinations(range(1, 6), 4))) 0 1 2 3 0 1 2 3 4 1 1 2 3 5 2 1 2 4 5 3 1 3 4 5 4 2 3 4 5 would like to turn the above dataframe to a dataframe look like below. Thanks. 1 2 3 4 5 0 1 1 1 1 0 1 1 1 1 0 1 2 1 1 0 1 1 3 1 0 1 1 1 4 0 1 1 1 1
You can use MultiLabelBinarizer: from sklearn.preprocessing import MultiLabelBinarizer lb = MultiLabelBinarizer() df = pd.DataFrame(lb.fit_transform(comb.values), columns= lb.classes_) print (df) 1 2 3 4 5 0 1 1 1 1 0 1 1 1 1 0 1 2 1 1 0 1 1 3 1 0 1 1 1 4 0 1 1 1 1
Pandas change each group into a single row
I have a dataframe like the follows. >>> data target user data 0 A 1 0 1 A 1 0 2 A 1 1 3 A 2 0 4 A 2 1 5 B 1 1 6 B 1 1 7 B 1 0 8 B 2 0 9 B 2 0 10 B 2 1 You can see that each user may contribute multiple claims about a target. I want to only store each user's most frequent data for each target. For example, for the dataframe shown above, I want the result like follows. >>> result target user data 0 A 1 0 1 A 2 0 2 B 1 1 3 B 2 0 How to do this? And, can I do this using groupby? (my real dataframe is not sorted) Thanks!
Using groupby with count create the helper key , then we using idxmax df['helperkey']=df.groupby(['target','user','data']).data.transform('count') df.groupby(['target','user']).helperkey.idxmax() Out[10]: target user A 1 0 2 3 B 1 5 2 8 Name: helperkey, dtype: int64 df.loc[df.groupby(['target','user']).helperkey.idxmax()] Out[11]: target user data helperkey 0 A 1 0 2 3 A 2 0 1 5 B 1 1 2 8 B 2 0 2
Pandas: The best way to create new Frame by specific criteria
I have a DataFrame: df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4], 'sex': [0,0,0,1,0,0,0,1,1,0,1,1]}) id sex 0 1 0 1 1 0 2 1 0 3 1 1 4 2 0 5 2 0 6 2 0 7 3 1 8 3 1 9 3 0 10 4 1 11 4 1 I want to get new DateFrame where there are only id's with both sex values. So I want to get something like this. id sex 0 1 0 1 1 0 2 1 0 3 1 1 4 3 1 5 3 1 6 3 0
Using groupby and filter with required condition In [2952]: df.groupby('id').filter(lambda x: set(x.sex) == set([0,1])) Out[2952]: id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0 Also, In [2953]: df.groupby('id').filter(lambda x: all([any(x.sex == v) for v in [0,1]])) Out[2953]: id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0
Use drop_duplicates by both columns and then get size of one column by value_counts first. Then filter all values by boolean indexing with isin: s = df.drop_duplicates()['id'].value_counts() print (s) 3 2 1 2 4 1 2 1 Name: id, dtype: int64 df = df[df['id'].isin(s.index[s == 2])] print (df) id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0
One more:) df.groupby('id').filter(lambda x: x['sex'].nunique()>1) id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0
Use isin() Something like this: df = pd.DataFrame({'id':[1,1,1,1,2,2,2,3,3,3,4,4], 'sex': [0,0,0,1,0,0,0,1,1,0,1,1]}) male = df[df['sex'] == 0] male = male['id'] female = df[df['sex'] == 1] female = female['id'] df = df[(df['id'].isin(male)) & (df['id'].isin(female))] print(df) Output: id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0
Or you can try this m=df.groupby('id')['sex'].nunique().eq(2) df.loc[df.id.isin(m[m].index)] Out[112]: id sex 0 1 0 1 1 0 2 1 0 3 1 1 7 3 1 8 3 1 9 3 0