I have a dataset on which I want to apply AI/ML algorithm to classify patterns as given below, please suggest to me the ways in which I can do it efficiently.
Example Dataset-
Name Subset Value
A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565
The output would be;
Pattern No. Instance Pattern
1 1 A X_1 55
A X_A 89
A X_B 45
2 C X_1 66
C X_A 656
C X_B 456
2 1 B B_1 55
B B_A 89
3 1 D D_1 545
D D_2 546
D D_3 895
D D_4 565
You need a clustering algorithm.
Define a distance function between instances
Try various clustering algorithms
I would recommend to start with Jaccard distance and Agglomerative clustering.
Code example
import re
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def read_data():
text = '''A X_1 55
A X_A 89
A X_B 45
B B_1 55
B B_A 89
C X_1 66
C X_A 656
C X_B 456
D D_1 545
D D_2 546
D D_3 895
D D_4 565'''
data = [re.split('\s+', line) for line in text.split('\n')]
return pd.DataFrame(data, columns=re.split('\s+', 'Name Subset Value'))
df = read_data()
instances = []
for name, name_df in df.groupby('Name'):
instances.append({'name': name, 'subsets': name_df['Subset'].tolist()})
def jaccard_distance(list_1, list_2):
set_1 = set(list_1)
set_2 = set(list_2)
return 1 - len(set_1.intersection(set_2)) / len(set_1.union(set_2))
clustering = AgglomerativeClustering(n_clusters=None,
distance_threshold=1,
affinity='precomputed',
linkage='average')
distance_matrix = [
[
jaccard_distance(instance_1['subsets'], instance_2['subsets'])
for instance_2 in instances
] for instance_1 in instances
]
clusters = clustering.fit_predict(distance_matrix)
for instance, cluster in zip(instances, clusters):
instance['cluster'] = cluster
print(instances)
Output
[{'name': 'A', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'B', 'subsets': ['B_1', 'B_A'], 'cluster': 2},
{'name': 'C', 'subsets': ['X_1', 'X_A', 'X_B'], 'cluster': 0},
{'name': 'D', 'subsets': ['D_1', 'D_2', 'D_3', 'D_4'], 'cluster': 1}]
Related
I have a dataset that contains the NBA Player's average statistics per game. Some player's statistics are repeated because of they've been in different teams in season.
For example:
Player Pos Age Tm G GS MP FG
8 Jarrett Allen C 22 TOT 28 10 26.2 4.4
9 Jarrett Allen C 22 BRK 12 5 26.7 3.7
10 Jarrett Allen C 22 CLE 16 5 25.9 4.9
I want to average Jarrett Allen's stats and put them into a single row. How can I do this?
You can groupby and use agg to get the mean. For the non numeric columns, let's take the first value:
df.groupby('Player').agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in df.dtypes[1:].items()})
output:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22 TOT 18.666667 6.666667 26.266667 4.333333
NB. content of the dictionary comprehension:
{'Pos': 'first',
'Age': 'mean',
'Tm': 'first',
'G': 'mean',
'GS': 'mean',
'MP': 'mean',
'FG': 'mean'}
x = [['a', 12, 5],['a', 12, 7], ['b', 15, 10],['b', 15, 12],['c', 20, 1]]
import pandas as pd
df = pd.DataFrame(x, columns=['name', 'age', 'score'])
print(df)
print('-----------')
df2 = df.groupby(['name', 'age']).mean()
print(df2)
Output:
name age score
0 a 12 5
1 a 12 7
2 b 15 10
3 b 15 12
4 c 20 1
-----------
score
name age
a 12 6
b 15 11
c 20 1
Option 1
If one considers the dataframe that OP shares in the question df the following will do the work
df_new = df.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Jarrett Allen C 22.0 TOT 18.666667 6.666667 26.266667 4.333333
This one uses:
pandas.DataFrame.groupby to group by the Player column
pandas.core.groupby.GroupBy.agg to aggregate the values based on a custom made lambda function.
pandas.api.types.is_string_dtype to check if a column is of string type (see here how the method is implemented)
Let's test it with a new dataframe, df2, with more elements in the Player column.
import numpy as np
df2 = pd.DataFrame({'Player': ['John Collins', 'John Collins', 'John Collins', 'Trae Young', 'Trae Young', 'Clint Capela', 'Jarrett Allen', 'Jarrett Allen', 'Jarrett Allen'],
'Pos': ['PF', 'PF', 'PF', 'PG', 'PG', 'C', 'C', 'C', 'C'],
'Age': np.random.randint(0, 100, 9),
'Tm': ['ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'ATL', 'TOT', 'BRK', 'CLE'],
'G': np.random.randint(0, 100, 9),
'GS': np.random.randint(0, 100, 9),
'MP': np.random.uniform(0, 100, 9),
'FG': np.random.uniform(0, 100, 9)})
[Out]:
Player Pos Age Tm G GS MP FG
0 John Collins PF 71 ATL 75 39 16.123225 77.949756
1 John Collins PF 60 ATL 49 49 30.308092 24.788401
2 John Collins PF 52 ATL 33 92 11.087317 58.488575
3 Trae Young PG 72 ATL 20 91 62.862313 60.169282
4 Trae Young PG 85 ATL 61 77 30.248551 85.169038
5 Clint Capela C 73 ATL 5 67 45.817690 21.966777
6 Jarrett Allen C 23 TOT 60 51 93.076624 34.160823
7 Jarrett Allen C 12 BRK 2 77 74.318568 78.755869
8 Jarrett Allen C 44 CLE 82 81 7.375631 40.930844
If one tests the operation on df2, one will get the following
df_new2 = df2.groupby('Player').agg(lambda x: x.iloc[0] if pd.api.types.is_string_dtype(x.dtype) else x.mean())
[Out]:
Pos Age Tm G GS MP FG
Player
Clint Capela C 95.000000 ATL 30.000000 98.000000 46.476398 17.987104
Jarrett Allen C 60.000000 TOT 48.666667 19.333333 70.050540 33.572896
John Collins PF 74.333333 ATL 50.333333 52.666667 78.181457 78.152235
Trae Young PG 57.500000 ATL 44.500000 47.500000 46.602543 53.835455
Option 2
Depending on the desired output, assuming that one only wants to group by player (independently of Age or Tm), a simpler solution would be to just group by and pass .mean() as follows
df_new3 = df.groupby('Player').mean()
[Out]:
Age G GS MP FG
Player
Jarrett Allen 22.0 18.666667 6.666667 26.266667 4.333333
Notes:
The output of this previous operation won't display non-numerical columns (apart from the Player name).
Let's say I have a series, ser1 with a TimeSeriesIndex length x. I also have another series, ser2 length y. How do I multiply these so that I get a dataframe shape (x,y) where the index is from ser1 and the columns are the indices from ser2. I want every element of ser2 to be multiplied by the values of each element in ser1.
import pandas as pd
ser1 = pd.Series([100, 105, 110, 114, 89],index=pd.date_range(start='2021-01-01', end='2021-01-05', freq='D'), name='test')
test_ser2 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
Perhaps this is more elegantly done with numpy.
Try this using np.outer with pandas DataFrame constructor:
pd.DataFrame(np.outer(ser1, test_ser2), index=ser1.index, columns=test_ser2.index)
Output:
a b c d e
2021-01-01 100 200 300 400 500
2021-01-02 105 210 315 420 525
2021-01-03 110 220 330 440 550
2021-01-04 114 228 342 456 570
2021-01-05 89 178 267 356 445
In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012
There are two tables, the entries may have different id type. I need to join two tables based on id_type of df1 and the correct column of df2. For the background of the problem, the ids are security id in financial world, the id type may be CUSIP, ISIN, RIC etc..
print(df1)
id id_type value
0 11 type_A 0.1
1 22 type_B 0.2
2 13 type_A 0.3
print(df2)
type_A type_B type_C
0 11 21 xx
1 12 22 yy
2 13 23 zz
The desired output is
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
Here is an alternative approach, which generalizes to many security types (CUSIP, ISIN, RIC, SEDOL, etc.).
First, create df1 and df2 along the lines of the original example:
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'sec_id': [11, 22, 33],
'sec_id_type': ['CUSIP', 'ISIN', 'RIC'],
'value': [100, 200, 300]})
df2 = pd.DataFrame({'CUSIP': [11, 21, 31],
'ISIN': [21, 22, 23],
'RIC': [31, 32, 33],
'SEDOL': [41, 42, 43]})
Second, create an intermediate data frame x1. We will use the first column for one join, and the second and third columns for a different join:
index = [idx for idx in df2.index for _ in df2.columns]
sec_id_types = df2.columns.to_list() * df2.shape[0]
sec_ids = df2.values.ravel()
data = [
(idx, sec_id_type, sec_id)
for idx, sec_id_type, sec_id in zip(index, sec_id_types, sec_ids)
]
x1 = pd.DataFrame.from_records(data, columns=['index', 'sec_id_type', 'sec_id'])
Join df1 and x1 to extract values from df1:
x2 = (x1.merge(df1, on=['sec_id_type', 'sec_id'], how='left')
.dropna()
.set_index('index'))
Finally, join df2 and x1 (from previous step) to get final result
print(df2.merge(x2, left_index=True, right_index=True, how='left'))
CUSIP ISIN RIC SEDOL sec_id_type sec_id value
0 11 21 31 41 CUSIP 11 100.0
1 21 22 32 42 ISIN 22 200.0
2 31 23 33 43 RIC 33 300.0
The columns sec_id_type and sec_id show the joins work as expected.
NEW Solution 1: create a temporary column that determines the ID with np.where
df2['id'] = np.where(df2['type_A'] == df1['id'], df2['type_A'], df2['type_B'])
df = pd.merge(df2,df1[['id','value']],how='left',on='id').drop('id', axis=1)
NEW Solution 2: Can you simply merge on the index? If not go with solution #1.
df = pd.merge(df2, df1['value'], how ='left', left_index=True, right_index=True)
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
OLD Solution:
Through a combination of pd.merge, pd.melt and pd.concat, I found a solution, although I wonder if there is a shorter way (probably):
df_A_B = pd.merge(df2[['type_A']], df2[['type_B']], how='left', left_index=True, right_index=True) \
.melt(var_name = 'id_type', value_name='id')
df_C = pd.concat([df2[['type_C']]] * 2).reset_index(drop=True)
df_A_B_C = pd.merge(df_A_B, df_C, how='left', left_index=True, right_index=True)
df3 = pd.merge(df_A_B_C, df1, how='left', on=['id_type', 'id']).dropna().drop(['id_type', 'id'], axis=1)
df4 = pd.merge(df2, df3, how='left', on=['type_C'])
df4
output:
type_A type_B type_C value
0 11 21 xx 0.1
1 12 22 yy 0.2
2 13 23 zz 0.3
I have a data table df1 that looks like this (result of a df.groupby('id').agg(lambda x: x.tolist())):
df1:
id people
51 [125, 126, 127, 128, 129]
52 [302, 303, 128]
53 [312]
In another dataframe df2, I have mapped names and gender, according to a unique pid. The list entries in df1.people are in fact those pid items:
df2:
pid name gender
100 Jack Lumber m
125 Holly Polly f
126 Jeremy Owens m
127 Ron Bronco m
128 Natalia Berg f
129 Robyn Hill f
300 Crusty Clown m
302 Danny McKenny m
303 Tara Hill f
312 Glenn Dalough m
400 Fryda Beans f
Now I like to replace or map the respective pid with the gender field from df2 and hereby create following desired output, including a list count:
Outcome:
id gender count_m count_f
51 [f, m, m, f, f] 2 3
52 [m, f, f] 1 2
52 [m] 1 0
What's the best approach to create this table?
Solution:
from collections import Counter
d = dict(df2.drop('name', 1).values)
m = df1.assign(gender=df1.name.apply(lambda x: [d.get(i) for i in x])).drop('people', 1)
n = pd.DataFrame([Counter(x) for x in m.gender], index=m.index).fillna(0).add_prefix('count_')
final = m.join(n)
You can use dict.get() to get the corresponding dictionary values, then create a dataframe by exploding the dataframe and apply crosstab and then merge:
d=dict(df2.drop('name',1).values)
m=df1.assign(gender=df1.people.apply(lambda x: [d.get(i) for i in x])).drop('people',1)
n=pd.DataFrame({'id':m.loc[m.index.repeat(m.gender.str.len()),'id'],
'gender':np.concatenate(m.gender)})
#for pandas .25.0 use: n=m.explode('gender')
final=m.merge(pd.crosstab(n.id,n.gender).add_prefix('count_'),left_on='id',right_index=True)
id gender count_f count_m
0 51 [f, m, m, f, f] 3 2
1 52 [m, f, f] 2 1
2 53 [m] 0 1