Is it possible to calculate a feature matrix only for test data? - pandas

I have more than 100,000 rows of training data with timestamps and would like to calculate a feature matrix for new test data, of which there are only 10 rows. Some of the features in the test data will end up aggregating some of the training data. I need the implementation to be fast since this is one step in a real-time inference pipeline.
I can think of two ways this can be implemented:
Concatenating the train and test entity sets and running DFS and then only using the last 10 rows and throwing away the rest. This is very time consuming. Is there a way to calculate a subset of an entity set while using data from the entire entity set?
Using the steps outlined in Calculating Feature Matrix for New Data section on the Featuretools Deployment page. However, as demonstrated below, this doesn't seem to work.
Create all/train/test entity sets:
import featuretools as ft
data = ft.demo.load_mock_customer(n_customers=3, n_sessions=15)
df_sessions = data['sessions']
# Create all/train/test entity sets.
all_es = ft.EntitySet(id='sessions')
train_es = ft.EntitySet(id='sessions')
test_es = ft.EntitySet(id='sessions')
all_es = all_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions, # all sessions
index='session_id',
time_index='session_start',
)
train_es = train_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[:10], # first 10 sessions
index='session_id',
time_index='session_start',
)
test_es = test_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[10:], # last 5 sessions
index='session_id',
time_index='session_start',
)
# Normalise customer entities so we can group by customers.
all_es = all_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
train_es = train_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
test_es = test_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
Set cutoff_time since we are dealing with data with timestamps:
cutoff_time = (df_sessions
.filter(['session_id', 'session_start'])
.rename(columns={'session_id': 'instance_id',
'session_start': 'time'}))
Calculate feature matrix for all data:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time,
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
11
1
4
12
2
6
13
3
3
14
1
5
15
3
4
Calculate feature matrix for train data:
feature_matrix, features_defs = ft.dfs(entityset=train_es,
cutoff_time=cutoff_time.iloc[:10],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
Calculate feature matrix for test data (using method shown in "Feature Matrix for New Data" on the Featuretools Deployment page):
feature_matrix = ft.calculate_feature_matrix(features=features_defs,
entityset=test_es,
cutoff_time=cutoff_time.iloc[10:])
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
11
1
1
12
2
1
13
3
1
14
1
2
15
3
2
As you can see, the feature matrix generated from train_es matches the first 10 rows of the feature matrix generated from all_es. However, the feature matrix generated from test_es doesn't match the corresponding rows from the feature matrix generated from all_es.

You can control which instances you want to generate features for with the cutoff_time dataframe (or the instance_ids argument in DFS if the cutoff time is a single datetime). Featuretools will only generate features for instances whose IDs are in the cutoff time dataframe and will ignore all others:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time[10:],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
customer_id
customers.COUNT(sessions)
session_id
1
4
2
6
3
3
1
5
3
4
The method in "Feature Matrix for New Data" is useful when you want to calculate the same features but on entirely new data. All the same features will be created, but data isn't shared between the entitysets. That doesn't work in this case, since the goal is to use all the data but only generate features for certain instances.

Related

How to speed up the loop in a dataframe

I would like to speed up my loop because I have to do it on 900 000 data.
To simplify i show you a sample.
I would like to add an column name 'Count' which count the number of times where score was under target score for the same player. But for each row the target change.
Input :
index Nom player score target score
0 0 felix 3 10
1 1 felix 8 7
2 2 theo 4 5
3 3 patrick 12 6
4 4 sophie 7 6
5 5 sophie 3 6
6 6 felix 2 4
7 7 felix 2 2
8 8 felix 2 3
Result :
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
Below the code i current use but is it possible to speed up ? I saw some articles about vectorization is it possible to apply on my calcul ? If yes how to do it
df2 = df.copy()
df2['Count']= [np.count_nonzero((df.values[:,1] == row[2] )& ( df.values[:,2] < row[4]) ) for row in df.itertuples()]
print(df2)
Jérôme Richard's insights for a O(n log n) solution can be translated to pandas. The speed up depends on the number and size of the groups in the dataframe.
df2 = df.copy()
gr = df2.groupby('Nom player')
lookup = gr.score.apply(np.sort).to_dict()
df2['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
print(df2)
Output
index Nom player score target score count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
You can try:
df['Count'] = df.groupby("Nom player").apply(
lambda x: pd.Series((sum(x["score"] < s) for s in x["target score"]), index=x.index)
).droplevel(0)
print(df)
Prints:
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
EDIT: Quick benchmark:
from timeit import timeit
def add_count1(df):
df["Count"] = (
df.groupby("Nom player")
.apply(
lambda x: pd.Series(
((x["score"] < s).sum() for s in x["target score"]), index=x.index
)
)
.droplevel(0)
)
def add_count2(df):
df["Count"] = [
np.count_nonzero((df.values[:, 1] == row[2]) & (df.values[:, 2] < row[4]))
for row in df.itertuples()
]
def add_count3(df):
gr = df.groupby('Nom player')
lookup = gr.score.apply(lambda x: np.sort(np.array(x))).to_dict()
df['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
df = pd.concat([df] * 1000).reset_index(drop=True) # DataFrame of len=9000
t1 = timeit("add_count1(x)", "x=df.copy()", number=1, globals=globals())
t2 = timeit("add_count2(x)", "x=df.copy()", number=1, globals=globals())
t3 = timeit("add_count3(x)", "x=df.copy()", number=1, globals=globals())
print(t1)
print(t2)
print(t3)
Prints on my computer:
0.7540620159707032
6.63946107000811
0.004106967011466622
So my answer should be faster than the original, but Michael Szczesny's answer is the fastest.
There are two main issues in the current code. CPython string objects are slow, especially string comparison. Moreover, the current algorithm has a quadratic complexity: it compares all rows matching with the current one, for each row. The later is the biggest issue for large dataframes.
Implementation
The first thing to do is to replace the string comparison with something faster. Strings objects can be converted to native string using np.array. Then, the unique strings can be extracted as well as their location using np.unique. This basically help us to replace the string matching problem with an integer matching problem. Comparing native integer is generally significantly faster mainly because the processor like that and Numpy can use efficient SIMD instructions so to compare integers. Here is how to convert the string column to label indices:
# 0.065 ms
labels, labelIds = np.unique(np.array(df.values[:,1], dtype='U'), return_inverse=True)
Now, we can group-by the score by label (player names) efficiently. The thing is Numpy does not provide any group-by function. While this is possible to do that efficiently using multiple np.argsort, a basic pure-Python dict-based approach turns out to be pretty fast in practice. Here is the code grouping scores by label and sorting the set of score associated to each label (useful for the next step):
# 0.014 ms
from collections import defaultdict
scoreByGroups = defaultdict(lambda: [])
labelIdsList = labelIds.tolist()
scoresList = df['score'].tolist()
targetScoresList = df['target score'].tolist()
for labelId, score in zip(labelIdsList, scoresList):
scoreByGroups[labelId].append(score)
for labelId, scoreGroup in scoreByGroups.items():
scoreByGroups[labelId] = np.sort(np.array(scoreGroup, np.int32))
scoreByGroups can now be used to efficiently find the number of scores smaller than a given one for a given label. One just need to read scoreByGroups[label] (constant time) and then do a binary search on the resulting array (O(log n)). Here is how:
# 0.014 ms
counts = [np.searchsorted(scoreByGroups[labelId], score)
for labelId, score in zip(labelIdsList, targetScoresList)]
# Copies are slow, but adding a new column seems even slower
# 0.212 ms
df2 = df.copy()
df2['Count'] = np.fromiter(counts, np.int32)
Results
The resulting code takes 0.305 ms on my machine on the example input while the initial code takes 1.35 ms. This means this implementation is about 4.5 times faster. 2/3 of the time is unfortunately spent in the creation of the new dataframe with the new column. Note that the provided code should be much faster than the initial code on large dataframe thanks to a O(n log n) complexity instead of a O(n²) one.
Faster implementation for large dataframes
On large dataframe, calling np.searchsorted for each item is expensive due to the overhead of Numpy. On solution to easily remove this overhead is to use Numba. The computation can be optimized using a list instead of a dictionary since the labels are integers in the range 0..len(labelIds). The computation can also partially done in parallel.
The string to int conversion can be made significantly faster using pd.factorize though this is still an expensive process.
Here is the complete Numba-based solution:
import numba as nb
#nb.njit('(int64[:], int64[:], int64[:])', parallel=True)
def compute_counts(labelIds, scores, targetScores):
groupSizes = np.bincount(labelIds)
groupOffset = np.zeros(groupSizes.size, dtype=np.int64)
scoreByGroups = [np.empty(e, dtype=np.int64) for e in groupSizes]
counts = np.empty(labelIds.size, dtype=np.int64)
for labelId, score in zip(labelIds, scores):
offset = groupOffset[labelId]
scoreByGroups[labelId][offset] = score
groupOffset[labelId] = offset + 1
for labelId in nb.prange(len(scoreByGroups)):
scoreByGroups[labelId].sort()
for i in nb.prange(labelIds.size):
counts[i] = np.searchsorted(scoreByGroups[labelIds[i]], targetScores[i])
return counts
df2 = df.copy() # Slow part
labelIds, labels = pd.factorize(df['Nom player']) # Slow part
counts = compute_counts( # Pretty fast part
labelIds.astype(np.int64),
df['score'].to_numpy().astype(np.int64),
df['target score'].to_numpy().astype(np.int64)
)
df2['Count'] = counts # Slow part
On my 6-core machine, this code is much faster on large dataframe. In fact, it is the fastest one of the proposed answers. It is only 2.5 faster than the one of #MichaelSzczesny on a random dataframe with 9000 rows. The string to int conversion takes 40-45% of the time and the creation of the new Pandas dataframe (with the additional column) takes 25% of the time. The time taken by the Numba function is actually small in the end. Most of the time is finally lost in overheads.
Note that using categorial data can be done once (pre-computation) and it can be useful to other computation so it may actually not be so expensive.

Labeling rows in pandas using multiple boolean conditions without chaining

I'm trying to label data in the original dataframe, based on multiple boolean conditions. This is easy enough when labeling based on one or two conditions, but as I begin requiring multiple conditions the code becomes difficult to manage. The solution seems to break the code down into copies, but that causes chain errors. Here is one example of the issue...
This is a simplified version of what my data looks like:
df=pd.DataFrame(np.array([['ABC',1,3,3,4], ['std',0,0,2,4],['std',2,1,2,4],['std',4,4,2,4],['std',2,6,2,6]]), columns=['Note', 'Na','Mg','Si','S'])
df
Note Na Mg Si S
0 ABC 1 3 3 4
1 std 0 0 2 4
2 std 2 1 2 4
3 std 4 4 2 4
4 std 2 6 2 6
A standard (std) is located throughout the dataframe. I would like to create a label when the instrument fails. This occurs in the data when:
String condition met (Note = standard/std)
Na>0 & Mg>0
Doesn't fall outside of a calculated range for more than 2 elements.
For requirement 3 - Here is an example of a range:
maxMin=pd.DataFrame(np.array([['Max',3,3,3,7], ['Min',1,1,2,2]]), columns=['Note', 'Na','Mg','Si','S'])
maxMin
Note Na Mg Si S
0 Max 3 3 3 7
1 Min 1 1 2 2
Calculating out of bound standard:
elements=['Na','Mg','Si','S']
std=df[(df['Note'].str.contains('std|standard'))&(df['Na']>0)&(df['Mg'])
df.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>2]
Note Na Mg Si S
3 std 4 4 2 4
Now, I would like to label this datapoint within the original dataframe. Desired result:
Note Na Mg Si S Error
0 ABC 1 3 3 4 False
1 std 0 0 2 4 False
2 std 2 1 2 4 False
3 std 4 4 2 4 True
4 std 2 6 2 6 False
I've tried things like:
df['Error'].loc[std.loc[(std[elements].lt(maxMin.loc[1, :])|std[elements].gt(mMmaxMinloc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values.copy()]=True
That unfortunately causes a chain error.
How would you accomplish this without creating a chain error? Most books/tutorial revolve around creating one long expression, but as I dive deeper, I feel there might be a simpler solution. Any input would be appreciated
I figured it out a solution that works for me.
The solution was to use .index.value to create an array of the index that passed the bool conditions. That array can be used to pass edit the original dataframe.
##These two conditions can probably be combined
condition1=df[(df['Note'].str.contains('std|standard'))&(df['Na']>.01)&(df['Mg']>.01)]
##where condition1 is greater/less than the bounds of the known value.
##provides array where condition is true
OutofBounds=condition1.loc[(condition1[elements].lt(maxMin.loc[1, :])|condition1[elements].gt(maxMin.loc[0, :]).select_dtypes(include=['bool'])).sum(axis=1)>5].index.values
OutofBounds
out:array([ 3], dtype=int64)
Now I can pass the array into the original dataframe:
df.loc[OutofBounds, 'Error']=True

Compute element overlap based on another column, pandas

If I have a dataframe of the form:
tag element_id
1 12
1 13
1 15
2 12
2 13
2 19
3 12
3 15
3 22
how can I compute the overlaps of the tags in terms of the element_id ? The result I guess should be an overlap matrix of the form:
1 2 3
1 X 2 2
2 2 X 1
3 2 1 X
where I put X on the diagonal since the overlap of a tag with itself is not relevant and where the numbers in the matrix represent the total element_ids that the two tags share.
My attempts:
You can try and use a for loop like :
for item in df.itertuples():
element_lst += [item.element_id]
element_tag = item.tag
# then intersect the element_list row by row.
# This is extremely costly for large datasets
The second thing I was thinking about was to use df.groupby('tag') and try to somehow intersect on element_id, but it is not clear to me how I can do that with grouped data.
merge + crosstab
# Find element overlap, remove same tag matches
res = df.merge(df, on='element_id').query('tag_x != tag_y')
pd.crosstab(res.tag_x, res.tag_y)
Output:
tag_y 1 2 3
tag_x
1 0 2 2
2 2 0 1
3 2 1 0

SQL table structure for store value against list of combination

I have a requirement from client where I need to store a value against list of combination.
For example I have following LOBs and against each combination I need to store a value.
Auto
WC
Personal
I purposed multiple solutions he is not satisfied with anyone.
Solution 1: create single table, insert value against all possible combination(string) something like
LOB Value
Auto 1
WC 2
Personal 3
Auto,WC 4
Auto, personal 5
WC, Personal 6
Auto, WC, Personal 7
Solution 2: create lkp_lob, lob_group and lob_group_detail tables. Each group combination represent a group.
Lkp_lob
Lob_key Name
1 Auto
2 WC
3 Person
Lob_group (unique query constrain on lob_group_key and lob_key)
Lob_group_key Lob_key
1 1
2 2
3 3
4 1
4 2
5 1
5 3
6 2
6 3
7 1
7 2
7 3
Lob_group_detail
Lob_group_key Value
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Any suggestion would be highly appreciated.
First of all I did not understood that terms you said.
But from database perspective it is always good to have multiple tables for each module. You will be facing less difficulties when doing CRUD. And will be more faster.

Assigning one column to another column between pandas DataFrames (like vector to vector assignment)

I have a super strange problem which I spent the last hour trying to solve, but with no success. It is even more strange since I can't replicate it on a small scale.
I have a large DataFrame (150,000 entries). I took out a subset of it and did some manipulation. the subset was saved as a different variable, x.
x is smaller than the df, but its index is in the same range as the df. I'm now trying to assign x back to the DataFrame replacing values in the same column:
rep_Callers['true_vpID'] = x.true_vpID
This inserts all the different values in x to the right place in df, but instead of keeping the df.true_vpID values that are not in x, it is filling them with NaNs. So I tried a different approach:
df.ix[x.index,'true_vpID'] = x.true_vpID
But instead of filling x values in the right place in df, the df.true_vpID gets filled with the first value of x and only it! I changed the first value of x several times to make sure this is indeed what is happening, and it is. I tried to replicate it on a small scale but it didn't work:
df = DataFrame({'a':ones(5),'b':range(5)})
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
z =Series([random() for i in range(5)],index = range(5))
0 0.812561
1 0.862109
2 0.031268
3 0.575634
4 0.760752
df.ix[z.index[[1,3]],'b'] = z[[1,3]]
a b
0 1 0.000000
1 1 0.812561
2 1 2.000000
3 1 0.575634
4 1 4.000000
5 1 5.000000
I really tried it all, need some new suggestions...
Try using df.update(updated_df_or_series)
Also using a simple example, you can modify a DataFrame by doing an index query and modifying the resulting object.
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
df_2 = df_1.ix[3:5]
df_2.b = df_2.b + 2
df_2
a b
3 1 5
4 1 6
df_1
a b
0 1 0
1 1 1
2 1 2
3 1 5
4 1 6