Why does numpy.cross() support only 2 or 3 dimensions? - numpy

Why does numpy.cross() support only 2 or 3 dimensions, whereas wikipedia says that we can have cross product even for 7 dimensions?

Related

Is it possible to calculate a feature matrix only for test data?

I have more than 100,000 rows of training data with timestamps and would like to calculate a feature matrix for new test data, of which there are only 10 rows. Some of the features in the test data will end up aggregating some of the training data. I need the implementation to be fast since this is one step in a real-time inference pipeline.
I can think of two ways this can be implemented:
Concatenating the train and test entity sets and running DFS and then only using the last 10 rows and throwing away the rest. This is very time consuming. Is there a way to calculate a subset of an entity set while using data from the entire entity set?
Using the steps outlined in Calculating Feature Matrix for New Data section on the Featuretools Deployment page. However, as demonstrated below, this doesn't seem to work.
Create all/train/test entity sets:
import featuretools as ft
data = ft.demo.load_mock_customer(n_customers=3, n_sessions=15)
df_sessions = data['sessions']
# Create all/train/test entity sets.
all_es = ft.EntitySet(id='sessions')
train_es = ft.EntitySet(id='sessions')
test_es = ft.EntitySet(id='sessions')
all_es = all_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions, # all sessions
index='session_id',
time_index='session_start',
)
train_es = train_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[:10], # first 10 sessions
index='session_id',
time_index='session_start',
)
test_es = test_es.entity_from_dataframe(
entity_id='sessions',
dataframe=df_sessions.iloc[10:], # last 5 sessions
index='session_id',
time_index='session_start',
)
# Normalise customer entities so we can group by customers.
all_es = all_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
train_es = train_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
test_es = test_es.normalize_entity(base_entity_id='sessions',
new_entity_id='customers',
index='customer_id')
Set cutoff_time since we are dealing with data with timestamps:
cutoff_time = (df_sessions
.filter(['session_id', 'session_start'])
.rename(columns={'session_id': 'instance_id',
'session_start': 'time'}))
Calculate feature matrix for all data:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time,
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
11
1
4
12
2
6
13
3
3
14
1
5
15
3
4
Calculate feature matrix for train data:
feature_matrix, features_defs = ft.dfs(entityset=train_es,
cutoff_time=cutoff_time.iloc[:10],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
1
3
1
2
3
2
3
1
1
4
2
1
5
2
2
6
2
3
7
2
4
8
1
2
9
2
5
10
1
3
Calculate feature matrix for test data (using method shown in "Feature Matrix for New Data" on the Featuretools Deployment page):
feature_matrix = ft.calculate_feature_matrix(features=features_defs,
entityset=test_es,
cutoff_time=cutoff_time.iloc[10:])
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
session_id
customer_id
customers.COUNT(sessions)
11
1
1
12
2
1
13
3
1
14
1
2
15
3
2
As you can see, the feature matrix generated from train_es matches the first 10 rows of the feature matrix generated from all_es. However, the feature matrix generated from test_es doesn't match the corresponding rows from the feature matrix generated from all_es.
You can control which instances you want to generate features for with the cutoff_time dataframe (or the instance_ids argument in DFS if the cutoff time is a single datetime). Featuretools will only generate features for instances whose IDs are in the cutoff time dataframe and will ignore all others:
feature_matrix, features_defs = ft.dfs(entityset=all_es,
cutoff_time=cutoff_time[10:],
target_entity='sessions')
display(feature_matrix.filter(['customer_id', 'customers.COUNT(sessions)']))
customer_id
customers.COUNT(sessions)
session_id
1
4
2
6
3
3
1
5
3
4
The method in "Feature Matrix for New Data" is useful when you want to calculate the same features but on entirely new data. All the same features will be created, but data isn't shared between the entitysets. That doesn't work in this case, since the goal is to use all the data but only generate features for certain instances.

Reduce two columns of IDs with a many to many relationship

I have a dataset with two columns of non-unique IDs (ID-A and ID-B respectively).
A single ID-A can have multiple ID-Bs and vice versa. I am trying to generate a third, set identifier using transitivity (calling it ID-C) which is set to the same value for all records with either ID-A or ID-B in common. Two records having neither ID-A nor ID-B in common should only share an ID-C set identifier if there is a transitive chain of records between them.
To visualize, I have something like the first two columns, and want to generate the third column(ID-C)
ID-A ID-B ID-C
1 1 1
1 2 1
1 3 1
2 2 1
2 4 1
3 4 1
4 5 2
5 5 2
5 6 2
6 7 3
I am using Presto SQL inside of AWS Athena, so I cannot use any variables or loops, that I am aware of.

SQL table structure for store value against list of combination

I have a requirement from client where I need to store a value against list of combination.
For example I have following LOBs and against each combination I need to store a value.
Auto
WC
Personal
I purposed multiple solutions he is not satisfied with anyone.
Solution 1: create single table, insert value against all possible combination(string) something like
LOB Value
Auto 1
WC 2
Personal 3
Auto,WC 4
Auto, personal 5
WC, Personal 6
Auto, WC, Personal 7
Solution 2: create lkp_lob, lob_group and lob_group_detail tables. Each group combination represent a group.
Lkp_lob
Lob_key Name
1 Auto
2 WC
3 Person
Lob_group (unique query constrain on lob_group_key and lob_key)
Lob_group_key Lob_key
1 1
2 2
3 3
4 1
4 2
5 1
5 3
6 2
6 3
7 1
7 2
7 3
Lob_group_detail
Lob_group_key Value
1 1
2 2
3 3
4 4
5 5
6 6
7 7
Any suggestion would be highly appreciated.
First of all I did not understood that terms you said.
But from database perspective it is always good to have multiple tables for each module. You will be facing less difficulties when doing CRUD. And will be more faster.

How to fit multiple equal references into table structure?

How to fit multiple equal references into table structure? How could I do that? For example: I have list of classmates:
1 Peter
2 Jack
3 John
4 Mary
5 Birgit
6 Stella
7 Janus
8 Margo
9 Fred
Now I want to define fellowships. In first place, let's limit that every kid may belong to one fellowship. So we could have 3 fellowships:
[Peter, Jack]
[John, Mary, Birgit]
[Stella, Janus, Margo, Fred]
All members are equal, so they all should reference to other members. Is there better ways to define such relations than just to have table of pairs? Like:
1 2
3 4
3 5
4 5
4 3
5 3
5 4
6 7
6 8
6 9
7 6
7 8
7 9
8 6
8 7
8 9
9 6
9 7
9 8
If using table of pairs, is it better to describe relation both way (like above), or is it enough to have link just from one way to another? What are the benefits of both ways?
Table of pairs does not constrain any member into just one fellowsip, but how would it possible?
I was looking for SQL table solution, but maybe there are better tools for handling such data-structures, so I added nosql-tag too. I am looking for right tools for such data, but I am eager to know, how to fit it in SQL tables too.
Yes, there is another way. If you have "fellowships", then you do not have pair-wise relationships. STart with a Fellowships table that has a FellowshipsId.
Then you would have a FellowshipsKids table. This is called a junction table, and it would have one row for each member of each fellowship. It would have rows like this:
FellowshipId KidId
1 1
1 2
2 3
2 4
2 5
. . .
What you have is an m-n relationship between fellowships and kids -- one fellowship can have multiple kids, one kids can be in multiple fellowships. A junction table is the standard way of represent this in a relational database.

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.