how to use ft.dfs result join to test set? - training-data

I know featuretools has ft.calculate_feature_matrix method, but it calculate data use test. I need when I get the feature use train data,and join to test data not use the same feature on test data.
for example:
train data:
id sex score
1 f 100
2 f 200
3 m 10
4 m 20
after dfs, I get:
id sex score sex.mean(score)
1 f 100 150
2 f 200 150
3 m 10 15
4 m 20 15
i want get like this on test set:
id sex score sex.mean(score)
5 f 30 150
6 f 40 150
7 m 50 15
8 m 60 15
not
id sex score sex.mean(score)
5 f 30 35
6 f 40 35
7 m 50 55
8 m 60 55
how can i realization it, thanks you。

Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.
First, let me create a DataFrame of people
import pandas as pd
people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})
which looks like this
id sex
0 1 f
1 2 f
2 3 m
3 4 m
4 5 f
5 6 f
6 7 m
7 8 m
Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0 for training data and time 1 for the test data.
scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"person_id": [1, 2, 3, 4, 5, 6, 7, 8],
"time": [0, 0, 0, 0, 1, 1, 1, 1],
"score": [100, 200, 10, 20, 30, 40, 50, 60]})
which looks like this
id person_id score time
0 1 1 100 0
1 2 2 200 0
2 3 3 10 0
3 4 4 20 0
4 5 5 30 1
5 6 6 40 1
6 7 7 50 1
7 8 8 60 1
Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity
import featuretools as ft
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=people,
entity_id='people',
index='id')
es.entity_from_dataframe(dataframe=scores,
entity_id='scores',
index='id',
time_index= "time")
# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")
# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)
es
Here is our entity set
Entityset: example
Entities:
scores [Rows: 8, Columns: 4]
sexes [Rows: 2, Columns: 1]
people [Rows: 8, Columns: 2]
Relationships:
scores.person_id -> people.id
people.sex -> sexes.sex
Next, let's calculate the feature of interest. Notice when we use the cutoff_time argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.
from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)
The output is now
sexes.MEAN(scores.score)
id
1 150
2 150
3 15
4 15
5 150
6 150
7 15
8 15
This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.
For information on how time indexes work in Featuretools read the Handling Time page in the documentation.
EDIT
If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs
feature_list = ft.dfs(target_entity="people",
entityset=es,
agg_primitives=["count", "std", "max"],
features_only=True)
feature_list
this returns feature definitions you can pass to ft.calculate_feature_matrix
[<Feature: sex>,
<Feature: MAX(scores.score)>,
<Feature: STD(scores.time)>,
<Feature: STD(scores.score)>,
<Feature: COUNT(scores)>,
<Feature: MAX(scores.time)>,
<Feature: sexes.STD(scores.score)>,
<Feature: sexes.COUNT(people)>,
<Feature: sexes.STD(scores.time)>,
<Feature: sexes.MAX(scores.score)>,
<Feature: sexes.MAX(scores.time)>,
<Feature: sexes.COUNT(scores)>]
Read more about DFS in this write-up

Related

Find common values within groupby in pandas Dataframe based on two columns

I have following dataframe:
period symptoms recovery
1 4 2
1 5 2
1 6 2
2 3 1
2 5 2
2 8 4
2 12 6
3 4 2
3 5 2
3 6 3
3 8 5
4 5 2
4 8 4
4 12 6
I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value
of two columns 'symptoms' and 'recovery'
Result should be :
symptoms recovery period
5 2 [1, 2, 3, 4]
8 4 [2, 4]
where each same two columns values has the periods occurrence in a list or column.
I'm I approaching the problem in the wrong way ? Appreciate your help.
I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame.
Tried sorting values based on 3 columns but couldn't get the common ones between each period section.
Last attempt :
df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
s = df.groupby(["symptoms", "recovery"]).size()
You were almost there:
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
period;symptoms;recovery
1;4;2
1;5;2
1;6;2
2;3;1
2;5;2
2;8;4
2;12;6
3;4;2
3;5;2
3;6;3
3;8;5
4;5;2
4;8;4
4;12;6
""")
df = pd.read_csv(data, sep=";")
# collect unique periods
df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
This gives
symptoms recovery period
0 3 1 [2]
1 4 2 [1, 3]
2 5 2 [1, 2, 3, 4]
3 6 2 [1]
4 6 3 [3]
5 8 4 [2, 4]
6 8 5 [3]
7 12 6 [2, 4]

Combine two dataframes in Pandas to generate many to many relationship

I have two lists, say
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I want to generate a Pandas dataframe so:
All customers and accounts are used
There is a many to many relationship between customers and accounts (one customer 'may' have multiple accounts and an account 'may' be owned by multiple customers
I want the many to many relationship to be random. That is, some customers will have one account and some will have more than one. Similarly, some accounts will be owned by just one customers and others by more than one.
Something like,
Customer
Account
a
1
a
2
b
2
c
3
a
4
b
4
c
4
b
5
b
6
b
7
b
8
a
9
Since I am generating random data, in the worst case scenario, I can generate way too many accounts and discard the unused ones if the code is easier (essentially relaxing the requirement 1 above).
I am using sample (n=20, replace=True) to generate 20 records in both dataframes and then merging them into one based on the index. Is there an out of the box API or library to do this or is my code the recommended way?
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
customers_df = pd.DataFrame(data=customers)
customers_df = customers_df.sample (n=20, replace=True)
customers_df['new_index'] = range (20)
customers_df.set_index ('new_index', inplace=True)
accounts_df = pd.DataFrame (data=accounts)
accounts_df = accounts_df.sample (n=20, replace=True)
accounts_df['new_index'] = range (20)
accounts_df.set_index ('new_index', inplace=True)
combined_df = pd.merge (customers_df, accounts_df, on='new_index')
print (combined_df)
Edit: Modified the question and added sample code I have tried.
One way to accomplish this is to collect the set of all possible relationships with a cartesian product, then select from that list before building your dataframe:
import itertools
import random
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_associations = [ca for ca in itertools.product(customers, accounts)]
df = pd.DataFrame.from_records(random.choices(possible_associations, k=20), columns=['customers', 'accounts']).sort_values(['customers','accounts'])
print(df)
Output
customers accounts
0 a 2
3 a 2
15 a 2
18 a 4
16 a 5
14 a 7
7 a 8
12 a 8
1 a 9
2 b 5
9 b 5
8 b 8
11 b 8
19 c 2
17 c 3
5 c 4
4 c 5
6 c 5
13 c 5
10 c 7
To have a repeatable test result, start with np.random.seed(1) (in the target
version drop it).
Then proceed as follows:
Create a list of probabilities - how many accounts can have a customer, e.g.:
prob = [0.5, 0.25, 0.15, 0.09, 0.01]
Generate a Series stating how many owners shall have each account:
cnt = pd.Series(np.random.choice(range(1, len(prob) + 1), size=len(accounts),
p=prob), name='Customer')
Its name is Customer, because it will be the source to create just
Customer column.
For my sample probablities and generator seeding the result is:
0 1
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: Customer, dtype: int32
(the left column is the index, the right - actual values).
Because your data sample contains only 9 accounts, the result does
not contain "greater" number of owners. But in your target version,
with more accounts, there will be accounts with greater numbers of
owners.
Generate the result - cust_acct DataFrame, defining the assignment of customers
to accounts:
cust_acct = cnt.apply(lambda x: np.random.choice(customers, x, replace=False))\
.explode().to_frame().join(pd.Series(accounts, name='Account')).reset_index(drop=True)
The result, for your sample data and my seeding and probabilities, is:
Customer Account
0 b 1
1 a 2
2 b 2
3 b 3
4 b 4
5 c 5
6 b 6
7 c 7
8 a 8
9 b 9
Of course, you can assume different proabilities in prob.
You can also choose other "top" number of owners (the number of
entries in prob).
In this case no change in code is needed, because the range of values in
the first np.random.choice is set to accomodete to the lenght of prob.
Note: Because your sample data contains only 3 customers,
under different generator seeding there can occur ValueError: Cannot
take a larger sample than population when 'replace=False'.
The reason is that if the number of owners for some account was > 3 then just
this error occurs.
But with your target data, with greater number of customers, this error
will not occur.

How do I aggregate a pandas Dataframe while retaining all original data?

My goal is to aggregate a pandas DataFrame, grouping rows by an identity field. Notably, rather than just gathering summary statistics of the group, I want to retain all the information in the DataFrame in addition to summary statistics like mean, std, etc. I have performed this transformation via a lot of iteration, but I am looking for a cleaner/more pythonic approach. Notably, there may be more or less than 2 replicates per group, but all groups will always have the same number of replicates.
Example: I would llke to translate the below format
df = pd.DataFrame([
["group1", 4, 10],
["group1", 8, 20],
["group2", 6, 30],
["group2", 12, 40],
["group3", 1, 50],
["group3", 3, 60]],
columns=['group','timeA', 'timeB'])
print(df)
group timeA timeB
0 group1 4 10
1 group1 8 20
2 group2 6 30
3 group2 12 40
4 group3 1 50
5 group3 3 60
into a df of the following format:
target = pd.DataFrame([
["group1", 4, 8, 6, 10, 20, 15],
["group2", 6, 12, 9, 30, 45, 35],
["group3", 1, 3, 2, 50, 60, 55]
], columns = ["group", "timeA.1", "timeA.2", "timeA.mean", "timeB.1", "timeB.2", "timeB.mean"])
print(target)
group timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
0 group1 4 8 6 10 20 15
1 group2 6 12 9 30 45 35
2 group3 1 3 2 50 60 55
Finally, it doesn't really matter what the column names are, these ones are just to make the example more clear. Thanks!
EDIT: As suggested by a user in the comments, I tried the solution from the linked Q/A without success:
df.insert(0, 'count', df.groupby('group').cumcount())
df.pivot(*df)
TypeError: pivot() takes from 1 to 4 positional arguments but 5 were given
Try with pivot_table:
out = (df.assign(col=df.groupby('group').cumcount()+1)
.pivot_table(index='group', columns='col',
margins='mean', margins_name='mean')
.drop('mean')
)
out.columns = [f'{x}.{y}' for x,y in out.columns]
Output:
timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
group
group1 4.0 8.0 6.0 10 20 15
group2 6.0 12.0 9.0 30 40 35
group3 1.0 3.0 2.0 50 60 55

Simplify Array Query with Range

i have a big query datatable with 512 variables as arrays with quite the long names (x__x_arrVal_arrSlices_0__arrValues to arrSlices_511). In each array are 360 values. the bi-tool cannot compute an array in this form. this is the reason why i want to have each value as an output.
the query excerpt i use right now is:
SELECT
timestamp, x_stArrayTag_sArrayName, x_stArrayTag_sComission,
1 as row,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(1)] AS f001,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(10)] AS f010,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(20)] AS f020,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(30)] AS f030,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(40)] AS f040,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(50)] AS f050,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(60)] AS f060,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(70)] AS f070,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(80)] AS f080,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(90)] AS f090,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(100)] AS f100,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(110)] AS f110,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(120)] AS f120,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(130)] AS f130,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(140)] AS f140,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(150)] AS f150,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(160)] AS f160,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(170)] AS f170,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(180)] AS f180,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(190)] AS f190,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(200)] AS f200,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(210)] AS f210,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(220)] AS f220,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(230)] AS f230,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(240)] AS f240,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(250)] AS f250,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(260)] AS f260,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(270)] AS f270,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(280)] AS f280,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(290)] as f290,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(300)] AS f300,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(310)] AS f310,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(320)] AS f320,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(330)] AS f330,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(340)] AS f340,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(350)] AS f350,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(359)] AS f359
FROM
`project.table`
WHERE
_PARTITIONTIME >= "2017-01-01 00:00:00"
AND _PARTITIONTIME < "2018-02-16 00:00:00"
UNION ALL
The output i get is unfortunately only a fracture of all values. getting all 512*360 values with this query is not possible because if i used this query for all slices i reach the limit of bigquery.
is there a possibility to rename the the long name and to select a range?
best regards
scotti
You can get 360 rows and 512 columns by using UNNEST. Here is a small example:
WITH data AS (
SELECT
[1, 2, 3, 4] as a,
[2, 3, 4, 5] as b,
[3, 4, 5, 6] as c
)
SELECT v1, b[OFFSET(off)] as v2, c[OFFSET(off)] as v3
FROM data, unnest(a) as v1 WITH OFFSET off
Output:
v1 v2 v3
1 2 3
2 3 4
3 4 5
4 5 6
Having in mind a little messy table you are dealing with - in making decision on restructuring the important aspect is practicality of query to implement that decision
In your specific case - I would recommend full flattening of the data like below (each row will be transformed into ~180000 rows each representing one of the elements of one of the array in original row - slice field will represent array number and pos will represent element position in that array) - query is generic enough to handle any number/names of slices and array sizes and at the same time result is flexible and also generic enough to be used in any imaginable algorithm
#standardSQL
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
you can test/play with it using below dummy example
#standardSQL
WITH `project.dataset.messytable` AS (
SELECT 1 id,
[ 1, 2, 3, 4, 5] x__x_arrVal_arrSlices_0,
[11, 12, 13, 14, 15] x__x_arrVal_arrSlices_1,
[21, 22, 23, 24, 25] x__x_arrVal_arrSlices_2 UNION ALL
SELECT 2 id,
[ 6, 7, 8, 9, 10] x__x_arrVal_arrSlices_0,
[16, 17, 18, 19, 20] x__x_arrVal_arrSlices_1,
[26, 27, 28, 29, 30] x__x_arrVal_arrSlices_2
)
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
the result is as below
Row id slice pos value
1 1 0 0 1
2 1 0 1 2
3 1 0 2 3
4 1 0 3 4
5 1 0 4 5
6 1 1 0 11
7 1 1 1 12
8 1 1 2 13
9 1 1 3 14
10 1 1 4 15
11 1 2 0 21
12 1 2 1 22
13 1 2 2 23
14 1 2 3 24
15 1 2 4 25
16 2 0 0 6
17 2 0 1 7
18 2 0 2 8
19 2 0 3 9
20 2 0 4 10
21 2 1 0 16
22 2 1 1 17
23 2 1 2 18
24 2 1 3 19
25 2 1 4 20
26 2 2 0 26
27 2 2 1 27
28 2 2 2 28
29 2 2 3 29
30 2 2 4 30

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.