How do I aggregate a pandas Dataframe while retaining all original data? - pandas

My goal is to aggregate a pandas DataFrame, grouping rows by an identity field. Notably, rather than just gathering summary statistics of the group, I want to retain all the information in the DataFrame in addition to summary statistics like mean, std, etc. I have performed this transformation via a lot of iteration, but I am looking for a cleaner/more pythonic approach. Notably, there may be more or less than 2 replicates per group, but all groups will always have the same number of replicates.
Example: I would llke to translate the below format
df = pd.DataFrame([
["group1", 4, 10],
["group1", 8, 20],
["group2", 6, 30],
["group2", 12, 40],
["group3", 1, 50],
["group3", 3, 60]],
columns=['group','timeA', 'timeB'])
print(df)
group timeA timeB
0 group1 4 10
1 group1 8 20
2 group2 6 30
3 group2 12 40
4 group3 1 50
5 group3 3 60
into a df of the following format:
target = pd.DataFrame([
["group1", 4, 8, 6, 10, 20, 15],
["group2", 6, 12, 9, 30, 45, 35],
["group3", 1, 3, 2, 50, 60, 55]
], columns = ["group", "timeA.1", "timeA.2", "timeA.mean", "timeB.1", "timeB.2", "timeB.mean"])
print(target)
group timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
0 group1 4 8 6 10 20 15
1 group2 6 12 9 30 45 35
2 group3 1 3 2 50 60 55
Finally, it doesn't really matter what the column names are, these ones are just to make the example more clear. Thanks!
EDIT: As suggested by a user in the comments, I tried the solution from the linked Q/A without success:
df.insert(0, 'count', df.groupby('group').cumcount())
df.pivot(*df)
TypeError: pivot() takes from 1 to 4 positional arguments but 5 were given

Try with pivot_table:
out = (df.assign(col=df.groupby('group').cumcount()+1)
.pivot_table(index='group', columns='col',
margins='mean', margins_name='mean')
.drop('mean')
)
out.columns = [f'{x}.{y}' for x,y in out.columns]
Output:
timeA.1 timeA.2 timeA.mean timeB.1 timeB.2 timeB.mean
group
group1 4.0 8.0 6.0 10 20 15
group2 6.0 12.0 9.0 30 40 35
group3 1.0 3.0 2.0 50 60 55

Related

rolling windows defined by backward cumulative sums

I have got a pandas DataFrame like this:
A B
0 3 ...
1 2
2 4
3 4
4 1
5 7
6 5
7 3
I would like to compute a rolling along column A summing its elements backwards until I reach at least 10. The resulting windows should be:
A B window_indices
0 3 ... NA
1 2 NA
2 4 NA
3 4 --> [3,2,1]
4 1 [4,3,2,1]
5 7 [5,4,3]
6 5 [6,5]
7 3 [7,6,5]
Next, I want to compute some statistics on column B, something like that:
df.my_rolling(on='A', func='sum', threshold=10).B.mean()
I have got an idea: we could think of the elements of column A as seconds. Transform A in a datetime column and perform a standard rolling on it. But I don't know how to do that.
This is no able to do with rolling since the rolling window is not fixed
l = [[df.index[(df.A.loc[:x].iloc[::-1].cumsum()>=10).idxmax():x+1].tolist()[::-1]
if (df.A.loc[:x].sum()>=10) else np.nan] for x in df.A.index]
Out[46]:
[[nan],
[nan],
[nan],
[[3, 2, 1]],
[[4, 3, 2, 1]],
[[5, 4, 3]],
[[6, 5]],
[[7, 6, 5]]]
df['new'] = l

How to efficiently set default value for missing columns when extracting columns in pandas dataframe

I want to get values for "Banana", "Mango" and "Orange" columns in my pandas dataframe.
Supposse, one of my pandas dataframes looks as follows. In other words, it does not have a column named "Mango".
Apple Orange Banana Pear
Basket1 10 20 30 40
Basket2 7 14 21 28
Basket3 55 15 8 12
In such situations, I want to set the value to "None". So, my output is;
Banana Mango Orange
Basket1 30 None 20
Basket2 21 None 14
Basket3 8 None 15
I want to do this process in one line. Just wondering if It is possible to do so in pandas?
My code is as follows.
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12]],
                  columns=['Apple', 'Orange', 'Banana', 'Pear'],
                  index=['Basket1', 'Basket2', 'Basket3'])
df[df.columns & ["Banana", "Mango", "Orange"]] #I check what are the columns that exists
My output so far looks follows.
Banana Orange
Basket1 30 20
Basket2 21 14
Basket3 8 15
I am happy to provide more details if needed.

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

how to use ft.dfs result join to test set?

I know featuretools has ft.calculate_feature_matrix method, but it calculate data use test. I need when I get the feature use train data,and join to test data not use the same feature on test data.
for example:
train data:
id sex score
1 f 100
2 f 200
3 m 10
4 m 20
after dfs, I get:
id sex score sex.mean(score)
1 f 100 150
2 f 200 150
3 m 10 15
4 m 20 15
i want get like this on test set:
id sex score sex.mean(score)
5 f 30 150
6 f 40 150
7 m 50 15
8 m 60 15
not
id sex score sex.mean(score)
5 f 30 35
6 f 40 35
7 m 50 55
8 m 60 55
how can i realization it, thanks you。
Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.
First, let me create a DataFrame of people
import pandas as pd
people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})
which looks like this
id sex
0 1 f
1 2 f
2 3 m
3 4 m
4 5 f
5 6 f
6 7 m
7 8 m
Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0 for training data and time 1 for the test data.
scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"person_id": [1, 2, 3, 4, 5, 6, 7, 8],
"time": [0, 0, 0, 0, 1, 1, 1, 1],
"score": [100, 200, 10, 20, 30, 40, 50, 60]})
which looks like this
id person_id score time
0 1 1 100 0
1 2 2 200 0
2 3 3 10 0
3 4 4 20 0
4 5 5 30 1
5 6 6 40 1
6 7 7 50 1
7 8 8 60 1
Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity
import featuretools as ft
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=people,
entity_id='people',
index='id')
es.entity_from_dataframe(dataframe=scores,
entity_id='scores',
index='id',
time_index= "time")
# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")
# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)
es
Here is our entity set
Entityset: example
Entities:
scores [Rows: 8, Columns: 4]
sexes [Rows: 2, Columns: 1]
people [Rows: 8, Columns: 2]
Relationships:
scores.person_id -> people.id
people.sex -> sexes.sex
Next, let's calculate the feature of interest. Notice when we use the cutoff_time argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.
from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)
The output is now
sexes.MEAN(scores.score)
id
1 150
2 150
3 15
4 15
5 150
6 150
7 15
8 15
This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.
For information on how time indexes work in Featuretools read the Handling Time page in the documentation.
EDIT
If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs
feature_list = ft.dfs(target_entity="people",
entityset=es,
agg_primitives=["count", "std", "max"],
features_only=True)
feature_list
this returns feature definitions you can pass to ft.calculate_feature_matrix
[<Feature: sex>,
<Feature: MAX(scores.score)>,
<Feature: STD(scores.time)>,
<Feature: STD(scores.score)>,
<Feature: COUNT(scores)>,
<Feature: MAX(scores.time)>,
<Feature: sexes.STD(scores.score)>,
<Feature: sexes.COUNT(people)>,
<Feature: sexes.STD(scores.time)>,
<Feature: sexes.MAX(scores.score)>,
<Feature: sexes.MAX(scores.time)>,
<Feature: sexes.COUNT(scores)>]
Read more about DFS in this write-up

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.