Python/Pandas: Transformation of column within a list of columns - pandas

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?

If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1

I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

Related

Pandas split dataframe by sessions

I've got the next DataFrame:
id
sec
1
45
2
1
3
176
1
19
1
876
3
123
I want to split it to groups by id by sessions, or create multiple dataframes of this sessions. Like I want to have sessions of each id (session is When more than 30 seconds have passed between user actions)
For example:
sessions for id 1: [45, 19], [876]
I tried gruopby and cat, but I have no idea how to implement this
To identify the session you can use:
df['session'] = (df.sort_values(by=['id', 'sec'])
.groupby('id')['sec']
.apply(lambda s: s.diff().gt(30).cumsum().add(1))
)
Output:
id sec session
0 1 45 1
1 2 1 1
2 3 176 2
3 1 19 1
4 1 876 2
5 3 123 1

cumulative product for specific groups of observations in pandas

I have a dataset of the following type
Date ID window var
0 1998-01-28 X -5 8.500e-03
1 1998-01-28 Y -5 1.518e-02
2 1998-01-29 X -4 8.005e-03
3 1998-01-29 Y -4 7.905e-03
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
My aim is to calculate the cumulative product of the variable var according to the variable window. The idea is that for every date, I have identified a window of 5 days around that date /the variable window goes from -5 to 5). Now, I want to calculate the cumulative product in the window that belongs to a specific date. For example, the first date (1998-01-28) has a value of windows of -5, and thus represent the starting point for the calculation of the cumprod. I want to have a new variable called cumprod which is exactly var on the date in which window is -5, then it is the cumprod between the value of varat -5 and -4, and so on until window is equal to 5. This defines the value of cumprod for the first group of dates, where every group is defined by consecutive dates in a way that var starts at -5 and ends at 5. I shall then repeat this for any group of date. I will therefore obtain something like
Date ID window var cumprod
0 1998-01-28 X -5 8.500e-03 8.500e-03
1 1998-01-28 Y -5 1.518e-02 1.518e-02
2 1998-01-29 X -4 8.005e-03 6.80425e-05
3 1998-01-29 Y -4 7.905e-03 0.00011999790000000002
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
where I gave an example in of cumprod for the first 2 dates.
How could I achieve this? I was thinking to find a way to attach an identifier to every group of dates and then run some sort of cumprod() method using .groupby(group_identifier). I can't think of how to do it though. Would it be possible to simplify it by using a rolling function on window? Any other kind of approach is very welcomed.
I suggest the following
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({"Date": pd.date_range("1998-01-28", freq="d", periods=22),
"window": np.concatenate([np.arange(-5,6,1),np.arange(-5,6,1)]),
"var": np.random.randint(1,10,22)
})
My df is similar to yours:
Date window var
0 1998-01-28 -5 3
1 1998-01-29 -4 3
2 1998-01-30 -3 7
3 1998-01-31 -2 2
4 1998-02-01 -1 4
5 1998-02-02 0 7
6 1998-02-03 1 2
7 1998-02-04 2 1
8 1998-02-05 3 2
9 1998-02-06 4 1
10 1998-02-07 5 1
11 1998-02-08 -5 4
12 1998-02-09 -4 5
Then I create a grouping variable and transform var usingcumprod:
df = df.sort_values("Date") # My df is already sorted by Date given the way
# I created it, but I add this to make sure yours is sorted by date
df["group"] = (df["window"] == -5).cumsum()
df = pd.concat([df, df.groupby("group")["var"].transform("cumprod")], axis=1)
And the result is :
Date window var group var
0 1998-01-28 -5 3 1 3
1 1998-01-29 -4 3 1 9
2 1998-01-30 -3 7 1 63
3 1998-01-31 -2 2 1 126
4 1998-02-01 -1 4 1 504
5 1998-02-02 0 7 1 3528
6 1998-02-03 1 2 1 7056
7 1998-02-04 2 1 1 7056
8 1998-02-05 3 2 1 14112
9 1998-02-06 4 1 1 14112
10 1998-02-07 5 1 1 14112
11 1998-02-08 -5 4 2 4
12 1998-02-09 -4 5 2 20
13 1998-02-10 -3 1 2 20

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106

How to merge two pandas dataframes/ series

I'm trying to merge multiple pandas data frames into one. I have 1 main frame with the locations of measurements. The other data frames contain multiple measurements for one location. Like below:
df 1: Location ID | X | Y | Z
1 |1| 2 |3
2 |3| 2 |1
n
df 2: Location ID | Date | Measurement
1 |January 1 12:30 | 1
1 |January 16 12 :30 | 4
1 ...
df 2: Location ID | Date | Measurement
2 January 1 12:30 3
2 January 16 12 :30 9
2 ...
df n: Location ID | Date | Measurement
n January 1 12:30 4
n January 16 12 :30 6
n January 20 11:30 7 ...
I'm trying to create a data frame like this:
df_final: Location ID | X | Y | Z | january 1 12:00 | January 16 12 :30| January 20 11:30 etc.
1 1 2 3 1 4 NaN
2 3 2 1 3 9 NaN
n 2 5 7 4 6 7
The dates are already datetime objects and the Location ID is the index of both dataframes.
I tried to use the append, the merge and the concat functions both using two frames and converting the frame to a list by List = frame['measurements'] before adding it.
The problem is that either rows are added under the first data frame, while the measured values should be added in new columns on an existing row( the location ID resp.), or the dates end op to be new rows while new columns with location IDs are created.
I'm sorry my question lay-out is not so nice, but I'm new to this forum.
Found it myself.
I used frame. pivot to reshape df2-n and then used concat to ad it to the locations df.

Merge two DataFrames on multiple columns

hope you can help me.
I have two pretty big Datasets.
DF1 Example:
|id| A_Workflow_Type_ID | B_Workflow_Type_ID | ...
1 123 456
2 789 222 ...
3 333 NULL ...
DF2 Example:
Workflow| Operation | Profile | Type | Name | ...
123 1 2 Low_Cost xyz ...
456 2 5 High_Cost z ...
I need to merge the two datasets without creating many NaNs and multiple columns. So i merge on the informations A_Workflow_Type_ID and B_Workflow_Type_ID from DF1 on Workflow from DF2.
I tried it with several join operations in pandas and the merge option it failure.
My last try:
all_Data = pd.merge(left=DF1,right=DF2, how='inner', left_on =['A_Workflow_Type_ID ','B_Workflow_Type_ID '], right_on=['Workflow'])
But that returns an error that they have to be equal lenght on both sides.
Thanks for the help!
You need reshape first by melt and then merge:
#generate all column without strings Workflow
cols = DF1.columns[~DF1.columns.str.contains('Workflow')]
print (cols)
Index(['id'], dtype='object')
df = DF1.melt(cols, value_name='Workflow', var_name='type')
print (df)
id type Workflow
0 1 A_Workflow_Type_ID 123.0
1 2 A_Workflow_Type_ID 789.0
2 3 A_Workflow_Type_ID 333.0
3 1 B_Workflow_Type_ID 456.0
4 2 B_Workflow_Type_ID 222.0
5 3 B_Workflow_Type_ID NaN
all_Data = pd.merge(left=df,right=DF2, on ='Workflow')
print (all_Data)
id type Workflow Operation Profile Type Name
0 1 A_Workflow_Type_ID 123 1 2 Low_Cost xyz
1 1 B_Workflow_Type_ID 456 2 5 High_Cost z