I've got data frame that include x and y variables, and the indexes are: ID, date and time.
I want to create new variable that will be created by applying some defined function.
For example, the function could be:
def some_function(x1, x2 , y1, y2):
z = x1*x2 + y1*y2
return z
The real function is more complex.
Note: The function should be applied on each ID separately.
Data illustration:
ID date time x y
1 08/27/2019 18:00 1 2
19:00 3 4
20:00 .. ..
21:00 .. ..
2 08/28/2019 18:00 .. ..
19:00 .. ..
19:31 .. ..
19:32 .. ..
19:34 .. ..
So for example, the first row in the new variable should be 0, since there is no previous row, and the second row should be 3*1 + 4*2 = 11.
You can try:
def myfunc(d):
return d['x'].mul(d['x'].shift()) + d['y'].mul(d['y'].shift())
df['new_col'] = df.groupby('ID').apply(myfunc)
Assuming index is numeric,
(df.join(df.groupby('id')[['x','y']].shift(),lsuffix='1',rsuffix='2')
.apply(lambda x:some_function(x.x1,x.x2,x.y1,x.y2),axis=1))
You can do this with shift:
df_shifted= df[['x', 'y']].shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']
The output looks like this:
df= pd.DataFrame(dict(
ID= [1, 1, 2, 3, 3],
time= ['02:37', '05:28', '09:01', '10:05', '10:52'],
x=[1, 3, 4, 7, 1],
y=[2, 4, 3, 2, 6]
)
)
df_shifted= df.shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']
df
Out[474]:
ID time x y new_col
0 1 02:37 1 2 0.0
1 1 05:28 3 4 11.0
2 2 09:01 4 3 24.0
3 3 10:05 7 2 34.0
4 3 10:52 1 6 19.0
So it kind of mixes the rows of different IDs. So the Value of ID 2 is calculated with the last row of ID 1. If you don't want to have that, you need to work with groupby like this:
# make sure, the dataframe is sorted
df.sort_values(['ID', 'time'], inplace=True)
# define a function that gets the sub dataframes
# which belong to the same id
def calculate(sub_df):
df_shifted= sub_df.shift(1).fillna(0)
sub_df['new_col']= sub_df['x']*df_shifted['x']+sub_df['y']*df_shifted['y']
return sub_df
df.groupby('ID').apply(calculate)
The output looks like this on the same data as above:
Out[472]:
ID time x y new_col
0 1 02:37 1 2 0.0
1 1 05:28 3 4 11.0
2 2 09:01 4 3 0.0
3 3 10:05 7 2 0.0
4 3 10:52 1 6 19.0
You see, that now the first entry of each group is 0.0. Mixing doesn't happen anymore.
Related
I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something
In the pandas dataframe I need to shift all the values in every seventh row (every Saturday) by one place, so that all the 10s line up vertically.
So 2020-11-07
should go from
10 1 10 3 10 9 10 7 10 3 10 5 10 5 10 10
to
10 10 1 10 3 10 9 10 7 10 3 10 5 10 5 10
And likewise for 2020-11-14, 2020-11-21 etc.
For every 7th row in the dataframe, use shift method to move everything right by one place and then concatenate the last value and everything from the shifted row except the first one (which will be null)
for i in range(len(df)):
if i % 6 == 0:
df.iloc[i, :] = [df.iloc[i, -1]] + df.iloc[i, :].shift(1).tolist()[1:]
If needed, this solution can be generalized to shift every kth row by r places
for i in range(len(df)):
if i % (k-1) == 0:
df.iloc[i, :] = df.iloc[i, -r:].tolist() + df.iloc[i, :].shift(r).tolist()[r:]
EDIT:
You can achieve this also without using the shift method
For every 7th row 1 place:
for i in range(len(df)):
if i % 6 == 0:
df.iloc[i, :] = [df.iloc[i, -1]] + df.iloc[i, :-1].tolist()
For the general case of kth row r places:
for i in range(len(df)):
if i % (k-1) == 0:
df.iloc[i, :] = df.iloc[i, -r:].tolist() + df.iloc[i, :-r].tolist()
Do you want something like this ?
import pandas as pd
import numpy as np
# Build a little exemple (with only 2 columns)
rng = pd.date_range('2021-11-01', periods=15, freq='D')
df = pd.DataFrame({ 'Date': rng, '1': range(15), '2':range(15) })
df["2"] = df["2"] * 100
# Algo :
# creat a new column with the max hour+1
df["3"] = np.NaN
for i in [2,1]:
# Increment hour every sunday (day 6, because starting at 0)
df[str(i+1)] = df[str(i)].where(df["Date"].dt.weekday == 6, df[str(i+1)])
# Last hour become first hour every sunday
df[str(1)] = df[str(3)].where(df["Date"].dt.weekday == 6, df[str(1)])
#EndFor
# Keep only first columns
df = df[["Date", "1", "2"]]
Result :
Date 1 2
0 2021-11-01 0.0 0
1 2021-11-02 1.0 100
2 2021-11-03 2.0 200
3 2021-11-04 3.0 300
4 2021-11-05 4.0 400
5 2021-11-06 5.0 500
6 2021-11-07 600.0 6
7 2021-11-08 7.0 700
8 2021-11-09 8.0 800
9 2021-11-10 9.0 900
10 2021-11-11 10.0 1000
11 2021-11-12 11.0 1100
12 2021-11-13 12.0 1200
13 2021-11-14 1300.0 13
14 2021-11-15 14.0 1400
This is not the best way to get the result, but this is one.
I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved