Create series based on two pandas dataframe bool columns - pandas

How do I create a series based on two pandas dataframe bool columns?
round_up round_down is_round_up is_round_down High Low
0 0.75 0.7 False True 0.70532 0.69818
1 0.75 0.7 False True 0.70196 0.67268
2 0.75 0.7 False True 0.71243 0.69938
3 0.75 0.7 False True 0.70226 0.69884
4 0.75 0.7 False True 0.70292 0.69952
5 0.75 0.7 True True 0.75100 0.69000
Desired output is a series with round_up if is_round_up or round_down if is_round_down or if both True select is_round_up.
0 0.70
1 0.70
2 0.70
3 0.70
4 0.70
5 0.75
Test data
df = pd.DataFrame({'round_up': {0: 0.75,
1: 0.75,
2: 0.75,
3: 0.75,
4: 0.75,
5: 0.75},
'round_down': {0: 0.70,
1: 0.70,
2: 0.70,
3: 0.70,
4: 0.70,
5: 0.70},
'is_round_up': {0: False, 1: False, 2: False, 3: False, 4: False, 5:True},
'is_round_down': {0: True, 1: True, 2: True, 3: True, 4: True, 5:True},
'High': {0: 0.70532, 1: 0.70196, 2: 0.71243, 3: 0.70226, 4: 0.70292, 5:0.751},
'Low': {0: 0.69818, 1: 0.67268, 2: 0.69938, 3: 0.69884, 4: 0.69952, 5:0.69}})
Edit: To clarify if both columns are False I would expect to see Nan for that row.

Select the columns in order of choice (up is preferred to down in case of both True), mask the False values, bfill and get the first column.
Advantage: the solution is scalable to any number of columns.
(df[['round_up', 'round_down']]
.where(df[['is_round_up', 'is_round_down']].values)
.bfill(axis=1)
.iloc[:,0]
.rename('result')
)
output:
0 0.70
1 0.70
2 0.70
3 0.70
4 0.70
5 0.75
Name: result, dtype: float64

Related

multiple nested groupby in pandas

Here is my pandas dataframe:
df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})
Here is how it looks:
I want to add the following columns:
'Date_Range_Avg': average of 'Range' grouped by Date
'Date_Sector_Range_Avg': average of 'Range' grouped by Date and Sector
'Date_Segment_Range_Avg': average of 'Range' grouped by Date and Segment
This would be the output:
res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})
This is how it looks:
Note I have rounded some of the values - but this rounding is not essential for the question I have (please feel free to not round)
I'm aware that I can do each of these groupings separately but it strikes me as inefficient (my dataset contains millions of rows)
Essentially, I would like to first do a grouping by Date and then re-use it to do the two more fine-grained groupings by Date and Segment and by Date and Sector.
How to do this?
My initial hunch is to go like this:
day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')
and then to re-use day_groups to do the 2 more fine-grained groupbys like this:
df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')
Which doesn't work as you get:
'AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby''
groupby runs really fast when the aggregate function is vectorized. If you are worried about performance, try it out first to see if it's the real bottleneck in your program.
You can create temporary data frames holding the result of each groupby, then successively merge them with df:
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = [
df.groupby(columns)["Range"].mean().to_frame(key)
for key, columns in group_bys.items()
]
result = df
for t in tmp:
result = result.merge(t, left_on=t.index.names, right_index=True)
Result:
Date Stock Sector Segment Range Date_Range_Avg Date_Sector_Range_Avg Date_Segment_Range_Avg
0 2016-10-11 A 0 0 5 1.6 2.500000 5.00
1 2016-10-11 B 0 1 0 1.6 2.500000 0.75
2 2016-10-11 C 1 1 1 1.6 1.000000 0.75
3 2016-10-11 D 1 1 0 1.6 1.000000 0.75
4 2016-10-11 E 1 1 2 1.6 1.000000 0.75
5 2016-10-12 F 0 1 6 7.8 9.666667 6.00
6 2016-10-12 G 0 2 0 7.8 9.666667 11.50
7 2016-10-12 H 0 2 23 7.8 9.666667 11.50
8 2016-10-12 I 1 3 5 7.8 5.000000 5.00
9 2016-10-12 J 1 3 5 7.8 5.000000 5.00
Another option is to use transform, and avoid the multiple merges:
# reusing your code
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = {key : df.groupby(columns)["Range"].transform('mean')
for key, columns in group_bys.items()
}
df.assign(**tmp)

Calculate median of column with multiple values per cell (ranges)

I have this code
df = pd.DataFrame({'R': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'}, 'a': {0: 1.0, 1: 1.0, 2: 2.0, 3: 3.0, 4: 3.0, 5: 2.0, 6: 3.0}, 'nv1': {0: [-1.0], 1: [-1.0], 2: [], 3: [], 4: [-2.0], 5: [-2.0, -1.0, -3.0, -1.0], 6: [-2.0, -1.0, -2.0, -1.0]}})
yielding the following dataframe:
R a nv1
0 1 1.0 [-1.0]
1 2 1.0 [-1.0]
2 3 2.0 []
3 4 3.0 []
4 5 3.0 [-2.0]
5 6 2.0 [-2.0, -1.0, -3.0, -1.0]
6 7 3.0 [-2.0, -1.0, -2.0, -1.0]
I need to calculate median of df['nv1']
df['med'] = median of df['nv1']
Desired output as follows
R a nv1 med
1 1.0 [-1.0] -1
2 1.0 [-1.0] -1
3 2.0 []
4 3.0 []
5 3.0 [-2.0] -2
6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
I tried both line of codes below independently, but I ran into errors:
df['nv1'] = pd.to_numeric(df['nv1'],errors = 'coerce')
df['med'] = df['nv1'].median()
Use np.median:
df['med'] = df['nv1'].apply(np.median)
Output:
>>> df
R a nv1 med
0 1 1.0 [-1.0] -1.0
1 2 1.0 [-1.0] -1.0
2 3 2.0 [] NaN
3 4 3.0 [] NaN
4 5 3.0 [-2.0] -2.0
5 6 2.0 [-2.0, -1.0, -3.0, -1.0] -1.5
6 7 3.0 [-2.0, -1.0, -2.0, -1.0] -1.5
Or:
df['med'] = df['nv1'].explode().dropna().groupby(level=0).median()

Multiple vectorized condition - Length of values between two data frames not matching

I am trying to perform a rather simple task by using vectorized conditions. The size of the two dataframes differ but still I do not understand why that may an issue.
df1_data = {'In-Person Status': {0: 'No', 1: 'Yes', 2: 'No', 3: 'Yes', 4: 'No', 5: 'Yes'},
'ID': {0: 5, 1: 45, 2: 22, 3: 34, 4: 46, 5: 184}}
df1 = pd.DataFrame(df1_data)
df2_data = {'Age': {0: 22, 1: 34, 2: 51, 3: 8}, 'ID': {0: 5, 1: 2145, 2: 5022, 3: 34}}
df2 = pd.DataFrame(df2_data)
I am using the following code:
conditions = [
(df2['ID'].isin(df1['ID'])) & (df1['In-Person Status'] == 'No')
]
value = ['True']
df2['Result'] = NaN
df2['Result'] = np.select(conditions, value, 'False')
Desired output:
Age ID Result
22 0005 True
34 2145 False
51 5022 False
8 0034 False
Although the task might be very simple, I am getting the following error message:
ValueError: Length of values (72610) does not match length of index (1634)
I would very much appreciate any suggestions.
We can join the two dfs as suggested in the comments, then drop the nan value rows in the Age column. The last couple of lines are optional to get the format to match your output.
dfj = df1.join(df2, rsuffix='_left')
conditions = [(dfj['ID'].isin(dfj['ID_left'])) & (dfj['In-Person Status'] == 'No')]
value = [True]
dfj['Result'] = np.select(conditions, value, False)
dfj = dfj.dropna(axis=0, how='any', subset=['Age'])
dfj = dfj[['Age' , 'ID_left', 'Result']]
dfj.columns = ['Age', 'ID', 'Result']
dfj['ID'] = dfj['ID'].apply(lambda x: str(x).zfill(6)[0:4])
dfj['Age'] = dfj['Age'].astype(int)
Output:
Age ID Result
0 22 0005 True
1 34 2145 False
2 51 5022 False
3 8 0034 False

Pandas Dataframe: change columns, index and plot

Hi, I generated the table above using Counter from collections for counting the combinations of 3 variables from a dataframe: Jessica, Mike, and Dog. I got the combination and their counts.
Any help to make that table a bit more prettier? I would like to rename the index as grp1, grp2, etc and the column as well with something else than 0.
Also, what would be the best plot to use for plotting the different groups?
Thanks for your help!!
I used this command to produce the table here:
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
print(H)
quite an unusual technique....
you can change the dict index to a multi-index
then plot() as barh and labels make sense
df= np.random.choice(["Mike", "Jessica", "Dog"], size=(20, 3))
Z= pd.DataFrame(df,columns=['a', 'b', 'c'])
import collections
from collections import Counter
LL= Z.apply (Counter, axis= "columns").value_counts()
H= pd.DataFrame(LL)
I = pd.Series(H.index).apply(pd.Series)
H = H.set_index(pd.MultiIndex.from_arrays(I.T.values, names=I.columns))
H.plot(kind="barh")
H after setting as multi-index
0
Mike Dog Jessica
2.0 1.0 NaN 5
NaN 1.0 4
NaN 1.0 2.0 3
1.0 NaN 2.0 3
1.0 1.0 2
NaN NaN 3.0 1
2.0 1.0 1
3.0 NaN NaN 1
Instead of using counter, you can apply value_counts directly to each row:
import pandas as pd
from matplotlib import pyplot as plt
# Hard Coded For Reproducibility
df = pd.DataFrame({'a': {0: 'Dog', 1: 'Jessica', 2: 'Mike',
3: 'Dog', 4: 'Dog', 5: 'Dog',
6: 'Jessica', 7: 'Jessica',
8: 'Dog', 9: 'Dog', 10: 'Jessica',
11: 'Mike', 12: 'Dog',
13: 'Jessica', 14: 'Mike',
15: 'Mike',
16: 'Mike', 17: 'Dog',
18: 'Jessica', 19: 'Mike'},
'b': {0: 'Mike', 1: 'Mike', 2: 'Jessica',
3: 'Jessica', 4: 'Dog', 5: 'Jessica',
6: 'Mike', 7: 'Dog', 8: 'Mike',
9: 'Dog', 10: 'Dog', 11: 'Dog',
12: 'Dog', 13: 'Jessica',
14: 'Jessica', 15: 'Dog',
16: 'Dog', 17: 'Dog', 18: 'Jessica', 19: 'Jessica'},
'c': {0: 'Mike', 1: 'Dog', 2: 'Jessica',
3: 'Dog', 4: 'Dog', 5: 'Dog', 6: 'Dog',
7: 'Jessica', 8: 'Mike', 9: 'Dog',
10: 'Dog', 11: 'Mike', 12: 'Jessica',
13: 'Jessica', 14: 'Jessica',
15: 'Jessica', 16: 'Jessica',
17: 'Dog', 18: 'Mike', 19: 'Dog'}})
# Apply value_counts across each row
df = df.apply(pd.value_counts, axis=1) \
.fillna(0)
# Group By All Columns and
# Get Duplicate Count From Group Size
df = pd.DataFrame(df
.groupby(df.columns.values.tolist())
.size()
.sort_values())
# Plot
plt.figure()
df.plot(kind="barh")
plt.tight_layout()
plt.show()
df after groupby, size, and sort:
0
Dog Jessica Mike
0.0 3.0 0.0 1
1.0 2.0 0.0 1
0.0 2.0 1.0 3
1.0 0.0 2.0 3
3.0 0.0 0.0 3
2.0 1.0 0.0 4
1.0 1.0 1.0 5
Plt:

Pandas Calculating minimum timedelta by group

My input data frame as below:
Input Dataframe:
Input1 = pd.DataFrame({'LOT': {0: 'A1', 1: 'A2', 2: 'A3', 3: 'A4', 4: 'A5'},
'OPERATION': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0},
'TXN_DATE': {0: '12/6/2016',
1: '12/5/2016',
2: '11/30/2016',
3: '11/27/2016',
4: '11/22/2016'}})
Input2 = pd.DataFrame({'LOT': {0: 'B1', 1: 'B2', 2: 'B3', 3: 'B4', 4: 'B5', 5: 'B6'},
'OPERATION': {0: 500, 1: 500, 2: 500, 3: 500, 4: 500, 5: 500},
'TXN_DATE': {0: '12/7/2016',
1: '12/3/2016',
2: '11/17/2016',
3: '11/22/2016',
4: '12/4/2016',
5: '12/3/2016'}})
I am interesting to calculate companion lot from Input2 to lot in Input1 table based on minimum TXN_DATES delta between them (time delta suppose to be minimal):
Final DataFrame:
Expected_out = pd.DataFrame({'COMPANION_LOT': {0: 'B5', 1: 'B5', 2: 'B4', 3: 'B4', 4: 'B4'},
'COMPANION_LOT TXN_DATE': {0: '12/4/2016',
1: '12/4/2016',
2: '11/22/2016',
3: '11/22/2016',
4: '11/22/2016'},
'LOT': {0: 'A1', 1: 'A2', 2: 'A3', 3: 'A4', 4: 'A5'},
'OPERATION': {0: 100, 1: 100, 2: 100, 3: 100, 4: 100},
'TXN_DATE': {0: '12/6/2016',
1: '12/5/2016',
2: '11/30/2016',
3: '11/27/2016',
4: '11/22/2016'}})`
Thank you
You can use mainly pandas.merge_asof and then add new column by map:
Input1.TXN_DATE = pd.to_datetime(Input1.TXN_DATE)
Input2.TXN_DATE = pd.to_datetime(Input2.TXN_DATE)
Input1 = Input1.sort_values('TXN_DATE')
Input2 = Input2.sort_values('TXN_DATE')
df = pd.merge_asof(Input1, Input2, on='TXN_DATE', suffixes=('','_COMPANION')) \
.sort_values('LOT') \
.drop('OPERATION_COMPANION', axis=1)
df['LOT_TXN_DATE'] = df.LOT_COMPANION.map(Input2.set_index('LOT')['TXN_DATE'])
print (df)
LOT OPERATION TXN_DATE LOT_COMPANION LOT_TXN_DATE
4 A1 100.0 2016-12-06 B5 2016-12-04
3 A2 100.0 2016-12-05 B5 2016-12-04
2 A3 100.0 2016-11-30 B4 2016-11-22
1 A4 100.0 2016-11-27 B4 2016-11-22
0 A5 100.0 2016-11-22 B4 2016-11-22