Multiple vectorized condition - Length of values between two data frames not matching - pandas

I am trying to perform a rather simple task by using vectorized conditions. The size of the two dataframes differ but still I do not understand why that may an issue.
df1_data = {'In-Person Status': {0: 'No', 1: 'Yes', 2: 'No', 3: 'Yes', 4: 'No', 5: 'Yes'},
'ID': {0: 5, 1: 45, 2: 22, 3: 34, 4: 46, 5: 184}}
df1 = pd.DataFrame(df1_data)
df2_data = {'Age': {0: 22, 1: 34, 2: 51, 3: 8}, 'ID': {0: 5, 1: 2145, 2: 5022, 3: 34}}
df2 = pd.DataFrame(df2_data)
I am using the following code:
conditions = [
(df2['ID'].isin(df1['ID'])) & (df1['In-Person Status'] == 'No')
]
value = ['True']
df2['Result'] = NaN
df2['Result'] = np.select(conditions, value, 'False')
Desired output:
Age ID Result
22 0005 True
34 2145 False
51 5022 False
8 0034 False
Although the task might be very simple, I am getting the following error message:
ValueError: Length of values (72610) does not match length of index (1634)
I would very much appreciate any suggestions.

We can join the two dfs as suggested in the comments, then drop the nan value rows in the Age column. The last couple of lines are optional to get the format to match your output.
dfj = df1.join(df2, rsuffix='_left')
conditions = [(dfj['ID'].isin(dfj['ID_left'])) & (dfj['In-Person Status'] == 'No')]
value = [True]
dfj['Result'] = np.select(conditions, value, False)
dfj = dfj.dropna(axis=0, how='any', subset=['Age'])
dfj = dfj[['Age' , 'ID_left', 'Result']]
dfj.columns = ['Age', 'ID', 'Result']
dfj['ID'] = dfj['ID'].apply(lambda x: str(x).zfill(6)[0:4])
dfj['Age'] = dfj['Age'].astype(int)
Output:
Age ID Result
0 22 0005 True
1 34 2145 False
2 51 5022 False
3 8 0034 False

Related

Replace frame order according to values in row

If I have the following dataframe:
import pandas as pd
df = {'Status': {0: 'Available',
1: 'Collect',
2: 'Failed',
3: 'Delivered',
4: 'Totaal',
5: 'sent out',
6: 'received',
7: 'Not yet executed',
8: 'received',
9: 'Approved'},
'Aantal': {0: 5,
1: 25,
2: 35,
3: 55,
4: 105,
5: 65,
6: 75,
7: 95,
8: 55,
9: 505}}
df = pd.DataFrame(df)
And I would like to re-arrange the order of the dataframe. So instead of the first row; 'Available', I would like Collect.
How can I do this?
Thank you in advance.
A robust way might be to sort using inequality to "Collect" as key and a stable sort:
out = df.sort_values('Status', key=lambda s: s.ne('Collect'), kind='stable')
Other option, using slicing and concat:
m = df['Status'].eq('Collect')
out = pd.concat([df[m], df[~m]])
output:
Status Aantal
1 Collect 25
0 Available 5
2 Failed 35
3 Delivered 55
4 Totaal 105
5 sent out 65
6 received 75
7 Not yet executed 95
8 received 55
9 Approved 505

multiple nested groupby in pandas

Here is my pandas dataframe:
df = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}})
Here is how it looks:
I want to add the following columns:
'Date_Range_Avg': average of 'Range' grouped by Date
'Date_Sector_Range_Avg': average of 'Range' grouped by Date and Sector
'Date_Segment_Range_Avg': average of 'Range' grouped by Date and Segment
This would be the output:
res = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-12',6: '2016-10-12',7: '2016-10-12',8: '2016-10-12',9: '2016-10-12'}, 'Stock': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H',8: 'I', 9:'J'}, 'Sector': {0: 0,1: 0, 2: 1, 3: 1, 4: 1, 5: 0, 6:0, 7:0, 8:1, 9:1}, 'Segment': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6:2,7:2,8:3,9:3}, 'Range': {0: 5, 1: 0, 2: 1, 3: 0, 4: 2, 5: 6, 6:0, 7:23, 8:5, 9:5}, 'Date_Range_Avg':{0: 1.6, 1: 1.6, 2: 1.6, 3: 1.6, 4: 1.6, 5: 7.8, 6: 7.8, 7: 7.8, 8:7.8, 9: 7.8}, 'Date_Sector_Range_Avg':{0: 2.5, 1: 2.5, 2: 1, 3: 1, 4: 1, 5: 9.67, 6: 9.67, 7: 9.67, 8: 9.67, 9: 9.67}, 'Date_Segment_Range_Avg':{0: 5, 1: 0.75, 2: 0.75, 3: 0.75, 4: 0.75, 5: 6, 6: 11.5, 7: 11.5, 8: 5, 9: 5}})
This is how it looks:
Note I have rounded some of the values - but this rounding is not essential for the question I have (please feel free to not round)
I'm aware that I can do each of these groupings separately but it strikes me as inefficient (my dataset contains millions of rows)
Essentially, I would like to first do a grouping by Date and then re-use it to do the two more fine-grained groupings by Date and Segment and by Date and Sector.
How to do this?
My initial hunch is to go like this:
day_groups = df.groupby("Date")
df['Date_Range_Avg'] = day_groups['Range'].transform('mean')
and then to re-use day_groups to do the 2 more fine-grained groupbys like this:
df['Date_Sector_Range_Avg'] = day_groups.groupby('Segment')[Range].transform('mean')
Which doesn't work as you get:
'AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby''
groupby runs really fast when the aggregate function is vectorized. If you are worried about performance, try it out first to see if it's the real bottleneck in your program.
You can create temporary data frames holding the result of each groupby, then successively merge them with df:
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = [
df.groupby(columns)["Range"].mean().to_frame(key)
for key, columns in group_bys.items()
]
result = df
for t in tmp:
result = result.merge(t, left_on=t.index.names, right_index=True)
Result:
Date Stock Sector Segment Range Date_Range_Avg Date_Sector_Range_Avg Date_Segment_Range_Avg
0 2016-10-11 A 0 0 5 1.6 2.500000 5.00
1 2016-10-11 B 0 1 0 1.6 2.500000 0.75
2 2016-10-11 C 1 1 1 1.6 1.000000 0.75
3 2016-10-11 D 1 1 0 1.6 1.000000 0.75
4 2016-10-11 E 1 1 2 1.6 1.000000 0.75
5 2016-10-12 F 0 1 6 7.8 9.666667 6.00
6 2016-10-12 G 0 2 0 7.8 9.666667 11.50
7 2016-10-12 H 0 2 23 7.8 9.666667 11.50
8 2016-10-12 I 1 3 5 7.8 5.000000 5.00
9 2016-10-12 J 1 3 5 7.8 5.000000 5.00
Another option is to use transform, and avoid the multiple merges:
# reusing your code
group_bys = {
"Date_Range_Avg": ["Date"],
"Date_Sector_Range_Avg": ["Date", "Sector"],
"Date_Segment_Range_Avg": ["Date", "Segment"]
}
tmp = {key : df.groupby(columns)["Range"].transform('mean')
for key, columns in group_bys.items()
}
df.assign(**tmp)

Split dataframe colum by content

How can I separate this data column by 'A','B' ...?
The first column as an index must be retained.
df = pd.DataFrame(data)
df = df[['seconds', 'marker', 'data1', 'data2', 'data3']]
seconds,marker,data1,data2,data3
00001,A,3,3,0,42,0
00002,B,3,3,0,34556,0
00003,C,3,3,0,42,0
00004,A,3,3,0,1833,0
00004,B,3,3,0,6569,0
00005,C,3,3,0,2454,0
00006,C,3,3,0,3256,0
00007,C,3,3,0,5423,0
00008,A,3,3,0,569,0
You can just get the unique values in the letter column (that's what I called it). And then filter the DataFrame containing all values using these unique values.
I am storing the newly created DataFrames in a dictionary here, but you could also store them in a list or whatever. I've used the input you have provided but have given the first 2 columns the names index and letter as they were unnamed in your .csv.
import pandas as pd
df = pd.DataFrame({
'index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8},
'letter': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'C', 7: 'C', 8: 'A'},
'seconds': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'marker': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3, 7: 3, 8: 3},
'data1': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'data2': {0: 42, 1: 34556, 2: 42, 3: 1833, 4: 6569, 5: 2454, 6: 3256, 7: 5423, 8: 569},
'data3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0}
})
# get unique values
unique_values = df["letter"].unique()
# filter "big" dataframe using one of the unique value at a time
split_dfs = {value: df[df["letter"] == value] for value in unique_values}
print(split_dfs["A"])
print(split_dfs["B"])
print(split_dfs["C"])
Expected output:
index letter seconds marker data1 data2 data3
0 1 A 3 3 0 42 0
3 4 A 3 3 0 1833 0
8 8 A 3 3 0 569 0
index letter seconds marker data1 data2 data3
1 2 B 3 3 0 34556 0
4 4 B 3 3 0 6569 0
index letter seconds marker data1 data2 data3
2 3 C 3 3 0 42 0
5 5 C 3 3 0 2454 0
6 6 C 3 3 0 3256 0
7 7 C 3 3 0 5423 0
As you can see the index is preserved.

How to do a Pandas comparison with keep shape=False, but maintain relationship with the username column

I'm trying to run a Pandas dataframe comparison df.compare(df2) that returns only differences between two dataframes, but keep the relationship between the first column (with user's names) and the output when using the argument keep_shape=False which will only display rows with differences and the indexes, but the relationship with the username column is not displayed.
How do I keep the name column (which is the first column) and use the argument keep_shape=False at the same time so I can identify the username and the changes at the same time.
Example:
import pandas as pd
df_1 = pd.read_excel('../output/spreadsheet_Jan_1.xlsx')
df_2 = pd.read_excel('../output/spreadsheet_Feb_1.xlsx')
df_compare = df_1.compare(df_2, keep_equal=True, keep_shape=False)
I guess the image isn't showing...it's a spreadsheet with the df.compare() result showing the averages columns and the 'self' and 'other' columns split below the averages. The index is on the left hand side in the order of the 'keep_shape-False' format (e.g. 1, 6, 7, 8, 9 11, etc).
How do I match the usernames which is the first column along the left side with the associated indexes?
Thanks in advance.
Here is an example of one simple way to it:
import pandas as pd
df_1 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10},
}
)
df_2 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10},
}
)
In df_compare, we want to show the fruit names for which values are different in df_1 and df_2 (that is to say 'banana' and 'apple'):
df_compare = (
df_1
.compare(df_2, keep_equal=True, keep_shape=False)
.pipe(lambda df_: df_.set_index(df_1.loc[df_.index, "fruit"]))
.reset_index()
)
print(df_compare)
# Output
fruit quantity
self other
0 banana 22 27
1 apple 7 8
thanks Laurent for the given dataset example:
df_1 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10}})
df_2 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10}})
df_compare = pd.concat([df_1['fruit'],
df_1.compare(df_2, keep_equal=True, keep_shape=False)],1).dropna()
print(df_compare)
fruit (quantity, self) (quantity, other)
0 banana 22.0 27.0
2 apple 7.0 8.0

How to subtract one row to other rows in a grouped by dataframe?

I've got this data frame with some 'init' values ('value', 'value2') that I want to subtract to the mid term value 'mid' and final value 'final' once I've grouped by ID.
import pandas as pd
df = pd.DataFrame({
'value': [100, 120, 130, 200, 190,210],
'value2': [2100, 2120, 2130, 2200, 2190,2210],
'ID': [1, 1, 1, 2, 2, 2],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
My attempt was tho extract the index where I found 'init', 'mid' and 'final' and subtract from 'mid' and 'final' the value of 'init' once I've grouped the value by 'ID':
group = df.groupby('ID')
group['diff_1_f'] = group['value'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]]
group['diff_2_f'] = group['value2'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_1_m'] = group['value'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_2_m'] = group['value2'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
But of course it doesn't work. How can I obtain the following result:
df = pd.DataFrame({
'diff_value': [20, 30, -10,10],
'diff_value2': [20, 30, -10,10],
'ID': [ 1, 1, 2, 2],
'state': ['mid', 'final', 'mid', 'final'],
})
Also in it's grouped form.
Use:
#columns names in list for subtract
cols = ['value', 'value2']
#new columns names created by join
new = [c + '_diff' for c in cols]
#filter rows with init
m = df['state'].ne('init')
#add init rows to new columns by join and filter no init rows
df1 = df.join(df[~m].set_index('ID')[cols], lsuffix='_diff', on='ID')[m]
#subtract with numpy array by .values for prevent index alignment
df1[new] = df1[new].sub(df1[cols].values)
#remove helper columns
df1 = df1.drop(cols, axis=1)
print (df1)
value_diff value2_diff ID state
1 20 20 1 mid
2 30 30 1 final
4 -10 -10 2 mid
5 10 10 2 final