How to extract a database based on a condition in pandas? - pandas

Please help me
The below one is the problem...
write an expression to extract a new dataframe containing those days where the temperature reached at least 70 degrees, and assign that to the variable at_least_70. (You might need to think some about what the different columns in the full dataframe represent to decide how to extract the subset of interest.)
After that, write another expression that computes how many days reached at least 70 degrees, and assign that to the variable num_at_least_70.
This is the original DataFrame
Date Maximum Temperature Minimum Temperature \
0 2018-01-01 5 0
1 2018-01-02 13 1
2 2018-01-03 19 -2
3 2018-01-04 22 1
4 2018-01-05 18 -2
.. ... ... ...
360 2018-12-27 33 23
361 2018-12-28 40 21
362 2018-12-29 50 37
363 2018-12-30 37 24
364 2018-12-31 35 25
Average Temperature Precipitation Snowfall Snow Depth
0 2.5 0.04 1.0 3.0
1 7.0 0.03 0.6 4.0
2 8.5 0.00 0.0 4.0
3 11.5 0.00 0.0 3.0
4 8.0 0.09 1.2 4.0
.. ... ... ... ...
360 28.0 0.00 0.0 1.0
361 30.5 0.07 0.0 0.0
362 43.5 0.04 0.0 0.0
363 30.5 0.02 0.7 1.0
364 30.0 0.00 0.0 0.0
[365 rows x 7 columns]
I wrote the code for the above problem is`
at_least_70 = dfc.loc[dfc['Minimum Temperature']>=70,['Date']]
print(at_least_70)
num_at_least_70 = at_least_70.count()
print(num_at_least_70)
The Results it is showing
Date
204 2018-07-24
240 2018-08-29
245 2018-09-03
Date 3
dtype: int64
But when run the test case it is showing...
Incorrect!
You are not correctly extracting the subset.

As suggested by #HenryYik, remove the column selector:
at_least_70 = dfc.loc[dfc['Maximum Temperature'] >= 70,
['Date', 'Maximum Temperature']]
num_at_least_70 = len(at_least_70)

Use boolean indexing and for count Trues of mask use sum:
mask = dfc['Minimum Temperature'] >= 70
at_least_70 = dfs[mask]
num_at_least_70 = mask.sum()

Related

group/merge/pivot data by varied weight ranges in Pandas

Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0

how to get the difference between a column from two dataframes by getting their index from another dataframe?

I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())

How to add up with a variable instead a number in a dataframe?

Hi guys i am trying to select a the 2nd value and then add this value to the rest of the array exept the 1st value.
this is what i have so far.
Xloc = X.iloc(1) # selecting the second variable
X = X[1:-1] + Xloc # this doenst work but if i do + 1.25 it works...
the Dataframe
X
0
1.25
2.57
4.5
6.9
7.3
Expected Result
X
0
2.5
3.82
5.75
8.15
8.55
Given that this is your original df
N
0 0.00
1 1.25
2 2.57
3 4.50
4 6.90
5 7.30
you can assign these values and use a simple concat to add in the original value in place
df['M'] = pd.concat([df["N"].iloc[:1], (df["N"].iloc[1:] + df["N"].iloc[1])])
print(df)
N M
0 0.00 0.00
1 1.25 2.50
2 2.57 3.82
3 4.50 5.75
4 6.90 8.15
5 7.30 8.55

How to prepare paneldata to machine learning in Python?

I have a panel data set/time series. I want to prepare the dataset for machine learning prediction next year's gcp. My data looks like this:
ID,year,age,area,debt_ratio,gcp
654001,2013,49,East,0.14,0
654001,2014,50,East,0.17,0
654001,2015,51,East,0.23,1
654001,2016,52,East,0.18,0
112089,2013,39,West,0.13,0
112089,2014,40,West,0.15,0
112089,2015,41,West,0.18,1
112089,2016,42,West,0.21,1
What I want is something like this:
ID,year,age,area,debt_ratio,gcp,gcp-1,gcp-2,gcp-3
654001,2013,49,East,0.14,0,NA,NA,NA
654001,2014,50,East,0.17,0,0,NA,NA
654001,2015,51,East,0.23,1,0,0,NA
654001,2016,52,East,0.18,0,1,0,0
112089,2013,39,West,0.13,0,NA,NA,NA
112089,2014,40,West,0.15,0,0,NA,NA
112089,2015,41,West,0.18,1,0,0,NA
112089,2016,42,West,0.21,1,1,0,0
I've tried Pandas melt function, but it didn't work out. I searched online and found this post that is exact what I want to do, but it is done in R:
https://stackoverflow.com/questions/19813077/prepare-time-series-for-machine-learning-long-to-wide-format
Does anybody know how to do this in Python Pandas? Any suggestion would be appreciated!
Use DataFrameGroupBy.shift in loop:
for i in range(1, 4):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0
More dynamic solution is get maximum number of groups and pass to range:
N = df['ID'].value_counts().max()
for i in range(1, N):
df[f'gcp-{i}'] = df.groupby('ID')['gcp'].shift(i)
print (df)
ID year age area debt_ratio gcp gcp-1 gcp-2 gcp-3
0 654001 2013 49 East 0.14 0 NaN NaN NaN
1 654001 2014 50 East 0.17 0 0.0 NaN NaN
2 654001 2015 51 East 0.23 1 0.0 0.0 NaN
3 654001 2016 52 East 0.18 0 1.0 0.0 0.0
4 112089 2013 39 West 0.13 0 NaN NaN NaN
5 112089 2014 40 West 0.15 0 0.0 NaN NaN
6 112089 2015 41 West 0.18 1 0.0 0.0 NaN
7 112089 2016 42 West 0.21 1 1.0 0.0 0.0

pandas aggregating frames by largest common column denominator and filling missing values

I have been struggling with this issue for a bit and even though there are some workarounds i would assume, I would love to know if there is an elegant way to achieve this result:
import pandas as pd
import numpy as np
data = np.array([
[1,10],
[2,12],
[4,13],
[5,14],
[8,15]])
df1 = pd.DataFrame(data=data, index=range(0,5), columns=['x','a'])
data = np.array([
[2,100,101],
[3,120,122],
[4,130,132],
[7,140,142],
[9,150,151],
[12,160,152]])
df2 = pd.DataFrame(data=data, index=range(0,6), columns=['x','b','c'])
Now I would like to have a data frame that concatenate those 2 and fill the missing values with the previous value
or the first value otherwise. Both data frames can have differnet sizes, what we are interested in here is the unique column x.
That would be my desired output frame df_result.
x is the aggregated unique "x" between the 2 frames
x a b c
0 1 10 100 101
1 2 12 100 101
2 3 12 120 122
3 4 13 130 132
4 5 14 130 132
5 7 14 140 142
6 8 15 140 142
7 9 15 150 151
8 12 15 160 152
Any help or hint would be much appreciated, thank you very much
You can simply use merge operation on 2 dataframes, after that you can apply a sorting, forward fill and backward filling for null values fillling.
df1.merge(df2,on='x',how='outer').sort_values('x').ffill().bfill()
Out:
x a b c
0 1 10.0 100.0 101.0
1 2 12.0 100.0 101.0
5 3 12.0 120.0 122.0
2 4 13.0 130.0 132.0
3 5 14.0 130.0 132.0
6 7 14.0 140.0 142.0
4 8 15.0 140.0 142.0
7 9 15.0 150.0 151.0
8 12 15.0 160.0 152.0