find if column value reached upper limit or lower limit first in pandas - pandas

I have a pandas dataframe given below
id val ulim llim
1 100.25 101 98
2 97.30 99 95
3 104.22 106 100
4 105.00 107 102
5 95.00 99 91
.. .... .. ..
100000 105.92 107 103
For each row, I need to find if upper limit(ulim) has been reached first or the lower limit(llim) has been reached first.
For example:
For the first row, the Value (val) is 100.25 , the upper limit is 101 and the lower limit is 98.
The value of the second row, 97.30 is lesser than the lower limit (llim) . Hence , I will mark this row as (-1).
For the second row, the value(val) is 97.30, the upper limit is 99 and lower limit is 95.
The value of the third row, 104.22 is higher than the upper limit. Hence , this row will be marked as (1).
For the third row, the value(val) is 104.22. The upper limit is 106 and the lower limit is 100. The value in the fourth row (105) is in between upper limit and lower limit. Hence, we will move to the the fifth row where the value is 95 and its below the lower limit(100). Hence, this row will be marked as -1.
Target df would be as follows
id val ulim llim result
1 100.25 101 98 -1
2 97.30 99 95 1
3 104.22 106 100 -1
4 105.00 107 102 -1
5 95.00 99 91 1
.. .... .. ..
100000 105.92 107 103 NaN
I have more than a million rows like this. Is it possible to have a solution without iteration?
The current iterative solution which I tried is very slow and is as follows
Loop through each row in the data frame
Find the row index and slice the data frame as df.iloc[current_index:]
Take the val column from the sliced data frame and convert it to a list (sliced_df["val"].tolist())
A list comprehension to check if the first upper limit was reached or the lower limit was reached.
mark the result column based on step 4 result.

You need several things to complete this task: shift to shift the values up, np.select to map the values, then bfill to back fill the values when needed:
# first we shift the values up by 1 row
shifted = df['val'].shift(-1)
# temporary compare the shifted value to the limits,
# `nan` indicates where the values are within limit
df['result'] = np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan)
Here, we almost have what we want:
id val ulim llim result
0 1 100.25 101 98 -1.0
1 2 97.30 99 95 1.0
2 3 104.22 106 100 NaN
3 4 105.00 107 102 -1.0
4 5 95.00 99 91 1.0
5 100000 105.92 107 103 NaN
except for the NaN values at id==3. With this data, we can back fill val, then compare again:
shifted = shifted.mask(df['result'].isna()).bfill()
Now the shifted series is (notice that the 95 is shifted to the row 2 as well):
0 97.30
1 104.22
2 95.00
3 95.00
4 105.92
5 NaN
Name: val, dtype: float64
And we can repeat the assignment, fillna is to fill where the data is missing in result:
df['result'] = df['result'].fillna(np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan))
Output:
id val ulim llim result
0 1 100.25 101 98 -1.0
1 2 97.30 99 95 1.0
2 3 104.22 106 100 -1.0
3 4 105.00 107 102 -1.0
4 5 95.00 99 91 1.0
5 100000 105.92 107 103 NaN
Note: this works with the sample data. However, you may want to repeat the process until the result column (or the shifted series) doesn't change/update.

You simply need to create new col with shift(-1) then compare them to upper and lower limit
df1['valnew'] = df1['val'].shift(-1)
df1['check'] = np.where(df1['valnew']>=df1['ul'],1,-1)
df1

Related

Remove related row from pandas dataframe

I have the following dataframe:
id
relatedId
coordinate
123
125
55
125
123
45
128
130
60
132
135
50
130
128
40
135
132
50
So I have 6 rows in this dataframe, but I would like to get rid of the related rows resulting in 3 rows. The coordinate column equals 100 between the two related rows, and I would like to keep the one with the lowest value (so the one less than 50. If both are 50, simply one of them). The resulting dataframe would thus be:
id
relatedId
coordinate
125
123
45
132
135
50
130
128
40
Hopefully someone has a good solution for this problem.
Thanks
You can sort the values and get the first value per group using a frozenset of the 2 ids as grouper:
(df
.sort_values(by='coordinate')
.groupby(df[['id', 'relatedId']].agg(frozenset, axis=1), as_index=False)
.first()
)
output:
id relatedId coordinate
0 130 128 40
1 125 123 45
2 132 135 50
Alternatively, to keep the original order, and original indices, use idxmin per group:
group = df[['id', 'relatedId']].agg(frozenset, axis=1)
idx = df['coordinate'].groupby(group).idxmin()
df.loc[sorted(idx)]
output:
id relatedId coordinate
1 125 123 45
3 132 135 50
4 130 128 40

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Assigning Score based on Order Sequence in pandas

Following are the dataframes I have
score_df
col1_id col2_id score
1 2 10
5 6 20
records_df
date col_id
D1 6
D2 4
D3 1
D4 2
D5 5
D6 7
I would like to compute a score based on the following criteria:
When 2 occurs after 1 the score should be assigned 10 or when 1 occurs after 2, score should be assigned 10.
i.e when (1,2) gives a score 10 .. (2,1) also get the same score 10.
considering (1,2) . When 1 occurs first time we dont assign a score. We flag the row and wait for 2 to occur. When 2 occurs in the column we give the score 10.
considering (2,1). When 2 comes first. We assign value 0 and wait for 1 to occur. When 1 occurs, we give the score 10.
So, for the first time - dont assign the score and wait for the corresponding event to occur and then assign the score
So, my result dataframe should look something like this
result
date col_id score
D1 6 0 -- Eventhough 6 is there in score list, it occured for first time. So 0
D2 4 0 -- 4 is not even there in list
D3 1 0 -- 1 occurred for first time . So 0
D4 2 10 -- 1 occurred previously. 2 occurred now.. we can assign 10.
D5 5 20 -- 6 occurred previously. we can assign 20
D6 7 0 -- 7 is not in the list
I have around 100k rows in both score_df and record_df. Looping and assigning score is taking the time. Can someone help with logic without looping the entire dataframe?
From what i understand , you can try melt for unpivotting and then merge. keeping the index from the melted df , we check where the index is duplicated , and then return score from the merge else 0.
m = score_df.reset_index().melt(['index','uid','score'],
var_name='col_name',value_name='col_id')
final = records_df.merge(m.drop('col_name',1),on=['uid','col_id'],how='left')
c = final.duplicated(['index']) & final['index'].notna()
final = final.drop('index',1).assign(score=lambda x: x['score'].where(c,0))
print(final)
uid date col_id score
0 123 D1 6 0.0
1 123 D2 4 0.0
2 123 D3 1 0.0
3 123 D4 2 10.0
4 123 D5 5 20.0
5 123 D6 7 0.0

Divide dataframe in different bins based on condition

i have a pandas dataframe
id no_of_rows
1 2689
2 1515
3 3826
4 814
5 1650
6 2292
7 1867
8 2096
9 1618
10 923
11 766
12 191
i want to divide id's into 5 different bins based on their no. of rows,
such that every bin has approx(equal no of rows)
and assign it as a new column bin
One approach i thought was
df.no_of_rows.sum() = 20247
div_factor = 20247//5 == 4049
if we add 1st and 2nd row its sum = 2689+1515 = 4204 > div_factor.
Therefore assign bin = 1 where id = 1.
Now look for the next ones
id no_of_rows bin
1 2689 1
2 1515 2
3 3826 3
4 814 4
5 1650 4
6 2292 5
7 1867
8 2096
9 1618
10 923
11 766
12 191
But this method proved wrong.
Is there a way to have 5 bins such that every bin has good amount of stores(approximately equal)
You can use an approach based on percentiles.
n_bins = 5
dfa = df.sort_values(by='no_of_rows').cumsum()
df['bin'] = dfa.no_of_rows.apply(lambda x: int(n_bins*x/dfa.no_of_rows.max()))
And then you can check with
df.groupby('bin').sum()
The more records you have the more fair it will be in terms of dispersion.

How to modify elements length of pandas dataframe?

I want to change pandas dataframe each element to specified length and decimal digits. Length mean the numbers of charactors. For example, element -23.5556
is 8 charactors length (contain minus and point). I want to modify it to total 6 charactors length containing 2 decimal digits, such as -23.56. If less than 6 charactors ,use space to fill. There is no seperation between each element of new df at last.
name x y elev m1 m2
136 5210580.00000 5846400.000000 43.3 -28.2 -24.2
246 5373860.00000 5809680.000000 36.19 -25 -22.3
349 5361120.00000 5735330.000000 49.46 -24.7 -21.2
353 5521370.00000 5770740.000000 17.74 -26 -20.5
425 5095630.00000 5528200.000000 58.14 -30.3 -26.1
434 5198630.00000 5570740.000000 73.26 -30.2 -26
442 5373170.00000 5593290.000000 37.17 -22.9 -18.3
each columns format requested:
charactors decimal digits
name 3 0
x 14 2
y 14 2
elev 4 1
m1 6 2
m2 6 2
the new df format I wanted:
1365210580.00 5846400.00 43.3-28.2 -24.2
2465373860.00 5809680.00 36.1-25.0 -22.3
3495361120.00 5735330.00 49.4-24.7 -21.2
3535521370.00 5770740.00 17.7-26.0 -20.5
4255095630.00 5528200.00 58.1-30.3 -26.1
4345198630.00 5570740.00 73.2-30.2 -26.0
4425373170.00 5593290.00 37.1-22.9 -18.3
Lastly, save the new df as .dat fixed ascii format.
Which tool could do this in pandas?
You can use string formatting
sf = '{name:3.0f}{x:<14.2f}{y:<14.2f}{elev:<4.1f}{m1:<6.1f}{m2:6.1f}'.format
df.apply(lambda r: sf(**r), 1)
0 1365210580.00 5846400.00 43.3-28.2 -24.2
1 2465373860.00 5809680.00 36.2-25.0 -22.3
2 3495361120.00 5735330.00 49.5-24.7 -21.2
3 3535521370.00 5770740.00 17.7-26.0 -20.5
4 4255095630.00 5528200.00 58.1-30.3 -26.1
5 4345198630.00 5570740.00 73.3-30.2 -26.0
6 4425373170.00 5593290.00 37.2-22.9 -18.3
You need
df.round(2)
The resulting df
name x y elev m1 m2
0 136 5210580 5846400 43.30 -28.2 -24.2
1 246 5373860 5809680 36.19 -25.0 -22.3
2 349 5361120 5735330 49.46 -24.7 -21.2
3 353 5521370 5770740 17.74 -26.0 -20.5
4 425 5095630 5528200 58.14 -30.3 -26.1
5 434 5198630 5570740 73.26 -30.2 -26.0
6 442 5373170 5593290 37.17 -22.9 -18.3