Pandas: slice by named index using loc, but not include first index - pandas

I have a dataframe with named indexes, need to select all above particular index, not including it.
For example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
max_speed
shield
cobra
1
2
viper
4
5
sidewinder
7
8
I need to select df below cobra. So like pseudo code: df.loc['cobra'+1 : ]

There are several ways to go about this:
>>> df.iloc[df.index.tolist().index('cobra')+1:]
max_speed shield
viper 4 5
sidewinder 7 8
>>> df.drop('cobra', axis=0)
max_speed shield
viper 4 5
sidewinder 7 8
>>> df[df.index != 'cobra']
max_speed shield
viper 4 5
sidewinder 7 8
An addition method that #Quang Hoang proposed:
>>> df.iloc[df.index.get_indexer(['cobra'])[0]+1:]
max_speed shield
viper 4 5
sidewinder 7 8

Selecting without include cobra:
df.iloc[df.index.get_indexer(['cobra'])[0]+2:,:]

try
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
print(df.loc[df.index > 'cobra'])
output
max_speed shield
viper 4 5
sidewinder 7 8

Related

How to do a conditional rolling mean in Pandas?

I have this data frame available. It has a timestamp for start, a timestamp for end and a duration column.
start
end
duration
1
5
4
2
5
3
3
4
1
4
6
2
5
9
4
6
7
1
7
10
3
I'd like to add a column 'rolling_mean' to the dataframe that calculates a rolling average on all previous rows (ordered by start) with this condition: only previous rows can be used for mean calculation where the event has already ended (so end date should be equal to or lower than the start date of the row for which the rolling mean is being calculated). So for row number 4, the rolling_mean is 1 because we look at all previous rows and only the previous one fulfills the condition of the event having ended.
This is the dataframe I'd like to get with a Pandas rolling mean:
start
end
duration
rolling_mean
1
5
4
Nan
2
5
3
Nan
3
4
1
Nan
4
6
2
1
5
9
4
2.666667
6
7
1
2.500000
7
10
3
2.200000
Here is the code to reproduce my example:
d = [[1, 5],
[2, 5],
[3, 4],
[4, 6],
[5, 9],
[6, 7],
[7, 10]]
df = pd.DataFrame(d, columns=['start_time', 'end_time'])
df['duration'] = df.end_time - df.start_time
I've tried to merge the dataframe with itself to then filter out the irrelevant rows, but the data frame is too big to take this approach.
So I'm looking for a rolling mean but where I can specify the extra condition.
Does anyone have any ideas for this one?
A for loop will do the job:
rolling_mean = np.repeat(np.nan, len(df))
start, end, duration = df[["start_time", "end_time", "duration"]].to_numpy().T
for i in range(len(df)):
matches = duration[:i][end[:i] <= start[i]]
if matches.any():
rolling_mean[i] = matches.mean()

rolling windows defined by backward cumulative sums

I have got a pandas DataFrame like this:
A B
0 3 ...
1 2
2 4
3 4
4 1
5 7
6 5
7 3
I would like to compute a rolling along column A summing its elements backwards until I reach at least 10. The resulting windows should be:
A B window_indices
0 3 ... NA
1 2 NA
2 4 NA
3 4 --> [3,2,1]
4 1 [4,3,2,1]
5 7 [5,4,3]
6 5 [6,5]
7 3 [7,6,5]
Next, I want to compute some statistics on column B, something like that:
df.my_rolling(on='A', func='sum', threshold=10).B.mean()
I have got an idea: we could think of the elements of column A as seconds. Transform A in a datetime column and perform a standard rolling on it. But I don't know how to do that.
This is no able to do with rolling since the rolling window is not fixed
l = [[df.index[(df.A.loc[:x].iloc[::-1].cumsum()>=10).idxmax():x+1].tolist()[::-1]
if (df.A.loc[:x].sum()>=10) else np.nan] for x in df.A.index]
Out[46]:
[[nan],
[nan],
[nan],
[[3, 2, 1]],
[[4, 3, 2, 1]],
[[5, 4, 3]],
[[6, 5]],
[[7, 6, 5]]]
df['new'] = l

Find pattern in pandas dataframe, reorder it row-wise, and reset index

This is a multipart problem. I have found solutions for each separate part, but when I try to combine these solutions, I don't get the outcome I want.
Let's say this is my dataframe:
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
df
Values Vals
0 1 6
1 3 7
2 6 7
3 7 9
4 7 5
5 8 3
6 4 1
Let's say I want to find the pattern [6, 7, 7] in the 'Values' column.
I can use a modified version of the second solution given here:
Pandas: How to find a particular pattern in a dataframe column?
pattern = [6, 7, 7]
pat_i = [df[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[ Values Vals
2 6 7
3 7 9
4 7 5]
The only way I've found to narrow this down to just index values is the following:
pat_i = [df.index[i-len(pattern):i] # Get the index
for i in range(len(pattern), len(df)) # for each 3 consequent elements
if all(df['Values'][i-len(pattern):i] == pattern)] # if the pattern matched
pat_i
[RangeIndex(start=2, stop=5, step=1)]
Once I've found the pattern, what I want to do, within the original dataframe, is reorder the pattern to [7, 7, 6], moving the entire associated rows as I do this. In other words, going by the index, I want to get output that looks like this:
df.reindex([0, 1, 3, 4, 2, 5, 6])
Values Vals
0 1 6
1 3 7
3 7 9
4 7 5
2 6 7
5 8 3
6 4 1
Then, finally, I want to reset the index so that the values in all the columns stay in the new re-ordered place;
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1
In order to use pat_i as a basis for re-ordering, I've tried to modify the second solution given here:
Python Pandas: How to move one row to the first row of a Dataframe?
target_row = 2
# Move target row to first element of list.
idx = [target_row] + [i for i in range(len(df)) if i != target_row]
However, I can't figure out how to exploit the pat_i RangeIndex object to use it with this code. The solution, when I find it, will be applied to hundreds of dataframes, each one of which will contain the [6, 7, 7] pattern that needs to be re-ordered in one place, but not the same place in each dataframe.
Any help appreciated...and I'm sure there must be an elegant, pythonic way of doing this, as it seems like it should be a common enough challenge. Thank you.
I just sort of rewrote your code. I held the first and last indexes to the side, reordered the indexes of interest, and put everything together in a new index. Then I just use the new index to reorder the data.
import pandas as pd
from pandas import RangeIndex
df = pd.DataFrame(list(zip([1, 3, 6, 7, 7, 8, 4], [6, 7, 7, 9, 5, 3, 1])), columns = ['Values', 'Vals'])
pattern = [6, 7, 7]
new_order = [1, 2, 0] # new order of pattern
for i in list(df[df['Values'] == pattern[0]].index):
if all(df['Values'][i:i+len(pattern)] == pattern):
pat_i = df[i:i+len(pattern)]
front_ind = list(range(0, pat_i.index[0]))
back_ind = list(range(pat_i.index[-1]+1, len(df)))
pat_ind = [pat_i.index[i] for i in new_order]
new_ind = front_ind + pat_ind + back_ind
df = df.loc[new_ind].reset_index(drop=True)
df
Out[82]:
Values Vals
0 1 6
1 3 7
2 7 9
3 7 5
4 6 7
5 8 3
6 4 1

reindex (1,N) dimension dataframe

A = pandas.DataFrame({"A" : [1, 4], "Output1" : [6, 8]}).set_index(["A"]).fillna(0)
new_A = A.reindex(pandas.MultiIndex.from_tuples([['Output1', "-"]]) , axis="columns")
I'm expecting to get
Output1
-
A
1 6
4 8
But instead I get
Output1
-
A
1 NaN
4 NaN
Anything wrong in my code ?
Don't use reindex, which aligns the columns by names. Just reassign the columns:
A.columns = pd.MultiIndex.from_tuples([['Output1', "-"]])
Output:
Output1
-
A
1 6
4 8

Pandas using iloc, apply and lambda with df column as part of condition

So i have that kind of code:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Length': [7, 7, 7, 7, 6]
}
df = pd.DataFrame(myData, columns=['Price', 'Length'])
print(df)
df.iloc[:, np.r_[0]] = df.iloc[:, np.r_[0]].apply(lambda x: [y if y >= 30000 else round(y / 2, 0) for y in x])
print(df)
What it does is, it takes value from column "Price" and if its equal or above 30 000 then it doesnt change the value otherwise it divides it by 2 and rounds to full numbers.
This on works great, but problem I do have is how to change this code to divide it by value in column "Length" instead ???
I need to use iloc since i dont know names of the columns (they may change but their position wont) and I would like to have it solved using apply and lambda.
Other question is how to use same thing but for example i want to divide two columns (lets say "Price" and "Age" by values in column "Length").
Thanks for any help on this issue.
EDIT:
Based on answer below from jezrael i managed to solve my second question by using loop:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Age': [7, 14, 21, 28, 30000],
'Length': [7, 7, 7, 7, 7]
}
df = pd.DataFrame(myData, columns=['Price', 'Age', 'Length'])
for column in df.columns[np.r_[0, 1]]:
df[column] = np.where(df[column] >= 30000, df[column], (df[column] / df.iloc[:, 2]).round())
print(df[column])
print(df)
I wonder if it can be done without using loops though ???
Use numpy.where by condition, here apply is not recommended, because slow:
df.iloc[:, 0] = np.where(df.iloc[:, 0] >= 30000,
df.iloc[:, 0],
(df.iloc[:, 0] / df.iloc[:, 1]).round())
print(df)
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6
EDIT:
For working with multiple columns use DataFrame.iloc and divide values by DataFrame.div with axis=0:
df.iloc[:, [0, 1]] = np.where(df.iloc[:, [0, 1]] >= 30000,
df.iloc[:, [0, 1]],
df.iloc[:, [0, 1]].div(df.iloc[:, 2], axis=0).round())
print (df)
Price Age Length
0 30000.0 1.0 7
1 28.0 2.0 7
2 30000.0 3.0 7
3 28.0 4.0 7
4 28.0 30000.0 7
One way is to find all indexes where the column is less than 30000 using .loc and .iloc. With this filter apply the division to the desired data
mask = df.loc[df.iloc[:,0] < 30000].index
df.iloc[mask, 0] = (df.iloc[mask, 0] / df.iloc[mask, 1]).round()
#output
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6