Filling previous value by field - Pandas apply function filling None - pandas

I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)

Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))

The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)

Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)

Related

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Fill Empty Panda Dataframe Using Loop Method

I am currently working with some telematics data where the trip id is missing. Trip id is unique. 1 trip id contains multiple of rows of data consisting i.e gps coordinate, temp, voltage, rpm, timestamp, engine status (on or off). The data pattern indicate time of engine status on and off, can be cluster as a unique trip id. Though, I have difficulty to translate the above logic in order to generate these tripId.
Tried to use few pandas loop methods but keep failing.
import pandas as pd
inp = [{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''}]
test = pd.DataFrame(inp)
print (test)
Approach Taken
n=1
for index, row in test.iterrows():
test['tripID']=np.where(test['Ignition_Status']=='ON',n,n)
n=n+1
Expected Result
Use series.eq() to check for OFF and series.shift() with series.cumsum():
test=test.assign(tripID=test.Ignition_Status.eq('OFF')
.shift(fill_value=False).cumsum().add(1))
Ignition_Status tripID
0 ON 1
1 ON 1
2 ON 1
3 OFF 1
4 ON 2
5 ON 2
6 ON 2
7 ON 2
8 ON 2
9 OFF 2
10 ON 3
11 OFF 3

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

Python Pandas groupby and join

I am fairly new to python pandas and cannot find the answer to my problem in any older posts.
I have a simple dataframe that looks something like that:
dfA ={'stop':[1,2,3,4,5,1610,1611,1612,1613,1614,2915,...]
'seq':[B, B, D, A, C, C, A, B, A, C, A,...] }
Now I want to merge the 'seq' values from each group, where the difference between the next and previous value in 'stop' is equal to 1. When the difference is high like 5 and 1610, that is where the next cluster begins and so on.
What I need is to write all values from each cluster into separate rows:
0 BBDAC #join'stop' cluster 1-5
1 CABAC #join'stop' cluster 1610-1614
2 A.... #join'stop' cluster 2015 - ...
etc...
What I am getting with my current code is like:
True BDACABAC...
False BCA...
for the entire huge dataframe.
I understand the logic behid the whay it merges it, which is meeting the condition (not perfect, loosing cluster edges) I specified, but I am running out of ideas if I can get it joined and split properly into clusters somehow, not all rows of the dataframe.
Please see my code below:
dfB = dfA.groupby((dfA.stop - dfA.stop.shift(1) == 1))['seq'].apply(lambda x: ''.join(x)).reset_index()
Please help.
P.S. I have also tried various combinations with diff() but that didn't help either. I am not sure if groupby is any good for this solution as well. Please advise!
dfC = dfA.groupby((dfA['stop'].diff(periods=1)))['seq'].apply(lambda x: ''.join(x)).reset_index()
This somehow splitted the dataframe into smaller chunks, cluster-like, but I am not understanding the legic behind the way it did it, and I know the result makes no sense and is not what I intended to get.
I think you need create helper Series for grouping:
g = dfA['stop'].diff().ne(1).cumsum()
dfC = dfA.groupby(g)['seq'].apply(''.join).reset_index()
print (dfC)
stop seq
0 1 BBDAC
1 2 CABAC
2 3 A
Details:
First get differences by diff:
print (dfA['stop'].diff())
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
5 1605.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1301.0
Name: stop, dtype: float64
Compare by ne (!=) for first values of groups:
print (dfA['stop'].diff().ne(1))
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 True
Name: stop, dtype: bool
Asn last create groups by cumsum:
print (dfA['stop'].diff().ne(1).cumsum())
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
10 3
Name: stop, dtype: int32
I just figured it out.
I managed to round the values of 'stop' to a nearest 100 and assigned it as a new column.
Then my previous code is working....
Thank you so much for quick answer though.
dfA['new_val'] = (dfA['stop'] / 100).astype(int) *100

How to delete "1" followed by trailing zeros from Data Frame row values ?

From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.
Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)
I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.