Fill Empty Panda Dataframe Using Loop Method - pandas

I am currently working with some telematics data where the trip id is missing. Trip id is unique. 1 trip id contains multiple of rows of data consisting i.e gps coordinate, temp, voltage, rpm, timestamp, engine status (on or off). The data pattern indicate time of engine status on and off, can be cluster as a unique trip id. Though, I have difficulty to translate the above logic in order to generate these tripId.
Tried to use few pandas loop methods but keep failing.
import pandas as pd
inp = [{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''}]
test = pd.DataFrame(inp)
print (test)
Approach Taken
n=1
for index, row in test.iterrows():
test['tripID']=np.where(test['Ignition_Status']=='ON',n,n)
n=n+1
Expected Result

Use series.eq() to check for OFF and series.shift() with series.cumsum():
test=test.assign(tripID=test.Ignition_Status.eq('OFF')
.shift(fill_value=False).cumsum().add(1))
Ignition_Status tripID
0 ON 1
1 ON 1
2 ON 1
3 OFF 1
4 ON 2
5 ON 2
6 ON 2
7 ON 2
8 ON 2
9 OFF 2
10 ON 3
11 OFF 3

Related

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

Filling previous value by field - Pandas apply function filling None

I am trying to fill each row in a new column (Previous time) with a value from previous row of the specific subset (when condition is met). The thing is, that if I interrupt kernel and check values, it is ok. But if it runs to the end, then all rows in new column are filled with None. If previous row doesnt exist, than I will fill it with first value.
Name First round Previous time
Runner 1 2 2
Runner 2 5 5
Runner 3 5 5
Runner 1 6 2
Runner 2 8 5
Runner 3 4 5
Runner 1 2 6
Runner 2 5 8
Runner 3 5 4
What I tried:
df.insert(column = "Previous time", value = 999)
def fce(arg):
runner= arg[0]
stat = arg[1]
if stat == 999:
# I used this to avoid filling all rows in a new column again for the same runner
first = df.loc[df['Name'] == runner,"First round"].iloc[0]
df.loc[df['Name'] == runner,"Previous time"] = df.loc[df['Name'] == runner]["First round"].shift(1, fill_value = first)
df["Previous time"] = df[['Name', "Previous time"]].apply(fce, axis=1)
Condut gruopby shift for each Name and fill the missing values with the original series.
df['Previous time'] = (df.groupby('Name')['First round']
.shift()
.fillna(df['First round'], downcast='infer'))
The problem is that your function fce returns None for every row, so the Series produced by the term df[['Name', "Previous time"]].apply(fce, axis=1) is a Series of None.
That is, instead of overriding the Dataframe with df.loc inside the function, you need to return the value to fill for this position. Unfortunately, this is impossible since then you need to know which indices you already calculated.
A better way to do it would be to use groupby. This is a more natural way, since you want to perform an action on each group. If you use apply after groupby and you to return a series, you, in fact, define a value for each row. Just remember to remove the extra index "Name" that groupby adds.
def fce(g):
first = g["First round"].iloc[0]
return g["First round"].shift(1, fill_value=first)
df["Previous time"] == df.groupby("Name").apply(fce).reset_index("Name", drop=True)
Thank you very much. Please can you answer me one more question? How does it work with group by on multiple columns if I want to return mean of all rounds based on specific runner a sleeping time before race.
Expected output:
Name First round Sleep before race Mean
Runner 1 2 8 4
Runner 2 5 7 6
Runner 3 5 8 5
Runner 1 6 8 4
Runner 2 8 7 6
Runner 3 4 9 4,5
Runner 1 2 9 2
Runner 2 5 7 6
Runner 3 5 9 4,5
This does not work for me.
def last_season(g):
aa = g["First round"].mean()
df["Mean"] = df.groupby(["Name", "Sleep before race"]).apply(g).reset_index(["Name", "Sleep before race"], drop=True)

pandas create Cross-Validation based on specific columns

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?
As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.
As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.
As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508

Splitting data frame in to test and train data sets

Use pandas to create two data frames: train_df and test_df, where
train_df has 80% of the data chosen uniformly at random without
replacement.
Here, what does "data chosen uniformly at random without replacement" mean?
Also, How can i do it?
Thanks
"chosen uniformly at random" means that each row has an equal probability of being selected into the 80%
"without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not
For example, consider the data below:
A B
0 5
1 6
2 7
3 8
4 9
If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)
Without Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
When the next row is assigned to training or test, it will be selected from the remaining rows:
A B
1 6
2 7
3 8
4 9
With Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)
A B
0 5
1 6
2 7
3 8
4 9
How can you can do this:
You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Or you could do this using pandas and Numpy:
df['random_number'] = np.random.randn(length_of_df)
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]