pandas create Cross-Validation based on specific columns - pandas

I have a dataframe of few hundreds rows , that can be grouped to ids as follows:
df = Val1 Val2 Val3 Id
2 2 8 b
1 2 3 a
5 7 8 z
5 1 4 a
0 9 0 c
3 1 3 b
2 7 5 z
7 2 8 c
6 5 5 d
...
5 1 8 a
4 9 0 z
1 8 2 z
I want to use GridSearchCV , but with a custom CV that will assure that all the rows from the same ID will always be on the same set.
So either all the rows if a are in the test set , or all of them are in the train set - and so for all the different IDs.
I want to have 5 folds - so 80% of the ids will go to the train and 20% to the test.
I understand that it can't guarentee that all folds will have the exact same amount of rows - since one ID might have more rows than the other.
What is the best way to do so?

As stated, you can provide cv with an iterator. You can use GroupShuffleSplit(). For example, once you use it to split your dataset, you can put the result within GridSearchCV() for the cv parameter.

As mentioned in the sklearn documentation, there's a parameter called "cv" where you can provide "An iterable yielding (train, test) splits as arrays of indices."
Do check out the documentation in future first.

As mentioned previously, GroupShuffleSplit() splits data based on group lables. However, the test sets aren't necessarily disjoint (i.e. doing multiple splits, an ID may appear in multiple test sets). If you want each ID to appear in exactly one test fold, you could use GroupKFold(). This is also available in Sklearn.model_selection, and directly extends KFold to take into account group lables.

Related

Fill Empty Panda Dataframe Using Loop Method

I am currently working with some telematics data where the trip id is missing. Trip id is unique. 1 trip id contains multiple of rows of data consisting i.e gps coordinate, temp, voltage, rpm, timestamp, engine status (on or off). The data pattern indicate time of engine status on and off, can be cluster as a unique trip id. Though, I have difficulty to translate the above logic in order to generate these tripId.
Tried to use few pandas loop methods but keep failing.
import pandas as pd
inp = [{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''}]
test = pd.DataFrame(inp)
print (test)
Approach Taken
n=1
for index, row in test.iterrows():
test['tripID']=np.where(test['Ignition_Status']=='ON',n,n)
n=n+1
Expected Result
Use series.eq() to check for OFF and series.shift() with series.cumsum():
test=test.assign(tripID=test.Ignition_Status.eq('OFF')
.shift(fill_value=False).cumsum().add(1))
Ignition_Status tripID
0 ON 1
1 ON 1
2 ON 1
3 OFF 1
4 ON 2
5 ON 2
6 ON 2
7 ON 2
8 ON 2
9 OFF 2
10 ON 3
11 OFF 3

Get coherent subsets from pandas series

I'm rather new to pandas and recently run into a problem. I have a pandas DataFrame that I need to process. I need to extract parts of the DataFrame where specific conditions are met. However, i want these parts to be coherent blocks, not one big set.
Example:
Consider the following pandas DataFrame
col1 col2
0 3 11
1 7 15
2 9 1
3 11 2
4 13 2
5 16 16
6 19 17
7 23 13
8 27 4
9 32 3
I want to extract the subframes where the values of col2 >= 10, resulting maybe in a list of DataFrames in the form of (in this case):
col1 col2
0 3 11
1 7 15
col1 col2
5 16 16
6 19 17
7 23 13
Ultimately, I need to do further analysis on the values in col1 within the resulting parts. However, the start and end of each of these blocks is important to me, so simply creating a subset using pandas.DataFrame.loc isn't going to work for me, i think.
What I have tried:
Right now I have a workaround that gets the subset using pandas.DataFrame.loc and then extracts the start and end index of each coherent block afterwards, by iterating through the subset and check, whether there is a jump in the indices. However, it feels rather clumsy and I feel that I'm missing a basic pandas function here, that would make my code more efficient and clean.
This is code representing my current workaround as adapted to the above example
# here the blocks will be collected for further computations
blocks = []
# get all the items where col2 >10 using 'loc[]'
subset = df.loc[df['col2']>10]
block_start = 0
block_end = None
#loop through all items in subset
for i in range(1, len(subset)):
# if the difference between the current index and the last is greater than 1 ...
if subset.index[i]-subset.index[i-1] > 1:
# ... this is the current blocks end
next_block_start = i
# extract the according block and add it to the list of all blocks
block = subset[block_start:next_block_start]
blocks.append(block)
#the next_block_start index is now the new block's starting index
block_start = next_block_start
#close and add last block
blocks.append(subset[block_start:])
Edit: I was by mistake previously referring to 'pandas.DataFrame.where' instead of 'pandas.DataFrame.loc'. I seem to be a bit confused by my recent research.
You can split you problem into parts. At first you check the condition:
df['mask'] = (df['col2']>10)
We use this to see where a new subset starts:
df['new'] = df['mask'].gt(df['mask'].shift(fill_value=False))
Now you can combine these informations into a group number. The cumsum will generate a step function which we set to zero (via the mask column) if this is not a group we are interested in.
df['grp'] = (df.new + 0).cumsum() * df['mask']
EDIT
You don't have to do the group calculation in your df:
s = (df['col2']>10)
s = (s.gt(s.shift(fill_value=False)) + 0).cumsum() * s
After that you can split this into a dict of separate DataFrames
grp = {}
for i in np.unique(s)[1:]:
grp[i] = df.loc[s == i, ['col1', 'col2']]

Splitting data frame in to test and train data sets

Use pandas to create two data frames: train_df and test_df, where
train_df has 80% of the data chosen uniformly at random without
replacement.
Here, what does "data chosen uniformly at random without replacement" mean?
Also, How can i do it?
Thanks
"chosen uniformly at random" means that each row has an equal probability of being selected into the 80%
"without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not
For example, consider the data below:
A B
0 5
1 6
2 7
3 8
4 9
If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)
Without Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
When the next row is assigned to training or test, it will be selected from the remaining rows:
A B
1 6
2 7
3 8
4 9
With Replacement
Assume the first row is assigned to the training set. Now the training set is:
A B
0 5
But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)
A B
0 5
1 6
2 7
3 8
4 9
How can you can do this:
You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Or you could do this using pandas and Numpy:
df['random_number'] = np.random.randn(length_of_df)
train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]

SQL: How to sort overlapping groups efficiently

I'm trying to make groups on a database with 10.000+ rows.
I need to be fast and efficient, so I'm doing binary variables for each cluster.
One, Two, Four, Five and Six is in Group1.
But 'Two' might also be in Group nr. 2, because of errors I cannot overcome because my dataset is from a webscrape. I try to sort everything in a unique way, but it's basically impossible not to do errors, if I wish to be efficient and fast.
ID Title Group1 Group2 Group3 Ungrouped
1 One 1 0 0 0
2 Two 1 1 0 0
3 Three 0 1 1 0
4 Four 1 0 1 0
5 Five 1 0 0 0
6 Six 1 1 1 0
7 Seven 0 0 0 1
My idea for a sollution:
Assign groups (one's) until everything is grouped one or more times.
Make a query for everything that has more than one group assigned (2, 3, 4, 6)
Manually decide which 1's to remove, until they only have one group assigned each.
It's actually a good idea to do the 3rd part manually, because it requires content analysis of the documents)
My question:
How do I specify, that I need to see everything with more than one group? Does it have something to do with constraints and unique values, or is there a more simple and obvious way that I'm not seeing?
If your clusters are stored as integers, you can just do:
select c.*
from clusters c
where (cluster1 + cluster2 + cluster3) > 1;
I don't know what a "binary variable" is in SQLite. Some databases do support binary flags, and you would need to convert the values to integers for the where clause.

SAS INPUT COLUMN

I have a problem in SAS, I would like to know how can I input several columns in only one column(put everything in a single variable)?
For example, I have 3 columns but I would like to put this 3 columns in only one column.
like this:
1 2 3
1 3 1
3 4 4
output:
1
1
3
2
3
4
3
1
4
I'm assuming you're reading from a file, so use the trailing ## to keep reading variables past the end of the line:
data want;
input a ##;
cards;
1 2 3
1 3 1
3 4 4
;
run;
If the dataset is not big just split it to several small data set with one variable each, then rename all variables to one name and concatenate vertiacally using simple set statement. I am sure there are more elegant solutions than this one and if your data set is big let me know, I will write the actual code needed to perform this action with optimal coding