LGBMClassifier + Unbalanced data + GridSearchCV() - pandas

The dependent variable is binary, the unbalanced data is 1:10, the dataset has 70k rows, the scoring is the roc curve, and I'm trying to use LGBM + GridSearchCV to get a model. However, I'm struggling with the parameters as sometimes it doesn't recognize them even when I use the parameters as the documentation shows:
params = {'num_leaves': [10, 12, 14, 16],
'max_depth': [4, 5, 6, 8, 10],
'n_estimators': [50, 60, 70, 80],
'is_unbalance': [True]}
best_classifier = GridSearchCV(LGBMClassifier(), params, cv=3, scoring="roc_auc")
best_classifier.fit(X_train, y_train)
So:
What is the difference between putting the parameters in the GridsearchCV() and params?
As it's unbalanced data, I'm trying to use the roc_curve as the scoring metric as it's a metric that considers the unbalanced data. Should I use the argument scoring="roc_auc" put it in the params argument?

The difference between putting the parameters in GridsearchCV()or params is mentioned in the docs of GridSearch:
When you put it in params:
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
And yes you can put the scoring also in the params.

Related

How to remove objects from a list using a list of indexes on python?

I have a DataFrame from which I wanted to randomly select 20% of the data to use as test data. However, I need to remove said data from my original set to use as training data.
I have a list of the indexes the random sample is made up from (indexes of the original DF). When i use a for loop and the function .pop() the indexes change so the elements been removed after that the first iteration are not the ones that are in my test data frame. I need help to remove the data from the first data frame but no functions will take a list of indexes as an argument. What can i do about this? Is there a way to subtract a data from from another?
Regarding your question,
Is there a way to subtract a data from from another?
You can simply drop the indexes belonging to Test from the primary DataFrame to get your Train. Try this -
train = df.drop(test.index, axis=0)
#Where df is the main dataset from which test data has been sampled.
#train, test, df are all pd.DataFrames
However, if you are preparing data for a machine learning problem, I would recommend some better methods, as discussed in the next part of my answer.
1. Using Sklearn API (Recommended)
You could try using the sklearn.model_selection.train_test_split api to save you a lot of time in doing such train test splits.
from sklearn.model_selection import train_test_split
df = pd.DataFrame(np.random.random((100,10)))
train, test = train_test_split(df, test_size=0.2)
train.shape, test.shape
((80, 10), (20, 10))
2. Using pandas methods
Another way is to sample 20% data from df and then filter the rest for train.
test = df.sample(frac=0.2)
train = df.loc[~df.index.isin(test.index)]
train.shape, test.shape
((80, 10), (20, 10))
3. Starting with a list of indexes
Let's say you already have a list of indexes (test_idx), as you mention in your question. In that case, you can still work with pandas methods to do this without any for loops or pop()
test_idx = np.random.choice(range(100), 20, replace=False) #approx 20% random indexes
test = df.loc[df.index.isin(test_idx)]
train = df.loc[~df.index.isin(test_idx)]
train.shape, test.shape
((80, 10), (20, 10))
There are a couple of solutions to this problem. You could...
Iterate in reverse
Create another array to store the values
Use list comprehension
An example of the third method is as follows.
Let's say that you want to remove all 2's from an array:
data = [1, 2, 3, 2, 2, 1]
new_data = [n for n in data if n != 2]
# new_data = [1, 3, 1]
In my past experience this is always the method I use when cleaning/reconstructing arrays.

Confused by output of keras.text.preprocessing.one_hot

I have some text data that I'd like to convert to one hot vectors:
from keras.preprocessing import text
s = 'wow this is such a thing'
vocab = set(s.split())
text.one_hot(s, round(len(vocab)*1.3))
This returns [2, 6, 6, 7, 6, 7] but my string does not contain any repeated words. Does anyone know what's going on here?
Source code of the function: It clearly states that:
This is a wrapper to the hashing_trick function using hash as the
hashing function; unicity of word to index mapping non-guaranteed.
Since in hashing there is a chance to getting assigned to same index, as your example. You can try to increase the size of vocab if you want more uniqueness.

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.
I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
​
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.
PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

Xlsxwriter: Grouping using set_column after setting width with set_column

I'm using xlsxwriter to write data and afterwards autofit the columns to the maximum string length of every column.
For that I'm using something like for every single column (as every column has different max string lengths):
Sheet1.set_column(0, 0, 15)
At the end of my script I want to group a few columns together. Hence using something like this from the doc:
Sheet1.set_column(0, 10, None, None, {'level': 1})
The grouping shows but not for the desired columns. Am I doing something wrong? Interestingly, the formatting (i.e. the column width) of one of the grouped columns went away, somehow seems to get overwritten. Also I tried something like set_column('A:D', None, None, {'level': 1}) but doesn't work either.
When grouping an empty sheet, ie without writing any data, hence without applying any styles, it works. Isn't it possible to use consecutive set_columns on the same columns??
Thanks a lot in advance
Isn't it possible to use consecutive set_column() on the same columns??
No. Any call to set_column() will overwrite previous calls in the same range.
So you will need to group together all the options that you want to set, such as width, format or grouping, and apply them in one go.
Also, you will need to set overlapping ranges separately. Like this:
# Not like this!
worksheet.set_column(0, 9, 20, None, {'level': 1})
worksheet.set_column(4, 5, 30, None, {'level': 1})
# Use separate non-overlapping ranges.
worksheet.set_column(0, 3, 20, None, {'level': 1})
worksheet.set_column(4, 5, 30, None, {'level': 1})
worksheet.set_column(6, 9, 20, None, {'level': 1})

If I pass a ndarray view to a function I can find its base but how can I find the slice?

numpy slicing e.g. S=np.s_[1:-1]; V=A[1:-1], produces a view of the underlying array. I can find this underlying array by V.base. If I pass such a view to a function, e.g.
def f(x):
return x.base
then f(V) == A. But how can I find the slice information S? I am looking for an attribute something like base containing information on the slice that created this view. I would like to be able to write a function to which I can pass a view of an array and return another view of the same array calculated from the view. E.g. I would like to be able to shift the view to the right or left of a one dimensional array.
As far as I know the slicing information is not stored anywhere, but you might be able to deduce it from attributes of the view and base.
For example:
In [156]: x=np.arange(10)
In [157]: y=x[3:]
In [159]: y.base
Out[159]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [160]: y.data
Out[160]: <memory at 0xb1a16b8c>
In [161]: y.base.data
Out[161]: <memory at 0xb1a16bf4>
I like the __array_interface__ value better:
In [162]: y.__array_interface__['data']
Out[162]: (163056924, False)
In [163]: y.base.__array_interface__['data']
Out[163]: (163056912, False)
So y databuffer starts 12 bytes beyond x. And since y.itemsize is 4, this means that the slicing start is 3.
In [164]: y.shape
Out[164]: (7,)
In [165]: x.shape
Out[165]: (10,)
And comparing the shapes, I deduce that the slice stop is None (the end).
For 2d arrays, or stepped slicing you'd have to look at the strides as well.
But in practice it is probably easier, and safer, to pass the slicing object (tuple, slice, etc) to your function, rather than deduce it from the results.
In [173]: S=np.s_[1:-1]
In [174]: S
Out[174]: slice(1, -1, None)
In [175]: x[S]
Out[175]: array([1, 2, 3, 4, 5, 6, 7, 8])
That is, pass S itself, rather than deduce it. I've never seen it done before.