getting important features in dataframe - pandas

I would like to ask please how to get the important features in a dataframe
# fit model to training data
xgb_model = XGBClassifier(random_state = 0 )
xgb_model.fit(X_train, y_train)
print("Feature Importances : ", xgb_model.feature_importances_)
I know how to plot it but I want to know howw to put the 20 most important features in a dataframe or a list

Related

What is the sequence for preprocessing text df with tensorflow?

I have a pandas data frame, containing two columns: sentences and annotations:
Col 0
Sentence
Annotation
1
[This, is, sentence]
[l1, l2, l3]
2
[This, is, sentence, too]
[l1, l2, l3, l4]
There are several things I need to do:
split to features and labels
split into train-val-test data
vectorize train data using:
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=maxlen,
standardize='lower',
split='whitespace',
ngrams=(1,3),
output_mode='tf-idf',
pad_to_max_tokens=True,)
I haven't worked with tensors before so I am a little confused about how to order the steps above and access the information from the tensors. Specifically, at what point do I have to split into features and labels, and how to access one or the other? Then, should I split into features and labels before splitting to train-val-test (I want to make it right and not use sklearn's train_test_split when I work with tensorflow) or it is the opposite?
You can split your dataset before creating a model. After splitting you need to tokenize your sentences using
tensorflow.keras.preprocessing.text.Tokenizer((num_words = vocab_size, oov_token=oov_tok)
After tokenizing you need to add padding to the sentence using
training_padded = pad_sequences(training_sequences, maxlen=max_length, truncating = trunc_type)
Then you can train your model with the data. For more details please refer to this working code example. Thank You.

How can I find the optimal number of topics in LDA with scikit-learn?

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

How to import a CSV file, split it 70/30 and then use first column as my 'y' value?

I am having an issue at the moment, I think im making it far more complicated than it needs to be. my csv file is 31 rows by 500. I need to import this, split it in a 70/30 ratio and then be able to use the first column as my 'y' value for a neural network, and the remaining 30 columns need to be my 'x' value.
ive implemented the below code to do this, but when I run it through my basic sigmoid and testing functions, it provides results in a weird format i.e. [6.54694655e-06].
I believe this is due to my splitting/importing of the data, which I think I have done wrong. I need to import the data into arrays that are readable by my functions, and be able to separate my first column specifically to a 'y' value. how do I go about this?
df = pd.read_csv(r'data.csv', header=None)
df.to_numpy()
#splitting data 70/30
trainingdata= df[:329]
testingdata= df[:141]
#converting data to seperate arrays for training and testing
training_features= trainingdata.loc[:, trainingdata.columns != 0].values.reshape(329,30)
training_labels = trainingdata[0]
training_labels = training_labels.values.reshape(329,1)
testing_features = testingdata[0]
testing_labels = testingdata.loc[:, testingdata.columns != 0]
Usually for splitting the dataframe on test and train data I use sklearn.model_selection.train_test_split. Documentation here.
Some other methods are described here Hope this will help you!
Make you train/test split easy by using sklearn.model_selection.train_test_split.
If you don't have sklearn installed, first install it by running pip install -U scikit-learn.
Then
from sklearn.model_selection import train_test_split
df = pd.read_csv(r'data.csv', header=None)
# X is your features, y is your target column
X = df.loc[:,1:]
y = df.loc[:,0]
# Use train_test_split function with test size of 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
df = pd.read_csv(r'data.csv')
df.to_numpy()
print(df)

Tensorflow/Keras, How to convert tf.feature_column into input tensors?

I have the following code to average embeddings for list of item-ids.
(Embedding is trained on review_meta_id_input, and used as look up for pirors_input and for getting average embedding)
review_meta_id_input = tf.keras.layers.Input(shape=(1,), dtype='int32', name='review_meta_id')
priors_input = tf.keras.layers.Input(shape=(None,), dtype='int32', name='priors') # array of ids
item_embedding_layer = tf.keras.layers.Embedding(
input_dim=100, # max number
output_dim=self.item_embedding_size,
name='item')
review_meta_id_embedding = item_embedding_layer(review_meta_id_input)
selected = tf.nn.embedding_lookup(review_meta_id_embedding, priors_input)
non_zero_count = tf.cast(tf.math.count_nonzero(priors_input, axis=1), tf.float32)
embedding_sum = tf.reduce_sum(selected, axis=1)
item_average = tf.math.divide(embedding_sum, non_zero_count)
I also have some feature columns such as..
(I just thought feature_column looked cool, but not many documents to look for..)
kid_youngest_month = feature_column.numeric_column("kid_youngest_month")
kid_age_youngest_buckets = feature_column.bucketized_column(kid_youngest_month, boundaries=[12, 24, 36, 72, 96])
I'd like to define [review_meta_id_iput, priors_input, (tensors from feature_columns)] as an input to keras Model.
something like:
inputs = [review_meta_id_input, priors_input] + feature_layer
model = tf.keras.models.Model(inputs=inputs, outputs=o)
In order to get tensors from feature columns, the closest lead I have now is
fc_to_tensor = {fc: input_layer(features, [fc]) for fc in feature_columns}
from https://github.com/tensorflow/tensorflow/issues/17170
However I'm not sure what the features are in the code.
There's no clear example on https://www.tensorflow.org/api_docs/python/tf/feature_column/input_layer either.
How should I construct the features variable for fc_to_tensor ?
Or is there a way to use keras.layers.Input and feature_column at the same time?
Or is there an alternative than tf.feature_column to do the bucketing as above? then I'll just drop the feature_column for now;
The behavior you desire could be achieved through following steps.
This works in TF 2.0.0-beta1, but may being changed or even simplified in further reseases.
Please check out issue in TensorFlow github repository Unable to use FeatureColumn with Keras Functional API #27416. There you will find the more general example and useful comments about tf.feature_column and Keras Functional API.
Meanwhile, based on the code in your question the input tensor for feature_column could be get like this:
# This you have defined feauture column
kid_youngest_month = feature_column.numeric_column("kid_youngest_month")
kid_age_youngest_buckets = feature_column.bucketized_column(kid_youngest_month, boundaries=[12, 24, 36, 72, 96])
# Then define layer
feature_layer = tf.keras.layers.DenseFeatures(kid_age_youngest_buckets)
# The inputs for DenseFeature layer should be define for each original feature column as dictionary, where
# keys - names of feature columns
# values - tf.keras.Input with shape =(1,), name='name_of_feature_column', dtype - actual type of original column
feature_layer_inputs = {}
feature_layer_inputs['kid_youngest_month'] = tf.keras.Input(shape=(1,), name='kid_youngest_month', dtype=tf.int8)
# Then you can collect inputs of other layers and feature_layer_inputs into one list
inputs=[review_meta_id_input, priors_input, [v for v in feature_layer_inputs.values()]]
# Then define outputs of this DenseFeature layer
feature_layer_outputs = feature_layer(feature_layer_inputs)
# And pass them into other layer like any other
x = tf.keras.layers.Dense(256, activation='relu')(feature_layer_outputs)
# Or maybe concatenate them with outputs from your others layers
combined = tf.keras.layers.concatenate([x, feature_layer_outputs])
#And probably you will finish with last output layer, maybe like this for calssification
o=tf.keras.layers.Dense(classes_number, activation='softmax', name='sequential_output')(combined)
#So you pass to the model:
model_combined = tf.keras.models.Model(inputs=[s_inputs, [v for v in feature_layer_inputs.values()]], outputs=o)
Also note. In model fit() method you should pass info which data sould be used for each input.
One way, if you use tf.data.Dataset, take care that you have used the same names for features in Dataset and for keys in feature_layer_inputs dictionary
Other way use explicite notation like:
model.fit({'review_meta_id_input': review_meta_id_data, 'priors_input': priors_data, 'kid_youngest_month': kid_youngest_month_data},
{'outputs': o},
...
)

Reading EMNIST dataset

I am building a CNN using tensorflow in python, but having problem with loading the data from EMNIST dataset. Can anyone please show me a sample code of retrieving each image in a batch and pass during the training session?
There are a couple of formats of the EMNIST dataset...the one I've found easiest to understand is the CSV version on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).
You can check out one of my implementations of an EMNIST CNN using Keras, where your dataset loading can be similar:
import pandas as pd
raw_data = pd.read_csv("data/emnist-balanced-train.csv")
train, validate = train_test_split(raw_data, test_size=0.1) # change this split however you want
x_train = train.values[:,1:]
y_train = train.values[:,0]
x_validate = validate.values[:,1:]
y_validate = validate.values[:,0]
from https://github.com/Josh-Payne/cs230/blob/master/Alphanumeric-Augmented-CNN/augmented-cnn.py