I'm trying to train a binary classification model using DeepFM for the first time. The dataset consists of anonymized ids mapped to a list of segments with a boolean 1 or 0 if they have the segment.
The data is one hot encoded so data looks like:
id
SEGMENT1
SEGMENT2
SEGMENT3
Label
id1
0
1
0
0
id2
1
1
1
1
id2
1
0
1
1
I am training via the documentation in deepctr documents, but they have a requirement for dense (numeric) and sparse features (categorical). I would assume I dense since its defined by 0 and 1 and I don't need to transform anything with label-encoder for categorical. Do I still need to use dnn_feature_columns and linear_feature_columns? I don't have both in my data.
linear_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
train_model_input = {name: train[name] for name in feature_names}
test_model_input = {name: test[name] for name in feature_names}
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
model.compile("adam", "binary_crossentropy",
metrics=['binary_crossentropy'], )
Thank you in advance!
Related
I have a dataframe like this:
id text feat_1 feat_2 feat_3 feat_n
1 random coments 0 0 1 0
2 random coments2 1 0 1 0
1 random coments3 1 1 1 1
Feat columns goes from 1 to 100 and they are labels of a multilabel dataset. The type of data as is 1 and 0 (boolean)
The dataset has over 50k records the labels are unbalance. I am looking for a way to balance it and I was working on this approach:
Sum the values in each feat column and then use the lowest value of this sum as a threshold to filter the dataset.
I need to keep all features columns so I can exclude comments to achieve.
The main idea boild down to: i need to get a balanced dataset to use in a multilabel classification problem, i mean, I need the same amount of feat_columns data as they are my labels.
I am trying to create a sequential model witch would classify random groups of vectors to a class. The model consistently classifies all groups to the same class.
creating data:
Each news has 200 random vectors with a dimension of 300.
I want the model to be able to classify each news group to a class
allnews=[]
for j in range(50):
news=[]
for i in range(200):
news.append(np.random.random(300))
allnews.append(np.array(news))
#allnews= tf.convert_to_tensor(allnews)
allnews= np.array(allnews)
print(np.shape(allnews))
allnews = allnews.reshape((allnews.shape[0], allnews.shape[1], 300))
print(np.shape(allnews))
lables=[]
for j in range(20):
lables.append(0)
for j in range(20):
lables.append(1)
for d in range(10):
lables.append(2)
lables= tf.convert_to_tensor(lables)
print(lables)
creating the model:
the model i am trying to create:
YourSequenceLenght=200
model = tf.keras.Sequential()
model.add(Input(shape=(YourSequenceLenght,300)))
model.add(Dense(300,use_bias=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),activation="linear"))
model.add(SimpleRNN(1, return_sequences=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),use_bias=False,recurrent_regularizer=tf.keras.regularizers.l1(0.01),activation="sigmoid"))
model.add(Dense(3,use_bias=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),activation="softmax"))
model.summary()
METRICS = [
keras.metrics.TruePositives(name='tp'),
keras.metrics.FalsePositives(name='fp'),
keras.metrics.TrueNegatives(name='tn'),
keras.metrics.FalseNegatives(name='fn'),
keras.metrics.BinaryAccuracy(name='accuracy'),
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
keras.metrics.AUC(name='auc'),
]
model.compile(optimizer='sgd',loss='categorical_crossentropy',metrics=METRICS)
training and predicting:
print(lables)
lables = keras.utils.to_categorical(y=lables,num_classes= 3)
# y_train = np_utils.to_categorical(y=y_train, num_classes=10)
print(lables)
history = model.fit(allnews,lables,epochs=10)
res= model.predict(allnews)
print(np.shape(res))
import operator
for r in res:
index, value = max(enumerate(r), key=operator.itemgetter(1))
print(index)
print(value)
for r in res:
print(r)
the outputs from the for prints:
2
0.34069243
2
0.34070647
2
0.33907583
2
0.34005642
2
0.34013948
2
0.34007362
2
0.34028214
2
0.33997294
2
0.34018084
2
0.33995336
2
0.33998552
2
0.33882195
2
0.3401062
2
0.3418465
2
0.33978543
2
0.3396516
2
0.34062216
2
0.3419327
2
0.34114555
2
0.34119973
2
0.3404259
2
0.33981207
2
0.34035686
2
0.34139898
2
0.3398025
2
0.3391234
2
0.34051093
2
0.34120804
2
0.34140897
2
0.34064025
2
0.34133258
2
0.34019342
2
0.3404882
2
0.33930022
2
0.3416659
2
0.3406455
2
0.34054703
2
0.34057957
2
0.3391579
2
0.3395657
2
0.34069654
2
0.3400011
2
0.338789
2
0.34008256
2
0.34080264
2
0.34000066
2
0.340322
2
0.341806
2
0.34178147
2
0.34078327
EDIT:
clarification
I am trying to use a model witch works as follows :
sigmoid hidden layer(with resurrection ) and softmax projection
You are trying to learn something from random data. Your model is (randomly) initilialized in such a way that it always predict class 2, and the gradient updates don't steer the weights into any particular direction, because the input is random, so they stay there. Try having your input data be structured instead of random (e.g. random.random()*tf.one_hot(1,depth=200) for class 1, random.random()*tf.one_hot(2, depth=200) for class 2 and random.random()*tf.one_hot(3, depth=200). Now your values will still be random, but will adhere to a structure.
EDIT:
I took a look at your colab:
1) you can speed up the dataset construction by adding .numpy() after the tf.one_hot: tf.one_hot(1).numpy().
2) When I changed the model to:
model = tf.keras.Sequential()
model.add(Input(shape=(YourSequenceLenght,300)))
model.add(tf.keras.layers.Flatten())
model.add(Dense(300,use_bias=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),activation="linear"))
# model.add(SimpleRNN(1, return_sequences=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),use_bias=False,recurrent_regularizer=tf.keras.regularizers.l1(0.01),activation="sigmoid"))
model.add(Dense(3,use_bias=False,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l1(0.01),activation="softmax"))
model.summary()
the accuracy quickly became 100 % after 4 epochs. I think because you only have only 1 output neuron in the SimpleRNN, you can't encode enough information to what class it should be, at least not with just 1 Dense layer afterwards.
3) You are using BinaryAccuracy in your metrics, that doesn't make a lot of sense here. You can just use the normal accuracy (as a string) for the accuracy metric (metrics = ["accuracy", tf.keras.metrics.TruePositives(...), ...])
My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone
I'm trying to train a neural network for a path-finding problem on 10x10 grid map but it seems it doesn't work. Here is the details:
My input to the neural network is 10x10x2 matrix where first 10x10 represents obstacles on the map, second 10x10 represents only two points, initial and final points.
My output to the system is the shorthest path found by A* algorithm. I've written a code that produces desired number of cases, and the optimal route is found by A* just after producing the case. I want to teach finding this paths to neural network. As an example, the general structure for 4x4 case is like below.
obstacles matrix(input):
0 0 0 0
0 1 1 1
0 1 1 0
0 0 0 0
initial and final point matrix(input):
0 1 0 0
0 0 0 0
0 0 0 1
0 0 0 0
route(desired output):
1 1 0 0
1 0 0 0
1 0 0 1
1 1 1 1
Also, I'm adding the pictures of a case and the output of neural network.
obstacles
start and target points
desired route
combined image
Up to now, I've described the inputs and output of the neural network. I'm trying to train network using 3 fully connected layer but it seems it does not learn the pattern. Here is my network:
x = tf.placeholder(dtype=tf.float32, shape=[None,10,10,2])
y = tf.placeholder(dtype=tf.float32, shape=[None,10,10])
rate = tf.placeholder(dtype=tf.float32)
# flatten the input
x_flatten = tf.contrib.layers.flatten(x)
y_flatten = tf.contrib.layers.flatten(y)
# fully connected layer
fc = tf.layers.dense(inputs=x_flatten, units=1000, activation=tf.nn.tanh)
fc = tf.layers.dropout(fc, rate=rate, training=True) # rate = 0.3
fc = tf.layers.dense(inputs=fc, units=500, activation=tf.nn.tanh)
logits = tf.layers.dense(inputs=fc, units=100, activation=None)
cost = tf.reduce_mean(tf.abs(logits - y_flatten))
optimizer = tf.train.AdamOptimizer().minimize(cost)
Finally, I'm adding the outcome of the NN after training with 1000 cases and 20 epochs, and the ground truth together.
training outcome
test outcome
I have also tried CNN but it also did not work. Any suggestions will be welcomed, thanks in advance.
I have train / test input files in this format (filename label):
...\000881.JPG 2
...\000961.JPG 1
...\001700.JPG 1
...\001291.JPG 1
The input file above will be used with the ImageDeserializer. Since I have been unable to retrieve a row ID and the label from my code after the model have been trained, I created a second test file in this format:
|index 881 |piece_type 0 0 1 0 0 0
|index 961 |piece_type 0 1 0 0 0 0
|index 1700 |piece_type 0 1 0 0 0 0
|index 1291 |piece_type 0 1 0 0 0 0
The format of the second file is the same information as represented in the first file, but formatted differently. The index is the row number and the !piece_type is the label encoded in the one hot format. I need the file in the second format in order to be able to get to the row number and the label. The second file is used with the CTFDeserializer to create a composite reader like this:
image_source = ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
))
text_source = CTFDeserializer("test_map2.txt")
text_source.map_input('index', dim=1, format="dense")
text_source.map_input('piece_type', dim=6, format="dense")
# define a composite reader
reader_config = ReaderConfig([image_source, text_source])
minibatch_source = reader_config.minibatch_source()
The reason I have added the second file is to be able to create a confusion matrix and then I need to be able to have both the true labels and the predicted labels for a given minibatch that I test with. The row numbers are nice to have in order to get a pointer pack to the input images.
Would it be possible somehow to be able to do this with just one input file? It's bit of a hassle to deal with multiple files and formats.
You could load the test images without using a reader as described in this wiki page. Admittedly this puts the burden of all the transformations (cropping/mean subtraction etc.) to the user but at least the PIL package makes these easy. This CNTK tutorial uses PIL to crop and scale the input images before feeding them to CNTK.