Combine image data, row id and labels in one input file? - cntk

I have train / test input files in this format (filename label):
...\000881.JPG 2
...\000961.JPG 1
...\001700.JPG 1
...\001291.JPG 1
The input file above will be used with the ImageDeserializer. Since I have been unable to retrieve a row ID and the label from my code after the model have been trained, I created a second test file in this format:
|index 881 |piece_type 0 0 1 0 0 0
|index 961 |piece_type 0 1 0 0 0 0
|index 1700 |piece_type 0 1 0 0 0 0
|index 1291 |piece_type 0 1 0 0 0 0
The format of the second file is the same information as represented in the first file, but formatted differently. The index is the row number and the !piece_type is the label encoded in the one hot format. I need the file in the second format in order to be able to get to the row number and the label. The second file is used with the CTFDeserializer to create a composite reader like this:
image_source = ImageDeserializer(map_file, StreamDefs(
features = StreamDef(field='image', transforms=transforms), # first column in map file is referred to as 'image'
labels = StreamDef(field='label', shape=num_classes) # and second as 'label'
))
text_source = CTFDeserializer("test_map2.txt")
text_source.map_input('index', dim=1, format="dense")
text_source.map_input('piece_type', dim=6, format="dense")
# define a composite reader
reader_config = ReaderConfig([image_source, text_source])
minibatch_source = reader_config.minibatch_source()
The reason I have added the second file is to be able to create a confusion matrix and then I need to be able to have both the true labels and the predicted labels for a given minibatch that I test with. The row numbers are nice to have in order to get a pointer pack to the input images.
Would it be possible somehow to be able to do this with just one input file? It's bit of a hassle to deal with multiple files and formats.

You could load the test images without using a reader as described in this wiki page. Admittedly this puts the burden of all the transformations (cropping/mean subtraction etc.) to the user but at least the PIL package makes these easy. This CNTK tutorial uses PIL to crop and scale the input images before feeding them to CNTK.

Related

DeepFM Features

I'm trying to train a binary classification model using DeepFM for the first time. The dataset consists of anonymized ids mapped to a list of segments with a boolean 1 or 0 if they have the segment.
The data is one hot encoded so data looks like:
id
SEGMENT1
SEGMENT2
SEGMENT3
Label
id1
0
1
0
0
id2
1
1
1
1
id2
1
0
1
1
I am training via the documentation in deepctr documents, but they have a requirement for dense (numeric) and sparse features (categorical). I would assume I dense since its defined by 0 and 1 and I don't need to transform anything with label-encoder for categorical. Do I still need to use dnn_feature_columns and linear_feature_columns? I don't have both in my data.
linear_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
train_model_input = {name: train[name] for name in feature_names}
test_model_input = {name: test[name] for name in feature_names}
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
model.compile("adam", "binary_crossentropy",
metrics=['binary_crossentropy'], )
Thank you in advance!

Best approach to balance a multilabel dataset

I am working on a multilabel dataset that is quite unbalanced with almost 100 labels. I can have one to several labels like this:
text labels
some text ["earth"]
another text ["earth","car"]
text again ["sun","earth","truck"]
from here I can have a get a dataframe with all possible labels and it's frequency:
labels_frequency = df.labels.map(ast.literal_eval).explode().value_counts()
out_labels = pd.DataFrame(labels_frequency).reset_index()
out_labels
And I can see that the label with the highest count have 10k records and the label with the lowest have 1k records
I am creating my dataset using sklearn MultiLabelBinarizer to get this:
text label1 label2 ... label100
some text 0 0 1
another text 1 1 0
text again 0 1 0
What I need from here:
I want to undersample this dataset in such way that I have all texts witht the lowest label count, ih this example, would be 1k records of each label. But as I told above, I can have records with more than one label per row..
So, what's the best way to tackle this problem?

Preprocess Data for Tensorflow 2.0

I have a .csv File that has hundreds of thousands of lines. The information was collected in order by the user.
For example, one user's inputs may range 20-400 rows, and the corresponding target is a single row where the users first input row started.
inputs | Targets
0, 7
1
2
3
4
So one set of targets per x amount of input rows.
Some of my columns contain '-' I feel like this will mess up my model when trying to train, considering it isn't a float or int what I should do?
Also, Should I shuffle my data if it is chunked like this?

Pandas not consistently skipping input number of rows for skiprows argument?

I am using Pandas to organize CSV files to later plot with matplotlib. First I create a Pandas dataframe to find the line containing 'Pt'. This is what I search for to use as my header line. header
Then I save the index of this line and apply it to the skiprow argument when creating the new dataframe which I will use.
Oddly, depending on the file format, even though the correct index is found, the wrong line shows up as the header. For example, note how in Pandas line 54 has 'Pt" right after the tab:
correct index on first file
The dataframe comes out correctly here.
correct dataframe on first file
For another file, line 44 is correctly recognized with having 'Pt'.
correct index on second file
But the dataframe includes line 43 as the header!
incorrect dataframe on second file
I have tried setting header=0, header=none. Am I missing something?
Here is the code
entire_df = pd.read_csv(file_path, header=None)
print(entire_df.head(60))
header_idx = -1
for index, row in entire_df.iterrows(): # find line with desired header
if any(row.str.contains('Pt')):
print("Yes! I have pt!")
print("Header index is: " + str(index))
print("row contains:")
print(entire_df.loc[[index]])
header_idx = index # correct index obtained!
break
df = pd.read_csv(file_path, delimiter='\t', skiprows=header_idx, header=0) # use line index to exclude extra information above
print(df.head())
Here are sections of the two files that give different results. They are saved as .dta files. I cannot share the entire files.
file1 (properly made dataframe)
FRAMEWORKVERSION QUANT 7.07 Framework Version
INSTRUMENTVERSION LABEL 4.32 Instrument Version
CURVE TABLE 16875
Pt T Vf Im Vu Pwr Sig Ach Temp IERange Over
# s V A V W V V deg C # bits
0 0.1 3.49916E+000 -1.40364E-002 0.00000E+000 -4.91157E-002 -4.22328E-001 0.00000E+000 1.41995E+003 11 ...........
1 0.2 3.49439E+000 -1.40305E-002 0.00000E+000 -4.90282E-002 -4.22322E-001 0.00000E+000 1.41995E+003 11 ...........
2 0.3 3.49147E+000 -1.40258E-002 0.00000E+000 -4.89705E-002 -4.22322E-001
file2 (dataframe with wrong header)
FRAMEWORKVERSION QUANT 7.07 Framework Version
INSTRUMENTVERSION LABEL 4.32 Instrument Version
CURVE TABLE 18
Pt T Vf Vm Ach Over Temp
# s V vs. Ref. V V bits deg C
0 2.00833 3.69429E+000 3.69429E+000 0.00000E+000 ........... 1419.95
1 4.01667 3.69428E+000 3.69352E+000 0.00000E+000 ........... 1419.95
2 6.025 3.69419E+000 3.69284E+000 0.00000E+000 ........... 1419.95
3 8.03333 3.69394E+000 3.69211E+000 0.00000E+000 ........... 1419.95
Help would be much appreciated.
You should pay attention to your indentation levels. Your code block in which you want to set the header_idx depending on your if any(row.str.contains('Pt')) condition has the same intendation level as the if statement, which means it is executed at each iteration of the for loop, and not just when the condition is met.
for index, row in entire_df.iterrows():
if any(row.str.contains('Pt')):
[...]
header_idx = index
Adapt the indentation like that to put the assignment under the control of the if statement:
for index, row in entire_df.iterrows():
if any(row.str.contains('Pt')):
[...]
header_idx = index

Dendrograms with SciPy

I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?