Assume one has JSON data containining instructions for generating the following 10x5 cell patterns, and that each cell can contain one of the following characters: _ 0 x y z
Also assume that each character can be displayed in various colors.
pattern 1:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
pattern 2:
xx0z00yy_z
zzx_0000_x
_yxy0y__yx
_xz0z__0_y
y__x0_0_y_
pattern 3:
yx0x_xz0_z
xz_x0_xxxz
_yy0x_0z00
zyy0__0zyx
z_xy0_0xz0
These were randomly generated, and are all black, but assume they were devised according to some set of rules, and in color.
The JSON for the first pattern would look something like:
{
width: 10,
height: 5,
cells: [
{
value: '_',
color: 'red'
},
{
value: 'y',
color: 'blue'
}, ...
]
}
If one wanted to train on this data in order to generate new yet similar patterns (again, assuming these were not randomly generated), what is the recommended approach for:
reading the data in (I'd imagine putting the JSON into an Example protobuf, serializing the buffer to string with tf.parse_example, and then writing that to TFRecord files)
training on that data
generating new patterns based on the trained model
supplying seed data for the generated patterns, e.g. first cell is the character "x' with the color blue.
I want to achieve something similar to what I've seen in style transfer with art/photos, and with music/MIDI data (see: Google Magenta). In those cases, here the model is trained an a distinctive set of artwork or melodic style, and a seed in the form of a photograph or primer melody is supplied in order to generate content similar to the data used in training.
Thanks!
I dislike preprocessing the dataset into new forms, it makes it difficult to change later on and slows future development, it's like technical debt in my opinion.
My approach would be to keep your JSON as-is and write some simple python code (a generator specifically which mean you use yield instead of return statements) to read the JSON file and spit out samples in sequence.
Then use the tensorflow Dataset input pipeline with Dataset.from_generator(...) to take data from your input function.
https://www.tensorflow.org/programmers_guide/datasets
The Dataset pipeline provides everything you need to manage the various transformations you'll want to apply, you can buffer, shuffle, batch, prefetch, and map functions onto your data trivially and in a nice modular, testable, framework that feeds naturally into your tensorflow model.
Related
Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.
I'm having some troubles finding the right way how to annotate my data. I'm dealing with laboratory test related texts and I am using the following labels:
1) Test specification (e.g. voltage, length,...)
2) Test object (e.g. battery, steal beam,...)
3) Test value (e.g. 5 V; 5 m...)
Let's take this example sentences:
The battery voltage should be 5 V.
I would annotate this sentences like this:
The
battery voltage (test specification)
should
be
5 V (Test value)
.
However, if this sentences looks like this:
The voltage of the battery should be 5 V.
I would use the following annotation:
The
voltage (Test specification)
of
the
battery (Test object)
should
be
5 V (Test value)
.
Is anyone experienced in annotating data to explain if this is the right way? Or should I use in he first example the Test object label for battery as well? Or should I combine the labels in the second example voltage of the battery as Test specification?
I am annotating the data to perform information extraction.
Thanks for any help!
All of your examples are unusual annotations formats. The typical way to tag NER data (in text) is to use an IOB/BILOU format, where each token is on one line, the file is a TSV, and one of the columns is a label. So for your data it would look like:
The
voltage U-SPEC
of
the
battery U-OBJ
should
be
5 B-VALUE
V L-VALUE
.
Pretend that is TSV, and I have omitted O tags, which are used for "other" items.
You can find documentation of these schema in the spaCy docs.
If you already have data in the format you provided, or you find it easier to make it that way, it should be easy to convert at least. For training NER spaCy requires the data be provided in a particular format, see the docs for details, but basically you need the input text, character spans, and the labels of those spans. Here's example data:
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
This format is trickier to produce manually than the above TSV type format, so generally you would produce the TSV-like format, possibly using a tool, and then convert it.
The main rule to correctly annotate entities is to be consistent (i.e. you always apply the same rules when deciding which entity is what). I can see you already khave some rules in terms of when battery voltage should be considered test object or test specification.
Apply those rules consistently and you'll be ok.
Have a look at the spacy-annotator.
It is a library that helps you annotating data in the way you want.
Example:
import pandas as pd
import re
from spacy_annotator.pandas_annotations import annotate as pd_annotate
# Data
df = pd.DataFrame.from_dict({'full_text' : [The battery voltage should be 5 V., 'The voltage of the battery should be 5 V.']})
# Annotations
pd_dd = pd_annotate(df,
col_text = 'full_text', # Column in pandas dataframe containing text to be labelled
labels = ['test_specification', 'test object', 'test_value'], # List of labels
sample_size=1, # Size of the sample to be labelled
delimiter=',', # Delimiter to separate entities in GUI
model = None, # spaCy model for noisy pre-labelling
regex_flags=re.IGNORECASE # One (or more) regex flags to be applied when searching for entities in text
)
# Example output
pd_dd['annotations'][0]
The code will show you a user interface you can use to annotate each relevant entities.
I have some labeled data (down to 1000) (shape: text, category ) and up to 10k unlabeled data. I want to use the rules-based matching linguistic tool of Spacy to define for every category a pattern. After this, I would like to train a new model using the rules and the data that I've had labeled. It is this possible? I've seen some tutorial on youtube* which does something similar, but they use the labeled data to determine if a sentence contains some entity. On the other hand, I want to put a label on an entire paragraph.
https://www.youtube.com/watch?v=IqOJU1-_Fi0
Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!
Suppose I have a data file that has entries that look like this
0.00,2015-10-21,1,Y,798.78,323793701,6684,0.00,Q,H2512,PE0,1,0000
I would like to use this as an input to an mxnet model (basic Feed Forward Multi-layer Perecptron). A single input record has data types, in the order show above
float,date,int,categorical,float,int,int,float,categorical,categorical,categorical,int, float
each record is a meaningful representation of a specific entity. how do I represent this sort of data to mxnet? also, to complicate things a bit, suppose I want to one-hot encode the categorical columns? And what if each record has these fields, in the order show, but repeated multiple times in some cases such that each record may have a different length?
The docs are great for the basic cases where you have input data that is all of the same type and can all be loaded into the same input without any transformation but how to handle this case?
Update: adding some additional details. to keep this as simple as possible, let's say I just want to feed this into a simple network. something like:
my $data = mx->symbol->Variable("data");
my $fc = mx->symbol->FullyConnected($data, num_hidden => 1);
my $softmax=mx->symbol->SoftmaxOutput(data => $fc, name => "softmax");
my $module = mx->mod->new(symbol => $softmax);
in the simple case of the data being all one type and not requiring much in the way of pre-processing I then could just do something along the lines of
$module->fit(
$train_iter,
eval_data => $eval_iter,
optimizer => "adam",
optimizer_params=>{learning_rate=>0.001},
eval_metric => "mse",
num_epoch => 25
);
where $train_iter is a simple NDArray iterator over the training data. (Well, with the Perl API it's not exactly an NDArray, but has complete parity with that interface so it is conceptually identical).
NDArrayIter also supports multi input. You can use it as follows
data = {'data1':np.zeros(shape=(10,2,2)), 'data2':np.zeros(shape=(20,2,2))}
label = {'label1':np.zeros(shape=(10,1)), 'label2':np.zeros(shape=(20,1))}
dataiter = mx.io.NDArrayIter(data, label, 3, True, last_batch_handle='discard')
Before that you will have to convert your non-numeric data into numerical data. This could be in the form of a one-hot vector or some other fashion which depends on the meaning of that variable.
As for the question regarding samples have different length, the easiest way would be to bring them all to a common length by padding the shorter ones with 0s.