How to serialize data in example-in-example format for tensorflow-ranking? - tensorflow

I'm building a ranking model with tensorflow-ranking. I'm trying to serialize a data set in the TFRecord format and read it back at training time.
The tutorial doesn't show how to do this. There is some documentation here on an example-in-example data format but it's hard for me to understand: I'm not sure what the serialized_context or serialized_examples fields are or how they fit into examples and I'm not sure what the Serialize() function in the code block is.
Concretely, how can I write and read data in example-in-example format?

The context is a map from feature name to tf.train.Feature. The examples list is a list of maps from feature name to tf.train.Feature. Once you have these, the following code will create an "example-in-example":
context = {...}
examples = [{...}, {...}, ...]
serialized_context = tf.train.Example(features=tf.train.Features(feature=context)).SerializeToString()
serialized_examples = tf.train.BytesList()
for example in examples:
tf_example = tf.train.Example(features=tf.train.Features(feature=example))
serialized_examples.value.append(tf_example.SerializeToString())
example_in_example = tf.train.Example(features=tf.train.Features(feature={
'serialized_context': tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
'serialized_examples': tf.train.Feature(bytes_list=serialized_examples)
}))
To read the examples back, you may call
tfr.data.parse_from_example_in_example(example_pb,
context_feature_spec = context_feature_spec,
example_feature_spec = example_feature_spec)
where context_feature_spec and example_feature_spec are maps from feature name to tf.io.FixedLenFeature or tf.io.VarLenFeature.

First of all, I recommend reading this article to ensure that you know how to create a tf.Example as well as tf.SequenceExample (which by the way, is the other data format supported by TF-Ranking):
Tensorflow Records? What they are and how to use them
In the second part of this article, you will see that a tf.SequenceExample has two components: 1) Context and 2)Sequence (or examples). This is the same idea that Example-in-Example is trying to implement. Basically, context is the set of features that are independent of the items that you want to rank (a search query in the case of search, or user features in the case of a recommendation system) and the sequence part is a list of items (aka examples). This could be a list of documents (in search) or movies (in recommendation).
Once you are comfortable with tf.Example, Example-in-Example will be easier to understand. Take a look at this piece of code for how to create an EIE instance:
https://www.gitmemory.com/issue/tensorflow/ranking/95/518480361
1) bundle context features together in a tf.Example object and serialize it
2) bundle sequence(example) features (each of which could contain a list of values) in another tf.Example object and serialize this one too.
3) wrap these inside a parent tf.Example
4) (if you're writing to tfrecords) serialize the parent tf.Example object and write to your tfrecord file.

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Confused about Tensorflow Algorithm function

Colab notebook
Under the section on Feature Columns, there is this specific line of code
feature_columns = [ ]
for feature_name in CATEGORICAL_COLUMNS:
vocabulary = dftrain[feature_name].unique()
I'm struggling to understand what this is doing. I don't really know what to search up too as I'm still quite new to programming. Why is there a need for this line? I understand that it outputs all unique values of the specified feature_name, but don't get how it's linked to the next line.
When you don't understand a function just google the module name (TensorFlow) and the function name. I found the documentation for tf.feature_column.categorical_column_with_vocabulary_list described here. To quote the documentation:
Use this when your inputs are in string or integer format, and you have an in-memory vocabulary mapping each value to an integer ID. By default, out-of-vocabulary values are ignored.
What this section of code is doing is going through each column and mapping each unique string value to a unique integer (its location in the vocabulary list). Transforming your column using this type of mapping is common for categorical data. The reason that unique is needed is because tf.feature_column.categorical_column_with_vocabulary_list needs a unique list as an argument before it can work its magic.
In the future please put all necessary code in the question. It should not be required to visit another link to answer your question.

Create new column from existing column in Dataset - Apache Spark Java

I am new to Spark ML and got stuck in a task which require some data normalization and there is very less documentation available on net for Spark ML - Java. Any help is much appreciated.
Problem Description :
I have a Dataset that contains encoded url in column (ENCODED_URL) and I want to create new column (DECODED_URL) in existing Dataset that contains decoded version of ENCODED_URL.
For Eg :
Current Dataset
ENCODED_URL
https%3A%2F%2Fmywebsite
New Dataset
ENCODED_URL | DECODED_URL
https%3A%2F%2Fmywebsite | https://mywebsite
Tried using withColumn but had no clue what i should pass as 2nd argument
Dataset<Row> newDs = ds.withColumn("new_col",?);
After reading the Spark documentation got an idea that it may be possible with SQLTransformer but couldn't figure out how to customize it to decode the url.
This is how i read information from CSV
Dataset<Row> urlDataset = s_spark.read().option("header", true).csv(CSV_FILE).persist(StorageLevel.MEMORY_ONLY());
A Spark primer
The first thing to know is that Spark Datasets are effectively immutable. Whenever you do a transformation, a new Dataset is created and returned. Another thing to keep in mind is the difference between actions and transformations -- actions cause Spark to actually to start crunching numbers and compute your DataFrame while transformations add to the definition of a DataFrame but are not computed unless an action is called. An example of an action is DataFrame#count while an example of a transformation is DataFrame#withColumn. See the full list of actions and transformations in the Spark Scala documentation.
A solution
withColumn allows you to either create a new column or replace an existing column in a Dataset (if the first argument is an existing column's name). The docs for withColumn will tell you that the second argument is supposed to be a Column object. Unfortunately, the Column documentation only describes methods available to Column objects but does not link to other ways to create Column objects, so it's not your fault that you're at a loss for what do next.
The thing you're looking for is org.apache.spark.sql.functions#regexp_replace. Putting it all together, your code should look something like this:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
ds = ds.withColumn(
"decoded_url",
functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
regexp_replace requires that we pass a Column object as the first value but nothing requires that it even exist on any Dataset because Column objects are basically instructions for how to compute a column, they don't actually contain any real data themselves. To illustrate this principle, we could write the above snippet as:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
Column myColExpression = functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
ds = ds.withColumn("decoded_url", myColExpression)
If you wanted, you could reuse myColExpression on other datasets that have an encoded_url column.
Suggestion
If you haven't already, you should familiarize yourself with the org.apache.spark.sql.functions class. It's a util class that's effectively the Spark standard lib for transformations.

Find a suitable vocabulary database to build a C structure

Let's begin with the question final purpose: my aim is to build a word-based neural network which should take a basic sentence and select for each individual word the meaning it is supposed to yield in the sentence itself. It is then going to learn something about the language (for example the possible correlation between two given words, what is the probability to find both in a single sentence and so on) and at the final stage (after the learning phase) try to build some very simple sentences of its own according to some input.
In order to do this I need some kind of database representing a vocabulary of a given language from which I could extract some information such as word list, definitions, synonyms et cetera. The database should be structured in a way such that I can build C data structures containing the needed information such as
typedef struct _dictEntry DictionaryEntry;
typedef struct _dict Dictionary;
struct _dictEntry {
const char *word; // Word string
const char **definitions; // Array of definition strings
DictionaryEntry **synonyms; // Array of pointers to synonym words
Dictionary *dictionary; // Pointer to parent dictionary
};
struct _dict {
const char *language; // Language identification string
int count; // Number of elements in the dictionary
float **correlations; // Correlation matrix between i-th and j-th entries
DictionaryEntry *entries; // Array of dictionary entries
};
or equivalent Obj-C objects.
I know (from Searching the Mac OSX system dictionaries?) that apple provided dictionaries are licensed so I cannot use them to create my data structures.
Basically what I want to do is the following: given an arbitrary word A I want to fetch all the dictionary entries which have a definition containing A and select such definition only. I will then implement some kind of intersection procedure to select the most appropriate definition and synonyms based on the rest of the sentence and build a correlation matrix.
Let me give a little example: let us suppose I type a sentence containing "play"; I want to fetch all the entries (such as "game", "instrument", "actor", etc.) the word "play" can be correlated to and for each of them select the corresponding definition (I don't want for example to extract the "instrument" definition which corresponds to the "tool" meaning since you cannot "play a tool"). I will then select the most appropriate of these definitions looking at the rest of the sentence: if it contains also the word "actor" then I will assign to "play" the meaning "drama" or another suitable definition.
The most basic way to do this is scanning every definition in the dictionary searching for the word "play" so I will need to access all definitions without restrictions and as I understand this cannot be done using the dictionaries located under /Library/Dictionaries. Sadly this work MUST be done offline.
Is there any available resource I can download which allows me to get my hands on all the definitions and fetch my info? Currently I'm not interested in any particular file format (could be a database or an xml or anything else) but it must be something I can decompose and put in a data structure. I tried to google it but, whatever the keywords I use, if I include the word "vocabulary" or "dictionary" I (pretty obviously) only get pages about the other words definitions on some online dictionary site! I guess this is not the best thing to search for...
I hope the question is clear... If it is not I'll try to explain it in a different way! Anyway, thanks in advance to all of you for any helpful information.
Probably an ontology which is free, like http://www.eat.rl.ac.uk would help you. In the university sector there are severals available.

cacti: Display how much % one data source item has of an other datasource item

I want to create a graph template in which it is displayed how much percentage a data source item has of another data source item.
I assumed I'd need to use CDEF functions for that and according to that question CDEF Function to find % value in Cacti it isn't even a difficult one.
However, I have no idea how to actually use the given CDEF function within the graph template web interface, how to choose which data source items should serve as input for the CDEF function, how to get the CDEF functions output as input for drawing a graph item of (e.g of type LINE1).
Nowhere does the documentation mentions such things, or if, I didn't find or get it.
The way to find out what datasource is what letter value is by going into Console -> Graph Management -> Pick the Graph you are working on -> Turn On Debug Mode
What you are looking for are the lines that start with DEF a=, b= etc.
From there you build the CDEF function using reverse polish notation as shown in my question you have referenced above.
To use the value in a graph eg a LINE add a new item in the graph template then just dont select a datasource and select your prebuilt CDEF function like below.
That should do exactly what you are looking for. In my example I used an AREA but that is just what was best suited for the graph in question.